The below guide is for an apache reverse proxy setup (such as outlined here). This method allows setting a single directive that affects all apache virtual hosts (which is what I wanted since I have several sites).
You might want to use robots.txt to give directions to various search engines regarding crawling your site. Search engines will check and respect (the good ones anyway) for /robots.txt before commencing their crawl.
In the example below, I use robots.txt to direct a search engine (in this case MJ12bot) to not crawl my sites.
Guide
First, let's create the robots.txt file. I'm going to create the following file and put it in /var/www/
User-agent: MJ12bot Disallow: /
We now need to tell apache to redirect all requests for robots.txt to our created file above. We can do this by adding the below code to our apache.conf (or httpd.conf is using centOS, Amazon Linux etc.):
sudo nano /etc/apache2/apache2.conf
Note if using CentOS or Amazon Linux, you'll like do this instead:
sudo nano /etc/httpd/httpd.conf
And add the following to the end of the file:
# global robots.txt file for controlling those crawlers (good ones anyway) <Location "/robots.txt"> ProxyPass ! </Location> Alias /robots.txt /var/www/robots.txt
The above code does two things: first it tells apache to not proxypass any requests for robots.txt, and secondly it tells apache the location to our robots.txt file.
Finally, we need to reload (or restart) apache for the configuration to load:
sudo service apache2 reload
sudo service httpd reload
References
Related articles