What do you think? Discuss, post comments, or ask questions at the end of this article [More about me]

The below guide is for an apache reverse proxy setup (such as outlined here).  This method allows setting a single directive that affects all apache virtual hosts (which is what I wanted since I have several sites).

You might want to use robots.txt to give directions to various search engines regarding crawling your site.  Search engines will check and respect (the good ones anyway) for /robots.txt before commencing their crawl.

In the example below, I use robots.txt to direct a search engine (in this case MJ12bot) to not crawl my sites.

Guide

First, let's create the robots.txt file.  I'm going to create the following file and put it in /var/www/

/var/www/robots.txt
User-agent: MJ12bot
Disallow: /

We now need to tell apache to redirect all requests for robots.txt to our created file above.  We can do this by adding the below code to our apache.conf (or httpd.conf is using centOS, Amazon Linux etc.):

Edit apache2.conf on Debian/Ubuntu
sudo nano /etc/apache2/apache2.conf

Note if using CentOS or Amazon Linux, you'll like do this instead:

Edit httpd.conf on CentOS or Amazon Linux
sudo nano /etc/httpd/httpd.conf

And add the following to the end of the file:

# global robots.txt file for controlling those crawlers (good ones anyway)
<Location "/robots.txt">
    ProxyPass !
</Location>
Alias /robots.txt /var/www/robots.txt

The above code does two things: first it tells apache to not proxypass any requests for robots.txt, and secondly it tells apache the location to our robots.txt file.

Finally, we need to reload (or restart) apache for the configuration to load:

Reload apache2 service (for ubuntu etc.)
sudo service apache2 reload


Reload apache2 service (for CentOS, Amazon Linux etc.)
sudo service httpd reload

References

  1. https://stackoverflow.com/questions/28616917/robots-txt-on-apache-reverse-proxy
  2. http://mj12bot.com/