jeudi 22 septembre 2016

Allow web crawlers to only one subdomain Apache configuration

I'm trying to figure out a clean way to stop web crawlers from accessing my site except for the www subdomain on port 80.

Here's what my Apache config looks like:

<VirtualHost 255.255.255.255:80>
  Include /etc/httpd/conf/apps/website.common
  ServerName www.website.com
  ServerAlias www.website.com
  alias /robots.txt /var/www/apps/website/current/public/okrobots.txt
</VirtualHost>

<VirtualHost 255.255.255.255:80>
  Include /etc/httpd/conf/apps/website.common
  ServerName star.website.com
  ServerAlias *.website.com
  alias /robots.txt /var/www/apps/website/current/public/robots.txt
</VirtualHost>


<VirtualHost 255.255.255.255:443>
  Include /etc/httpd/conf/apps/website.common

  SSLEngine on
  SSLCertificateFile /etc/httpd/conf/apps/ssl/website.crt
  SSLCertificateKeyFile /etc/httpd/conf/apps/ssl/website.key
  SSLCACertificateFile /etc/httpd/conf/apps/ssl/website_ca_bundle.crt
  SSLProtocol -ALL +SSLv3 +TLSv1
  SSLCipherSuite ALL:!ADH:!LOW:!SSLv2:!EXP:+HIGH:+MEDIUM

  RequestHeader set X_FORWARDED_PROTO 'https'
</VirtualHost>

okrobots.txt only allows the access to homepage and contains

User-Agent: *
Allow: /$
Disallow: /

and robots.txt contains

User-Agent: *
Disallow: /

Would I be able to put alias /robots.txt /var/www/apps/website/current/public/robots.txt in the /etc/httpd/conf/apps/website.common file?

I think this would stop all robots from accessing the site because I use the Include directive in each of the VirtualHost blocks. In the first VirtualHost block I would then override the alias in the common file to use the okrobots.txt file to let the crawlers access the www site on port 80.

So, my main concern is, can I effectively override the directives in my common file?




Aucun commentaire:

Enregistrer un commentaire