Blocking Image Leechers in Apache

, ,
Want to master JavaScript in a week? Buy Xah JavaScript Tutorial.

This page gives some tips about preventing websites that use images from your website.

Image leeching is a problem. Basically, some other website using inline image with the image at your website. Besides copyright issues, it causes bandwidth problem on your site. There are a lot websites these days that allow its users to insert images from a URL. The user may not be aware that it is a problem, since most of them are not technical person, and they simply wanted to show their friends some images they found.

Image leeching often takes significant bandwidth from your site. If you have a image, let's say a beautiful girl. Many sites that are porn or otherwise shady sites, such as MySpace, infested by teens and highschool and college students, gamers, they have huge amounts of traffic for rather useless content (mostly teen drivels and bantering). If they insert one of your images, your image may get few thousands hits a day. If you get leeched, then more than 50% of your site's bandwidth will be from leechers, more likely, the bandwidth usage will be few times your normal.

My website does not have image leech protection up to 2003 or so. Then i noticed image leechers, they cause my site to go over bandwidth limit for that month. This happened to me several times in the past. This means, i have to pay extra for the hosting fee, by the mega bytes. (See: XahLee.org Web Traffic Report)

Apache “.htaccess” Config for Blocking Leechers

The following code is Apache HTTP Server config file for blocking leechers. You need to place it in a file named “.htaccess”, usually at the root web dir.

RewriteEngine on

# block image leechers
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.org.+|^http://xahlee\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.xahlee\.info.+|^http://xahlee\.info.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www\.ergoemacs\.org.+|^http://ergoemacs\.org.+$ [NC]
RewriteCond %{HTTP_REFERER} !^http://xahlee\.blogspot\.com.+$ [NC]
RewriteRule \.(png|gif|jpg|jpeg|mov|mp3|wav|ico)$ - [NC,F]

http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

What the above code does is this: Overall, it tries to match the text from the HTTP_REFERER the browser sends. (HTTP_REFERER contains text where the requested page is from) If conditions are met, then do a “rewrite” about the URL (that is, deny access).

Here's the conditions: The HTTP_REFERER line does not match blank (if it is blank, it usually means the visitor typed the image URL in the browser). The HTTP_REFERER does not matches any of of xahlee.org, xahlee.info, ergoemacs.org, xahlee.blogspot.com. Otherwise (the image is inline from some other website), deny access, for URL ending in “png”, “gif”, “jpg”, etc. The “NC” means ignore letter case. The “F” means deny access (Forbidden).

Note that this protection is not absolute. The very nature of web from its original conception is to allow anyone to share info, without much thought about copyright or commercial development. Leecher can easily just mirror your image on their server (which steal your image but doesn't steal your bandwidth), or server's pages can have JavaScript code that bypass this protection. Anyhow, that's getting into hacking. The above code should prevent vast majority bandwidth theft.

There are several more advance blocking methods. One is http://www.alistapart.com/articles/hotlinking/, others are using JavaScript to prevent people from knowing your image URL, or embed images in Flash. But these usually gets a bit complicated.

Site Whackers

Another related problem is site whackers. Some people, typically programing geekers, when they like your site, they download everything to their computer, so they can read offline, or as some packrat habit. If your website is few hundred mega bytes or giga bytes, then few whacks will suck off your monthly bandwidth quota. Whack also causes huge spike in your traffic, causing your site to become slow.

Here's the config to prevent simple web whackers:

# forbid bots that whack entire site
RewriteCond  %{HTTP_USER_AGENT}  ^Wget|Teleport\ Pro|webreaper.net|WebCopier|HTTrack|WebCapture
RewriteRule  ^.+$                -  [F,L]

The above does not provide absolute protection. Dedicated whackers can easily bypass it by masking their user agent id. But it should prevent majority of casual web whackers.

Some popular website downloader are: HTTrack, Wget. Note: cURL does not allow you to download a site recursively or a page with all linked pages. It allows you to download one single page, or a set of pages or images. (See also: Linux: Sync Files Across Machines: rsync Tutorial.)

Of course there are legitimate uses. In the older days (say, late 1990s or early 2000s), internet speed are not as fast as today, and many are still using modem with 28.8 kbit/s, and websites are not that reliable or accessible. So, people need to download sites to read offline, or archive it in case the info disappears the next day. However, today, with video sites and 100 megabyte movie trailers and all that, these reasons are mostly gone.

Note that there's Robots exclusion standard, which instruct how web crawlers should behave. However, web downloading software usually ignore that, or can be easily set to ignore it.

blog comments powered by Disqus