Apache, FancyIndexing and PHP 5 (mod_autoindex)

Introduction
The default Directory Listing in Apache is pretty much awful, but I had a need to present some files through a web browser. Rather than produce something with PHP alone I decided to enhance the Apache FancyIndexing option as it is designed for exactly this purpose.
I came across a nice PHP enhancement (update to include link and credit) to the FancyIndexing that added guided navigation to the directory listing, as well as improve the default font and general styling thanks to effective use of CSS.

Instructions

1. Edit httpd.conf and add or modify the following

AccessFileName .htaccess
<Files ~ “^\.ht”>
Order allow,deny
Deny from all
</Files>

<Directory /your/directory>
AllowOverride all
</Directory>

2. In the directory you want the listing of add the following .htaccess file

Options +Indexes +FollowSymlinks
IndexOptions FancyIndexing HTMLTable FoldersFirst SuppressRules SuppressDescription SuppressHTMLPreamble Charset=UTF-8
#
# AddIcon* directives tell the server which icon to show for different# files or filename extensions.  These are only displayed for
# FancyIndexed directories.
#
AddIcon /autoindex/icons/application.png .exe .app
AddIcon /autoindex/icons/type_binary.png .bin .hqx .uu
AddIcon /autoindex/icons/type_box.png .tar .tgz .tbz .tbz2 bundle .rar
AddIcon /autoindex/icons/type_code.png .html .htm .htx .htmls .dhtml .phtml .shtml .inc .ssi .c .cc .css .h .rb .js .rb .pl .py .sh .shar .csh .ksh .tcl .as
AddIcon /autoindex/icons/type_database.png .db .sqlite .dat
AddIcon /autoindex/icons/type_disc.png .iso .image
AddIcon /autoindex/icons/type_document.png .ttf
AddIcon /autoindex/icons/type_excel.png .xlsx .xls .xlm .xlt .xla .xlb .xld .xlk .xll .xlv .xlw
AddIcon /autoindex/icons/type_flash.png .flv
AddIcon /autoindex/icons/type_illustrator.png .ai .eps .epsf .epsi
AddIcon /autoindex/icons/type_pdf.png .pdf
AddIcon /autoindex/icons/type_php.png .php .phps .php5 .php3 .php4 .phtm
AddIcon /autoindex/icons/type_photoshop.png .psd
AddIcon /autoindex/icons/monitor.png .ps
AddIcon /autoindex/icons/type_powerpoint.png .ppt .pptx .ppz .pot .pwz .ppa .pps .pow
AddIcon /autoindex/icons/type_swf.png .swf
AddIcon /autoindex/icons/type_text.png .tex .dvi
AddIcon /autoindex/icons/type_vcf.png .vcf .vcard
AddIcon /autoindex/icons/type_word.png .doc .docx
AddIcon /autoindex/icons/type_zip.png .Z .z .tgz .gz .zip
AddIcon /autoindex/icons/globe.png .wrl .wrl.gz .vrm .vrml .iv
AddIcon /autoindex/icons/vector.png .plot

AddIconByType (TXT,/autoindex/icons/type_text.png) text/*
AddIconByType (IMG,/autoindex/icons/type_image.png) image/*
AddIconByType (SND,/autoindex/icons/type_audio.png) audio/*
AddIconByType (VID,/autoindex/icons/type_video.png) video/*
AddIconByEncoding (CMP,/autoindex/icons/type_box.png) x-compress x-gzip
AddIcon /autoindex/icons/back.png ..
AddIcon /autoindex/icons/information.png README INSTALL
AddIcon /autoindex/icons/type_folder.png ^^DIRECTORY^^
AddIcon /autoindex/icons/blank.png ^^BLANKICON^^

#
# DefaultIcon is which icon to show for files which do not have an icon# explicitly set.
#
DefaultIcon /autoindex/icons/type_document.png
#
# Enables PHP to be used in our header file
# 
AddHandler application/x-httpd-php .php
AddType text/html .php .html
#
# ReadmeName is the name of the README file the server will look for by
# default, and append to directory listings.
#
# HeaderName is the name of a file which should be prepended to
# directory indexes.
ReadmeName /autoindex/footer.php
HeaderName /autoindex/header.php
#
# IndexIgnore is a set of filenames which directory indexing should ignore
# and not include in the listing.  Shell-style wildcarding is permitted.
#
IndexIgnore autoindex .??* *~ *# RCS CVS *,v *,t *.dat ..

IndexOptions +NameWidth=42
AddDescription "PNG images" *.png

Warning for PHP 5.3 and higher

I originally had this running with PHP 5.1 and it was working great.  I upgraded to PHP 5.3.3 (latest at the time of writing) and it refused to parse the PHP, despite the PHP working if I called the Header and Footer PHP pages directly.

It turned out to be the directive XHTML in the IndexOptions line.  Remove this and it will parse.  XHTML says:

The XHTML keyword forces mod_autoindex to emit XHTML 1.0 code instead of HTML 3.2.

Whereas the same pages says that a Header/Readme filename “must resolve to a document with a major content type of text/* (e.g.text/htmltext/plain, etc.).”

Building Apache 2.2, PHP 5 with GD and MySQLi support from source

1. Download the following
Apache 2.2 from http://httpd.apache.org/download.cgi [2.2.17]
PHP 5.3.3 from http://www.php.net/downloads.php [5.3.3]
Expat from http://sourceforge.net/projects/expat/ [2.0.1]
JPEG from http://www.ijg.org/ [v8b]
PNG from http://sourceforge.net/projects/libpng/files/ [1.4.4]

2. Apache

./configure --enable-so --enable-modules=most --enable-proxy --with-mpm=worker --disable-imap --enable-deflate
make
sudo make install

3. Expat XML Parser

./configure
make
sudo make install

4 JPEG

./configure
make
sudo make install

5. PNG

./configure
make
sudo make install

6. PHP

./configure --disable-cli --enable-embedded-mysqli --with-zlib --enable-shared --with-apxs2=/usr/local/apache2/bin/apxs --with-gd
make
sudo make install

Detecting Scraping: Apache, Squid using Hadoop and Pig

Architecture

Squid is operating at the front-end as a reverse proxy-cache behind a hardware load balancer which in turn gets its content from a large load balance pool of Apache (with JBoss) servers.  The Squid logs have some useful information in (IP of remote client) and the Apache logs have some equally important information in (User-Agent, Session ID).

The first challenge: Stitching Squid logs to Apache

The squid logs are in Squid’s native log format which is optimised for speed.  The squid logs rotate hourly and the Apache logs rotate daily.  This arrangement fits with front end speed and the convenience of rotating logs at the one of quietest times of the day (4am).

The Apache logs are a custom log format that includes response time and cookie information:

CustomLog   logs/access.log "%h %D %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

To stitch the two together, the first thing to realise is that you  only need to look at Squid MISSes – logically these are the ones that have had to go to Apache to get content.

The only identifying piece of information common to both Squid and Apache are the timestamp and URL so immediately there is room for improvement here and a small percentage of hits don’t get stitched together.  This is fine as we’re after a picture of potential scrapers, rather than accurate accounting information.

So the first Pig script does a merge of the data based on timestamp and URL.  This takes a fair amount of time to do and in the end a log file is produced with a lot of useful information in – external IP, user-agent, cookie information, etc. It does this by processing this through a stream using some perl as we need to convert the unix timestamp format in the squid log to the same format as Apache. I end up with the following

03/Oct/2010:04:02:14	44	xxx.xxx.xxx.xxx TCP_MISS/200 365	GET	/static/js/bundle.js	-	172.22.25.155	54905	03/Oct/2010:04:02:14	GET	/static/js/bundle.js	1.1	200	289939	http://mysite.co.uk/some/resource Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; InfoPath.1)	34AA0D711456868AAF7C871F8190C72F	1243622207600.672376.715	LU11LY

Note that final 3 fields.  These are a sessionid, a userid and a postcode [something specific to this particular site].  We can extract any meaningful information from the log entry when we stream the data through some perl script.

The crux of the challenge: What is scraping?

There are a number of issues with scraping. Some scrapes are very easy to spot – sequential page number increasing, single IPs, same identifiable information used in the search strings. The more complex ones mix IP addresses with different User-Agent strings, even the single IP addresses with multiple User-Agents used pose a problem: mega-proxies where many, many genuine home users filter through from AOL, etc.

The first steps: Grouping IP, with Session ID and User-Agent

The first steps towards identifying the slightly less basic scrapers (ones with multiple IPs, user-agents) was to group the IP with Session ID and User-Agent.  This is then counted and ordered by the count to show how often that combination from that IP is seen… notice we don’t need to see the URL.  It is assumed that a high number of this combination is unusual and worthy of further investigation.

This job takes a few hours to run on relatively small amount of data on the current PoC Hadoop cluster so there is room for optimisation here.

Note that this script highlights the bots too – which by the same token are a high number of hits from single or multiple IPs…

The result is something like the following:

I’ll post the work-in-progress code here to share the findings and work on a way of improving this.

How to solve a problem like scraping

It’s been a while since I last blogged but it doesn’t mean I’ve disappeared. See it as me being deep in thought.

I work for a large web site operating in EMEA that has lots of invaluable data available to the public. This is great, but other people want to take that data wholesale without going through the proper authorised channels. This is known as scraping – effectively “Site Content Raping” to coin a not so nice phrase.

Scraping is very easy to do. There are tools out there that in a few clicks, will spider your site and download the content – after all, the data is public, the hyperlinks are designed to take you through the data. The web search engine bots effectively scrape our site, but the difference is that they report back the links. Scraping content involves downloading the relevant data that causes legal issues. In truth, scraping is a legal issue – but legal routes to stopping scraping is hit by two issues: its a lengthy process and one that needs evidence to support the scraping activity to show its breaking the terms and conditions of your site.

The problem with scraping is being able to identify it in the first place. Some scrapes are relatively benign and easy to spot. Ironically, they’re not usually an issue unless the lazy way they’ve implemented their scrape causes site capacity issues. But well designed site infrastructure should be able to cope with any surge in demand however it is presented. Most scraping activity remains under the radar though and spotting the trail involves understanding how the site can be scraped in the first place, the methods to evade detection and the hardest challenge in all this is distinguishing this from the millions of legitimate traffic accessing the site at the same time.

There are a number of ways to tackle the problem:

– Employ a 3rd party to monitor and report on the scraping activity in real-time on your behalf as part of a monthly service operational expense
– Implement ingress filtering of your data to report on activities in real-time using equipment maintained and set up by teams internally
– Implement log analysis after the event

I’m looking at the log analysis to tackle the issue which involves large data set processing using Hadoop and custom scripts to slice and dice the information to help form conclusions that will help towards writing the reports to support scraping activity.

Over the next few months I aim to track my successes and failures in combating this problem.

Hadoop, Pig, Apache and Squid Log Processing

I’ve been experimenting with Hadoop to help process the gigabytes of logs generated from Apache and Squid where I work. Currently, we’ve a very small proof of concept cluster comprising of 5 nodes that is churning through Squid logs hourly to produce GnuPlot graphs of traffic over the last hour and last day.

I’ll not cover Hadoop in any detail here (there are many places to look for this – for example Yahoo! and Cloudera) but I’ll document the scripts used to get Hadoop processing Squid and Apache logs here.