Installing OpenSUSE 11.3 under Virtual Box 3.2

You will need

  1. VirtualBox
  2. OpenSUSE Live CD


Guest Additions

  1. Update the packages
    zypper up
  2. Reboot
  3. Install Kernel Development Packages
    sudo zypper in -t pattern devel_kernel
  4. Mount the VirtualBox Guest Additions CD[VirtualBox Menu] Devices… Install Guest Additions
  5. Run the installer
    sudo /media/VBOXADDITIONS_3.2.10_66523/
  6. Reboot

Detecting Scraping: Apache, Squid using Hadoop and Pig


Squid is operating at the front-end as a reverse proxy-cache behind a hardware load balancer which in turn gets its content from a large load balance pool of Apache (with JBoss) servers.  The Squid logs have some useful information in (IP of remote client) and the Apache logs have some equally important information in (User-Agent, Session ID).

The first challenge: Stitching Squid logs to Apache

The squid logs are in Squid’s native log format which is optimised for speed.  The squid logs rotate hourly and the Apache logs rotate daily.  This arrangement fits with front end speed and the convenience of rotating logs at the one of quietest times of the day (4am).

The Apache logs are a custom log format that includes response time and cookie information:

CustomLog   logs/access.log "%h %D %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

To stitch the two together, the first thing to realise is that you  only need to look at Squid MISSes – logically these are the ones that have had to go to Apache to get content.

The only identifying piece of information common to both Squid and Apache are the timestamp and URL so immediately there is room for improvement here and a small percentage of hits don’t get stitched together.  This is fine as we’re after a picture of potential scrapers, rather than accurate accounting information.

So the first Pig script does a merge of the data based on timestamp and URL.  This takes a fair amount of time to do and in the end a log file is produced with a lot of useful information in – external IP, user-agent, cookie information, etc. It does this by processing this through a stream using some perl as we need to convert the unix timestamp format in the squid log to the same format as Apache. I end up with the following

03/Oct/2010:04:02:14	44 TCP_MISS/200 365	GET	/static/js/bundle.js	-	54905	03/Oct/2010:04:02:14	GET	/static/js/bundle.js	1.1	200	289939 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; InfoPath.1)	34AA0D711456868AAF7C871F8190C72F	1243622207600.672376.715	LU11LY

Note that final 3 fields.  These are a sessionid, a userid and a postcode [something specific to this particular site].  We can extract any meaningful information from the log entry when we stream the data through some perl script.

The crux of the challenge: What is scraping?

There are a number of issues with scraping. Some scrapes are very easy to spot – sequential page number increasing, single IPs, same identifiable information used in the search strings. The more complex ones mix IP addresses with different User-Agent strings, even the single IP addresses with multiple User-Agents used pose a problem: mega-proxies where many, many genuine home users filter through from AOL, etc.

The first steps: Grouping IP, with Session ID and User-Agent

The first steps towards identifying the slightly less basic scrapers (ones with multiple IPs, user-agents) was to group the IP with Session ID and User-Agent.  This is then counted and ordered by the count to show how often that combination from that IP is seen… notice we don’t need to see the URL.  It is assumed that a high number of this combination is unusual and worthy of further investigation.

This job takes a few hours to run on relatively small amount of data on the current PoC Hadoop cluster so there is room for optimisation here.

Note that this script highlights the bots too – which by the same token are a high number of hits from single or multiple IPs…

The result is something like the following:

I’ll post the work-in-progress code here to share the findings and work on a way of improving this.

Ubuntu Karmic Koala 64, VirtualBox + Windows 7 64

I have an Intel Core i7, 8Gb DDR3 laptop which essentialy means I’ve 8 cores at my disposal. To take advantage of this and the ample memory available, like many servers these days, virtualisation is the way to go.

I do have a preference for running Linux on the desktop, which stems from familiarity and flexibility rather than a hate of Microsoft, but there are tools and reasons why Windows is the only way forward. As a middle ground, running Windows as a VM guest strikes a good balance.

VirtualBox is Oracle’s desktop virtualisation offering and is more akin to VMware Server 1.x but with modern hardware virtualisation support. Why. VBox over VMware Server 2.x? It’s far simpler to set up and run, and far simpler to install under Ubuntu. Ubuntu’s repos are populated with VirtualBox OSE, which stands for Open Source Edition.

To install do the following:

apt-get install virtualbox-ose

At the time of writing the OSE version in Karmic Stable is 3.0.8, whereas the latest version is 3.1.4.  I mention this because I ran into the following problems with running Windows 7 64-Bit and VirtualBox-OSE 3.0.8 64-Bit:

  • Intermittent slowness in the guest
  • Locking up of the Guest + VirtualBox

To alleviate these problems I did the following

  • Installed the RealTime (RT) kernel in Ubuntu Karmic
    apt-get install install linux-image-rt linux-headers-rt
  • Reduced the number of vCPUs from 4, to 2, to 1

This did seem to fix the issue of locking up, and gave me a Windows 7 desktop, but I had 8 cores to play with – going to 1vCPU just caused me problems running applications within the guest that needed a little more processing power that I had available.

I’ve now resorted to installing the version found at This is 3.1.4 r57640 at the time of this posting.

As I already had Windows 7 up and running under OSE 3.0.8 I did the following

apt-get remove virtual-box-ose

(Which removes certain libqt4 dependencies – but I’ll show the complete steps to install Virtual Box 3.1.4 from below:

apt-get install libqt4-network libqt4-opengl libqtcore4 libqtgui4
dpkg -i /tmp/virtualbox-3.1_3.1.4-57640_Ubuntu_karmic_amd64.deb

This will ask you to compile new updated kernel modules, so answer yes to the question of doing so.
It also says you must add your user to the vboxusers group so do this:

sudo usermod -G vboxusers -a username

Log back out and back in again to activate your user in the vboxusers group.

Fire up Windows 7 and things appear more stable.  Currently running 2vCPU with 4Gb Ram.  If you had Windows 7 previously running under an older version of VirtualBox then don’t forget to update Guest Additions to that of the running VirtualBox version.

Qnap TS-210: Home NAS

I recently purchased a home network storage device after consolidating the myriad of external hard drives I had. After a good few hours of googling and asking for recommendations I took delivery of a Qnap TS-210 – a 2 drive bay SATA-2 NAS enclosure. This device is awesome! After populating it with a couple of fast, quiet Western Digital Blue drives I now have RAID1 shared storage connected to my wireless hub so my data is available across all my home machines.

This is all well and good but what sets this apart from the rest of the NAS devices out there for under £250 is that this device is not just great for serving my photos over NFS…

The Qnap TS-210 sports the following:

iSCSI target
Apache with PHP, SSL and WebDAV
MySQL Server
TwonkyServer uPNP multimedia server
BitTorrent Client
Web Cam Surveillance server

And the list doesn’t stop there. The device runs Linux and it doesn’t hide access to it’s internals. Enable the Optware plugin and you get access to an apt/yum like repository. From here you can install a wide range of tools and services such as Squid.

Administration of the NAS is through a polished Ajax interface.
Setup is straightforward although there are a couple of gotchas which I came across in the 3.2 version firmware:

Do not run EXT4 if you expect a stable NFS server. It’s not production ready and after a frustrating few days of copying data from hard drive to hard drive, EXT3 is rock solid stable.

If, like me, you start with a single drive expecting to easily mirror it later on then you’re wrong. Copying data off again and back after setting up a new mirror is the only option on this device.

More positives though come in the shape of power management and green features. The device consumes only 14W when in use and 7W when idle. Hard drives are powered down when not in use after a set period and you can set times when you want the device off and when to power up again. Handy for a device for home or small that doesn’t need it on 24 hours a day yet at the same time want the hassle of remembering to power it on each day.