Hadoop and Pig on Ubuntu 10.04 Lucid Lynx

Hadoop is a parallel data processing framework that Google is based upon consisting of MapReduce and HDFS.  It is used to batch process masses amount of data across many nodes.

Getting it running on Ubuntu Linux is easy and will be documented here shortly.

Overview (Single Node)

  • Download Hadoop 0.20.2
  • Download Pig
  • Configure environment

Overview (Multi Node Cluster)

  • As for Single node but configure relevant config files



Hadoop, Pig, Apache and Squid Log Processing

I’ve been experimenting with Hadoop to help process the gigabytes of logs generated from Apache and Squid where I work. Currently, we’ve a very small proof of concept cluster comprising of 5 nodes that is churning through Squid logs hourly to produce GnuPlot graphs of traffic over the last hour and last day.

I’ll not cover Hadoop in any detail here (there are many places to look for this – for example Yahoo! and Cloudera) but I’ll document the scripts used to get Hadoop processing Squid and Apache logs here.