Easy OpenStack Folsom with VirtualBox and Vagrant

Testing OpenStack is now as easy thanks to VirtualBox and Vagrant. To run a mini test environment with Compute, Cinder, Keystone and Horizon you just need the following tools:

  • VirtualBox
  • Vagrant
  • Git client

Getting Ready

To set up a sandbox environment within VirtualBox to run OpenStack Folsom you will need to download:

Installation of these tools are simple – follow the on-screen prompts.

When ready we need to configure the VirtualBox “Host-Only” Networking. This networking mode allows us to communicate with our VirtualBox guest and our underlying host.
We will set up the following:

  • Host-Only Network: IP 172.16.0.254; Network 172.16.0.0/255.255.0.0; Disable DHCP
  • Host-Only Network #2: IP 10.0.0.254; Network 10.0.0.0/255.0.0.0; Disable DHCP

(Hint: there is a bash script @ https://raw.github.com/uksysadmin/OpenStackInstaller/folsom/virtualbox/vbox-create-networks.sh to create these for you).

How To Do It

To create a VirtualBox VM, running Ubuntu 12.04 with OpenStack Folsom from Ubuntu’s Cloud Archive, carry out the following

1. Clone the GitHub OpenStackInstaller scripts

git clone https://github.com/uksysadmin/OpenStackInstaller.git

2. Make the scripts the ‘folsom’ branch

cd OpenStackInstaller
git checkout folsom

3. Run ‘vagrant’ to launch your OpenStack instance which will come up with IP 172.16.0.201

cd virtualbox
vagrant up

4. After a short while your instance will be ready. Note that on the first run, Vagrant will download a 384Mb Precise64 “box”. Subsequent launches will not require this step.

Launch a web browser at http://172.16.0.201/horizon and log in with:

Username: admin
Password: openstack

(Note, to edit the IP it is assigned, modify virtualbox/vagrant-openstack-bootstrap.sh (Warning its a bit of a sed hack!).

Protecting SSH against brute force attacks

Running a public AWS instance is always asking for unexpected trouble from script kiddies and bots trying to find a vector in to compromise your server.
Sshguard (www.sshguard.net) monitors your log and alters your IPtables firewall accordingly to help keep persistent brute force attackers at bay.

1. Download the latest version from http://www.sshguard.net @ http://freshmeat.net/urls/6ff38f7dc039f95efec2859eefe17d3a

wget -O sshguard-1.5.tar.bz2
    http://freshmeat.net/urls/6ff38f7dc039f95efec2859eefe17d3a

2. Unpack

tar jxvf sshguard-1.5.tar.bz2

3. Configure + Make

cd sshguard-1.5
./configure --with-firewall=iptables
make

4. Install (to /usr/local/sbin/sshguard)

sudo make install

5. /etc/init.d/sshguard (chmod 0755)

! /bin/sh
# this is a concept, elaborate to your taste
case $1 in
start)
/usr/local/sbin/sshguard -a 4 -b 5:/var/sshguard/blacklist.db -l
     /var/log/auth.log &
;;
stop)
killall sshguard
;;
*)
echo "Use start or stop"
exit 1
;;
esac

6. /etc/iptables.up.rules

# Firewall
*filter
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:INPUT DROP [0:0]
-N sshguard
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p tcp --dport http -j ACCEPT
-A INPUT -p tcp --dport ftp-data -j ACCEPT
-A INPUT -p tcp --dport ftp -j ACCEPT
-A INPUT -p tcp --dport ssh -j sshguard
-A INPUT -p udp --source-port 53 -d 0/0 -j ACCEPT
-A OUTPUT -j ACCEPT
-A INPUT -j DROP
COMMIT
# Completed

7. Read in the IPtables rules

iptables-restore < /etc/iptables.up.rules

8. Start Sshguard

mkdir /var/sshguard&&/etc/init.d/sshguard start

Verification

tail -f /var/log/auth.log
iptables -L -n

Detecting Scraping: Apache, Squid using Hadoop and Pig

Architecture

Squid is operating at the front-end as a reverse proxy-cache behind a hardware load balancer which in turn gets its content from a large load balance pool of Apache (with JBoss) servers.  The Squid logs have some useful information in (IP of remote client) and the Apache logs have some equally important information in (User-Agent, Session ID).

The first challenge: Stitching Squid logs to Apache

The squid logs are in Squid’s native log format which is optimised for speed.  The squid logs rotate hourly and the Apache logs rotate daily.  This arrangement fits with front end speed and the convenience of rotating logs at the one of quietest times of the day (4am).

The Apache logs are a custom log format that includes response time and cookie information:

CustomLog   logs/access.log "%h %D %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""

To stitch the two together, the first thing to realise is that you  only need to look at Squid MISSes – logically these are the ones that have had to go to Apache to get content.

The only identifying piece of information common to both Squid and Apache are the timestamp and URL so immediately there is room for improvement here and a small percentage of hits don’t get stitched together.  This is fine as we’re after a picture of potential scrapers, rather than accurate accounting information.

So the first Pig script does a merge of the data based on timestamp and URL.  This takes a fair amount of time to do and in the end a log file is produced with a lot of useful information in – external IP, user-agent, cookie information, etc. It does this by processing this through a stream using some perl as we need to convert the unix timestamp format in the squid log to the same format as Apache. I end up with the following

03/Oct/2010:04:02:14	44	xxx.xxx.xxx.xxx TCP_MISS/200 365	GET	/static/js/bundle.js	-	172.22.25.155	54905	03/Oct/2010:04:02:14	GET	/static/js/bundle.js	1.1	200	289939	http://mysite.co.uk/some/resource Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; InfoPath.1)	34AA0D711456868AAF7C871F8190C72F	1243622207600.672376.715	LU11LY

Note that final 3 fields.  These are a sessionid, a userid and a postcode [something specific to this particular site].  We can extract any meaningful information from the log entry when we stream the data through some perl script.

The crux of the challenge: What is scraping?

There are a number of issues with scraping. Some scrapes are very easy to spot – sequential page number increasing, single IPs, same identifiable information used in the search strings. The more complex ones mix IP addresses with different User-Agent strings, even the single IP addresses with multiple User-Agents used pose a problem: mega-proxies where many, many genuine home users filter through from AOL, etc.

The first steps: Grouping IP, with Session ID and User-Agent

The first steps towards identifying the slightly less basic scrapers (ones with multiple IPs, user-agents) was to group the IP with Session ID and User-Agent.  This is then counted and ordered by the count to show how often that combination from that IP is seen… notice we don’t need to see the URL.  It is assumed that a high number of this combination is unusual and worthy of further investigation.

This job takes a few hours to run on relatively small amount of data on the current PoC Hadoop cluster so there is room for optimisation here.

Note that this script highlights the bots too – which by the same token are a high number of hits from single or multiple IPs…

The result is something like the following:

I’ll post the work-in-progress code here to share the findings and work on a way of improving this.

How to solve a problem like scraping

It’s been a while since I last blogged but it doesn’t mean I’ve disappeared. See it as me being deep in thought.

I work for a large web site operating in EMEA that has lots of invaluable data available to the public. This is great, but other people want to take that data wholesale without going through the proper authorised channels. This is known as scraping – effectively “Site Content Raping” to coin a not so nice phrase.

Scraping is very easy to do. There are tools out there that in a few clicks, will spider your site and download the content – after all, the data is public, the hyperlinks are designed to take you through the data. The web search engine bots effectively scrape our site, but the difference is that they report back the links. Scraping content involves downloading the relevant data that causes legal issues. In truth, scraping is a legal issue – but legal routes to stopping scraping is hit by two issues: its a lengthy process and one that needs evidence to support the scraping activity to show its breaking the terms and conditions of your site.

The problem with scraping is being able to identify it in the first place. Some scrapes are relatively benign and easy to spot. Ironically, they’re not usually an issue unless the lazy way they’ve implemented their scrape causes site capacity issues. But well designed site infrastructure should be able to cope with any surge in demand however it is presented. Most scraping activity remains under the radar though and spotting the trail involves understanding how the site can be scraped in the first place, the methods to evade detection and the hardest challenge in all this is distinguishing this from the millions of legitimate traffic accessing the site at the same time.

There are a number of ways to tackle the problem:

– Employ a 3rd party to monitor and report on the scraping activity in real-time on your behalf as part of a monthly service operational expense
– Implement ingress filtering of your data to report on activities in real-time using equipment maintained and set up by teams internally
– Implement log analysis after the event

I’m looking at the log analysis to tackle the issue which involves large data set processing using Hadoop and custom scripts to slice and dice the information to help form conclusions that will help towards writing the reports to support scraping activity.

Over the next few months I aim to track my successes and failures in combating this problem.

Hadoop, Pig, Apache and Squid Log Processing

I’ve been experimenting with Hadoop to help process the gigabytes of logs generated from Apache and Squid where I work. Currently, we’ve a very small proof of concept cluster comprising of 5 nodes that is churning through Squid logs hourly to produce GnuPlot graphs of traffic over the last hour and last day.

I’ll not cover Hadoop in any detail here (there are many places to look for this – for example Yahoo! and Cloudera) but I’ll document the scripts used to get Hadoop processing Squid and Apache logs here.