Bixo uses Cascading to define the execution of tasks. And Cascading is built on top of Hadoop, the leading open source MapReduce framework.
This means that Bixo will scale out to many servers, with little or no extra effort. You can run Bixo on your laptop, and you can run Bixo on a cluster of 100 servers, depending on your needs.
And it also means that Bixo runs well in Amazon’s Elastic Computer Cloud (EC2). With a few commands you can create a Bixo cluster, fetch and process the target web pages, save the results, and dispose of the cluster – without buying, provisioning or maintaining any hardware.
This next section will provide step-by-step instructions for running Bixo in EC2.
WARNING: Information about configuring EC2 on the Hadoop and AWS sites will be different than what’s described below. Please follow these steps, which leverage as much as possible the automation available via modified versions of Hadoop Bash scripts.
Getting an AWS account
The very first thing you need to do is to create an AWS account. This creates the various keys you need to interact with EC2, and sets up billing for the time that you use.
- Sign up for an AWS account, if you don’t already have one.
- Go to the Amazon EC2 Getting Started Guide page. Note that you don’t need to do everything described here, just the specific items listed below.
- Follow the steps in the Setting up an Account/Signing up for Amazon EC2 section. Once you’ve followed these steps, you’ll have the X.509 certificate and private key, as well as your AWS account ID and your Access Key ID and Secret Key ID. Note that this is a direct link to the “Access Identifiers” page. The AWS account ID is also called the “Account Number” on some AWS pages, and is a 12 digit number with the format xxxx-xxxx-xxxx. Also note that you must sign up for an EC2 account in addition to your AWS account.
Configuring Bixo for EC2
- Follow the Getting Started instructions to download Bixo onto your hard disk.
- Create a new directory on your hard disk for your AWS key information. If you have a fork of Bixo, make sure this directory is NOT in your git directory, so that you don’t accidentally push your secret key information into GitHub.
- Inside this directory, copy the X.509 cert-<id>.pem and pk-<id>.pem files from the Getting an AWS Account procedure above.
- Inside this directory, create a file called accountid that contains your AWS account ID.
- Inside this directory, create a file called accesskey that contains your Access Key ID.
- Inside this directory, create a file called secretkey that contains your Secret Key ID.
- In your <bixo root>/bin/ec2/ directory, create a file called .local.awskey-path. This should contain the full path to the AWS key directory that you’ve populated above.
% cd <bixo root>/bin/ec2/% . setenv.sh% ec2-add-keypair <keypair name>Use something short like “mybixo” for the keypair name.- Copy the output of everything between (and including) the “—–BEGIN RSA PRIVATE KEY—–” and “—–END RSA PRIVATE KEY—–” lines to create a file called id_rsa-<keypair name> in the AWS key directory.
- Set privileges on the id_rsa-<keypair name> file to be read-write for only the user:
% chmod go= id_rsa-myaws
Congratulations, you are now set up to run Bixo in EC2.
Note: If you’re using Cygwin on Windows, make sure you have ssh installed in your Cygwin environment.
Setting up ElasticFox
ElasticFox is a free Firefox extension that lets you monitor your EC2 servers, find out information about them, configure public access, etc.
Follow these steps to install and configure ElasticFox:
- Launch Firefox (version 3.0 or later)
- Download the Elasticfox extension by clicking this link.
- If you get a dialog asking you what to do with the download file, select “Open with…” and choose Firefox.
- Select Tools > Elasticfox.
- The “Enter AWS credentials” dialog should be automatically displayed. Choose any Account Name, enter your AWS Access and Secret Access Keys, and click the “Add” button, followed by the “Close”button.
- Click the icon to the left of “Account IDs”, choose any Display Name, enter your AWS Account ID, and click the “Add” button.
- Make sure the “Regions” popup is set to “us-east-1″.
Setting up FoxyProxy
FoxyProxy is a free Firefox extension that can use a local proxy to correctly handle internal EC2 URLs.
Follow these steps to install and configure FoxyProxy:
- Launch Firefox (version 3.0 or later)
- Download the FoxyProxy extension by clicking this link.
- Configure the FoxyProxy extension as follows (sorry, lots of steps here):
- Select Tools > FoxyProxy > Options
- Click the “Add New Proxy” button.
- Select “Manual Proxy Configuration” radio button.
- Enter “localhost” for the “Host or IP Address” field.
- Enter “6666″ for the “Port” field.
- Click on the “General” tab at the top of the dialog box.
- Enter “EC2″ for the “Proxy Name” field.
- Click on the “URL Patterns” tab at the top of the dialog box.
- Click the “Add New Pattern” button.
- Enter “EC2″ for the “Pattern Name” field.
- Enter “*compute-1.amazonaws.com*, *.ec2.internal*, *.compute-1.internal*” for the “URL pattern” field (not case sensitive)
- Select the “Whitelist” and “Wildcards” radio buttons.
- Click the “OK” button to dismiss the new URL pattern dialog box.
- Click the “OK” button to dismiss the new proxy dialog box. You should now have a window that looks like this:

- Close the FoxyProxy Options dialog box.
- At the bottom right corner of your browser, you should now have a new “FoxyProxy Disabled” label.
Launching a Bixo EC2 Cluster
You are now ready to launch a Bixo cluster in EC2.
% cd <Bixo root>/bin/ec2% . setenv.sh% hadoop-ec2 launch-cluster <cluster name> <number of slave servers>For example “% hadoop-ec2 launch-cluster bixo 2″. Ignore the many “[Deprecated] Xalan : xxx” messages that annoyingly appear in the terminal window.- Open a new Firefox browser window and select Tools > Elasticfox
- Wait for your <cluster name>-master and <cluster name> slave servers to show up in the Elasticfox list. Eventually it will look something like this:

- Open a new terminal window
% cd <Bixo root>/bin/ec2% . setenv.sh% hadoop-ec2 proxy <cluster name>This will start up a local proxy, and output URLs to use for monitoring the cluster.- Right-click on the “FoxyProxy disabled” text at the bottom right of any Firefox browser window. Select “Use proxy EC2 for all URLs” from the pop-up menu.

- Open a new browser window and paste the JobTracker URL output from the proxy. This will let you track the actual Hadoop jobs once Bixo starts running.
- Open a new browser window and paste the NameNode URL output from the proxy. This will let you view files in HDFS (Hadoop Distributed File System) that are generated by Bixo.
Running a Bixo job
Once your cluster is up and running, you can run an actual Bixo job on it.
% cd <Bixo root>/bin/ec2% . setenv.sh% hadoop-ec2 push <cluster name> ../../release/bixo-job-<version>.jarThis may take a while, as the large jar has to be uploaded from your machine to the <cluster name>-master server.% hadoop-ec2 login <cluster name>This logs you into the master server.% hadoop jar bixo-core-job-<version>.jar -domain yahoo.com -numloops 2 -outputdir output -agentname <your agent name>
The above command will start a crawl of the yahoo.com domain, and do two loops. You can monitor the progress of the crawl via the browser window you opened with the JobTracker URL from the proxy. In addition, you can watch files being created in HDFS by browsing the HDFS file system in the browser window you opened with the NameNode URL from the proxy.
Additional Notes
Finding internal and external EC2 server names – In Elasticfox, right-click on any server and select “Copy Private DNS Name to clipboard” or “Copy Public DNS Name to clipboard”.
SSHing into slaves – First use the hadoop-ec2 login command to log into the master, then just % ssh <internal server name>All servers in the cluster are configured for keyless SSH.
Copying files from EC2 servers – Use the scp tool, as in:% scp $SSH_OPTS root@<public DNS name>:<path to file> <local path>The $SSH_OPTS shell variable has been set up by the setenv.sh script with the values needed to access the EC2 servers.