Requirements

A Java Runtime Environment (JRE) – version 1.6 or later.

Running locally with pre-build binaries

  1. Download the latest distribution file and save it to your computer.
  2. Expand the file into a directory on your computer.
  3. Using the command line:
    • % cd <bixo directory>
    • % bin/bixo crawl -agentname <name> -domain <domain> -outputdir <dir> -numloops 3

This will start a crawl job with for pages found in <domain>, and do three loops. The results will be saved to the output directory you specify. This directory shouldn’t exist yet, as otherwise the crawl will assume you’re continuing from a previous crawl. The <domain> should be a valid top-level domain, e.g. cnn.com, and the <name> you specify for the agent name should be something specific to your organization or use-case, NOT “bixo”.

After the crawl has completed, you can dump some statistics by executing % bin/bixo status -crawldir <dir>

See Building Bixo for details on how to build Bixo from source.

Running locally in Eclipse

  1. Follow the Building Bixo steps for getting the source and creating an Eclipse project.
  2. Open the Run dialog for the SimpleCrawlTool class, and specify appropriate parameters for the file containing URLs to crawl, the directory to use for results, and the user-agent name. You’ll also need to set the JVM parameters to “-Xmx256m” so that there’s enough memory to run the Hadoop jobs.

Running in Amazon’s EC2

See the detailed instructions on the Running Bixo in EC2 page.