Requirements
A Java Runtime Environment (JRE) – version 1.6 or later.
Running locally with pre-build binaries
- Download the latest distribution file and save it to your computer.
- Expand the file into a directory on your computer.
- Using the command line:
% cd <bixo directory>% bin/bixo crawl -agentname <name> -domain <domain> -outputdir <dir> -numloops 3
This will start a crawl job with for pages found in <domain>, and do three loops. The results will be saved to the output directory you specify. This directory shouldn’t exist yet, as otherwise the crawl will assume you’re continuing from a previous crawl. The <domain> should be a valid top-level domain, e.g. cnn.com, and the <name> you specify for the agent name should be something specific to your organization or use-case, NOT “bixo”.
After the crawl has completed, you can dump some statistics by executing % bin/bixo status -crawldir <dir>
See Building Bixo for details on how to build Bixo from source.
Running locally in Eclipse
- Follow the Building Bixo steps for getting the source and creating an Eclipse project.
- Open the Run dialog for the SimpleCrawlTool class, and specify appropriate parameters for the file containing URLs to crawl, the directory to use for results, and the user-agent name. You’ll also need to set the JVM parameters to “-Xmx256m” so that there’s enough memory to run the Hadoop jobs.
Running in Amazon’s EC2
See the detailed instructions on the Running Bixo in EC2 page.