Background
The Bixo project came about because two different companies needed the same thing – a web mining toolkit that could easily fit into an existing Cascading-based workflow.
In discussing various ways to solve this problem, it became clear that refactoring Nutch to work in this environment would be a painful and error-prone process. In addition, the known limitations of Nutch would still need to be worked around, while the resulting massive fork would have little to no chance of being rolled back into the main Nutch codebase.
So the shortest distance between the two points was a new, slimmed down implementation that satisfied the following constraints:
- Used Cascading to manage internal workflow as well as integrating with external data sources and sinks (outputs).
- Supported only http and https protocols, at least initially.
- Efficiently yet politely crawled white lists, with a limited number of discrete domains.
- Testable at multiple levels (unit, integration, simulated web crawl)
Powered By
The following is a partial list of companies using Bixo, along with any public details of use cases.
- Bebo – Help ensure the quality of their user experience.
- EMI Music – Extract music/artist popularity data from sources such as Facebook.
- ShareThis – Fetch, parse & generate a searchable index for shared URLs, and to mine a larger set of viewed web pages.
- Bixolabs – Bixo is a key component of their new EC2-based elastic web mining platform.
Acknowledgements
We’d like to thank EMI Music and ShareThis for sponsoring Bixo, and also Chris Wensel (author of Cascading, co-founder of Scale Unlimited) for extensive technical support.
Bixo also makes heavy use of a number of open source projects:
- Nutch – a great source for ideas and inspiration.
- HttpClient 4 – for all your HTTP protocol needs.
- Tika – a relatively new parser framework.
- Cascading – the key to efficient and reliable workflow.
- Hadoop – our foundation for distributed data processing.