Table Of Contents

This Page

Biodoop

Biodoop is a suite of tools for computational biology that focuses on the efficient, distributed implementation of the most computationally demanding and/or data-intensive tasks. It consists of a core component, which includes a set of general-purpose modules, plus a number of application-specific components.

Current applications focus on sequence alignment and manipulation of alignment records. Applications generally run on the Pydoop API for Hadoop and are built to scale well in both the number of computing nodes available and the amount of data to process, making them particularly well suited for processing large data sets.

Core

Currently, Biodoop’s core contains a few modules for handling FASTA streams, wrappers for BLAST, I/O modules for some bio formats, a module for converting sequences to the nib format and protobuf serializers for several objects.

Release Notes

Release 0.2.0:

  • added genotyping-related sub-package “gt” (for now it just contains protobuf serialization stuff)
  • added “io” sub-package with readers/writers for some bio file formats
  • added “messages” sub-package with general-purpose protobuf modules
  • added “seq/align” sub-package with tools for reading SAM files
  • added a module for converting sequences to the nib format

Installation

  1. install prerequisites:
  1. get biodoop-core from the download page

  2. unpack the biodoop-core tarball

  3. build the protobuf code in bl/core/messages and bl/core/gt/messages

  4. move to the distribution’s root directory and run:

    python setup.py install

    for a system-wide installation, or:

    python setup.py install --user

    for a local installation

BLAST

The BLAST package provides a wrapper-based MapReduce implementation of BLAST for Hadoop. See the Biodoop-BLAST documentation for details.