Quick Introduction to BioPython

Create a Word document (biopython.docx) containing responses to the questions below and check it into your section of the class repository.


BioPython is an open source project

BioPython is a community-driven, open source project. Developers from all over the world have contributed code to BioPython, and many of them are graduate students. Take a moment to review the BioPython Web site.

QUESTION ONE. What is the latest release of Biopython?



BioPython, along with BioPerl, BioJava, BioRuby, etc., is one of several open source bioinformatics programming projects called the Bio-* (pronounded Bio-Star) projects, the first of which was BioPerl.

The BioPython project uses git (not subversion) for source code version management. The BioPython project uses a Web site called Github to display the source code in a browsable format.

Take a moment to browse the source code on github by following this link:

http://github.com/biopython/biopython.

QUESTION TWO. Click the link for the Bio module. Take a look at the listing of modules. Which ones might be useful for working with alignments? For working with the output from blast?





Sequences & Parsers

BioPython contains a number of parser objects which read data in a variety of formats and return objects representing the types of data found in the files, such as sequences, alignments, etc.

BioPython parsers are iterators, objects that have interfaces similar to lists but do not (typically) contain all the list data at any given time. This allows them to handle one record at a time without reading entire files into memory, which is very useful when dealing with large data files commonly encountered in bioinformatics programming.

Create a module called biopython.py and add it to your section of the subversion repository. Next, download the file ls_orchid.fasta, a fasta-format file containing sequences from orchid species C. irapeanum.

Take a look at the method below and add it to your new module.

 
def readseqs(fn="ls_orchid.fasta"):
    """
    Function: Reads a fasta format file and returns
              sequence record objects
    Returns : a list of Bio.SeqRecord.SeqRecord objects
    Args    : fn - fasta file to parse
    """
    from Bio import SeqIO
    fh = open(fn)
    lst = []
    for seq in SeqIO.parse(fh,"fasta"):
        lst.append(seq)
    fh.close()
    return lst

The object SeqIO module, like the other parsers in BioPython, contains a method "parse" that does essentially the same thing as all other parsers: It reads a file record by record, converting each record into a data model you can use in downstream computations. BioPython parsers typically require additional arguments that indicate the format of the file you are trying to parse.

QUESTION THREE. Use the function above to create a list of sequences. Each item in the list is a Bio.SeqRecord.SeqRecord object. Examine one of these. What methods and instance variables does it contain?





When exploring a new python package or library, it's useful to have an easy way to find out the class affiliation for unfamiliar but potentially useful objects. (You've done this many times in class already by simply typing variable names into the python interpreter.)

However, BioPython objects often over-ride the base methods that govern python's behavior when you type variable names into the interpreter, thus making it hard (sometimes) to discover what an object is and what it can do. When working with BioPython objects, you can use the __class__ instance variable to identify class affiliation for objects. For example:

 
>>> rec
SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',
 SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533',
name='gi|2765658|emb|Z78533.1|CIZ78533',
description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA',
dbxrefs=[])
>>> rec.__class__
<class Bio.SeqRecord.SeqRecord at 0x33ccc0>
>>> seq = rec.seq
>>> seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
>>> seq.__class__

QUESTION FOUR. What class is seq?





QUESTION FIVE. How do you access the sequence bases as a simple python string from a SeqRecord object?



Read Chapter Eight in the Biopython Tutorial and Cookbook, the section describing how you can use BioPython to access eUtils services.

Pay particular attention the section on eInfo. Use what you learn to find out how many records are in the the PubMed database.





QUESTION SIX. Write a function getRecords that retrieves a list of all PubMed ids for articles with abstracts that contain at least one instance of each word in a given list. How many PubMed article abstracts contain the words "glucosinolate" and "broccoli"?





Congratulations - You're done!