Do you like this page?

Views of this page

Views of the site

Please help share my site. Thank you!

Parsing DNA and protein sequences with Biopython - Easy Microbial Genomics

Parsing DNA and protein sequences with Biopython

Last updated: April 28, 2022

If you have finished learning the Python basics section, you are now ready to solve problems using Python with some advanced skills. In this section, we are going to learn the application of Python in handling sequence data. We will mainly rely on the Python package 'biopython'. I will demonstrate some example tasks to better show how 'biopython' works.

Read and write FASTA files with SeqIO

Below you can see how a FASTA file looks like.

>[sequence_id_1] [description]
>[sequence_id_2] [description]

If you parse the format by yourself every time when you read a file, it is going to waste a lot of your time. We can use the module SeqIO of 'biopython' to parse a FASTA file. Install 'biopython' first by 'pip install biopython' first if you haven't done it. Save the sequence above in a text and name it 'test.fasta'. Then, run iPython.

from Bio import SeqIO
parse = SeqIO.parse('test.fasta', 'fasta')

If you check the data type of parse (run 'type(parse)'), you will see it is a Bio.SeqIO.FastaIO.FastaIterator object. It is defined by the module SeqIO and by the name we know it should be an iterable. You can use 'for' loop to iterate all the elements of an iterable, or use 'next' to get the next element.

GCTCACGATCGTACTACATCTACTAGCACTATCT'), id='[sequence_id_1]', name='[sequence_id_1]', description='[sequence_id_1] [description]', dbxrefs=[])

We see that the first element is a sequence record object. A sequence record object, like other object in Python, has multiple attributes and methods. Here is a trick to check what attributes and methods does an object have. Type the name of the object with a dot following and then press the tab key. See it?

rec = next(parse)
> annotations features letter_annotations reverse_complement() upper() dbxrefs format() lower() seq description id name translate()

You can use this trick to most of the objects you will see in future to explore what you can do with the objects.

We see that there are some attributes and methods (followed by a set of parentheses) for this object. You can try some of them.
> '[sequence_id_2]'

Note that now rec is the second sequence record because we have called 'next' once before.


The rec.seq is a Bio.Seq.Seq sequence object in 'biopython'. It also has attributes and methods.

You can make changes to the objects = '[sequence_id_3]'

Now, write the sequence record to a file.

SeqIO.write(rec, '1.fasta', 'fasta')

You can open a file handle first if you want to write multiple records.

outfile = open('1.fasta', 'w')
for rec in SeqIO.parse('test.fasta', 'fasta'):
    SeqIO.write(rec, outfile, 'fasta')

So far, you have known how to parse a FASTA file and check the attributes and methods of the sequence records.

Homework: read a FASTA file, get the reverse complement format of all the sequences and write to a new file.

Practice: sequence length distribution of FASTA file (e.g. assembly)

If you finish an assembly of a bacterial genomic sequence data, you may want to check the length distribution of the contigs. You can use SeqIO to parse the FASTA file and make a histogram of the lengths.

Download a test assembly data here. Upload the file to your Linux or MacOS working directory. Alternatively, you can copy the link and run 'wget [the link]' to download it directly.

First, let's write an argument parser to receive the arguments from command line. Refer to here to learn about command line arguments in Python if you have not done it yet.

import argparse
from Bio import SeqIO
import matplotlib.pyplot as plt

parser=argparse.ArgumentParser(description='This program reads a FASTA file and make a histogram of sequence leng
parser.add_argument('-i', '--input', dest='inp', required=True, help='Input fasta file')
parser.add_argument('-o', '--output', dest='out', required=False, default='length_histogrm.pdf', help='Output file of
sequence length histogrm [Default: length_histogram.pdf]')
parser.add_argument('-b', '--bin', dest='bin', required=False, default=10, type=int, choices=range(10,51,10), help='Numb
er of bins of the histogram [Default: 10]')

These codes enabled the program to read arguments of input file, output file and bin number from the command line.

# read the input file and get the lengths
parse = SeqIO.parse(args.inp, 'fasta')
dict1 = SeqIO.to_dict(parse)
length = [len(dict1[i].seq) for i in dict1]

These codes read the FASTA input file and put the sequence IDs and sequence records in a dictionary using the function SeqIO.to_dict(). Then the lengths of all the sequence records were calculated and stored in a list.

# make histogram
plt.hist(length, bins=args.bin)
plt.xlabel('Sequence length')
plt.ylabel('Number of contigs')
plt.title('Sequence length distribution')

These codes made a histogram using the function hist from the library matplotlib.pyplot. The plot was then saved as a file.

Now save these codes in a file named, or you can use wget to download it directly with this link.

Run this command

python -i assembly.fna -o histogram.pdf -b 10

and here is the result histogram

Okay. Now you can parse a FASTA file and make a length distribution plot. Quite simple, isn't it?

To be continued