How I Analyze DNA with Biopython
bioinformatics·@cristi·
0.000 HBDHow I Analyze DNA with Biopython
<center>https://s15.postimg.org/no7ybfe7f/How_I_Analyze_DNA_with_Biopython___1.png</center> With its highly readable syntax and uncluttered visual layout, Python is one of the most versatile and adopted programming language. Many programmers can ignore that Python is not as fast as lower level languages when they have the modularity and the ability to scale things faster. Plus, the community is nothing short of amazing. Speaking of modularity, in Python there's a module for everything. Whether you want to do complex math, statistics, machine learning, text processing and text recognition, penetration testing and even bioinformatics, there's at least one module you can work with. In fact, my interest in bioinformatics is what allowed me to go over the struggle of learning to program. I tried to teach myself coding several times and I failed every single time. It was earlier this year when I started to take off from the ground and it was because of my active interest in genomics and bioinformatics. This is how I discovered _Biopython_, a Python module. With something specific to practice on and to play with everyday, I started having a better understanding of both worlds. In this post I will show you some basic stuff you can do with the Biopython module; and if you're interested you can learn on your own from here on. In future posts I will discuss other interesting/helpful Python modules. ___ ## Playing with DNA - The Biopython Way <center>https://s15.postimg.org/3uvuiq0tn/How_I_Analyze_DNA_with_Biopython___2.jpg</center> While there are several modules geared towards bioinformatics, the one that I learned and practice with is _Biopython_. It was my best choice because I found several books discussing it at length, one of these books being the free _Biopython Cookbook_ that you can read [here](http://biopython.org/DIST/docs/tutorial/Tutorial.html). It's like a hands-on approach, 400 pages long. I've inspired the following examples from the book. Before getting into the nitty-gritty I'm going to make a couple of assumptions: - you have some knowledge of biology, especially genomics - you have some knowledge of computation and programming languages I am not going to explain genomics and DNA sequences, what Python is or how to install it because the post would turn to be unbearably long. There are countless tutorials/lectures/videos freely available out there for that purpose. Here I'm trying to be very specific: - how you can use Biopython in computational biology. According to the [Cookbook](http://biopython.org/DIST/docs/tutorial/Tutorial.html): _"The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology."_ To install this module, go to the download page [here](http://biopython.org/wiki/Download). Make sure to install dependencies before you install Biopython. I use Biopython in Windows, but you can use it in Mac or Linux as well. In Windows, for easier installation, you could get the 'wheels' from [here](http://www.lfd.uci.edu/~gohlke/pythonlibs/) (dependencies and latest Biopython) and install them via pip. ___ Now I'll show you some stuff you can do with it. ### 1. Instead of making up a DNA sequence, I'll take a real genetic sequence from the National Center for Biotechnology Information (NCBI): - go to NCBI [homepage](https://www.ncbi.nlm.nih.gov/) - search for ['tiger'](https://www.ncbi.nlm.nih.gov/nuccore/?term=tiger) (lack of inspiration :) ) - use the sequence: [Panthera tigris voucher IRSNB-908B cytochrome c oxidase subunit I (COX1) gene, partial cds; mitochondrial](https://www.ncbi.nlm.nih.gov/nuccore/KR349754.1) <center>https://s15.postimg.org/dtgt57a97/How_I_Analyze_DNA_with_Biopython___3.jpg</center> - I'll click on [Fasta](https://www.ncbi.nlm.nih.gov/nuccore/831359061?report=fasta) - copy the sequence: CTGATTGGCCACTCTTCACGGGGGTAATATTAAATGGTCTCCCGCTATACTATGGGCTTTGGGATTCATTTTCCTATTCACCG TAGGGGGCTTAACAGGAATTGTATTAGCAAACTCCTCATTGGATATTGTCCTTCACGACACATACTACGTAGTAGCCCACTTC CACTACGTCTA - alternatively I could save it as Fasta or Genbank file and work with it directly. Biopython knows both extensions. ### 2. In Python - importing Biopython Seq and Alphabet methods - inputing our sequence: ```python >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> myseq = Seq('CTGATTGGCCACTCTTCACGGGGGTAATATTAAATGGTCTCCCGCTATACTATGGGCTTTGGGATTCATTTTCCTATTCA CCGTAGGGGGCTTAACAGGAATTGTATTAGCAAACTCCTCATTGGATATTGTCCTTCACGACACATACTACGTAGTAGCCCACTTCCACTACGTCTA', IUPAC.unambiguous_dna) ``` ### 3. Let's find the complement sequence of myseq: ```python >>> myseq.complement() ``` <center>https://s15.postimg.org/xphdyhgob/How_I_Analyze_DNA_with_Biopython___4.png</center> You can see it shows truncated (not in its entirety). You could use the print function to see the full sequence: ```python >>> print(myseq.complement()) ``` - similarly, you find the reverse complement with ```myseq.reverse_complement()``` ### 4. Assuming that myseq is a coding sequence, let's transcribe it to messenger RNA. According to the [cookbook](http://biopython.org/DIST/docs/tutorial/Tutorial.html): _"The actual biological transcription process works from the template strand, doing a reverse complement (TCAG ! CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T ! U."_ ```python >>> mrna = myseq.transcribe() >>> mrna ``` <center>https://s15.postimg.org/py0nzxciz/How_I_Analyze_DNA_with_Biopython___5.png</center> ### 5. Let's translate the messenger RNA to protein: ```python >>> mrna.translate() ``` <center>https://s15.postimg.org/3ncszyf8r/How_I_Analyze_DNA_with_Biopython___6.png</center> ### 5. Parsing full records - search the Entrez databases (like Pubmed, Genbank, Nucleotide, etc) with ```Bio.Entrez.esearch()```: ```python >>> from Bio import Entrez >>> Entrez.email = "cristi@cristi.com" >>> handle = Entrez.esearch(db="pubmed", term="human monoclonal antibodies") >>> record = Entrez.read(handle) >>> record["IdList"] ``` And you get a list of Pubmed IDs of articles related to your search: <center>https://s15.postimg.org/bhdelcn1n/How_I_Analyze_DNA_with_Biopython___7.png</center> - you could retrieve the full entry for an id with ```Efectch```: ```python >>> handle = Entrez.efetch(db="nucleotide", id="186972442", rettype="gb", retmode="text") >>> print(handle.read()) ``` <center>https://s15.postimg.org/muzxwjxkb/How_I_Analyze_DNA_with_Biopython___8.png</center> Here I retrieved the id ```186972442``` from the nucleotide database as Genbank file (```rettype = 'gb'``` and ```retmode = 'text'```). ___ ## What else can you do with Biopython? - find the lineage of an organism - run BLAST (basic local alignment search tool) with ```Bio.Blast``` - work with Swiss-Port and ExPaSy - 3D protein rendering with the PDB module - Population genetics - Phylogenetics and Sequence motif analysis - machine learning (!!!) - and many other useful operations for bioinformaticians and computational biologists. For this tutorial I used the Cookbook version from December 2015. The latest (which I linked above and [here - the pdf](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf )) is from August 25, 2016. So, it's 'extremely' recent. ___ ## Ending Thoughts If you want to follow the same learning curve as I did (genomics + programming), there are a few free resources that I'm going to provide to get you started. These are the ones that I used: - **Introduction to Biostatistics and Bioinformatics from NYU** Course Materials: [here](http://fenyolab.org/ibb2015/) My favorite lecture: [IBB 4: Biopython](https://www.youtube.com/watch?v=l8wLaoEGbUI) - **Dr. Martin Jones' books:** [Python for Biologists](https://www.amazon.com/dp/1492346136/) [Advanced Python for Biologists](https://www.amazon.com/dp/1495244377/) - **Via, Rother, and Tramontano's book:** [Managing your Biological Data with Python](https://www.amazon.com/dp/143988093X/) - **there are a few other books that I wish to go through next (on my watchlist), but I'm only going to mention one:** [Haddock and Dunn - Practical Computing for Biologists](https://www.amazon.com/dp/0878933913) Of course, you can watch many other free lectures on Youtube and go through course materials from top universities. The only thing that matters, in my opinion, is to actually do the work. Start with 1 book or 1 course and get your hands dirty. Practice deliberately and do it consistently. ___ ### <center>To stay in touch, follow @cristi</center> #bioinformatics #programming #genomics #research Credits for Images: [Dna Strands via Wikimedia Commons](https://commons.wikimedia.org/wiki/File:DNA_strands.png) and [Biopython Logo](http://biopython.org). ___ [Cristi Vlad](http://cristivlad.com), Self-Experimenter and Author
👍 cristi, goldmatters, ericvancewalton, jlufer, papa-pepper, justtryme90, gikitiki, jasonstaggers, carlidos, alexandre, craig-grant, blocktrades, murh, anonymous, gary-smith, lukeofkondor, sulev, aidar88, jackkomber, ace108, catsmart, micheletrainer, michaeldodridge, blueorgy, sergey44, neptun, coinbitgold, mari5555na, ionescur, matthewtiii, gargon, runaway-psyche, inertia, thecryptofiend, lemouth, logic, treeleaves, fooblic, valenttina, bitcoiner, kryptik, sizil, loveangel, bitshares101, funnyman, furion, mindover, kaylinart, repholder, weenis, soupernerd, aizensou, callimico, brian1221,