Saturday, July 9, 2016

Scientific Computing Resources For the Biologist

In scientific computing there are typically two types of computing referenced:
  1. High performance computing (HPC)
  2. High throughput computing (HTC)
Both are meant for working with "big data". Which of these is more important for your research depends on the how parallel the code being executed on these large datasets is. Additionally there are also cases where a hybrid is between HPC and HTC is best. The problem you will face is typically when to use HPC, HTC or a combination. Knowing which computing cluster is best for your research will allow you to choose resources to apply for.

High Performance Computing (HPC)

High performance computers (Supercomputers) can run code at or near its optimal rate. In other words HPC is capable of running code that normally takes days to running it in matter of seconds.

An example would be using a 15 base pair sliding window to call differentially methylated regions between two samples. Because this dataset contains information (methylated or not methylated) for every cytosine, each sample datasets is around 100 MB (in this example). Also the sliding window approach does not allow us to easily split up the data and run parallel tasks (the most we can split is by Chromosome). Also I (A Biologist by training) wrote this code and know very little about optimizing my algorithms. So I can wait for my slow code to run on a small server (where I may run into memory swapping issues) or use brute force to run this code on a HPC that can handle the work load. In summary, HPC servers are best used when you have a large (memory and time) task to run that can not be optimized further.

HPC is facilitated through access to large computing power containing thousands of processors in close proximity working together (cluster) with a large shared memory capacity and large working memory distributed across many processors. Because of the large computational capacity (memory and processors) needed for HPC they are expensive and usually shared with many different researcher groups. Some examples of these may include your university's cluster or resources like XSEDE. XSEDE is a powerful resource because it gives researchers access to a variety of clusters meant to individual computing needs.  One of the advantages and disadvantage to sharing resources  and using a cluster is that you are not granted full control of the system. You therefore do not have sudo rights and must locally install everything. Many clusters have pre-installed programs for commonly used programs like BLAST but they may not be the versions you are looking for.  XSEDE's pre-installed software can be found here.

Amazon and Microsoft are dramatically cutting the cost of using large computing resources that are not shared and can easily be personalized (The advantages of these are explained more here).

High Throughput Computing (HTC)

High throughput computing also takes advantage of many processors but rather than all processors working together on one task they work separately on many tasks. This type of computing is best for code that that runs the same task repeatedly (can run many tasks in parallel).

An example of this is code that needs to run 1 million blast queries against the same reference genome. Each individual blast command is small and can be ran with little memory. But running all  one million blasts is a huge task that would take days to run on one processor. HTC allows each task to be ran on a separate processor, breaking a large file of 1 million queries into 1 million separate fasta files that are ran simultaneously. Running each task separately not only cuts down the memory needed on each processor but also dramatically speeds the process.

The open science grid (OSG) is an example of a HTC resource. OSG is a collection of computers connected throughout the US in a grid. This network is extremely large and therefore powerful (put number of computers here). The grid is composed of a variety of machines donating time to the grid. These machines range from personal computers to university clusters. Because the OSG has to deal with a large variety of software and machines they developed a program (HTCondor) for matching idle computers with tasks best suited for them. HTCondor carries out this matching process using user defined system requirements to find computers on the grid that are available, have the system requirements requested (memory, unix) and have requested software installed (blast or matlab). DAGman also used by OSG allows for optimization in submitting jobs to determining the series of functions carried out by which node and when.  

Using both/either HPC and HTC 

Amazon and Microsoft also provide computational resources to researchers at a low cost (or through educational grants). These resources were developed for the general public and therefore are much more flexible. Here you rent a computer of the desired type and size for a desired period of time. Both XSEDE and OSG require using their HPC or HTC service for the length of the grant (1 year access to machines) where Amazon and Microsoft charge the grant by the hour of machines and number of machines used. This allows you to use a system with both HTC and HPC characteristics depending on the number of machines and size of the machines rented out at any given time. Of course renting out many large machines and/or cost much more then renting out a single large machine or multiple small machines. 

An advantage to paying hourly for computational time is that resources can easily be man available to users when needed because everyone is held accountable for time spent on them. Both XSEDE and OSG allow users unlimited usage to users. This often leads to users waiting for requested resources to become available. When resources are in high depend OSG uses a fair use policy preventing users from hogging a resources. 

I hope this article help clarified the differences to help you decide when to use one resource over the other or if you need to use both in combination.

Access to resources 

  • XSEDE (Mostly HPC but now has access to the OSG)
    1. Who can apply:
      • Posdocs and Professors
      • Graduate Students with NSF-GRFP 
    2. How do you apply:
      • First year requires an abstract and information on computational needs
      • After a year a 2-page grant is required for access to XSEDE

  • OSG (HTC)
    1. Who can apply:
      • Any researcher
    2. How do you apply:
      • Researchers can apply to OSG user school where they receive training plus 1-year free on the grid after that they need to join an organized group on OSG.
      • This application is a few pages on how the OSG will help facilitate your research needs.

  • Amazon & Microsoft (HPC, HTC, MTC)
    1. Who can apply:
      • Students and Postdocs at research universities
    2. How do you apply:
      • Submit an abstract and information on computational needs 

Sunday, March 13, 2016

Using AWS for Research

What is Amazon Web Services (AWS)? 

AWS is a cloud computing service provided by Amazon. Amazon EC2 is the service we'll go over in this post. It allows users to launch their own virtual instances from a variety of operating systems. Amazon provides these computational resources to the public and private sector at an extremely low price. I have listed below both the advantages and disadvantages of using cloud computing resources. For me the advantages definitely outweigh the disadvantages :) AWS is generally much more flexible than your universities local cluster in that you can choose the size, type and number of machines and have the rights to download and run whenever needed.


  1. None of the maintenance of hosting your own server 
  2. Access to high computing machines 
  3. Access to high parallel computing machines (like those running hadoop) 
  4. The cost is extremely cheap 
  5. Sudo user rights (can install whatever you want without asking your sysadmin) 
  6. The computers available to you come in a variety shapes and sizes


  1. There might not be enough machines for everyone and their maybe a wait time (This has yet happen) 
  2. It cost money (but amazon provides educational grants) 
  3. AWS is a blank slate therefore you need to install everything and copy files over (They make this easy with thing like s3 for storing files and allow you to create images of an instance so you can relaunch one with software pre-installed) 

 AWS Getting Started 

Using EC2 is easy once you've launched your instance you can ssh using your ssh keys:
 If your machine is ubuntu then  ssh -i location_to_pem_file.pem All inputs and outputs should be saved in your /mnt/ directory (this is where all of the system storage is).

By default you are not the owner of this directory so you need to change the permissions.

cd /mnt/
sudo chown ubuntu:ubuntu .
mkdir data

 If you need to mount more storage on your EC2 machine: 

  1.  Go to AWS Console  
  2. Create an EBS volume of the size needed 
  3. Attach this volume to your running EC2 machine 
  4. Then mount the volume to your EC2 machine using the following commands 

sudo mkfs -t ext4 /dev/xvdf 
sudo mount /dev/xvdf /mnt/data 
# /dev/xvdf might be diff depending on where volume was added 
cd /mnt/data/ 
# mount while in data dir 
sudo chown ubuntu:ubuntu . 
# df to confirm that your volume was successfully added

 All data can be easily on saved on S3 and instances or machines with preinstalled packages can be saved and reused as images

Sunday, January 18, 2015

Change node and edge colors in Cytoscape with Python

Often with Cytoscape we are interested in overlaying other information such as expression data. In order to easily do this without changing the layout/format of the original file it requires a bit of programming. Here I will explain how this can be easily done in python and have provided some of my own code (although it is still a bit messy).

First export your Cytoscape file to an XML file. This can be done in Cytoscape under file> export> network. XML is a superset of HTML, this file contains the positional information for each node in your network along with the edges, color and shape of each edge and node and any other graphical information you can think of.

With the Cytoscape file in XML format we can then parse this file in python using Beautiful Soup!
Beautiful Soup  is an XML parser written for python. First import Beautiful Soup and tell it the format.

 from bs4 import BeautifulSoup  
 soup = BeautifulSoup(xgmml)  

You can then select all the nodes or edges from you XML file using soup.find_all

edges = soup.find_all("edge")
print edges[0]

<edge cy:directed="1" id="412" label="MYB46 (PD) C4H" source="145" target="123">
<att name="interaction" type="string" value="PD"></att>
<att name="shared name" type="string" value="MYB46 (PD) C4H"></att>
<att name="selected" type="boolean" value="0"></att>
<att name="name" type="string" value="MYB46 (PD) C4H"></att>
<att name="shared interaction" type="string" value="PD"></att>

And Search for nodes or edges that match using with by getting the nodes label

edge = edge[0]
print edge.get("label")

MYB46 (PD) C4H

You can then change the color of a node for example by grabbing the graphics tag and

graphics_tag =
graphics_tag["fill"] = "#eb2200"

The graphics tag also contain other information inducing arrow types and others shown below:

<graphics fill="#999999" width="2.0">
<att name="EDGE_TARGET_ARROW_SHAPE" type="string" value="NONE"></att>
<att name="EDGE_LABEL_FONT_SIZE" type="string" value="10"></att>
<att name="EDGE_LABEL_COLOR" type="string" value="#000000"></att>
<att name="EDGE_LABEL_TRANSPARENCY" type="string" value="255"></att>
<att name="EDGE_SELECTED" type="string" value="false"></att>
<att name="EDGE_TOOLTIP" type="string" value=""></att>
<att name="EDGE_SOURCE_ARROW_UNSELECTED_PAINT" type="string" value="#000000"></att>
<att name="EDGE_TARGET_ARROW_UNSELECTED_PAINT" type="string" value="#999999"></att>
<att name="EDGE_LABEL_FONT_FACE" type="string" value="Dialog,plain,10"></att>
<att name="EDGE_LINE_TYPE" type="string" value="SOLID"></att>
<att name="EDGE_VISIBLE" type="string" value="true"></att>
<att name="EDGE_BEND" type="string" value=""></att>
<att name="EDGE_SOURCE_ARROW_SELECTED_PAINT" type="string" value="#ffff00"></att>
<att name="EDGE_CURVED" type="string" value="true"></att>
<att name="EDGE_SOURCE_ARROW_SHAPE" type="string" value="NONE"></att>
<att name="EDGE_TARGET_ARROW_SELECTED_PAINT" type="string" value="#ffff00"></att>
<att name="EDGE_STROKE_SELECTED_PAINT" type="string" value="#ff0000"></att>
<att name="EDGE_TRANSPARENCY" type="string" value="255"></att>
<att name="EDGE_LABEL" type="string" value=""></att>

I have used this code for many things and made if available on github here. I used this code to color feed forward loops and stress expression response in our Xylem Secondary Cell wall Network recently published in Nature.

Here is our Xylem Secondary Cell wall Network before changes:

The feed forward loops in our network are highlighted in Red here:

Here is the same network with salt stress microarray data overlaid where green shows positive correlation between transcription factors/ nodes and red negative.

An Arabidopsis gene regulatory network for secondary cell wall synthesis M Taylor-Teeples, L Lin, M de Lucas, G Turco, TW Toal… - Nature, 2014

Thursday, October 9, 2014

Get 23andme health information from 23andme data

One of the best things about 23andme is its transparency!! In addition to its awesome api that allows users to build tools to analyze their genetic data, the 23andme website also contains an exploratory mode which allows users to search for specific single nucleotide polymorphisms (SNPs).   SNPs associated with 23andme health reports in addition to other traits are available on SNPedia. Some of these traits include SNPs for left handedness or even SNPs for empathy. To get this information simply search for a SNP or trait of interest on SNPedia such as empathy: Rs53576. Then  login to your 23andme and go to

Here you can search for any SNP of interest like  Rs53576. My genotype at this SNP is GG.

If we go back to SNPedia in the left hand corner we can see that having GG at the position indicates I may have my optimistic and empathetic tendencies encoded in my genome  :)

Some of these SNPs arnt as well studied like the SNP for left handedness rs1446109-rs1007371-rs723524 here if you have something like AA-TT-GG you are likely to be left handed. Also at the top of the image above, if the orientation does not say "plus" but instead "minus" you need to use the reverse complement of the 23andme genotype to determine your genotype at that SNP. 

Happy Hacking :) 

Monday, January 20, 2014

And so it begins

As a second year PhD student in Biology at UC Davis, we are required to taking one last exam before we can move on to candidacy. I take mine sometime in the next 3-6 months!!! This test is about 3 hours in length half based on classes and half on your research project. As initiative to start prepping for my exam I am going to start blogging on topics I should know and need to know.  So here it goes #BlogingForMyQE (qualifying exam)

Saturday, July 20, 2013

The Five Obstacles You Will Face in Your First Year of Grad School

Officially finished my first year at UC Davis as a Biochemistry, Molecular, Cellular and Developmental Biology (BMCDB) graduate student. When I think back to the start of my first year I realize I was very anxious and confused. I had heard so many horror stories about graduate school and was really not sure what to expect. Hopefully this will help with advice on the graduate school environment.

 1) Finding the one (your advisor):

This one is stressful and tough, after all you will be working with this person for 4+ years. Even before deciding on a school make sure there are lots of options.  Many of my top rotation choices did not have funding or had just left the school or the lab just wasn't what I thought it would be. Also make sure to contact older graduate students in your program and get advice on different labs and what the work ethic is like. In hindsight rotating in the summer really made a huge difference. I not only gained an extra rotation but also could easily meet with professors before the chaos began. It is best when leaving a lab to let them know you are weighting your options but really enjoyed your time there. After rotations have finished you should discuss potential thesis projects with each lab before making your decision.  Just remember choose your rotations wisely, start early, and work hard :)

2) Pass your classes:

It is actually easier then it seems to pass as long as you complete all the assignments and actually try. This does not mean that the workload is not heavy! Juggling classes, assignments and rotations is enough to make you want to nap 24/7. It's all about focusing your energy in the right place. Keep in mind that grades do not matter, while where you will be spending your Ph.D does! In other words do not stress grades too much. Form study groups, try hard and you will be fine. Part of grad school is the realizing that there are a lot of subjects that you know very little about where others in your cohort maybe experts.  Within study groups show up with your work done, so you can not only defend you answer but admit to your mistakes and learn from others.

3) Get funding:

Start looking, writing and applying early, as in the summer before you start. I was forced to start my NSF application the summer before and was very grateful that I had. Once school starts there are so many other things to worry about. Also apply for everything even if you don't feel competitive enough. No matter how secure your funding is, many of us learned throughout rotations just how rare a professor with funding is. The best funding advice I could give is to double/triple check that all letter of rec writers will write outstanding letters and have proposals read by multiple people from varying departments and sell yourself!

4) Make Friends:

Last but not least you should be making friends in graduate school, these are your soon to be collegues and possibly people you will encounter in your future. These are people you will see everyday, who encounter the same struggles, be friends not enemies. The friendship part was very unexpected for me yet helped tremendously. You are going to be stressed and confused and have worries. While it's nice at the end to celebrate your hard work with these people they are also great support throughout. These people are smart and will be able to help you throughout your graduate career. After all you all were accepted into the same program and one thing I realized over time was that everyone in our cohort has their own thing they excel at at. Also it's amazing to finally have friends just as nerdy as you! Shout out to @annaphase


Wednesday, May 1, 2013

RNASeq Analysis: The Basics

Strange and Mysterious File Types (you might encounter)

Sequence Read Archive format (SRA): This is an NCBI specific file format used because of its ability to compress read sequence information. This is often the output of many illumina sequencing pipelines.

Fastq file: SRA files can be converted to Fastq files, these are similar to Fatsta files and contain a header, associated genomic sequence and a quality score for the sequence. This is often encoded in binary and needs to be read by quality control algorithms. Thus, Fastq files contain your raw sequence information.

BAM and SAM Files: These are your alignment files where SAM stands for sequence alignment file and BAM is the binary form of this. While BAM is unreadable by humans it is often used because it is more memory efficient and quicker for computer algorithms to read in. These are obtained by often by Bowtie or Tophat after your Fastq reads have been aligned to your reference genome

GTF/GFF: These are often used as reference for counting how many reads map to genomic regions. These are tab delimited files containing the , start, stop, chromosome, and strand information along with name of a genomic region (such as gene, transposon, mRNA).

RNASeq process in flow chart on top. Bottom indicated different file formats and types of analysis for each step.


If files are obtained from  NCBI’s Gene  Expression Omnibus (geo) then they are most likely in .SRA format and need to be converted to fastq using the fastaq-dump package.

Quality control:

Before processing the data reads must first be trimmed for adapter sequences and reads of bad quality should be filtered out. A great tool for easily viewing the distribution of  quality scores is FASTQC. I use trimreads to both filter low quality reads and trim adaptor sequences. For more detailed analysis others have used many components of the fastx toolkit.

Alignment to reference genome:

Your fastq file contains your RNA small sequence reads. These need to be aligned back to a reference genome. Reference genome: Model organism you are using, sequence containing coding sequence only, microRNA only or entire genome. BOWTIE excels both in speed and memory efficiency for aligning short sequence reads back to a long reference sequence. In order to obtain this efficiency the reference genome must first be indexed. This creates a database of keys (in this case they would be cds sequences) for each record and positions them into a memory efficient tree. An index is created simply by:
bowtie-build reference_genome name_for_refrence_genome
TopHat/Bowtie can then be used to align the fastq sequences to the index reads. TopHat is often preferred in RNAseq analysis because while it is build on top of Bowtie it is also able to identify splice junctions between exons.
Tophat --bowtie1 name_for_refrence_genome fastq_reads
Determining differentially expressed gene:

Once files have been aligned to a reference genome count data can be obtained.

Count Data: These are counts of RNASeq reads that align to a gene (or other genomic regions indicated by the GTF file). Therefore this tab-delimited file contains gene_ids and their corresponding RNA counts from your data.

To analyze for differential expression two programs are commonly used DeSeq or EdgeR. You can read up on comparisons of these two here:

Both programs assume that the number of reads in the sample can be modeled using a negative bionomial distribution.

Additional Resources: