Saturday, July 9, 2016

Scientific Computing Resources For the Biologist

In scientific computing there are typically two types of computing referenced:
  1. High performance computing (HPC)
  2. High throughput computing (HTC)
Both are meant for working with "big data". Which of these is more important for your research depends on the how parallel the code being executed on these large datasets is. Additionally there are also cases where a hybrid is between HPC and HTC is best. The problem you will face is typically when to use HPC, HTC or a combination. Knowing which computing cluster is best for your research will allow you to choose resources to apply for.


High Performance Computing (HPC)


High performance computers (Supercomputers) can run code at or near its optimal rate. In other words HPC is capable of running code that normally takes days to running it in matter of seconds.

An example would be using a 15 base pair sliding window to call differentially methylated regions between two samples. Because this dataset contains information (methylated or not methylated) for every cytosine, each sample datasets is around 100 MB (in this example). Also the sliding window approach does not allow us to easily split up the data and run parallel tasks (the most we can split is by Chromosome). Also I (A Biologist by training) wrote this code and know very little about optimizing my algorithms. So I can wait for my slow code to run on a small server (where I may run into memory swapping issues) or use brute force to run this code on a HPC that can handle the work load. In summary, HPC servers are best used when you have a large (memory and time) task to run that can not be optimized further.

HPC is facilitated through access to large computing power containing thousands of processors in close proximity working together (cluster) with a large shared memory capacity and large working memory distributed across many processors. Because of the large computational capacity (memory and processors) needed for HPC they are expensive and usually shared with many different researcher groups. Some examples of these may include your university's cluster or resources like XSEDE. XSEDE is a powerful resource because it gives researchers access to a variety of clusters meant to individual computing needs.  One of the advantages and disadvantage to sharing resources  and using a cluster is that you are not granted full control of the system. You therefore do not have sudo rights and must locally install everything. Many clusters have pre-installed programs for commonly used programs like BLAST but they may not be the versions you are looking for.  XSEDE's pre-installed software can be found here.

Amazon and Microsoft are dramatically cutting the cost of using large computing resources that are not shared and can easily be personalized (The advantages of these are explained more here).


High Throughput Computing (HTC)


High throughput computing also takes advantage of many processors but rather than all processors working together on one task they work separately on many tasks. This type of computing is best for code that that runs the same task repeatedly (can run many tasks in parallel).


An example of this is code that needs to run 1 million blast queries against the same reference genome. Each individual blast command is small and can be ran with little memory. But running all  one million blasts is a huge task that would take days to run on one processor. HTC allows each task to be ran on a separate processor, breaking a large file of 1 million queries into 1 million separate fasta files that are ran simultaneously. Running each task separately not only cuts down the memory needed on each processor but also dramatically speeds the process.


The open science grid (OSG) is an example of a HTC resource. OSG is a collection of computers connected throughout the US in a grid. This network is extremely large and therefore powerful (put number of computers here). The grid is composed of a variety of machines donating time to the grid. These machines range from personal computers to university clusters. Because the OSG has to deal with a large variety of software and machines they developed a program (HTCondor) for matching idle computers with tasks best suited for them. HTCondor carries out this matching process using user defined system requirements to find computers on the grid that are available, have the system requirements requested (memory, unix) and have requested software installed (blast or matlab). DAGman also used by OSG allows for optimization in submitting jobs to determining the series of functions carried out by which node and when.  



Using both/either HPC and HTC 


Amazon and Microsoft also provide computational resources to researchers at a low cost (or through educational grants). These resources were developed for the general public and therefore are much more flexible. Here you rent a computer of the desired type and size for a desired period of time. Both XSEDE and OSG require using their HPC or HTC service for the length of the grant (1 year access to machines) where Amazon and Microsoft charge the grant by the hour of machines and number of machines used. This allows you to use a system with both HTC and HPC characteristics depending on the number of machines and size of the machines rented out at any given time. Of course renting out many large machines and/or cost much more then renting out a single large machine or multiple small machines. 

An advantage to paying hourly for computational time is that resources can easily be man available to users when needed because everyone is held accountable for time spent on them. Both XSEDE and OSG allow users unlimited usage to users. This often leads to users waiting for requested resources to become available. When resources are in high depend OSG uses a fair use policy preventing users from hogging a resources. 


I hope this article help clarified the differences to help you decide when to use one resource over the other or if you need to use both in combination.



Access to resources 



  • XSEDE (Mostly HPC but now has access to the OSG)
    1. Who can apply:
      • Posdocs and Professors
      • Graduate Students with NSF-GRFP 
    2. How do you apply:
      • First year requires an abstract and information on computational needs
      • After a year a 2-page grant is required for access to XSEDE


  • OSG (HTC)
    1. Who can apply:
      • Any researcher
    2. How do you apply:
      • Researchers can apply to OSG user school where they receive training plus 1-year free on the grid after that they need to join an organized group on OSG.
      • This application is a few pages on how the OSG will help facilitate your research needs.

  • Amazon & Microsoft (HPC, HTC, MTC)
    1. Who can apply:
      • Students and Postdocs at research universities
    2. How do you apply:
      • Submit an abstract and information on computational needs 

Sunday, March 13, 2016

Using AWS for Research


What is Amazon Web Services (AWS)? 


AWS is a cloud computing service provided by Amazon. Amazon EC2 is the service we'll go over in this post. It allows users to launch their own virtual instances from a variety of operating systems. Amazon provides these computational resources to the public and private sector at an extremely low price. I have listed below both the advantages and disadvantages of using cloud computing resources. For me the advantages definitely outweigh the disadvantages :) AWS is generally much more flexible than your universities local cluster in that you can choose the size, type and number of machines and have the rights to download and run whenever needed.

 Advantages 

  1. None of the maintenance of hosting your own server 
  2. Access to high computing machines 
  3. Access to high parallel computing machines (like those running hadoop) 
  4. The cost is extremely cheap 
  5. Sudo user rights (can install whatever you want without asking your sysadmin) 
  6. The computers available to you come in a variety shapes and sizes

 Disadvantages 

  1. There might not be enough machines for everyone and their maybe a wait time (This has yet happen) 
  2. It cost money (but amazon provides educational grants) 
  3. AWS is a blank slate therefore you need to install everything and copy files over (They make this easy with thing like s3 for storing files and allow you to create images of an instance so you can relaunch one with software pre-installed) 

 AWS Getting Started 

Using EC2 is easy once you've launched your instance you can ssh using your ssh keys:
 If your machine is ubuntu then  ssh -i location_to_pem_file.pem ubuntu@ec2-54-153-7-122.us-west-1.compute.amazonaws.com All inputs and outputs should be saved in your /mnt/ directory (this is where all of the system storage is).

By default you are not the owner of this directory so you need to change the permissions.

cd /mnt/
sudo chown ubuntu:ubuntu .
mkdir data

 If you need to mount more storage on your EC2 machine: 

  1.  Go to AWS Console  
  2. Create an EBS volume of the size needed 
  3. Attach this volume to your running EC2 machine 
  4. Then mount the volume to your EC2 machine using the following commands 

sudo mkfs -t ext4 /dev/xvdf 
sudo mount /dev/xvdf /mnt/data 
# /dev/xvdf might be diff depending on where volume was added 
cd /mnt/data/ 
mount 
# mount while in data dir 
sudo chown ubuntu:ubuntu . 
df 
# df to confirm that your volume was successfully added


 All data can be easily on saved on S3 and instances or machines with preinstalled packages can be saved and reused as images