BioWordCount: An Intro To Bioinformatics on Apache Spark

Previously on, I demonstrated a bioinformatics use for Hadoop MapReduce. The idea was to build on the ubiquitous word count example, but using a problem which is at least somewhat relevant to bioinformatics. So I read in a VCF file and parsed out the reference and the variant bases, and collected an overall count of the mutation spectrum. So here we are, back at it with an Apache Spark version of the demo.
Read more

BioWordCount: A Brief Introduction to Hadoop For The Bioinformatics Practitioner

Many people who do bioinformatics (the field seems to have settled on “bioinformatician”, but I like “bioinformaticist” better) find themselves dealing with large data sets and long running processes, arranged in myriad pipelines. In our time, this inevitably demands distributed computing. Life innovated during the Cambrian explosion by going from single cells to colonies of cells. Life found a way to distribute and parallelize its processes. In order for us to properly focus our analytical microscopes on life, we imitate life in this strategy and distribute our processes across multiple CPU’s.
Read more

convert your 23andme raw data into VCF format

A week ago I received my results from Aside from the obvious points of interest, health risks, heritage, neanderthal composition, etc., I was also interested in getting my own data in raw format. While 23andme does provide a way to download your “raw” data, they are not really providing raw data. One cannot access the image data from the microarray sequencer that they used. What they do provide is formatted as follows:
Read more