BioWordCount: An Intro To Bioinformatics on Apache Spark

Previously on, I demonstrated a bioinformatics use for Hadoop MapReduce. The idea was to build on the ubiquitous word count example, but using a problem which is at least somewhat relevant to bioinformatics. So I read in a VCF file and parsed out the reference and the variant bases, and collected an overall count of the mutation spectrum. So here we are, back at it with an Apache Spark version of the demo.
Read more

BioWordCount: A Brief Introduction to Hadoop For The Bioinformatics Practitioner

Many people who do bioinformatics (the field seems to have settled on “bioinformatician”, but I like “bioinformaticist” better) find themselves dealing with large data sets and long running processes, arranged in myriad pipelines. In our time, this inevitably demands distributed computing. Life innovated during the Cambrian explosion by going from single cells to colonies of cells. Life found a way to distribute and parallelize its processes. In order for us to properly focus our analytical microscopes on life, we imitate life in this strategy and distribute our processes across multiple CPU’s.
Read more