looking for patterns in the patterns

FDA, 23andme, and Innovation

I’m neither a doctor nor a lawyer, so consider my opinions appropriately. I began my career as a software developer at the Genome Institute (TGI) at Washington University in St.Louis. I didn’t set out to get into bioinformatics, it just happened that it was the most interesting problem to work on in St.Louis at the time I was getting my degree in computer science. So I set about learning the craft of software development and the field of bioinformatics at the same time.

As many before me have noticed, biology and computers are strangely similar at their most basic level. Both decompose into some basic operations that get repeated over and over, with important results emerging at higher orders from the basic operations. Computer processors can only do simplistic operations on a few pieces of data at a time. Add two numbers. Multiply two numbers. Compare two numbers. Algorithms of enormous complexity can be implemented as a series of much smaller operations. Such is the case with biology. DNA is transcribed into RNA, which gets translated into proteins, which interact with each other in a cascading series of pathways that in the end results in the swarm of cells that is a human being.

So when I began working on the variant calling portion of the cancer sequencing pipeline at TGI, I started to see that not only are these two fields, computers and biology, deeply similar, but that they reinforce and supplement one another. Computers are to biology what the telescope was to astronomy. Both biology and astronomy existed before their defining technologies were invented, but neither would be nearly as valuable to mankind without their paradigm-shift enabling technologies.

I was intimately involved in the process of taking raw genomic data from a sequencing machine and shepherding it through the pipeline of steps required to produce some actionable output. In the case of cancer sequencing, the output is mostly studies on large aggregations of tumor samples to try and discover how varieties of cancer develop and spread. But sometimes the output is meant to used for clinical sequencing. This is the sequencing of one specific person and their tumor, to determine if the drugs we already have are good candidates for treatment. I was up to my elbows in other people’s DNA. I saw the variants in their DNA that determine their health. It was like reading the source code for a human being. So naturally I became curious about my own DNA.

Enter 23andme.

23andme provides a personal genome service that can determine which variants an individual has at one million pre-selected sites in the genome. All you need to do is spit in a tube and mail it to them. A few weeks later they let you know when the sequencing and analysis is done, and you get to revel in the information about your source code, just like people with access to big labs, like TGI. To be sure, this is not the same as whole genome re-sequencing, where each and every base of the entire individual is determined. That is a bit more costly, but getting cheaper every day. 23andme provides answers to specific questions like “do I have the BRCA mutations which make me much more susceptible to breast cancer?” The answers you get from 23andme are answers to well studied questions.

By this time, I had become more involved in the field of bioinformatics and ended up at a commercial bioinformatics software shop here in St.Louis, Partek Inc. In order to fuel their employees’ interest in genomics and spark their thinking on use cases, they paid for all of us to get sequenced by 23andme. So in the summer of 2012, we each received our results and marveled at the vast amount of detail. Each question answered by sequencing is backed up by a plethora of links to published journal articles describing the function and studies done on that particular variant.

Here is how 23andme describes the state of my genome with respect to the above mentioned BRCA mutations:

No copies of the three early-onset breast and ovarian cancer mutations identifiable by 23andme. May still have a different mutation in BRCA1 or BRCA2.

You have to be familiar with how genetics works in order to properly parse this statement.

Enter the FDA. The FDA has recently ordered 23andme to stop providing their service to consumers. One of the issues they cite, is providing BRCA data to consumers:

For instance, if the BRCA-related risk assessment for breast or ovarian cancer reports a false positive, it could lead a patient to undergo prophylactic surgery, chemoprevention, intensive screening, or other morbidity-inducing actions, while a false negative could result in a failure to recognize an actual risk that may exist.

This is an alarming point to raise. No one wants to needlessly cause “morbidity-inducing actions” or give a false sense of security. So let’s unpack both the false positive case and the false negative case.

A false positive would be asserting that the customer has a specific mutation, when they do not in fact have that mutation. What are the odds of this happening? I don’t have specific data on the 23andme lab, but similar technologies result in errors on the order of 2.5 calls in 100,000. This is indeed a small probability. This is in the neighborhood of flipping a coin and getting 16 heads in a row. It’s quite unlikely, but not unthinkable. So if you get 1,000,000 variant calls from 23andme, somewhere in the neighborhood of 25 will be incorrect. There is a much deeper error model built around sequencing and variant calling, but I will only glance off the surface for today’s purpose. The FDA is concerned that “morbidity-inducing actions”, like a mastectomy, could be undertaken based on faulty evidence. However, very few customers of 23andme are going to read their results, find a BRCA positive answer, and immediately perform a self-mastectomy. At some point, the medical establishment gets involved. There are orthogonal methods of validating the BRCA carrier status. 23andme is purely an informational service, meant to be the beginning of a conversation, not the last word.

The case of a false negative is on the same order of improbability as the false positive. The snag comes in the semantics of the question being answered. Does the patient have the previously identified, inherited mutations in the BRCA genes? This is like knowing a certain edition of a book has a typo in a specific word on a specific line. Science has identified several of these specific typos that are commonly inherited. 23andme tells you if you have these specific typos in your book of life. It’s also possible that you acquired a typo during your lifetime, in some other word in the BRCA chapter of your genome. Or even that you inherited a typo, but one that hasn’t been identified by science yet. These are answers that 23andme can’t give you. So while they can say you don’t have typo’s X,Y, or Z, they can’t state that you have no typos at all. 23andme even states this in the copy on their site “May still have a different mutation in BRCA1 or BRCA2.”

In effect, the FDA wishes to place some barrier, either themselves or some other apparatus of medical establishment, between your genome sequence and you. They call the service that 23andme provides a “medical device.” I don’t claim that the implications of consumer genome sequencing are clear or entirely knowable at this time. I do claim that if we step in with overreaching caution and control where it is not warranted, we risk destroying a nascent industry that is going to be one of the biggest most world-shaping technologies of the next 100 years. Companies that now enjoy the freedom to innovate with new products and technologies will be forced to move elsewhere or stop.

St.Louis is becoming an important center for life-sciences, with the Genome Institute, Monsanto, the Danforth Plant Science Center, Partek, and even newer players like Cofactor Genomics. The path forward to the 21st century for these technologies does not go through slow, monolithic government bureaucracies. People free to act on their insights and recent innovations will build this industry elsewhere if we clamp down on it now.

Digital Ocean Hosting: Error 2002 Can’t Connect With Local Server, and a Solution

I have recently started using Digital Ocean for hosting. I was previously using the free Heroku hosting, which I liked very much. The DNS was a bit tricky to set up, but once it got configured, it was smooth. I was able to get a one second load time for the index. It’s all static files, but I was still happy with Heroku, especially for free. However, I’ve been diving much deeper into web development lately, and I’ve decided I need a larger base of operations on the web, and Digital Ocean was at the top of my short list of candidates (along with dreamhost and linode). The SSD’s on every machine is what finally sold me.

I’ve been working on a rails project lately, and I need a dev machine on the cloud for testing purposes. So I found myself needing to set up a rails environment on my new Digital Ocean instance. I naturally went with my friend chris’s excellent blog on setting up rails for Ubuntu 12.10 with NginX, etc. I decided to go with MySQL instead of postgres.

Gypsies and Jet-setters: Bruce Sterling at 2006 SXSW

This is the first Bruce Sterling talk that I encountered. Now I’m an inveterate Sterlingite, but at the time, 2006, I had only barely crossed intellectual paths with Sterling. I downloaded it with some long dead, precambrian cousin of google reader that lived in the swamps and estuaries of windows computers and survived by allowing the user to view RSS feeds on his desktop. Then pushed it to my equally antediluvian, single purpose device, an Archos mp3 player. I listened to this talk many times before the Archos died off. I thought this audio was lost to the bit bucket of history, until one day… These people are the new Library of Alexandria, with podcasts instead of papyrus.

The full mp3 is available here.

A Mental Steam Shovel

A month ago I was hanging out with my friend Chris. We were having our weekly meetup, talking about our approach to our work, recent experiences, and just enjoying our surroundings. I noticed a connect four game sitting in the corner, on a dark wooden ledge. The old, disjoint of additive and subtractive primary colors, blue and yellow plastic version of the game would have done the trick, but this was even better. It was a wooden version of the board. There were two wooden dowels, each half the width of the board, inserted just beneath the bottom row, which held the pieces in the board. When you draw them out, one from each side, the game is reset and the pieces fall to the bottom of the game board.

“I’m sure there’s some simple heuristics to this I’ve long since forgotten.”


Textruder is the next installment in a long line of one-dimensional cellular automata implementations on various platforms and various media. This adventure begins like so many, on the command line. The inspiration for this project came from reading one of Stephen Wolfram’s papers on cellular automata. The original output of the programs testing the concepts of cellular automata was not graphical in the sense of directly mapping each cell to a pixel or block of pixels. Instead, they simply used the command line to emulate this behavior, printing out a new line for each iteration of the row of cells, with an “*” character representing the on cells and a space for the off cells.

Automatic Mechanical Self Reproduction

While reading Artificial Life: A Report From the Frontier Where Computers Meet Biology by Steven Levy, I came across a reference to self-reproducing structures built by Lionel and Roger Penrose. These structures were small plywood cutouts fitted with various shapes and levers which allowed them to link up, or not, when coming into contact with another block of the same make. See the following two part short film about this project:

Iterated Prisoner’s Dilemma

I recently decided to re-read a book I had read long ago, in order that I might filter it through the knowledge and experience I’ve accrued since I first read it. It occurred to me that so much of what I have done in the intervening years has an impact on my understanding of it, that I could scarcely hold a conversation with my past self on the topic. The book in question is The Selfish Gene by Richard Dawkins.

Convert Your 23andme Raw Data Into VCF Format

A week ago I received my results from Aside from the obvious points of interest, health risks, heritage, neanderthal composition, etc., I was also interested in getting my own data in raw format. While 23andme does provide a way to download your “raw” data, they are not really providing raw data. One cannot access the image data from the microarray sequencer that they used. What they do provide is formatted as follows:

# rsid  chromosome  position    genotype
rs4477212   1   82154   TT
rs3094315   1   752566  TC
rs3131972   1   752721  AA
rs12124819  1   776546  AC
rs11240777  1   798959  GA
rs6681049   1   800007  CC