The 1000G project just published it’s main manuscript, and MassGenomics provides an excellent summary:
The 1,000 Genomes project has provided a sort of “null expectation” for the number of rare, low-frequency, and common variants of different functional consequences found in randomly-chosen [healthy] individuals from various populations. […] It also tells you that if you sequence an individual’s whole genomes and don’t find about 3 million SNPs, something is probably wrong.
And Luke Jostin’s publication in Nature reports on a record number of genetic associations for Inflammatory Bowel Disease, along with the struggle to turn these associations into actionable predictions:
We could imagine genotyping healthy people and use our new IBD variants to find a “high risk” group that we can monitor more closely. How well would this work? Given the variants reported in the paper, the answer is “not very well”. Suppose we take people in the top 0.05% of IBD risk. Even in this high risk group only 1 in 10 people will get IBD. Even worse, 99% of real IBD patients WON’T be in this group, and so would be missed by the test!
The effect of a genetic variant depends on many factors beyond how it impacts protein structure: where the gene is expressed, when it’s expressed, how it interacts with other genes that themselves may harbor variants, etc. […] And let’s not forget the effect of environment. Diet, exercise, exposure to carcinogenes, and other factors may have a far more dramatic effect on whether or not someone gets a particular disease than their constitutional genetic makeup.
Respondents spend the greatest fraction of their NGS workflow performing data analysis and interpreting biological meaning from their data.
No, that doesn’t mean a genome will cost nearly $300,000 to fully analyze, a number touted recently by a Genetic Engineering & Biotechnology report, neatly debunked by Daniel McArthur at Genomes Unzipped. But the amount of time spent on interpretation and meta-analysis combined already exceeds time spent with primary data analysis (presumably QC, mapping, calling for a typical re-sequencing workflow), and I expect this to tip even more towards interpretation in the future. In other words, while we continue to automate more and more of the analytical process, demands on data annotation and interpretation continue to increase — and not all experimental resources should be allocated purely to data generation.
There is also incredible value in partnering with a sequencing facility that gets involved in the experimental design, even if that raises the initial cost of the experiment. See Bastien Chevreux’s brilliant summary of just some of the choices influencing a NGS experiment:
On a gene fishing expedition? Probably Illumina HiSeq, at least 100bp, 150 to 200bp if your provider supports it well. Ion if a small organism and you need it quick without caring for possible frameshifts. Want some larger contigs? 454 Titanium + Illumina 100bp (150 to 200bp if provider supports it). The same as above, but maybe cheaper? Ion Torrent (long chemistry) & Illumina. Even larger contigs and scaffolding? 454 Titanium + 454 paired-end + Illumina HiSeq (also paired-end if you need more coverage). Larger scaffolds? Like above, but different library sizes in the paired-end libraries. Feeling adventurous or have a complex eukaryote? PacBio + at least Illumina, preferably with 2x to 4× 454 mixed in.
In other words:
There’s one – and only one – question which you, as sequencing customer, need to be able to answer … if necessary in every excruciating detail, but you must know the answer. The question is: “WHAT DO YOU WANT?!” […] But often sequencing customers get their priorities wrong by putting forward another question: “WHAT WILL IT COST ME?”
The Genomics and Pathology Services at Washington University in St. Louis School of Medicine is offering support to tackle rare diseases:
Do you have a patient population poised for sequencing, yet lack the funds or expertise to carry out the testing? Do you have specific families with a large pedigree and strong suspicion for a genetic cause that await whole exome analysis? Do you have compelling research ideas to solve a rare disease by exome sequencing?
The deadline for letters of interest is April 2nd, with proposals being due on June 1st.
Titus Brown on teaching sequencing courses effectively:
Right now, I think it’s too much effort for too few students & little professional impact for me to continue with things past 2013: I don’t want to just affect a few dozen people for the amount of effort we’re putting into this. So… is there a way to scale?
The cost/benefit-ratio is something we have been struggling with as well. At some point you need a dedicated admin team to ensure even just two-day workshops run smoothly, and the immediate benefit isn’t always entirely clear. Good news is that this is widely recognized by now, and initiatives like Khan Academy are driving the development of better infrastructure and course delivery options.
Kerri Smith comments on the just published Gorilla genome. I particularly like Wolfgang Enard’s quote:
But there is not much point without data on behaviour and physiology. “We want phenotypes too”, he says.
While sequence and data analysis cost make it feasible to sequence thousands of species the real value is in linking this information to differences in phenotype.
Update: David Winter describes why it is expected that some of our genes are more closely related to their Gorilla counterparts than those of the chimp.
Quick update before the end of the month:
The long-awaited paper from Eric Lander on missing heritability has been published and is causing a new round of discussions, following the initial debate after the ASHG announcement. Luke has a short summary over at Genomes Unzipped, and Steve Hsu delves into the supplementary material. A good addition to this is Genetic Inference’s report on the current state of complex trait sequencing via ICHG2011.
BGI has started to move the sequence analysis to GPU-based servers, though the article in Wired is unfortunately light on details what algorithms ended up getting ported to the different architecture.
IonTorrent meanwhile starts supporting paired-end reads, sort of, and announces a new machine which is a bit of a disappointment as one of their main selling arguments was that improvements would happen through the chips, not the hardware around it. Be that as it may, we are getting close to the $1000 genome — for the data generation, which doesn’t take the very time consuming data analysis into account. This is also reflected in my favorite quote of Elaine Mardis’ interview with The Scientist:
“It makes me crazy to listen to people say sequencing is so cheap. Sure, generating data is cheap. But then what do you do with it? That’s where it gets expensive. ‘The $1,000 genome’ is just this throwaway phrase people use and I don’t think they stop to think about what’s involved. When we looked back after we analyzed the first tumor/normal pair, published in Nature in 2008, we figured that genome—not just generating the data, but coming up with the analytical approach, which nobody had ever done before—probably cost $1.6M. If the cost of analysis doesn’t fall over time, we’re never going to get to clinical reality for a lot of these tests.”
This is not going to get any easier as the sequencers get more and more efficient; see Illumina’s announcement of the HiSeq 2500 (summary by OmicsOmics and on the SeqAnswers forum). And though the price of the reagents keeps decreasing it’s still cheaper to store the data than to re-sequence the samples, storage problems notwithstanding.
Michael Eisen has a comment in the NYT on the Research Works Act that’s recommended reading. If you are a member of ISCB you might want to consider signing their policy statement which strongly opposes the act.
- If you aren’t following Edge you are missing out on some great science debates. The Guardian talks to its founder, John Brockman
- TopHat gets a new release
- Aaron Kitzmiller has a terrific commentary regarding the Core model in academia, and why this incredibly difficult to get right for bioinformatics
- St Jude’s releases Explore, a portal to their pediatric cancer genome data
- Keith Bradnam summarizes why it is so difficult to evaluate genome assemblies
- Dan Koboldt provides a neat summary of the current state of dbSNP
Back in the office after the holidays and quite some catching up to do.
Growth in genomics
Coverage of computational biology and genomics in the general media continues to increase. This time the Economist covers bioinformatics and the New York Times has an article on computational biology and cancer. And while public funding for some of the current genome centers is cut by as much as 20 percent new centers in New York and Connecticut are hoping to benefit from the increase in demand.
Clinical grade sequencing
Some of this demand is driven by a general trend towards clinical sequencing which is benefiting the Broad and other centers. Given the inroads made to identify causal mutations (nicely summarized by MassGenomics for disorders and cancers) in 2011 this is perhaps not surprising, and even clinical trials for personal genome sequencing are kicking off.
While the technology is making rapid progress we still have to deal with a large number of problems, among them the handling of genomics and privacy
, how to make sense of the results in the first place — something that new initiatives like openSNP are trying to address — and the discrepancies caused by differences in sequencing technologies and data processing. We will need a public assessment or competition of workflows and methods, similar to what the Assemblathon and GAGE have been doing for genome assembly approaches.
Resources and discussions
- Neil Saunders has started a lovely series on next-generation sequencing NGS for those familiar with good old Sanger
- A renewed discussion about public access to research funded by tax payers
- More thoughts on hypothesis vs data driven science, and also from Daily Life ruminations on data intensive science and workflows
- If you are new to the field of metagenomics this review is a great starting point
- New tool for quality control in RNA-seq: RNA-SeQC and an alternative workflow
- Two additional papers listing biases in RNA-seq introduced by barcoding
- OmicsOmics provides a comparison of IonTorrent and MiSEQ
- Our own Brad Chapman provides a summary of BioCloudCentral, a simple way to get started with a sequence analysis using Galaxy and CloudMan
- Speaking of which, CloudMan gets an update
- James Taylor wrote up a Galaxy tutorial for ChIP-seq analysis
- SomaticSniper has been released, hopefully the first of many tools made available by WashU
See you in a week or two!
A quick summary of interesting publications, articles and resources the CHB staff encountered this week:
News and publications
- Reproducible research received coverage last week, most notably from articles in Science (commentary by Robert Peng in Simply Statistics) and The Wall Street Journal. While tools like Sweave or Knitr help by keeping code and results in sync the consistent reproducibility of research is far from a solved problem. In particular, different versions of databases or bioinformatic tools increasingly cause problems in next-generation sequencing experiments. A problem closely associated with the question of reproducible research is the handling of large scale data storage, including issues around data security, redundancy and versioning which also is slowly starting to make headlines.
- The WSJ also covered Open Science and Citizen Scientists in the same issue (also see commentary from Jeff Leek). Researchers, scientific journals and funding agencies are still struggling trying to find the best approach to foster collaboration and give data back to the public, and any raised awareness through articles like these helps
- Daniel McArthur — who recently accepted a new position at MGH — takes down the Independent’s coverage of a sleep study. Recommended reading.
- The European Bioinformatics Institute launches a metagenomics framework
- Concise summary of the current next-generation sequencing field from The Molecular Ecologist, based on an article by Travis Glenn. Highly recommend as a resource for planning and budgeting the next sequencing experiment
- Dan Koboldt comments on the findings of a new publication from the Shendure group on filtering strategies for exome seq and the need for matched normals
- Yet-another-learning-R-book, but the new The Art of R Programming is getting stellar reviews
- And if you decide to delve into Python, Mir Nazim published an excellent overview of the current Python ecosystem
See you next week!