The Broad just launched a new workshop series focusing on tools and typical bioinformatics workflows
The BroadE curriculum (“E” stands for educational, and “Broad” means collaboration) offers insights and hands-on training in rapidly evolving technologies, high-throughput methods, and computational tools that are not typically found in conventional research labs.
It is open to members of the Broad Institute and affiliated institutions (Harvard, MIT). Three workshops coming up, two are still “open for registration:http://www.broadinstitute.org/what-broad/administration/broade/broad-workshops:
GenePattern – May 21
Using CellProfiler for Biological Image Analysis – May 25
RNA-Seq Introduction – June 25
We try to keep our work as reproducible as possible by keeping code in Bitbucket, re-using standard environments, creating workflows documented with knitr and storing results in open repositories, but this is put to shame by a recent publication from Titus Brown on digital normalization of sequencing data
Titus’ team not only submitted a pre-print to arXiv and made all source code available, they also disseminate all data, scripts to reproduce the manuscript’s figures, tutorials to make use of their software in practice, but to top it off they also provide an EC2 instance so readers can get started right away:
Once you’re connected in, select the ‘diginorm’ notebook (should be the only one on the list) and open it. Once open, go to the ‘Cell…’ menu and select ‘Run all’.
Update: Also see Titus’ blog post on the motivation and work that went into this
Harvard Catalyst is offering another primer on complex trait genetics next Wednesday, April 4 at MGH (8:30-4:30, Simches Research Building, Room 3.110). Registration is required. The program looks exciting:
This event is sponsored by the MGH Clinical Research Program, the Center for Human Genetic Research, the Broad Institute of Harvard and MIT, and in partnership with the Harvard Catalyst.
Are you able to keep up with the changing face of genetic research? Have you heard of disease areas that have seen explosive growth in genetic discoveries in the past year? This is an excellent opportunity to learn the essential elements of complex trait genetics and gain the latest insights from expert faculty from the Center for Human Genetic Research and the Broad Institute of Harvard and MIT.
8:00-8:30am – Registration and Continental Breakfast
8:30-9:15am | David Altshuler, MD, PhD: Introduction of Complex Trait Genetics
9:15-10:00am – Jim Gusella, PhD: Mendelian Traits Focusing on Methods, Modifiers, and Implications for Complex Traits
10:00-10:45am – Benjamin Neale, PhD: Resequencing
10:45-11:00am – Break
11:00-11:45am – Mark Daly, PhD: Inflammatory Bowel Disease Genetics
11:45-12:30pm – Christopher Newton-Cheh, MD, MPH: Translating Findings from Human Genetics to an Improved Understanding of Blood Pressure Regulation
12:30-1:30pm – Lunch on your own
1:30-2:15pm – David Beier, MD, PhD: Mouse Models as Tools for Follow up of Human Genetics
2:15-3:00pm – David Milan, MD: Zebrafish as a Model of Human Disease
3:00-3:45pm – Sean Wu, MD: Human Stem Cells as In Vitro Models for Genetic Discovery
3:45-4:30pm – Panel Discussion
Sometimes colleagues wonder what the point in tools like Twitter might be, and why we’d waste our time with it. A recent interaction with Neil Saunders just reminded me how useful new collaborative ways can be.
We had noticed a strange cyclical bias in probe intensities from the 450k Illumina Methylation chip seemingly related to the Sentrix position, but had written it off as a potential problem with uneven distribution of chemistry… until Neil mentioned seeing the same phenomenon in an independent analysis.
Neil kindly summarized his findings, and all of a sudden we have a much better overview of a potential technological problem and can account for this in future workflows.
Titus Brown on teaching sequencing courses effectively:
Right now, I think it’s too much effort for too few students & little professional impact for me to continue with things past 2013: I don’t want to just affect a few dozen people for the amount of effort we’re putting into this. So… is there a way to scale?
The cost/benefit-ratio is something we have been struggling with as well. At some point you need a dedicated admin team to ensure even just two-day workshops run smoothly, and the immediate benefit isn’t always entirely clear. Good news is that this is widely recognized by now, and initiatives like Khan Academy are driving the development of better infrastructure and course delivery options.
Normally not a fan of top ten lists, but this one from Titus Brown comes with tons of additional comments, a good summary of the challenges to research computing environments and great insights:
I think about it like this: generating hypotheses from large amounts of data isn’t that interesting — I can do that with publicly available data sets without spending any money! Constraining the space of hypotheses with big data sets is far more interesting, because it gives you the space of hypotheses that aren’t ruled out; and its putting your data to good use.
And I wholeheartedly agree about one of the main challenges:
We were delayed in some of our research by about a year, because of some systematic biases being placed in our sequencing data by Illumina. Figuring out that these non-biological features were there took about two months; figuring out how to remove them robustly took another 6 months; and then making sure that removing didn’t screw up the actual biological signal took another four months.
Go read the whole thing, going to keep you thinking for quite a while.
Quick update before the end of the month:
The long-awaited paper from Eric Lander on missing heritability has been published and is causing a new round of discussions, following the initial debate after the ASHG announcement. Luke has a short summary over at Genomes Unzipped, and Steve Hsu delves into the supplementary material. A good addition to this is Genetic Inference’s report on the current state of complex trait sequencing via ICHG2011.
BGI has started to move the sequence analysis to GPU-based servers, though the article in Wired is unfortunately light on details what algorithms ended up getting ported to the different architecture.
IonTorrent meanwhile starts supporting paired-end reads, sort of, and announces a new machine which is a bit of a disappointment as one of their main selling arguments was that improvements would happen through the chips, not the hardware around it. Be that as it may, we are getting close to the $1000 genome — for the data generation, which doesn’t take the very time consuming data analysis into account. This is also reflected in my favorite quote of Elaine Mardis’ interview with The Scientist:
“It makes me crazy to listen to people say sequencing is so cheap. Sure, generating data is cheap. But then what do you do with it? That’s where it gets expensive. ‘The $1,000 genome’ is just this throwaway phrase people use and I don’t think they stop to think about what’s involved. When we looked back after we analyzed the first tumor/normal pair, published in Nature in 2008, we figured that genome—not just generating the data, but coming up with the analytical approach, which nobody had ever done before—probably cost $1.6M. If the cost of analysis doesn’t fall over time, we’re never going to get to clinical reality for a lot of these tests.”
This is not going to get any easier as the sequencers get more and more efficient; see Illumina’s announcement of the HiSeq 2500 (summary by OmicsOmics and on the SeqAnswers forum). And though the price of the reagents keeps decreasing it’s still cheaper to store the data than to re-sequence the samples, storage problems notwithstanding.
Michael Eisen has a comment in the NYT on the Research Works Act that’s recommended reading. If you are a member of ISCB you might want to consider signing their policy statement which strongly opposes the act.
- If you aren’t following Edge you are missing out on some great science debates. The Guardian talks to its founder, John Brockman
- TopHat gets a new release
- Aaron Kitzmiller has a terrific commentary regarding the Core model in academia, and why this incredibly difficult to get right for bioinformatics
- St Jude’s releases Explore, a portal to their pediatric cancer genome data
- Keith Bradnam summarizes why it is so difficult to evaluate genome assemblies
- Dan Koboldt provides a neat summary of the current state of dbSNP
Back in the office after the holidays and quite some catching up to do.
Growth in genomics
Coverage of computational biology and genomics in the general media continues to increase. This time the Economist covers bioinformatics and the New York Times has an article on computational biology and cancer. And while public funding for some of the current genome centers is cut by as much as 20 percent new centers in New York and Connecticut are hoping to benefit from the increase in demand.
Clinical grade sequencing
Some of this demand is driven by a general trend towards clinical sequencing which is benefiting the Broad and other centers. Given the inroads made to identify causal mutations (nicely summarized by MassGenomics for disorders and cancers) in 2011 this is perhaps not surprising, and even clinical trials for personal genome sequencing are kicking off.
While the technology is making rapid progress we still have to deal with a large number of problems, among them the handling of genomics and privacy
, how to make sense of the results in the first place — something that new initiatives like openSNP are trying to address — and the discrepancies caused by differences in sequencing technologies and data processing. We will need a public assessment or competition of workflows and methods, similar to what the Assemblathon and GAGE have been doing for genome assembly approaches.
Resources and discussions
- Neil Saunders has started a lovely series on next-generation sequencing NGS for those familiar with good old Sanger
- A renewed discussion about public access to research funded by tax payers
- More thoughts on hypothesis vs data driven science, and also from Daily Life ruminations on data intensive science and workflows
- If you are new to the field of metagenomics this review is a great starting point
- New tool for quality control in RNA-seq: RNA-SeQC and an alternative workflow
- Two additional papers listing biases in RNA-seq introduced by barcoding
- OmicsOmics provides a comparison of IonTorrent and MiSEQ
- Our own Brad Chapman provides a summary of BioCloudCentral, a simple way to get started with a sequence analysis using Galaxy and CloudMan
- Speaking of which, CloudMan gets an update
- James Taylor wrote up a Galaxy tutorial for ChIP-seq analysis
- SomaticSniper has been released, hopefully the first of many tools made available by WashU
See you in a week or two!
The Stem Cell Discovery Engine (SCDE) is an integrated platform that allows users to consistently describe, share and compare cancer and tissue stem cell data. It is made up of an online database of curated experiments coupled to a customized instance of the Galaxy analysis engine with tools for gene list manipulation and molecular profile comparisons. The SCDE currently contains more than 50 stem cell-related experiments. Each has been manually curated and encoded using the ISA-Tab standard to ensure the quality of the data and its annotation.
The use of open source tools and community-driven standards was motivated by the desire to establish the SCDE as a community resource that encourages contributions of tools and new data sets from other stem cell researchers. The ISA-Tab framework is gaining support as a standard for scalable data capture and annotation, and makes it possible for SCDE to accommodate diverse data types. The Galaxy development community is growing rapidly and as a result, new methods are quickly being integrated into this framework, many of which will be applicable to stem cell data analysis. We are actively collaborating with the Galaxy team as well as developing our own tools for Galaxy, and will continue to align with this community resource as it is likely to yield benefits as we scale up and acquire new data and new data types.
CHB has supported the development of the SCDE with funding from the “Harvard Stem Cell Institute”: and an NIH GO grant on Comprehensive Phenotypic Comparison of Normal and Cancer Stem Cells. Shannan Ho Sui, the lead author on the the NAR paper would love to hear your feedback.
Ho Sui, S. J., Begley, K., Reilly, D., Chapman, B., McGovern, R., Rocca-Sera, P., Maguire, E., et al. (2011). The Stem Cell Discovery Engine: an integrated repository and analysis system for cancer stem cell comparisons. Nucleic acids research
On November 16th Fiona Brinkman presented on the Metagenomics Analysis of Watershed Microbiomes – Toward Improved Water Quality Monitoring at the Bioinformatics Core forum.
Her talk is now online along with the slides. Feedback welcome!
A quick summary of interesting publications, articles and resources the CHB staff encountered this week:
News and publications
- Reproducible research received coverage last week, most notably from articles in Science (commentary by Robert Peng in Simply Statistics) and The Wall Street Journal. While tools like Sweave or Knitr help by keeping code and results in sync the consistent reproducibility of research is far from a solved problem. In particular, different versions of databases or bioinformatic tools increasingly cause problems in next-generation sequencing experiments. A problem closely associated with the question of reproducible research is the handling of large scale data storage, including issues around data security, redundancy and versioning which also is slowly starting to make headlines.
- The WSJ also covered Open Science and Citizen Scientists in the same issue (also see commentary from Jeff Leek). Researchers, scientific journals and funding agencies are still struggling trying to find the best approach to foster collaboration and give data back to the public, and any raised awareness through articles like these helps
- Daniel McArthur — who recently accepted a new position at MGH — takes down the Independent’s coverage of a sleep study. Recommended reading.
- The European Bioinformatics Institute launches a metagenomics framework
- Concise summary of the current next-generation sequencing field from The Molecular Ecologist, based on an article by Travis Glenn. Highly recommend as a resource for planning and budgeting the next sequencing experiment
- Dan Koboldt comments on the findings of a new publication from the Shendure group on filtering strategies for exome seq and the need for matched normals
- Yet-another-learning-R-book, but the new The Art of R Programming is getting stellar reviews
- And if you decide to delve into Python, Mir Nazim published an excellent overview of the current Python ecosystem
See you next week!
MicroRNAs (miRNAs) are small RNAs that regulate gene expression by binding to mRNAs bearing a partially complementary sequence. miRNAs decrease the stability or translation of mRNA targets, leading to reduced protein expression. Understanding the biological function of a miRNA requires identifying its targets, and current target prediction systems yield a high false-positive rate.
Because we have a background in working with complex biological network relationships, CHB was invited to support a collaborative effort led by Ashish Lal from the Lieberman lab at the Immune Disease Institute, including researchers from Harvard Medical School, Harvard School of Public Health, Children’s Hospital, NCI and Memorial Sloan Kettering Cancer Center.
As part of the project just published in PLoS Genetics Ashish developed a sensitive and specific biochemical method to identify candidate microRNA targets that are enriched by pull-down with a tagged, transfected microRNA mimic. The method was used to isolate mRNAs pulled down with a transfected, biotinylated mir-34a, a tumor suppressor gene inhibiting cell proliferation, in K562 and HCT116 cancer cell lines.
One of the key challenges involved in the data analysis was to identify common biological functionality and interactions within the hundreds of identified target genes which we tackled by integrating gene-gene interaction information from KEGG, WikiPathways and GeneGo’s MetaCore.
Transcripts pulled down with miR-34a were highly enriched for roles in growth factor signaling and cell cycle progression, forming a dense network regulating multiple signal transduction pathways involved in the cell proliferative response.
Congratulations to Ashish and the whole team!
We are looking forward to a new Bioinformatics Core Seminar season and the first talk given by Fiona Brinkman on the metagenomics analysis of watershed microbiomes and improved water quality monitoring. Dr Brinkman is a Professor and Michael Smith Foundation for Health Research Senior Scholar at the Department of Molecular Biology and Biochemistry of Simon Fraser University, Vancouver.
The talk will be held at HSPH, seminar room FXB G13 on November 16th, 12:30 – 2:00; see the contact page on how to find us.
Her talk will focus on initial data from the project along with the development of bioinformatics tools to characterize the gene content of microbiomes:
“Access to clean drinking water is critical for maintaining public health. Currently, water quality is primarily assessed at the tap using approaches such as coliform counts that fail to detect the complete spectrum of water pathogens, or are too slow to be used as tools for real-time decision-making. What is required is water quality assessment that is more accurate, monitors upstream of the tap to identify problems sooner, and provides a method for identifying and remediating microbial pollution events. We are using a metagenomic approach to measure the health of pollution-impacted vs protected watersheds, plus potential pollution sources, over space and time. This first in-depth study of watershed microbiomes is comparing both shotgun and amplicon sequencing of protist, bacterial and viral sequences. Our overall aim is to develop novel molecular tools for the detection of a wider range of microbes which better reflect watershed health and facilitate microbial pollution source tracking.”
Looking forward to seeing you!
Fridays 10-11am, Biostatistics Department, Breezeway
The weekly Bioinformatics Brunch is an informal get-together where students, postdocs, and faculty interested in computational biology can meet to talk about their coursework, research, methodological problems or solutions, the latest papers and publications, and how good bagels and coffee taste on a Friday morning. We’d particularly like to welcome new students, and anyone with any degree of interest in the department’s bioinformatics courses and research is more than welcome to stop by. We hope to see you there!