You are a life scientist undertaking cutting edge research that involves experiments in which you collect omics data: transcriptomics, genomics, proteomics, methylation, RNASeq, ATACSeq, single cell, etc. You are working on an exciting and very important problem. You have great samples and an elegant experiment design. You have performed your experiments and collected your data. Now, you only need to extract from these data the insights that you know are there, and put together convincing results proving your hypotheses. That’s the data analysis part: transforming that omics data into knowledge. What will you do to get this data analyzed? Getting to this point was very expensive. Doing those experiments and getting the samples run already cost you thousands or tens of thousands of dollars. How much will this data analysis cost you? Can you maybe get it done for free?
Well, what did you do the last time you needed a hair cut? Did you have your hair cut by a student stylist or by an experienced stylist? The former would be much cheaper, even free. If you haven’t used the former, you probably care – even minimally – about your appearance. When it comes to analyzing your data, your work and scientific reputation are at stake. Would you not care about your work and standing in the scientific community at least as much as as you care about your hair?
Here are 5 ways to avoid paying for data analysis software (and their hidden costs):
Why would you pay for software to analyze your omics data? There seem to be so many ways in which one can get their data analyzed that one should be crazy – or at least unwise – to actually pay for such software. Let’s examine these:
1. Use one of the many free software analysis platforms available on-line.
The benefit is obvious: you can analyze your data. Now. Free of charge. It doesn’t get any better than this. Or does it?
While it may seem like a great way to save money, consider how using such free analysis platforms will affect you and the quality of your work in the long run. They say there is no such thing as a free lunch. Is saving money now going to cost you more in the long run?
A recent paper published in Nature Methods tried to address this question (Vol. 13 No.9 | September 2016
Impact of outdated gene annotations on pathway enrichment analysis). The authors of this paper (not related in any way with us) analyzed 3,900 publications published during several years prior, and made an astonishing discovery. They found that 67% of the papers they analyzed used outdated software that captured only 26% of the biological processes and pathways that were known at the time. Yes, you read that right. Over two-thirds of these papers based their conclusions and findings on only about one quarter of what was known at the time. With results like this, you may be saving money but you are losing time, effort and – potentially – credibility. The last thing you want is to hear from the reviewers of your grant proposal or paper that you didn’t consider such-and-such pathway or phenomenon which has been known for a while.
If we cross free web sites off the list, what other options do we have?
2. Task one of your students to analyze your data.
The advantages are many: the student is there already, the student may be paid from other sources, (s)he needs to learn to do this kind of thing anyway, etc. Furthermore, there are lots of high-quality, free bioinformatics analysis software, such as those from Bioconductor. Why not get a student to use those?
Yes, an existing student will save you money upfront but the quality of the results will not be on the same level as that of an experienced data analyst (no offense to the student). First, if you are generating omics data you are probably a life scientist, training life science students. That means the student probably does not have the background and experience necessary to successfully analyze complex omics data. You have really brights students but there is only so much that can learn on their own, in a new area. I have been using R for several years, authored several packages in Bioconductor, advised more than 10 PhD students in this area, authored a book using R, and I still don’t feel that I am fully mastering the R environment. How do you think your student would do in this respect? Furthermore, it is easy to push buttons and call some routines written by other people but the crucial skill is to be able to look at the results and understand whether they are valid or not, and how to fix things if something is not right. Whether you like it or not, there is a huge chance of errors and inaccurate analysis. You may not extract from your data all the knowledge that you could if a proper analysis were to be done. The results may not be convincing. You may end up pursuing false positives which can setup your research back several months. In the business world, they call this “missed opportunity costs”: all the invisible and intangible things that you would have done if you had correct and complete results sooner. You may miss important deadlines, or you may have to downgrade the journal you send your next paper to for lack of convincing analysis results. None of these will reduce the balance in any of your research accounts yet all these represent significant costs that will impact your professional standing over the longer term.
3. Hire a student/post/research assistant for this specific purpose.
If you do this, you may address some of the problems above. In principle, you can hire somebody with a data science/data analysis background and experience. Since this is their specialty, they are likely to know what they are doing with your data. There are plenty of high-quality data analysis software out there for somebody with the right background and experience. However, here are some downsides. First, this is a very expensive solution. You are not paying for any software but you are paying probably much more for personnel. Between salary, benefits and fringe it would probably cost you between 45K and 80K/year to keep a full time data analyst in your team. That is probably an order of magnitude more than what you could pay for some state-of-the-art data analysis software. Furthermore, the volume of data that you generate may not be enough to keep this person busy 100% of their time.
4. Collaborate with someone else to analyze your data.
That also sounds great in theory: find a collaborator who is an expert in data analysis such as a computer science/statistics/biostatistics faculty and have them analyze your data. Here are the issues with this solution. If it’s really a faculty colleague that you are collaborating with, they probably have their own research program focused on developing novel methods and algorithms. So it is likely that they will push the task of analyzing your data to one of their students or postdocs. And that brings your precious data back in the hands of a student – even though probably a more qualified one. Another possibility, is that your colleague develops a new analysis approach and uses it for your data. That’s not quite what you want either. When you publish you paper, you probably want to use methods and algorithms that are _already_ peer-reviewed and accepted as valid. The last thing you want is to have to fight to prove the validity of your results, as well as the validity of the data analysis methods, at the same time, in the same paper. Of course, you can wait until they submit and publish their new method and then publish your paper but that would take a few months…
5. Use a data analysis core facility (e.g. a bioinformatics, or biostatistics core)
As the former director of a bioinformatics core at a major comprehensive cancer center I can tell you this is not a bad option. Core facilities are staffed by highly qualified experts who definitely know what they are doing. Their job is precisely to help people like you. Also, you do not need to pay them all year long, but rather pay only when you need them, for as long as you need them. So the money you spend for your data analysis is directly proportional to your needs (e.g. the number of experiments you undertake and the amount of data your generate). The disadvantages include a relatively high cost per analysis. The core facilities need to cover their costs including the space, personnel, equipment and the software they use – yes, most core facilities actually use commercial software because it increases their productivity while providing the accuracy and quality they need. Other disadvantages are the related to the core availability and response time. Most cores are very, very busy so you may need to wait until somebody in their team gets to your data. The most important point about using a core facility is to have the correct expectations both in terms of timeline, as well as in terms of deliverables (also see below).
So what is the best way to analyze my omics data?
When I was the core director at my previous institution, I often got data from various core users with the request to “just analyze it”. This is not the best way to do science. As a bioinformatics expert, myself or anybody on my team could deploy the most sophisticated analysis tools and algorithms to analyze any given data set. However, regardless of my knowledge and skills, I – as a bioinformatician – am not the best person to analyze your data. I know nothing about your experiment, phenotype or conditions. I have no idea about the hypotheses that led to the design of the experiment that generated these data. YOU on the other hand, the PI who conceived, and designed this experiment, are the best person in the world to analyze your data. Sometimes, the PIs completely delegate the analysis of their experiments, whether to a younger collaborator or to a core. In my personal opinion, this is a mistake. The often-encountered scenario in which one provides the raw data at one end, applies a set of pre-established algorithms, and gets a set of tables, figures and other such things at the other end is very sub-optimal. The data analysis should be a process to be carried out by a very experienced person, with a deep knowledge about the phenotype, the experiment and the hypotheses behind them. Data analysis should be an intelligent and highly sophisticated exploration in which the human expert is asking very specific questions of their data.
How can this be done? It’s not very difficult. If you work with a collaborator or a core, find the time to sit down with this data analyst and go through this process together. You – the project leader and the mastermind behind the experiments – should be asking specific questions and explain the hypotheses to be tested by the experiment. Your software and/or your data analyst should be able to explore these hypotheses one by one, and provide evidence to prove or dis-prove these hypotheses.
Our iPathwayGuide software was designed with this goal in mind. Our platform allows you – the life scientist – to explore your data in a very intuitive way, without using any command lines or code. With a little guidance from you, our software will be able to tell you the story behind your experiment. You can ask specific questions, based on your knowledge about the phenotype and the hypothesis behind your experiment design, and our software will identify the mechanisms behind the measured changes. You can ask questions such as “How do my differentially expressed genes involved in the MAPK pathway and present in the cell membrane, affect apoptosis?” or “How would this mechanism be affected if I chose to treat my subjects with drug X?”. An automatically generated written report, the ability to customize figures to your needs, and the ability to share for free, complete the sets of capabilities that would allow you to greatly increase your productivity and make rapid strides in your research. Our very satisfied customers range from individual PIs at institutions such as Stanford, University of Chicago, Columbia, Medical University of South Carolina, etc., to the National Institutes of Health, to biotech companies. Get in touch with us if you want to find out how your research can get to the next level!