My previous blog ended with an interesting question: has anybody really compared side-by-side gene set analysis methods and pathway analysis methods on the same data sets? Well, the answer is: not yet! The main reason for this is that it is extremely difficult to objectively validate the results of a pathway analysis method. It is even more difficult to compare the results of different pathway analysis methods. In this blog, I explain why such comparisons are difficult and I propose 3 approaches and a benchmark that will make such comparisons possible.
In a recent peer-reviewed paper, we have reviewed not fewer than 22 pathway analysis methods. There are at least as many if not more gene sets analysis methods (see our earlier review as well as the one written by my former PhD student, Purvesh Khatri, currently Associate Professor at Stanford). Most of these approaches have been published in peer-reviewed papers. Publishing a peer-reviewed paper usually requires the authors to demonstrate that their proposed approach: i) is novel and ii) does better than the existing approaches. So in principle, simply taking the most recently published approach should provide the best method available, right? Well… nope! You wish it were that simple! So how do these papers get published then? Well, here are the most common ways in which people “prove” their methods and the problems related to these approaches. Most papers will use one or more of these. Some are better, some are bad, and some are really bad.
Many people run their analysis on simulated data sets. This is particularly popular with statisticians and computer scientists (of which I am one) for whom numbers and data are as tangible and trustworthy as a ton of bricks (maybe we are still plugged into the Matrix). In a 2003 paper in Bioinformatics, I used this approach myself to demonstrate the superiority of a new approach for selecting the differentially expressed genes. Fig 2 in that paper (included below) shows a beautifully designed data set that had everything we cared about: up-regulated genes, down-regulated genes, noise, controls – everything.
The problem with this is that any simulated data set is constructed based on a set of assumptions. Furthermore, the simulated test datasets are designed by the same people who design the analysis method. If these people think that a certain number of factors are important and will lead to better analysis results, those factors are going to be integrated in the analysis method. This is good because this is how we get better and better methods, by thinking about new factors that could influence the results and developing methods able to correctly deal with these factors. However, the problem is that the same factors are also going to be used in generating the data on which the proposed analysis method will be tested. This means that testing new methods in this way will have some inherent bias. This has nothing to do with the integrity of the scientist doing the work. It is just impossible to develop a method and then generate test data that is completely independent of all assumptions made when the method is developed. This is the reason for which drugs are tested in double blind studies in which neither the patient nor the administering physician know whether they have the real drug or a placebo.
No problem, you will say. Just have some people generate some datasets using whatever characteristics they think are presents in the real data and then have everybody test their methods on those simulated data. Unfortunately, this wouldn’t work either. Let’s say that I would come up with a very comprehensive set of data that could be used by other people to test their new pathway analysis methods. First, it would be very difficult to convince everybody to use my data. Secondly, a method that would do perfectly on this data set, will only have learned whatever data features and characteristics I had chosen to include in my simulated data. There will still be no real evidence that this method would work well on real data.
Finally, life scientists, who are the people who we want to use our methods, tends to trust numbers and simulations much less than results coming from real experiments.
Advantages: complete control on the data sets; they can exhibit any features or characteristics that the designer think are important.
Disadvantages: complete control on the data sets (yes, it’s the same as above); there is an epistemological limitation: one cannot embed data features one is not aware of and the methods tend to be designed precisely for the data characteristics exhibited by the simulated data.
Bottom line: simulations are not so good. They are intrinsically biased and not well accepted by life scientists. Methods validated this way tend to be published in statistical or computational-oriented journals and tend to be less adopted.
“PubMed validations” or using results obtained on a couple of real datasets
In this scenario, people take a couple of real datasets and include results obtained on them. Sometimes, depending also on the quality of the journal and the reviewers, the authors just show the results obtained and justify why they are happy with them. A typical argument will be: the method proposed here has found such and such pathways as being significant and there are all these papers linking these pathways (or genes present on these pathways) with the condition studied here. I call this “PubMed validation”. There are several problems with this validation approach. First, living organisms are very complex systems. Almost any analysis result will be supported by some references which makes an unbiased and objective comparison of various analysis methods practically impossible. Given access to PubMed and some time, any postdoc worth their salary will find sooner or later a paper linking almost anything with anything else. Without a deep knowledge of the phenomena involved in the given phenotype, it is almost impossible to judge whether such connections are really meaningful or not. Second, the life science expert doing this search is invariable a co-author on the paper. Whether we like it or not we are all human beings. Any kind of validation performed by somebody who has a vested interest in the successful outcome of the given validation cannot be objective and is not scientifically sound. The scientific method requires formulating some hypothesis before hand (e.g. the pathway analysis methods proposed here is better than x, y and z existing methods) but also defining before hand what a successful outcome of the experiment should look like. Performing the experiment, observing the results and then searching the literature for evidence supporting whatever findings are observed is NOT scientifically sound.
Advantages: any method can be shown to have good results on some data set
Disadvantages: not objective; not scientifically sound
Bottom line: PubMed validations are really bad. I would not trust any recent method that is proposed based on such a validation approach.
Using target pathways
A better way of testing and validating a new method is taking some datasets (the more the better), and establishing before hand some pathways that are known to be related to the given condition. An example would be the cell cycle pathway which is going to be involved in many if not all cancers, because cancer is essentially cell replication gone amok. This approach is somewhat similar to the one above because one still references literature supporting the connection between the “related” pathways and the condition. The crucial difference is that this related pathways are chosen before hand, thus making the validation scientifically sound. As an example, we used this approach in our 2007 Genome Research paper describing the very first pathway analysis approach able to take into consideration the position of the genes on the pathway, as well as the direction and type of every signal described by the topology of the pathway.
The problem here is that by the time a paper is submitted for publication, there is no way to know whether these related pathways have truly been chosen before the experiment, or – much like in the PubMed validation above – first found to be significant and then the literature was brought in to merely support the findings. So even though this is a valid scientific approach, it cannot really be distinguished from the PubMed validation above.
Is there a way to make this better? Absolutely! There are two weaknesses in the above: i) the use of a small number of data sets and ii) the lack of absolute objectiveness. One can address both in a very elegant way. Let us just consider those experiments that compare a well studied condition that already has an associated pathway, versus controls. Such a condition could be for instance, colorectal cancer. This condition is sufficiently well studied that there exists a colorectal cancer pathway, which aims to describe precisely the phenomena involved in this disease. Any experiment comparing colorectal cancer samples versus controls should found this pathway as significantly impacted because this is the pathway that describes this disease. We will call such a pathway a “target pathway”. This connection is absolutely objective and can be established in an undeniable way before any experiment is performed. It turns out that there are a number of such conditions that have associated pathways and many data sets from experiments studying these conditions. In our 2012 paper proposing PADOG, we show the superiority of the proposed method using not fewer than 24 different datasets involving 11 different diseases. In each case, we look both at the p-value of the target pathway, as well as at its rank. A better method should report the target pathways as significant and rank them highly (low rank numbers). Note that this approach is: i) using a very large number of datasets (an order of magnitude more than in the methods above) and ii) completely objective (all true positive pathways are clearly defined in advance of any experiment). Furthermore, since no human interpretation is required, this approach is also completely reproducible as well as suitable for repeated large-scale testing of many methods.
So we are done, right? This seems to be the perfect way to assess and compare these methods, doesn’t it? Not quite. This approach is by far superior of the others discussed so far but it’s not perfect. The problem is that this approach only focuses on one target pathway for each condition. We know that pathway is a true positive but we do not know what other pathways are also truly impacted. Let us assume for instance that there are 10 true positive pathways in a given experiment. Out of this ten, this approach focuses only on one, the target pathway, the one created to describe the particular phenotype studied. One analysis method can correctly report all true pathways as significant with the target pathway in position 10, which would be a perfect results. Another method, however, can report the target pathway as significant and in position 1, while also reporting 9 other incorrect pathways as significant. In other words, method 1 reports 10 significant pathways and all are correct, while method 2 reports 10 significant pathways with 9 out of 10 false positives. In this case, this particular assessment approach would report method 2 as being superior because it only looks a the rank and p-value of the target pathway. However, this disadvantage is mitigated by the fact that this approach can be used in an automatic manner, on a large scale. If such a perverse situation can occur in one or two data sets, it is very unlikely that it will occur in tens of datasets involving many different conditions.
Advantages: completely objective; reproducible; can be used in large-scale testing of many methods
Disadvantages: only focuses on a single true positive in each data set
Bottom line: one of the best approaches for large-scale testing currently known to human-kind but some care is needed when interpreting the results
Using knock-out datasets
I think the bottom line here is that we, as computer scientists and statistician, are striving to develop methods to help our life scientists colleagues better understand the results of their experiment and uncover the true underlying biological phenomena. If we want to truly demonstrate our methods, we have to use real datasets. But wait, you will say. The problem with the real data sets is that you never know the ground truth. If you do not know this ground truth, you cannot possibly assess the results of any pathway analysis method. Actually, it not true that the true signal is unknown in all real data. There are many real data sets in which the true pathways are actually known. The best such data involve using knock-out experiments (KO). In these cases, the phenotype was created by knocking down a single, specific gene. The source of the perturbation is known. Every pathway that contains the KO gene is a true positive because it contains the true cause of the phenotype and is genuinely implicated in the changes associated to the phenotype. Similarly, every pathway that does NOT contain the KO gene is a true negative because it is not related to the cause of this phenotype. A search in the gene expression omnibus (GEO) for data sets coming from KO experiments reveal not fewer than 7015 such data sets. In every single such data set, all true positives and true negative pathways are precisely known. Here is a link to many data sets in which the ground truth is known. Using such data sets, one can calculate the usual quality measures involved in assessing the ability to identify some hidden targets: accuracy, sensitivity, specificity, positive predicted value (PPV) and negative predicted value (NPV). If you are rusty on these, I provide some easy-to-understand explanations of these in Chapter 8 of my book Statistics and Data Analysis for Microarrays Using R and Bioconductor.
If one wants to be super picky, this approach also has a limitation. In principle, there could be pathways that do not contain the KO gene but are truly affected by it. Some people might want to see those pathways also reported by a good pathway analysis method. Some others might not. The choice here reduces to whether one is interested in the cause of the phenotype investigated or all changes involved in the phenotype. In turn, this is related to the fundamental differences between class comparison and class prediction. But this is a topic for another day.
Advantages: completely objective; reproducible; can be used in large-scale testing of many methods
Disadvantages: does not consider pathways that do not include the cause of the phenotype
Bottom line: one of the best approach for large-scale testing currently known to human-kind; most likely to convince life scientists.
We have recently deployed the approaches above in a very extensive comparison of 11 different methods on over 2,500 samples from 75 human data sets and 11 mouse data sets. The results of these extensive comparisons are currently under review. A brief description of the results will be posted here as soon as the paper is accepted, so hang on and come back here in a little while…