How are you currently analyzing your RNASeq and proteomics data?
Are you currently using one of the many free tools available for pathway analysis, such as DAVID? Or, are you perhaps using Qiagen’s IPA? If yes, you are using an enrichment approach. This will severely limit your ability to identify the pathways that are truly impacted in the condition you are studying. An enrichment analysis is also likely to lead you to many dead ends by providing a lot of false positives that have no connection with the phenomena that are truly going on in your phenotype. This post will show you a side-by-side comparison between enrichment analysis and impact analysis on a real data set. This is a clear example in which the commonly used enrichment analysis both fails to identify the pathways truly related to the cause of the phenotype, as well as yields false positives. In contrast, the impact analysis reports both more true positives, as well as fewer false positives.
Defining our expectations
Let’s start by defining what we expect from a good pathway analysis. The input to the pathway analysis is a set of genes measured in several samples of two or more phenotypes (conditions) with the goal of understanding the differences between those phenotypes. These two conditions can be disease vs. healthy, treated vs. control, drug A vs. drug B, etc. To keep things simple, let’s focus on a classical treated vs. control comparison. Such a study would aim to identify the pathways that are significantly impacted by the given treatment and how these pathways produce a certain phenotypic effect. For instance, if a treatment inactivates a specific gene, we would like the pathway analysis to help us identify the pathway(s) containing that particular gene. Pretty simple and reasonable expectations, right?
As I explained elsewhere, it is difficult to assess various pathway analysis methods because, in most cases, it is difficult to know the pathways that are truly involved in a condition. But that’s in most cases. Let’s look at a case in which we actually know the exact cause of the phenotype. The GSE 70302 dataset from GEO contains the results of an experiment in which a certain gene, IL1a, was knocked-out (KO-ed) in mice. In this experiment, four IL1a-KO mice were compared to four C57BL/6 control mice. So in this particular case, we are looking at a phenotype which was created by the inactivation of the IL1a gene. We will pretend we do not know this and we will analyze the measured expression changes across the entire genome, trying to see if we can correctly locate the disturbance. As outlined above, we would like a good pathway analysis to identify the pathways that are “related to the phenotype”. In this case, this means identifying the pathways that contain the IL1a gene that was knocked out. Those are the pathways that were in fact disrupted and caused the given phenotype. Here is a list of these pathways:
So the ones above are the pathways that we would like our analysis to identify and report to us since every one of them contains the KO-ed gene.
Enrichment analysis results
As detailed in the “Pathway Analysis vs Gene Set Analysis“, the enrichment analysis only takes into consideration the number of differentially expressed genes (DEGs) that fall on a given pathway. The idea is very simple: the method compares the number of DEGs as a proportion of the total number of genes, with the proportion of DEGs that fall on a given pathway. If the proportion of DEGs that fall on a pathway is significantly higher than the proportion of DEGs in the entire gene set, the pathway is considered to be “enriched” in DEGs and thought to be related to the phenomena that caused the phenotype. For instance, if there are 1,000 DEGs out of 10,000 measured genes, it means that 10% of all genes are DEGs. If we were to pick a random set of genes from this set, we would expect that about 10% of them will be DEGs. Now, we look at each pathway, one at a time. If on a pathway that contains a total of 20 genes we find 5 DEGs, this pathway would be considered as enriched. This is because the proportion of DEGs on this pathway is 25% (5 out of 20), which is significantly higher than the 10% expected just by chance.
Let’s see this approach in action. Here are the results of the classical enrichment analysis:
The column you want to pay attention to is the pFisher column, which contains the p-values calculated based on this enrichment approach. The pathways that contain the KO-ed gene, IL1a, are highlighted. You will notice a couple of things. First, the most significant pathway identified by this approach is “Axon guidance” which has nothing to do with the KO-ed IL1a gene. Furthermore, two out of the top three pathways are false positives. If we look further down, only three of the top 10 pathways contain the cause of the phenotype, the IL1a gene. Let me put this in a different way: 66% of the top 3 and 70% of the top 10 pathways are false positives. That is scary!! This analysis definitely sends you on a wild goose chase by reporting all these false positives.
Impact analysis results
Let us now look at the results of the impact analysis. In brief, the impact analysis constructs a mathematical model that captures the entire topology of the pathway and uses it to calculate a perturbation for each gene. Then, these gene perturbations are combined into a total perturbation for the entire pathway and a p-value is calculated by comparing the observed value with what is expected by chance. This is the p-value shown in the column pPert. Here are the results of the impact analysis:
Here, you can see two things, both very important. First, the top 3 pathways are all significant and involve the true cause of the phenotype, the IL1a gene. Furthermore, the top 7 pathways, sorted by the pPert column, include 6 true positive pathways and only one single false positive!
Do you really want to continue using the enrichment analysis above?? If you want a better approach, the impact analysis is available as part of iPathwayGuide. If you want to apply the impact analysis on your data or do a similar comparison yourself, you can create a free account here and analyze several free datasets available in your account.
Why does this happen?
The limitations of the simple enrichment analysis are easy to explain. First, the enrichment analysis calculates a p-value using a statistical model that assumes that all genes are independent. You do not need to be a statistician to realize that this is obviously wrong. The very reason of existence of a pathway is to describe the complex dependencies between the genes involved in that pathway. Given this, it is absolutely crazy to ignore all these well documented dependencies and decide whether a pathway is significant based on the assumption that all the genes on the pathway are independent.
Second, the enrichment analysis completely ignores the position and roles of the genes on the pathways, all the direction and type of the signals between the genes, etc. There are signals that capture the specific influence that each gene has on the genes downstream of it: activation, repression, etc. Each signal and dependency described by a pathway probably took many experiments and many months or years to prove. All this rich knowledge that has been gathered over decades of research by many scientists is completely ignored by the enrichment approach.
The only type of evidence captured by the enrichment analysis is the relative enrichment of a pathway in DEGs. Take for instance the insulin pathway from KEGG:
The insulin receptor INSR is the key entry point in this pathway. If this gene is not functioning properly, the entire pathway will be disrupted. And yet, if this is the only gene that is inactive, the enrichment analysis will not yield a significant p-values. This is because as far as the enrichment approach is concerned, this pathway only has one DE gene, out of many genes on this pathway.
In contrast, the impact analysis builds a mathematical model consisting in a system of equations that capture and describe every single interaction on the pathway. Thus, this model will take into consideration the position and role of every gene on the pathway, the direction and type of every signal from one gene to another, as well as any potential feedback loop that might exist. Based on this system of equations, the impact analysis can calculate the perturbation at the level of every gene. These perturbations can then be used to accurately calculate a perturbation at the level of the entire pathway which can be translated into a p-value that will help you identify the truly impacted pathways, as well as avoid false positives.
If you prefer to read the technical paper(s) that introduced the impact analysis please refer to:
A systems biology approach for pathway level analysis, Genome Research, 2007
A novel signaling pathway impact analysis (SPIA), Bioinformatics (2009)
which can be found on the Science page under Publications by Advaita Team. The science behind this approach is well vetted. These two papers have accumulated over 1,400 citations at the time of writing this.
What to do
If you want to apply the impact analysis to analyze your data or simply do a side-by-side comparison yourself, you can. The impact analysis is available as part of iPathwayGuide. Create a free account here and analyze several free datasets available in your account.