Once upon a time, the dream of many a life scientist was simply to be able to measure all gene expression changes involved in a comparison of two phenotypes. “If we could only measure all these changes, it would be so easy to understand what is going on here,” the thinking went. Well, this capability has been available for about 15-20 years now. Microarrays first, and RNA-Seq later, made possible such high-throughput measurements. In fact, such experiments are so routine today, that they are often outsourced to third parties. And yet, in spite of having these amazing capabilities that were once seen as the holy grail, the understanding of the underlying biological phenomena still constitutes a formidable challenge.
So, how do we go from a list of genes that are differentially expressed (DE) between the two phenotypes of interest, to actually understanding what is going on? Well, this is what both pathway analysis and gene set analysis are supposed to do. Both approaches aim to take gene expression levels and leverage existing knowledge about the given organism in order to identify the underlying biological processes and mechanisms. In fact, most people do not even realize that there are crucial differences between a gene set analysis and a pathway analysis. The goal of this post is to clarify the differences between these two approaches and present some of the advantages and disadvantages of both.
What is the difference between a “pathway”and a “gene set?”
The first important concept that needs to be clarified is that of “pathway.” Wikipedia defines a biological pathway as “a series of interactions among molecules in a cell that leads to a certain product or a change in a cell.” This is a pretty good definition. A pathway is essentially a description of mechanisms and/or phenomena. A pathway is usually described by a graph that contains nodes and edges. There are several types of pathways: signaling, metabolic, etc. This is not important at this point. The important part is that a pathway is meant to describe certain phenomena, interactions and dependencies. in essence, pathways are models describing the interactions of genes, proteins, or metabolites within cells, tissues, or organisms, not simple lists of genes. Well-known pathway databases include KEGG, Reactome, Biocarta, etc.
In contrast, “gene sets”are exactly what the term says: a set of genes, ie. an unordered and unstructured collection of genes. One can define a gene set as the collection of genes associated with a specific biological process (e.g. cell cycle), location (e.g. on chromosome 1), disease (e.g. breast cancer), or even the set of genes that are present in a given pathway (e.g. the set of 128 genes involved in the KEGG cell cycle pathway). Aside from containing multiple genes, there is nothing that defines a gene set–in fact it could be completely arbitrary. The Molecular Signatures Database (MSigDB) includes over 10,000 such gene sets defined based on many criteria, some of them seemingly arbitrary.
Here are 6 situations in which you will want to use pathway analysis, and two when gene set analysis may be better.
1. You need a pathway analysis – when you care about how genes are known to interact
The crucial difference between a gene set and a pathway is that a gene set is an unordered collection of genes whereas a pathway is a complex model that describes a given process, mechanism or phenomenon. Thus, it is very important to understand the difference between a pathway and its corresponding gene set representation. In order to illustrate this, let us consider an example. Let us consider the KEGG MAPK pathway and the MSigDB gene set corresponding to the KEGG MAPK pathway. The left panel figure below shows a small part of the MAPK pathway and the right panel shows the gene set corresponding to this part of the pathway.
The left panel shows the location of various gene or gene products (inside the cell, outside of it, or in the membrane), which genes interact with which other genes, the types of each interaction (activation, repression, phosphorylation, etc.), the direction of the signal propagation, and potentially many other things (e.g. complex formation, etc.). The corresponding gene set shown in the right panel has lost all the structure and the additional information captured by the original pathway. This comparison shows how much important knowledge existent in pathways is ignored when pathways are treated as simple gene sets.
2. You need a pathway analysis – when you want to take full advantage of the sizes and directions of measured expression changes
Early gene set analysis methods took a list of differentially expressed (DE) genes as input, and identify the sets in which the DE genes are over-represented or under-represented. The significance of each pathway is measured by calculating the probability that the observed number of DE genes in a given pathway were simply observed by chance. These approaches are known as Over-Representation Analysis (ORA).
The methods in this first generation rely on a pre-defined threshold, which is used to determine a list of DE genes. The significance of each pathway is then assessed based on the degree to which the pathway is enriched in such DE genes. A pathway that contains significantly more than expected DE genes will more likely to be truly related to the given condition. This approach depends heavily on the criteria used to select the DE genes, including the statistical tests and thresholds used.
A second generation of methods was designed to eliminate this dependency on the gene selection criteria by taking all gene expression values into consideration. The hypothesis behind these methods is that small but coordinated changes in sets of functionally related genes may also be important, in addition to the genes that have large expression changes. These methods are also known as Functional Class Scoring methods (FCS) . Some of the popular of approaches in this group are GSEA , Catmap , GlobalTest , sigPathway , SAFE , GSA , Category , PADOG , PCOT2 , FunCluster , SAM-GS , among others.
Note that the measured expression changes cannot be fully used in a gene set analysis. These expression measurements have been used only to identify differentially expressed genes (ORA), or to rank the genes (FCS), but not to estimate the impact of such changes on specific pathways. Thus, ORA techniques will see no difference between a situation in which a subset of genes is differentially expressed just above the detection threshold (e.g. 2-fold) and the situation in which the same genes are changing by many orders of magnitude (e.g. 100-fold). Similarly, FCS techniques can provide the same rankings for entire ranges of expression values, if the correlations between the genes and the phenotypes remain similar. Even though analyzing this type of information in a pathway and system context would be extremely meaningful from a biological perspective, gene set analysis methods miss this opportunity.
3. You need a pathway analysis – when you want to account for the type and direction of interactions on a pathway
Considering the pathways as simple un-ordered and unstructured collection of genes – as gene sets methods do – discards a substantial amount of information about the biological processes described by these pathways. In essence, all the dependencies and interactions between genes that are meant to capture and describe the biological phenomena involved are completely ignored. Topology-based (TB) methods have been developed in an attempt to include all this additional information in the analysis. Besides gene expression changes, these methods also take into consideration the various positions and roles of all genes on each pathway, as well as all known signals and interactions between genes.
The Impact Analysis that we developed was the first such approach . This was followed by a plethora of over 30 tools and methods that fall in this category  from us (Pathway-Express [5, 18], SPIA , ROntoTools , BLMA [22, 23]), as well as others (NetGSA , TopoGSA , TopologyGSA , DEGraph , PWEA , PathOlogist , GGEA , cepaORA, cepaGSA [12, 13], PathNet , etc.). The common characteristic of approaches in this category is that they utilize the prior knowledge of topology information of pathways to derive some gene-level statistic which is subsequently used to calculate a pathway-level statistic used to rank the pathways.
4. You need a pathway analysis – when you want to predict or explain downstream or pathway-level effects
Let us now consider implications of these differences from the perspective of the analysis aiming at understanding the effects on the underlying biological processes and mechanisms. Gene set analysis methods consider only the set of genes on any given pathway and ignores their position in those pathways. This is very unsatisfactory from a biological point of view. If a pathway is triggered by a single gene product or activated through a single receptor and if that particular protein is not produced, the pathway will be greatly impacted, probably completely shut off. A good example is the insulin pathway, see left panel below. If the insulin receptor (INSR, highlighted in yellow) is not present, the entire pathway is shut off. Conversely, if several genes are involved in a pathway but they only appear somewhere downstream, changes in their expression levels may not affect the given pathway as much.Performing a gene set analysis can only tell you whether the set of pathway genes is enriched in the number of DE genes but cannot provide any information about how those measured DE changes propagate through the pathway and affect other genes or processes.
5. You need a pathway analysis – when you are looking for mechanisms that are specifically affected in your experiment
Some genes have multiple functions or are involved in several pathways, with different roles in each. For instance, the right hand panel above shows that INSR (highlighted in yellow) is also involved in the Adherens Junction pathway as one of the many receptor protein tyrosine kinases. However, if the expression of INSR changes, this pathway is not likely to be heavily perturbed because INSR is just one of many receptors on this pathway. Once again, none of these aspects are considered by any gene set analysis approach because all gene set analysis approaches ignore the position of the genes and the roles that they play. Without additional measurements, gene set analysis methods would have a hard time predicting whether the Adherens Junction or the Insulin pathway has a higher potential to be impacted.
Probably the most important drawback associated to gene set analysis is that the knowledge embedded in the pathways about how various genes interact with each other is not currently exploited. The very purpose of these pathway diagrams is to capture some of our knowledge about how genes interact and regulate each other. However, the gene set analysis approaches consider only the sets of genes involved on these pathways, without taking into consideration their topology.If you are interested in any specific mechanisms (either already understood or known or tentative mechanisms that you hypothesize) you want to use pathway analysis rather than a gene set analysis.
The image below shows a screen shot from iPathwayGuide which was used to perform a pathway analysis on the GSE 47363 data set from GEO. In this experiment, the scientists treated a cell line with a micro RNA (miR), miR-542-3p with the goal of understanding the effects of this miR. The pathway analysis of the expression changes measured in the treated vs control was performed with iPathwayGuide and included the impact analysis described in the Genome Research paper mentioned above . The completely automatic analysis yielded the mechanism showed in red below. This represents a putative mechanism that explains all measured changes throughout the system, after taking into consideration all signals and all dependencies between various genes. This kind of results is simply not possible to be obtained from any gene set analysis method.
6. You need a pathway analysis – when you want your results to be based on the most recent knowledge
Our understanding of various pathways is constantly improving as more data is gathered, and the rate of change is only expected to increase. Pathways may be modified by adding, removing or redirecting links on the pathway diagrams. If we limit ourselves to a gene set analysis, we will be completely unable to ever sense such changes. Thus, gene set analysis will provide identical results as long as the pathway diagram involves the same genes, even if the interactions between them are completely re-defined over time.
From the above, it sounds like pathways are much superior to gene sets and a pathway analysis should produce much more meaningful and accurate results compared to a gene sets analysis. That is generally true. So, why do people still use gene sets analysis methods? Well, these methods have a couple of advantages as follows.
1. You want a Gene Set Analysis – when you are looking for “quick and dirty” answers
These methods tend to be simpler since they do not use any topological information. Since they are simpler, they are also: i) easier to understand and ii) easier to implement quickly, for “back-of-the-envelope” type estimations. Calculating an enrichment p-value or a FCS score such as that provided by GSEA would give you a quick answer to see whether there is any chance that the group of genes you are looking at is in anyway related to the phenotype.
2. You want a Gene Set Analysis – when you have arbitrarily defined gene sets
The main disadvantage of the gene set analysis methods – that they do not use any dependency or other knowledge about the underlying phenomenon can sometimes be an advantage. If that kind of knowledge is not used, there is no need to understand exactly how these genes communicate, interact, or work together in order to see whether they are somewhat related to the phenotype as a group. If I get an idea that a bunch of genes may be working in concert to accomplish a given function, I can quickly call them a gene set, do some simple computations and figure out how likely it is for them to be related to the phenotype I am studying. Of course, this can also be a problem since some gene sets may include some random genes, not really related to the rest of them which will hamper my ability to really understand what is going on.
And since you are a scientist, you will ask the obvious question: has anybody really compared side-by-side gene set analysis methods and pathway analysis methods on the same data sets? Well, the answer is: not yet! The main reason for this is that it is extremely difficult to objectively compare the results of such analysis methods. In my next blog entry, I will explain why such comparisons are difficult and I will propose 3 approaches and a benchmark that will make such comparisons possible.
Now that you know the differences between pathway analysis and gene set analysis, you may want to know what are the best tools to use in each category. In the pathway analysis tool category, the only currently available professional-grade tool using a topology-based approach is iPathwayGuide. In full disclosure, this is the platform that my team and I developed over a number of years. However, I truly think it is the best platform for this type of analysis currently available. My (biased) opinion is based on a comparison of 11 different methods on over 2,500 samples from 75 human data sets and 11 mouse data sets. The results of these extensive comparisons are currently under review. A brief description of the results will be posted here as soon as the paper is accepted. In the gene set analysis category, the method that came on top in our extensive testing is the Gene Set Enrichment Analysis (GSEA). As mentioned above, we ran many datasets and GSEA is unbelievably unbiased and it gives you the best results that you can get if you are willing to ignore all pathway topology and the phenomena described by the pathway.
- Marit Ackermann and Korbinian Strimmer. A general modular framework for gene set enrichment analysis. BMC Bioinformatics, 10(1):1, 2009.
- William T. Barry, Andrew B. Nobel, and Fred Wright. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics, 21(9):1943–1949, May 2005.
- Thomas Breslin, Patrik Eden, and Morten Krogh. Comparing functional annotation analyses with Catmap. BMC Bioinformatics, 5(1):193, 2004.
- Irina Dinu, John D Potter, Thomas Mueller, Qi Liu, Adeniyi J Adewale, Gian S Jhangri, Gunilla Einecke, Konrad S Famulski, Philip Halloran, and Yutaka Yasui. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics, 8(1):242, 2007.
- Sorin Draghici, Purvesh Khatri, Adi L Tarca, Kashyap Amin, Arina Done, Calin Voichita, Constantin Georgescu, and Roberto Romero. A systems biology approach for pathway level analysis.Genome Research, 17(10):1537–1545, 2007.
- Bhaskar Dutta, Anders Wallqvist, and Jaques Reifman. PathNet: A tool for pathway analysis using topological information. Source Code for Biology and Medicine,7(1):10, 2012.
- Bradley Efron and Robert Tibshirani. On testing the significance of sets of genes.The Annals of Applied Statistics, 1(1):107–129, 2007.
- Ludwig Geistlinger, Gergely Csaba, Robert Kuffner, Nicola Mulder, and Ralf Zimmer.From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems. Bioinformatics, 27(13):i366–i373, 2011.
- Enrico Glaab, Anaıs Baudot, Natalio Krasnogor, and Alfonso Valencia. TopoGSA: network topological gene set analysis. Bioinformatics, 26(9):1271–1272, 2010.
- Jelle J. Goeman, Sara A. van deGeer,Floor deKort, and Hans C. vanHouwelingen. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics, 20(1):93–99, 2004.
- Greenblum, S. Efroni, C.Schaefer, and K. Buetow. The PathOlogist: an automated tool for pathway-centric analysis. BMC Bioinformatics, 12(1):133, 2011.
- Zuguang Gu, Jialin Liu, Kunming Cao, Junfeng Zhang, and Jin Wang. Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes.BMC systems biology, 6(1):56, 2012.
- Zuguang Gu and JinWang. Cepa: an R package for finding significant pathways weighted by multiple network centralities. Bioinformatics, 29(5):658–660, 2013.
- Corneliu Henegar, Raffaella Cancello, Sophie Rome, Hubert Vidal, Karine Clement, and Jean-Daniel Zucker. Clustering biological annotations and gene expression data to identify putatively co-regulated biological processes. Journal of bioinformatics and computational biology, 4(04):833–852, 2006.
- Jui-Hung Hung, Troy W Whitfield, Tun-Hsiang Yang, Zhenjun Hu, Zhiping Weng, and Charles DeLisi. Identification of functional modules that correlate with phenotypic difference: the influence of network topology.Genome Biology, 11(2):R23, 2010.
- Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit. Gains inpower from structured two-sample tests of means on graphs. Arxiv preprint arXiv:1009.5173, 2010.
- Zhen Jiang and Robert Gentleman. Extensions to gene set enrichment. Bioinformatics, 23(3):306–313, 2007.
- Purvesh Khatri, Sorin Draghici, Adi L Tarca, Sonia S Hassan, and Roberto Romero. A system biology approach for the steady-state analysis of gene signaling networks. In CIARP’07 Proceedings of the 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications, pages32–41, Valparaiso, Chile, 13-16 November 2007. ACM.
- Sek Won Kong, William T Pu, and Peter J Park. A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics, 22(19):2373–2380, 2006.
- Maria S Massa, Monica Chiogna, and Chiara Romualdi. Gene set analysis exploiting the topology of a pathway. BMC Systems Biology, 4(1):121, 2010.
- Cristina Mitrea, Zeinab Taghavi, Behzad Bokanizad, Samer Hanoudi, Rebecca Tagett, Michele Donato, Calin Voichita, and Sorin Draghici. Methods and approaches in the topology-based analysis of biological pathways. Frontiers in Physiology, 4:278, 2013.
- Tin Nguyen and Sorin Draghici. BLMA: A package for bi-level meta-analysis. Bioconductor, 2017. R package.
- Tin Nguyen, Rebecca Tagett, Michele Donato, Cristina Mitrea, and Sorin Draghici. A novel bi-level meta-analysis approach-applied to biological pathway analysis. Bioinformatics, 32(3):409–416, 2016.
- Ali Shojaie and George Michailidis. Analysis of Gene Sets Based on the Underlying Regulatory Net- work. Journal of Computational Biology,16(3):407–426, 2009.
- Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P.Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression. Proceeding of TheNational Academy of Sciences of the Unites States of America, 102(43):15545–15550, 2005.
- Adi L Tarca, Sorin Draghici, Gaurav Bhatti, and Roberto Romero. Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13(1):136, 2012.
- Adi L Tarca, Sorin Draghici, Purvesh Khatri, Sonia S Hassan, Pooja Mittal, Jung-sun Kim, Chong Jai Kim, Juan Pedro Kusanovic, and Roberto Romero. A novel signaling pathway impact analysis. Bioinformatics, 25(1):75–82, 2009.
- Lu Tian, Steven A.Greenberg, Sek WonKong, Josiah Altschuler, Isaac S. Kohane, and Peter J. Park. Discovering statistically significant pathways in expression profiling studies. Proceedingof TheNational Academy of Sciences of the USA, 102(38):13544–13549, 2005.
- Calin Voichita, Michele Donato, and Sorin Draghici. Incorporating gene significance in the impact analysis of signaling pathways. In Machine Learning and Applications (ICMLA), 2012 11th International Conference on, volume1, pages126–131, Boca Raton, FL, USA, 12-15 December 2012.