Methods and approaches in the topology-based analysis of biological pathways

The goal of pathway analysis is to identify the pathways significantly impacted in a given phenotype. Many current methods are based on algorithms that consider pathways as simple gene lists, dramatically under-utilizing the knowledge that such pathways are meant to capture. During the past few years, a plethora of methods claiming to incorporate various aspects of the pathway topology have been proposed. These topology-based methods, sometimes referred to as “third generation,” have the potential to better model the phenomena described by pathways. Although there is now a large variety of approaches used for this purpose, no review is currently available to offer guidance for potential users and developers. This review covers 22 such topology-based pathway analysis methods published in the last decade. We compare these methods based on: type of pathways analyzed (e.g., signaling or metabolic), input (subset of genes, all genes, fold changes, gene p-values, etc.), mathematical models, pathway scoring approaches, output (one or more pathway scores, p-values, etc.) and implementation (web-based, standalone, etc.). We identify and discuss challenges, arising both in methodology and in pathway representation, including inconsistent terminology, different data formats, lack of meaningful benchmarks, and the lack of tissue and condition specificity.

1. Introduction

In molecular biology and genetics, there is a large gap between current data analysis techniques and their ability to derive precise and accurate functional information from the large and constantly growing volume of high throughput molecular data. The capability of obtaining a comprehensive lists of genes/proteins that are different between two phenotypes is routine1 in research today. And yet, the holy grail of high-throughput has not delivered so far. Even though high-throughput comparisons are relatively easy to perform, understanding the phenomena that determine the measured changes is as challenging as ever, if not more so. Therefore, it is crucial to develop effective ways to analyze the vast amount of data that has been and will continue to be collected.

A major contributor to the gap between our ability to collect data and our ability to interpret it, is the fact that living organisms are complex systems whose emerging phenotypes are the results of thousands of complex interactions taking place on various metabolic and signaling pathways. The ability to correctly infer the perturbed pathways responsible for a phenotype from a list of differentially expressed (DE) genes or proteins may be the key to transforming the now abundant high-throughput expression data into biological knowledge. In turn, this can help understand mechanisms of disease, develop better drugs, personalize drug regimens, etc. For our purposes, pathways are models describing the interactions of genes, proteins, or metabolites within cells, tissues, or organisms, not simple lists of genes. This is why, in this paper, we focus exclusively on pathway analysis methods that aim to identify the pathways that are significantly impacted in a condition under study, taking pathway topology into account. This process uses two types of data: (i) previously accumulated knowledge in the form of known pathways, represented as graphs and (ii) experiment data, such as gene expression values or protein or metabolite abundance data obtained when comparing two phenotypes.

In spite of the crucial importance of this problem and of the recent increase in the number of methods and approaches for pathway analysis, to our knowledge there is no current review focused on topology-based methods. A reason for this may be related to the challenges currently associated with this problem. A first such challenge is the lack of standards for the evaluation of the results of the analyses. This has lead to the proliferation of many techniques that have never been compared with each other in a consistent way. Another set of challenges is related to the pathways themselves. Not only there is no universal agreement for the representation of information content in pathway databases, but the very definition of a pathway is not completely agreed upon (11). Some authors use the term “pathway” to refer to a simple list of genes (such as those associated with a given Gene Ontology (GO) term), lacking any structure and any information about the interactions between these genes. Many others use graphs to capture relationships but the meaning of edges and nodes varies dramatically from one source to another. Figure 5 shows not fewer than five different types of graphs, all referred to as “pathways.” Even pathways from the same source, often use different representations. For instance, genes/proteins are associated with nodes in KEGG signaling pathways while they are associated with edges in KEGG metabolic pathways.

The subset of available techniques that consider the pathways as simple lists of genes, such as those associated with a GO term (or another arbitrary descriptor) are worth of further discussion. Here, we will refer to these as gene set analysis methods, rather than pathway analysis methods. A comprehensive list of such techniques, as well as some comparisons between them can be found in several well-developed surveys (58)  (12) (44) (20) (46). While useful for the purpose for which they have been developed – to analyze sets of genes – these methods do not take into consideration the topology of the pathways, and hence completely ignore the interactions described by the pathways, the different types of genes, the position of the genes on their respective pathways, etc. This is illustrated in Figure 1. In some sense, the very reason for the existence of the pathways is to describe the way various genes interact. Therefore, methods that perform the analysis only on sets of genes, ignoring the topology of the pathway, are not included in the scope of the present review.

Figure 1. Gene sets are not pathways. (A) shows a small part of the MAPK signaling pathway from KEGG. This pathway shows the location of various genes or gene products (inside the cell, outside of it, or in the membrane), what gene interacts with what other gene(s), the type of each interaction (activation, repression, phosphorylation, etc.), the direction of the signal propagation, and potentially many other things (e.g., complex formation, etc.). (B) presents the same part of the same pathway as a gene set (no interactions). The gene set has lost all the structure and the additional information captured by the original pathway. This comparison shows how much important knowledge existent in pathway database is ignored when pathways are treated as simple gene sets.

Recent pathway analysis algorithms have become more refined than gene set analysis methods by incorporating topology (Figure 2). A first attempt to incorporate topology information in the analysis of pathways was through the use of graph theory methods. This approach became popular in the last decade (12) (6). Aittokallio and others survey graph-based analysis methods. They identify categories based on global structural properties, local structural connectivity, or hierarchical functional organization, and describe the features of gene regulatory networks, metabolic networks, and protein-protein interaction (PPI) networks  (3). Some of these graph theory methods and concepts are relevant to the pathway analysis methods able to compare phenotypes, which are the focus of the current review. However, as a broad category, the approaches based on graph theory methods are not able to identify the pathways that are significant in a given phenotype comparison and therefore, do not fall within the scope of this review.

Figure 2. Generalized overview of the data flow in pathway analysis methods. For each module, the various options available for different methods surveyed, as well as the comparison criteria used in this paper are presented in the white boxes.

Varadan and others (80) review the use of biological knowledge bases for cancer diagnosis and prognosis. They attempt to evaluate the performance of three topology-based methods, SPIA, PARADIGM, and PathOlogist, on the same input datasets to compare the biological relevance of their outputs. Unfortunately, since the 3 tools did not use the same pathway database, the authors chose to re-implement SPIA and adapt it to the pathway database used by the other two, so that the result from all three would be comparable. The authors discuss relative performance of the three methods, but could not draw definitive conclusions regarding the superiority of one tool versus another. We also ran their version of SPIA and the original SPIA implementation from Bioconductor, using exactly the same input, and obtained different results. This indeed demonstrates some of the inherent problems encountered when comparing pathway analysis methods. First of all, it is difficult to successfully re-implement an algorithm to force it to work on other data sources, especially when the re-implementation is done by third parties. Furthermore, sometimes the mere ability to reproduce published results – which is at the base of modern scientific research – is questionable in this area. For instance, in spite of having access to the source code and having the full cooperation of the authors, we could not even reproduce the results reported in (83).

Four topology-based tools, along with several gene-based methods, were recently reviewed by Khatri and others Khatri et al. (2012). This recent survey groups functional analysis based on GO together with pathway analysis methods. With this very loose definition of a pathway and pathway analysis, the authors present the limitations and challenges of various methods in general, and categorize topology-based methods as “third-generation” tools. However, even though it is very recent, this existing survey only includes 4 out of the 22 topology-based analysis methods reviewed here.

In a different direction, researchers tackle the problem of understanding disease by looking at signaling networks from the perspective of fault tolerance. Fault tolerance is a measure of the vulnerability of signaling networks to the abnormal function of its components. Abdi and Emamian survey this direction in a comprehensive study (1). Valuable results are presented highlighting vulnerable molecules in different molecular networks for biological phenomena such as mitosis or p53 signaling.

Kinetic/stoichiometric models based on the molecular mechanisms of interaction have been used for over 25 years in order to simulate biochemical phenomena. Such models are in some sense the ultimate tools because they can predict exact quantities for any variable in the system. However, their use is limited by the need to know the precise initial concentration for most reactants, exact reaction constants for all reactions, as well as the appropriate time scale for the studied phenomenon. Furthermore, the goals of such models are very different from the goals of pathway analysis methods. The goal of such kinetic models is to fully describe the biochemical phenomena involved and to make quantitative predictions about some of the reaction products involved. In contrast, the goal of pathway analysis methods is to identify the most significantly impacted pathways from a large collection of heterogeneous pathways, based on incomplete information. Furthermore, kinetic models work for biochemical pathways describing reactions of the same type (biochemical) with known reaction constants (75). The pathways we are considering here include gene signaling pathways containing different “signals” (inhibition, activation, phosphorylation, methylation, etc.) happening at many levels (transcription, translation, post-translational, etc.) between heterogeneous components (mRNA, DNA, protein, metabolites, etc.). Therefore, the entire body of work concerned with modeling biochemical pathways using mathematical models (e.g., differential or difference equations) does not fall within the scope of this review.

Finally, it is important to state that we do not intend to assess the efficacy of each method, since there is not a universally recognized correct output of such tools. Designing benchmark datasets would help to determine the most effective mathematical model but this is beyond the intended scope of the current review and hence, it is not attempted here.

In this paper, we describe 22 topology-based pathway analysis methods designed to analyze either signaling pathways (see Figure 3), or metabolic pathways (Figure 4). There are several commercial tools used for pathway analysis, which do not incorporate the pathway topology when computing pathway scores including Ingenuity Pathway Analysis (Ingenuity Systems, and Genomatix (Genomatix Software, Since these tools only perform a gene set analysis, failing to take advantage of the additional knowledge incorporated in the pathways, they will not be considered here. We found only two commercial tools that do incorporate topology in the pathway analysis. These are Pathway-Guide (Advaita Corporation, and MetaCore (Thomson Reuters,

Figure 3. Timeline showing when the surveyed pathway analysis tools, working mainly with signaling pathways, became available (this time may be different from publication time shown in Table 1). Some of the methods use additional interaction information that may be from an in-house or public gene/protein interaction knowledge base. BAPA-IGGFD (Zhao et al., 2012) and TBScore (Ibrahim et al., 2012) acronyms were assigned to the respective methods, in this manuscript, for ease of reference. The commercial tools, Pathway-Guide and MetaCore are not included in this figure.

Figure 4. Timeline showing the availability of pathway analysis tools that work mainly with metabolic pathways.

We categorize and compare all surveyed methods based on different criteria including: the type of input required, the type of output provided, the mathematical models used, and the implementation used. In section 2, we discuss the options for input data in different tools, in particular, the challenges specific to topology-based methods. Section 3 reviews the underlying mathematical models and scoring methods currently available to rate pathway deregulation. Section 4 focuses on the types of output provided. Finally, section 5 presents issues regarding the implementation of the methods. To the best of our knowledge, our review is the only comprehensive survey of topology-based pathway analysis methods to date.

2. Input Data

This review focuses on pathway analysis methods that try to exploit some of the information contained in the pathway topology in order to identify the pathways that are significantly impacted in a condition under study. In order to address this problem, any pathway analysis method will need: (i) a collection of pathways capturing our current knowledge about the interactions of genes, proteins, metabolites, or compounds in an organism (usually from a pathway database), and (ii) experimental data in the form of measurements of gene expression, protein abundance, metabolite concentration, or copy numbers. The pathway data is accumulated, updated, and refined by amassing knowledge from scientific literature describing individual interactions or high throughput experiment results. The experiment data is usually provided by measurements comparing two or more phenotypes such as treated vs. untreated, disease vs. healthy, or treated with drug A vs. drug B.

Analysis methods take various approaches to accommodate the different formats commonly used for both types of data. In this section, we compare all methods reviewed based on their input types and formats, and discuss the particular difficulties encountered when incorporating the pathway interactions into topology-based analysis methods.

2.1. Experiment Data

Most methods analyze data from high-throughput experiments, such as microarrays, next-generation sequencing, or proteomics. Most analysis methods accept either a list of gene IDs or a list of such gene IDs associated with measured changes. These changes could be measured with different technologies and therefore can serve as proxies for different biochemical entities. For instance, one could use gene expression changes measured with microarrays, or protein levels measured with a proteomic approach, etc. Transcription data is often used to approximate the proteome, since high-throughput protein abundance data is not readily available. Most methods expect a consistent input i.e., all values are expected to be of the same type. MetPA, which is a metabolic pathway analysis method, is the only method that does not accept gene expression. This method uses as input either a list of “important” compounds, or a metabolite concentration table.

Different analysis methods use different input formats. Many methods accept a list of all genes considered in the experiment together with their expression values. Some analysis methods select a subset of genes, considered to be differentially expressed (DE), based on a predefined cut-off. The cut-off is typically applied on fold-change, statistical significance, or both. A selection based on both criteria can be performed easily if the data is displayed as a volcano plot, i.e., in a coordinate system that has fold changes on the x axis and the negative log of the p-value on the y axis. In such a plot, genes that have large absolute fold changes as well as significant p-values will appear in the top part of the plot, towards the sides. These methods use the list of DE genes and their corresponding fold-change values as input. Other methods use only the list of DE genes, without corresponding expression values, because their scoring methods are based only on the relative positions of the genes in the graph. Methods which use cut-offs are sensitive to the chosen threshold value, because a small change in the cut-off may drastically change the number of selected genes (59). As a consequence, some genes with moderate differential expression may be lost, even though they might be important players in the impacted pathways (8). Furthermore, the genes included in the set of DE genes can vary dramatically if the selection methods are changed. Hence, the results of pathway analyses based on DE genes may be vastly different depending on both the selection method as well as the threshold value (62). On the other hand, methods which do not use a threshold are more sensitive to the noise coming from the (very many) genes that do not change much between the two phenotypes, genes that are normally eliminated by the DE selection process. An approach used to address this issue while still using all gene measurements uses the individual p-values of each gene (84).

Among the surveyed methods, ScorePAGE, PathOlogist, NetGSA, TopologyGSA, PWEA, TAPPA, ACST, BPA, BAPA-IGGFD, and DEGraph use all genes together with their expression values as input. However, for BPA and BAPA-IGGFD2, the fold changes are only used to label each gene and not considered in the analysis itself. In BPA, this label is whether the gene is DE or not and in BAPA-IGGFD, the label states whether the gene is up-regulated or down-regulated. Therefore, these two methods can be categorized as using a cut-off on the input gene list. Methods that use the DE gene list and their associated values include Pathway-Guide, Pathway-Express, SPIA, and TBScore. However, the impact analysis which is the approach used by Pathway-Guide, Pathway-Express and SPIA has been recently extended to work with the set of all genes as well (84), so these can now be used either with or without DE genes. Moreover, this functionality is now available as part of the Bioconductor package ROntoTools.3 MetaCore, TopoGSA, and EnrichNet use only the DE gene list without associated expression values. CePa is a method that has two options. It can work with either a list of DE genes, or the whole list of genes with their expression values and phenotype labels. GANPA and THINK-Back Density Analysis (DS) modify existing gene set analysis methods, such as GSEA, by calculating topology-based weights for each gene before applying the main gene set analysis method. In these methods, the gene set analysis used in the second stage uses as input the list of all genes with their expression values. However, the weighting process used in the first stage requires DE genes with their values, for GANPA, and the list of DE genes, for THINK-Back-DS.

2.2. Pathway Data

Biological processes can be represented by different types of models. Usually pathways, such as signaling or metabolic pathways, are sets of genes and/or gene products that interact with each other in a coordinated way to accomplish a given biological function or process. A typical signaling pathway (in KEGG for instance) uses nodes to represent genes or gene products and edges to represent signals, such as activation or repression, that go from one gene to another. A typical metabolic pathway uses nodes to represent biochemical compounds and edges to represent reactions that transform one or more compound(s) into one or more other compounds. These reactions are usually carried out or controlled by enzymes, which are in turn coded by genes. Hence, in a metabolic pathway, genes or gene products are associated with edges rather than nodes, as in a signaling pathway. The immediate consequence of this difference is that many techniques cannot be applied directly on all available pathways. There are other types of biological networks that incorporate genome wide interactions between genes or proteins such as protein-protein interaction (PPI) networks. These networks are not restricted to specific biological functions. The main caveat related to PPI data is that most such data are obtained from a bait-prey laboratory assay, rather than from in vivo or in vitro studies. The fact that two proteins stick to each other in an assay performed in an artificial environment can be misleading since the two proteins may never be present at the same time in the same tissue or the same part of the cell.

The pathway data that is the input of the pathway analysis methods, generally come from a single source such as a single pathway database. In some analysis methods a second source of interaction data is used, such as a gene/protein interaction knowledge base or a genome scale network. Most of the methods use one data source. However, among the surveyed methods, MetaCore, GANPA, BAPA-IGGFD, and EnrichNet use two sources of interaction data. MetaCore uses two types of proprietary knowledge: an interaction database, as well as canonical pathways. The interaction information is protein-protein interaction data gathered from literature which is used to generate a directed global network. There is no public information regarding the details of how the MetaCore interaction network and canonical pathways are created.

Another analysis method that uses two sources of data is BAPA-IGGFD. The first source is a predefined pathway knowledge base. BAPA-IGGFD is advertised as able to analyze any pathway format; however the example in (91). is restricted to pathways from the KEGG database. The second source is an interaction knowledge base, called PrimeDB, which was created by the authors of Zhao et al. (2012), by extracting directed gene-gene interaction information from scientific publications and past experiments. PrimeDB lists potential interactions between each pair of genes and counts reported instances of activation and inhibition separately.

EnrichNet and GANPA are other methods with two input sources. They use genome-scale interaction networks in addition to predefined pathway datasets as input. For the genome-scale interaction networks, EnrichNet uses PPI networks such as STRING (72) (85)  and GANPA builds a network, called gNET, based on different types of gene/protein association databases such as PPIs, co-annotation in GO Biological Process (BP), and co-expression in large-scale gene expression microarray data.

Pathway analysis methods can use public or proprietary input sources. MetaCore, BAPA-IGGFD, and GANPA use proprietary interaction networks. All other surveyed methods use public sources. Among them, TopoGSA infers PPI networks on the fly, for human and some model organisms, from databases such as MIPS (54), DIP (87), BIND (5), HPRD (63), IntAct (35), and BioGRID (74). TopoGSA also accepts any kind of predefined pathways as input which it scores and compares with the constructed network.

Publicly available curated pathway databases used by the surveyed methods are KEGG (61), NCI-PID (68), BioCarta (9), WikiPathways (64), PANTHER (55), and Reactome (43). These curated knowledge bases are more reliable than protein interaction networks but do not include all known genes and their interactions. As an example, KEGG included only about 5000 human genes in signaling pathways, at the time of writing this article.

Various research groups have tried different strategies to address the challenge of modeling complex biomolecular phenomena. These efforts have lead to variation among knowledge bases, complicating the task of developing pathway analysis methods. There is currently no accepted standard for constructing pathways, and as pathway paradigms evolve to better represent the biology, pathway analysis methods evolve in parallel. Depending on the database, there may be differences in: information sources, experiment interpretation, models of molecular interactions, or boundaries of the pathways. Therefore, it is possible that pathways with the same designation and aiming to describe the same phenomena may have different topologies in different databases. As an example, one could compare the insulin signaling pathways of KEGG and BioCarta. BioCarta includes fewer nodes and emphasizes the effect of insulin on transcription, while KEGG includes transcription regulation as well as apoptosis and other biological processes. However, BioCarta includes the C-JUN transcription factor, which is missing from the KEGG representation.

Differences in graph models for molecular interactions are particularly apparent when comparing the signaling pathways in KEGG and NCI-PID. While KEGG represents the interaction information using the directed edges themselves, NCI-PID introduces “process nodes” to model interactions (see Figure 5). Most pathway analysis methods are designed to use only one pathway graph model, which limits the user’s possibilities. Developers are faced with the challenge of modifying methods to accept novel pathway databases or modifying the actual pathway graphs to conform to the method.

Figure 5. Comparison of representative graph models for molecular interactions as used by different pathway databases. In a KEGG signaling pathway (A) nodes represent genes/gene products and edges represent regulatory signals such as activation, inhibition, phosphorylation, etc. (see for details). In the chemical network representation of a KEGG metabolic pathway (B) the nodes represent biochemical compounds and edges represent chemical reactions. These chemical reactions are performed by enzymes which are proteins encoded by genes. Hence, in contrast with the signaling pathways in which genes are associated with nodes, in a metabolic pathways genes are associated with edges. This is the main reason most methods developed for signaling pathways cannot be applied directly to metabolic pathways. In an NCI-PID signaling pathway (C) nodes fall in two categories: component nodes representing biomolecular components, or process nodes representing biochemical reactions or biological processes. Edges connect two biomolecular components through a biochemical reaction or a biological process. Process nodes can have 3 states: positive regulation, negative regulation, or “involved in.” (see for details). In a protein-protein interaction network (D) nodes represent proteins and the interactions among them represent physical binding. These interactions can be inferred from two-hybrid assays and they may be either undirected (top), or directed from the bait protein to the prey protein (bottom). In the Biological Pathway Exchange (BioPAX) (E) nodes are physical entities and edges are conversions. BioPAX entities can represent complexes, DNA, proteins, RNA, small molecules, DNA regions or RNA regions. Conversions can represent biochemical reactions complex assembly or degradation, transport or transport with biochemical reaction. This model is very generic and increasingly flexible. It provides a standard for pathway information to be available in machine readable format, therefore easy to use for pathway analysis and to exchange between pathway databases (see for details).

Pathway databases not only differ in the way that interactions are modeled, but their data are provided in different formats as well (12). Common formats are Pathway Interaction Database eXtensible Markup Language (PID XML), KEGG Markup Language (KGML), Biological Pathway Exchange (BioPAX) Level 2 and Level 3, System Biology Markup Language (SBML), and the Biological Connection Markup Language (BCML) (7). The NCI provides a unified assembly of BioCarta and Reactome, as well as their in-house “NCI-Nature curated pathways,” in NCI-PID format (68). In order to unify pathway databases, pathway information should be provided in a common format. XML is a flexible text format with increasing use for data exchange across different systems. However, XML is very low-level and lacks standard constructs to accurately describe biological phenomena. PID XML is both human- and machine-readable, and allows a platform-independent means of exchanging PID data. The BioPAX project is an effort to unify the format and exchange of pathway data, and has incorporated independent sources such as NCI, BioCarta, Reactome, and WikiPathways, UCSC, NIH, and others (10).

The implementation of analysis methods constrains the software to accept a specific input pathway data format, while the underlying graph models in the methods are independent of the input format. Regardless of the pathway format, this must be parsed into a computer readable graph data structure before being processed. The implementation may incorporate a parser, or this may be up to the user. For instance, SPIA accepts any signaling pathway or network if it can be transformed into an adjacency matrix representing a directed graph where all nodes are components and all edges are interactions. NetGSA is similarly flexible with regard to signaling and metabolic pathways. SPIA provides KEGG signaling pathways as a set of pre-parsed adjacency matrices. The methods described in this paper may be restricted to only one pathway database, or may accept several. The corresponding databases for the surveyed methods are shown in Table 1.

Table 1. Comparison of topology-based pathway analysis methods based on different criteria related to the input.


  1. Abdi, A., and Emamian, E. S. (2010). Fault diagnosis engineering in molecular signaling networks: anoverview and applications in target discovery. Chem. Biodivers. 7, 1111–1123. doi: 10.1002/cbdv.200900315
  2. Advaita Corporation. (2013) Pathway-Guide Software. Available online at: (Accessed on May 15, 2013)
  3. Aittokallio, T., and Schwikowski, B. (2006). Graph–based methods for analysing networks in cell biology. Brief. Bioinf. 7, 243–255. doi: 10.1093/bib/bbl022
  4. Arrell, D. K., and Terzic, A. (2010). Network systems biology for drug discovery. Clin. Pharmacol. Ther. 88, 120–125. doi: 10.1038/clpt.2010.91
  5. Bader, G. D., Donaldson, I., Wolting, C, Ouellette, F. B. F., Pawson, T., and Hogue, C. W. V. (2001). BIND–the biomolecular interaction network database. Nucleic Acids Res. 29, 242–245. doi: 10.1093/nar/29.1.242
  6. Barabási, A. L., Gulbahce, N., and Loscalzo, J. (2011). Network medicine: a network–based approach to human disease. Nat. Rev. Genet. 12, 56–68. doi: 10.1038/nrg2918
  7. Beltrame, L., Calura, E., Popovici, R. R., Rizzetto, L., Guedez, D. R., Donato, M., et al. (2011). The biological connection markup language: a SBGN-compliant formatfor visualization, filtering and analysis of biological pathways. Bioinformatics 27, 2127–2133. doi: 10.1093/bioinformatics/btr339
  8. Ben-Shaul, Y., Bergman, H., and Soreq, H. (2005). Identifying subtle interrelated changes in functional gene categoriesusing continuous measures of gene expression. Bioinformatics 21, 1129–1137. doi: 10.1093/bioinformatics/bti149
  9. BioCarta. (2000). BioCarta – Charting Pathways of Life. Technical report, BioCarta. Available online at:
  10. BioPAX. (2002). The Biological Pathway Exchange (BioPAX). Available online at:
  11. Chowbina, S. R., Wu, X., Zhang, F., Li, P. M., Pandey, R., Kasamsetty, H. N., et al. (2009). HPD: an online integrated human pathway database enabling systemsbiology studies. BMC Bioinformatics 10(Suppl. 11):S5. doi: 10.1186/1471-2105-10-S11-S5
  12. Chuang, H. Y., Hofree, M., and Ideker, T. (2010). A decade of systems biology. Ann. Rev. Cell Dev. Biol. 26, 721–744. doi: 10.1146/annurev-cellbio-100109-104122
  13. Chung, F. R. K. (1997). Spectral Graph Theory. Providence, RI: American Mathematical Society.
  14. Dezső, Z., Nikolsky, Y., Nikolskaya, Y, Miller, J., Cherba, D., Webb, C., et al. (2009). Identifying disease-specific genes based on their topologicalsignificance in protein networks. BMC Syst. Biol. 3:36. doi: 10.1186/1752-0509-3-36
  15. Drăghici, S., Khatri, P., and Voichiţa, C. (2013) Pathway-Express Software. Available online at:
  16. Drăghici, S., Khatri, P., Tarca, A. L., Amin, K., Done, A., Voichiţa, C., et al. (2007). A systems biology approach for pathway level analysis. Genome Res. 17, 1537–1545. doi: 10.1101/gr.6202607
  17. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26. doi: 10.1214/aos/1176344552
  18. Efron, B., and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1, 107–129. doi: 10.1214/07-AOAS101
  19. Efroni, S., Schaefer, C. F., and Buetow, K. H. (2007). Identification of key processes underlying cancer phenotypes usingbiologic pathway analysis. PLoS ONE 2:e425. doi: 10.1371/journal.pone.0000425
  20. Emmert-Streib, F., and Glazko, G. V. (2011). Pathway analysis of expression data: decipheringfunctional building blocks of complex diseases. PLoS Comput. Biol. 7:e1002053. doi: 10.1371/journal.pcbi.1002053
  21. Fang, Z., Tian, W., and Ji, H. (2011). A network-based gene-weighting approach for pathway analysis. Cell Res. 22, 565–580. doi: 10.1038/cr.2011.149
  22. Fang, Z., Tian, W., and Ji, H. (2013). GANPA Software. Available online at: (Accessed on May, 2013)
  23. Farfán, F., Ma, J, Sartor, M. A., Michailidis, G., and Jagadish, H. V. (2012). THINK Back: knowledge-based interpretation of high throughput data. BMC Bioinformatics 13(Suppl 2):S4. doi: 10.1186/1471-2105-13-S2-S4
  24. Farfán, F., Ma, J., Sartor, M. A., Michailidis, G., and Jagadish, H. V. (2013a). THINK-Back-DS Software Standalone. Available online at:, a (Accessed on May, 2013)
  25. Farfán, F., Ma, J., Sartor, M. A., Michailidis, G., and Jagadish, H. V. (2013b). THINK-Back-DS Software Web-Based. Avaialble onlne at:,(Accessed on May, 2013)
  26. Gao, S., and Wang, X. (2007). TAPPA: topological analysis of pathway phenotype association. Bioinformatics 23, 3100–3102. doi: 10.1093/bioinformatics/btm460
  27. Glaab, E. (2013). EnrichNet Software. Available online at:, (Accessed on May 15, 2013).
  28. Glaab, E., Baudot, A., Krasnogor, N., Schneider, R., and Valencia, A. (2012). EnrichNet: network-based gene set enrichment analysis. Bioinformatics 28, i451–i457. doi: 10.1093/bioinformatics/bts389
  29. Glaab, E., Baudot, A., Krasnogor, N., Schneider, R., and Valencia, A. (2010). TopoGSA: network topological gene set analysis. Bioinformatics 26, 1271–1272. doi: 10.1093/bioinformatics/btq131
  30. Glaab, E., Baudot, A., Krasnogor, N., Schneider, R., and Valencia, A. (2013). TopoGSA Software. Available online at: (Accessed on May 15, 2013).
  31. Greenblum, S. I., Efroni, S., Schaefer, C. F., and Buetow, K. H. (2013). Pathologist Software. Available online at: (Accessed on May 15, 2013).
  32. Gu, Z. (2013a). CePa Software Standalone. Available online at: (Accessed on May 15, 2013).
  33. Gu, Z. (2013b). CePa Software Web-Based. Available online at: (Accessed on May 15, 2013).
  34. Gu, Z., Liu, J., Cao, K., Zhang, J., and Wang, J. (2012). Centrality-based pathway enrichment: a systematic approach forfinding significant pathways dominated by key genes. BMC Syst. Biol. 6:56. doi: 10.1186/1752-0509-6-56
  35. Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., et al. (2004). IntAct: an open source molecular interaction database. Nucleic Acids Res. 32(Suppl. 1), D452–D455. doi: 10.1093/nar/gkh052
  36. Hung, J.-H. (2013). PWEA Software Available Online at: 2013. (Accessed on May 15, 2013).
  37. Hung, J.-H., Whitfield, T. W., Yang, T.-H., Hu, Z., Weng, Z., and DeLisi, C. (2010). Identification of functional modules that correlate with phenotypicdifference: the influence of network topology. Genome Biol. 11:R23. doi: 10.1186/gb-2010-11-2-r23
  38. Ibrahim, M. A., Jassim, S., Cawthorne, M. A., and Langlands, K. (2012). A topology-based score for pathway enrichment. J. Comput. Biol. 19, 563–573. doi: 10.1089/cmb.2011.0182
  39. Isci, S. (2013). BPA Software Available online at: (Accessed on May 15, 2013).
  40. Isci, S., Ozturk, C., Jones, J., and Otu, H. H. (2011). Pathway analysis of high-throughput biological data within a bayesiannetwork framework. Bioinformatics 27, 1667–1674. doi: 10.1093/bioinformatics/btr269
  41. Jacob, L., Neuvial, P., and Dudoit, S. (2010). “Gains in power from structured two-sample tests of means on graphs,” in Technical Report 271, Department of Statistics, Berkeley, CA: University of California. Available online at:
  42. Jacob, L., Neuvial, P., and Dudoit, S. (2013). DEGraph Software. Available online at: (Accessed on May 15, 2013).
  43. Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., et al. (2005). REACTOME: a knowledgebase of biological pathways. Nucleic Acids Res. 33, D428–D432. doi: 10.1093/nar/gki072
  44. Kelder, T., Conklin, B. R., Evelo, C. T., and Pico, A. R. (2010). Finding the right questions: exploratory pathway analysis to enhancebiological discovery in large datasets. PLoS Biol. 8:e1000472. doi: 10.1371/journal.pbio.1000472
  45. Khatri, P., Sellamuthu, S., Malhotra, P., Amin, K., Done, A., and Drăghici, S. (2005). Recent additions and improvements to the Onto-Tools. Nucleic Acids Res. 33(Suppl. 2), W762–W765. doi: 10.1093/nar/gki472
  46. Khatri, P., Sirota, M., and Butte, A. J. (2012). Ten years of pathway analysis: current approaches and outstandingchallenges. PLoS Comput. Biol. 8:e1002375. doi: 10.1371/journal.pcbi.1002375
  47. Khatri, P., Voichiţa, C., Kattan, K., Ansari, N., Khatri, A., Georgescu, C., et al. (2007). Onto-Tools: new additions and improvements in 2006. Nucleic Acids Res. 37, W206–W211. doi: 10.1093/nar/gkm327
  48. Kullback, S., and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Stat. 22, 79–86. doi: 10.1214/aoms/1177729694
  49. Lauritzen, S. L. (1996). Graphical Models, Vol. 17. Oxford: Oxford University Press.
  50. Mantel, N. (1966). Evaluation of survival data and two new rank order statistics arisingin its consideration. Cancer Chemother. Rep. 50, 163–170.
  51. Massa, M. S., Chiogna, M., and Romualdi, C. (2010). Gene set analysis exploiting the topology of a pathway. BMC Syst. Biol. 4:121. doi: 10.1186/1752-0509-4-121
  52. Massa S., and Sales, G. (2013). TopologyGSA Software. Available online at: (Accessed on May 15, 2013).
  53. McLean, R. A., Sanders, W. L., and Stroup, W. W. (1991). A unified approach to mixed linear models. Am. Stat. 45, 54–64. doi: 10.1080/00031305.1991.10475767
  54. Mewes, H.-W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., et al. (1999). MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 27, 44–48. doi: 10.1093/nar/27.1.44
  55. Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., et al. (2005). The PANTHER database of protein families, subfamilies, functionsand pathways. Nucleic Acids Res. 33(Suppl. 1), D284–D288. doi: 10.1093/nar/gki078
  56. Mieczkowski, J., Swiatek-Machado, K., and Kaminska, B. (2012). Identification of pathway deregulation–gene expression basedanalysis of consistent signal transduction. PLoS ONE 7:e41541. doi: 10.1371/journal.pone.0041541
  57. Mieczkowski, J., Swiatek-Machado, K., and Kaminska, B. (2013). ACST Software. Available online at: (Accessed on May 15, 2013).
  58. Misman, M. F., Deris, S., Hashim, S. Z. M., Jumali, R., and Mohamad, M. S. (2009). “Pathway-based microarray analysis for defining statistical significant phenotype-related pathways: a review of common approaches,” in ICIME’09. International Conference on Information Management and Engineering, 2009 (IEEE), (Kuala Lumpur), 496–500. doi: 10.1109/ICIME.2009.103
  59. Nam, D., and Kim, S.-Y. (2008). Gene-set approach for expression pattern analysis. Brief. Bioinf. 9, 189–197. doi: 10.1093/bib/bbn001
  60. Neapolitan, R. E. (2004). Learning Bayesian Networks. Prentice Hall.
  61. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. (1999). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27, 29–34. doi: 10.1093/nar/27.1.29
  62. Pan, K.-H., Lih, C.-J., and Cohen, S. N. (2005). Effects of threshold choice on biological conclusions reached duringanalysis of gene expression by dna microarrays. Proc. Natl. Acad. Sci. U.S.A. 102, 8961–8965. doi: 10.1073/pnas.0502674102
  63. Peri, S., Navarro, J. D., Kristiansen, T. Z., Amanchy, R., Surendranath, V., Muthusamy, B., et al. (2004). Human protein reference database as a discovery resource forproteomics. Nucleic Acids Res. 32(Suppl. 1), D497–D501. doi: 10.1093/nar/gkh070
  64. Pico, A. R., Kelder, T., van Iersel, M. P., Hanspers, K., Conklin, B. R., and Evelo, C. (2008). WikiPathways: pathway editing for the people. PLoS Biol. 6:e184. doi: 10.1371/journal.pbio.0060184
  65. Rahnenführer, J., Domingues, F. S., Maydt, J., and Lengauer, J. (2004). Calculating the statistical significance of changes in pathway activity from gene expression data. Stat. Appl. Genet. Mol. Biol. 3. doi: 10.2202/1544-6115.1055
  66. Reuters, T. (2013). MetaCore Software. Available online at: (Accessed on May 15, 2013).
  67. Sartor, M. A., Leikauf, G. D., and Medvedovic, M. (2009). LRpath: a logistic regression approach for identifying enrichedbiological groups in gene expression data. Bioinformatics 25, 211–217. doi: 10.1093/bioinformatics/btn592
  68. Schaefer, C. F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., et al. (2009). PID: the pathway interaction database. Nucleic Acids Res. 37(Suppl. 1), D674–D679. doi: 10.1093/nar/gkn653
  69. Shojaie, A. (2013). NetGSA Software. Available online at: (Accessed on May 15, 2013).
  70. Shojaie, A., and Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. J. Comput. Biol. 16, 407–426. doi: 10.1089/cmb.2008.0081
  71. Pubmed Abstract | Pubmed Full Text | CrossRef Full TextShojaie, A., and Michailidis, G. (2010). Network enrichment analysis in complex experiments. Stat. Appl. Genet. Mol. Biol. 9. doi: 10.2202/1544-6115.1483
  72. Snel, R., Lehmann, G., Bork, P., and Huynen, M. A. (2000). STRING: a web-server to retrieve and display the repeatedlyoccurring neighbourhood of a gene. Nucleic Acids Res. 28, 3442–3444. doi: 10.1093/nar/28.18.3442
  73. Spirtes, P. (1995). “Directed cyclic graphical representations of feedback models,” in Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (Morgan Kaufmann Publishers Inc.), (Montreal, QC), 491–498.
  74. Stark, C., Breitkreutz, B. J., Reguly, T., Boucher, L., Breitkreutz, A., and Tyers, M. (2006). BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 34(Suppl. 1), D535–D539. doi: 10.1093/nar/gkj109
  75. Steuer, R. (2007). Computational approaches to the topology, stability and dynamics ofmetabolic networks. Phytochemistry 68, 2139–2151. doi: 10.1016/j.phytochem.2007.04.041
  76. Subramanian, A., Tamayo, B., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: a knowledge-based approach forinterpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545–15550. doi: 10.1073/pnas.0506580102
  77. Tarca, A. L., Drăghici, S., Bhatti, G., and Romero, R. (2012). Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics 13:136. doi: 10.1186/1471-2105-13-136
  78. Tarca, A. L., Drăghici, S., Khatri, P., Hassan. S. S., Mittal, P., sun Kim, J., et al. (2009). A novel signaling pathway impact analysis (SPIA). Bioinformatics 25, 75–82. doi: 10.1093/bioinformatics/btn577
  79. Tarca, A. L., Khatri, P., and Drăghici, S. (2013). SPIA Software. Available online at:, (Accessed on May 15, 2013).
  80. Varadan, V., Mittal, P., Vaske, C. J., and Benz, S. C. (2012). The integration of biological pathway knowledge in cancer genomics: areview of existing computational approaches. Signal Process. Mag. IEEE 29, 35–50. doi: 10.1109/MSP.2011.943037
  81. Vaske, C. J., and Benz, S. C. (2013a). PARADIGM Software Standalone. Available online at:, (Accessed on May 15, 2013).
  82. Vaske, C. J., and Benz, S. C. (2013b). PARADIGM Software Web-Based. Available online at:, (Accessed on May 15, 2013).
  83. Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., et al. (2010). Inference of patient-specific pathway activities frommulti-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245. doi: 10.1093/bioinformatics/btq182
  84. Voichiţa, C, Donato, M., and Drăghici, S. (2012). “Incorporating gene significance in the impact analysis of signaling pathways,” in Machine Learning and Applications (ICMLA), 2012 11th International Conference on, IEEE. Vol. 1, (Boca Raton, FL), 126–131.
  85. Von Mering, C, Huynen, M., Jaeggi, D., Schmidt, S, Bork, P., and Snel, S. (2003). STRING: a database of predicted functional associations betweenproteins. Nucleic Acids Res. 31, 258–261. doi: 10.1093/nar/gkg034
  86. Watts, D. J., and Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature 393, 440–442. doi: 10.1038/30918
  87. Xenarios, I., Rice, D. W., Salwinski, L., Baron, M. K., Marcotte, E. M., and Eisenberg, D. (2000). DIP: the database of interacting proteins. Nucleic Acids Res. 28,289–291. doi: 10.1093/nar/28.1.289
  88. Xia, J. (2013). MetPA Software. Available online at: (Accessed on May 15, 2013).
  89. Xia, J., and Wishart, D. S. (2010). MetPA: a web-based metabolomics tool for pathway analysis andvisualization. Bioinformatics 26, 2342–2344. doi: 10.1093/bioinformatics/btq418
  90. Yin, E., Gupta, M., Weninger, T., and Han, J. (2010). “A unified framework for link recommendation using random walks,” in Advances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on, IEEE, (Odense), 152–159.
  91. Zhao, Y., Chen, M. H., Pei, H., Rowe, D., Shin, D. G., Xie, G., et al. (2012). A Bayesian approach to pathway analysis by integrating gene–genefunctional directions and microarray data. Stat. Biosci. 4, 105–131. doi: 10.1007/s12561-011-9046-1