# of Citations
Ontological analysis of gene expression data: current tools, limitations, and open problems.
Bioinformatics 21 (18), 3587-3595
A systems biology approach for pathway level analysis.
Genome Research, 2007, Vol. 17 (10), pages 1537-1545.
Global functional profiling of gene expression.
Genomics 81 (2), 98-104
Reliability and reproducibility issues in DNA microarray measurements.
TRENDS in Genetics 22 (2), 101-109
Data analysis tools for DNA microarrays.
(Book) CRC Press
Profiling gene expression using onto-express.
Genomics 79 (2), 266-270
Onto-tools, the toolkit of the modern biologist: onto-express, onto-compare, onto-design and onto-translate.
Nucleic acids research 31 (13), 3775-3781
Use and misuse of the gene ontology annotations.
Nature Reviews Genetics 9 (7), 509-515
Onto-Tools: New Additions and Improvements in 2006.
Nucleic Acids Research, Vol. 35, pages W206-W211, July 2007.
Statistics and data analysis for microarrays using R and bioconductor.
(Book) CRC Press
Analysis and correction of crosstalk effects in pathway analysis.
Genome Research, 2013, Vol. 23 (9).
A system biology approach for the steady-state analysis of gene signaling networks.
In Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications (CIARP’07).
Foote, A.P., Keel, B.N., Zarek, C.M. and Lindholm-Perry, A.K., 2017. Beef steers with average dry matter intake and divergent average daily gain have altered gene expression in the jejunum. Journal of Animal Science.
Worthington, R., Ball, E., Wolf, B. and Takacs, G., 2017. Method to Identify Silent Codon Mutations That May Alter Peptide Elongation Kinetics and Co-translational Protein Folding. In Proteomics for Drug Discovery (pp. 237-243). Humana Press, New York, NY.
Kumar, A., Bicer, E.M., Pfeffer, P., Monopoli, M.P., Dawson, K.A., Eriksson, J., Edwards, K., Lynham, S., Arno, M., Behndig, A.F. and Blomberg, A., 2017. Differences in the coronal proteome acquired by particles depositing in the lungs of asthmatic versus healthy humans. Nanomedicine: Nanotechnology, Biology and Medicine.
Lin, C.K.E., Kaptein, J.S. and Sheikh, J., 2017. Differential expression of microRNAs and their possible roles in patients with chronic idiopathic urticaria and active hives. Allergy & Rhinology, 8(2), pp.e67-e80.
Kumar, A., Terakosolphan, W., Hassoun, M., Vandera, K.K., Novicky, A., Harvey, R., Royall, P.G., Bicer, E.M., Eriksson, J., Edwards, K. and Valkenborg, D., 2017. A Biocompatible Synthetic Lung Fluid Based on Human Respiratory Tract Lining Fluid Composition. Pharmaceutical Research, pp.1-12.
Liu, Y., Lang, T., Jin, B., Chen, F., Zhang, Y., Beuerman, R.W., Zhou, L. and Zhang, Z., 2017. Luteolin inhibits colorectal cancer cell epithelial-to-mesenchymal transition by suppressing CREB1 expression revealed by comparative proteomics study. Journal of Proteomics.
Schatton, D., Pla-Martin, D., Marx, M.C., Hansen, H., Mourier, A., Nemazanyy, I., Pessia, A., Zentis, P., Corona, T., Kondylis, V. and Barth, E., 2017. CLUH regulates mitochondrial metabolism by controlling translation and decay of target mRNAs. J Cell Biol, pp.jcb-201607019.
Todorova, K., Metodiev, M.V., Metodieva, G., Mincheff, M., Fernández, N. and Hayrabedyan, S., 2016. Micro-RNA-204 Participates in TMPRSS2/ERG Regulation and Androgen Receptor Reprogramming in Prostate Cancer. Hormones and Cancer, pp.1-21.
Wang, S., Campos, J., Gallotta, M., Gong, M., Crain, C., Naik, E., Coffman, R.L. and Guiducci, C., 2016. Intratumoral injection of a CpG oligonucleotide reverts resistance to PD-1 blockade by expanding multifunctional CD8+ T cells. Proceedings of the National Academy of Sciences, p.201608555.
Simonik, E.A., Cai, Y., Kimmelshue, K.N., Brantley-Sieders, D.M., Loomans, H.A., Andl, C.D., Westlake, G.M., Youngblood, V.M., Chen, J., Yarbrough, W.G. and Brown, B.T., 2016. LIM-Only Protein 4 (LMO4) and LIM Domain Binding Protein 1 (LDB1) Promote Growth and Metastasis of Human Head and Neck Cancer (LMO4 and LDB1 in Head and Neck Cancer). PloS one, 11(10), p.e0164804.
Wadhwa, R., Nigam, N., Bhargava, P., Dhanjal, J.K., Goyal, S., Grover, A., Sundar, D., Ishida, Y., Terao, K. and Kaul, S.C., 2016. Molecular Characterization and Enhancement of Anticancer Activity of Caffeic Acid Phenethyl Ester by γ Cyclodextrin.Journal of Cancer, 7(13), pp.1755-1771.
Zhou, H., Manthey, J., Lioutikova, E., Yang, W., Yoshigoe, K., Yang, M.Q. and Wang, H., 2016. The up-regulation of Myb may help mediate EGCG inhibition effect on mouse lung adenocarcinoma. Human Genomics, 10(2), p.103.
Klener, P., Fronkova, E., Berkova, A., Jaksa, R., Lhotska, H., Forsterova, K., Soukup, J., Kulvait, V., Vargova, J., Fiser, K. and Prukova, D., 2016. Mantle cell lymphoma‐variant Richter syndrome: Detailed molecular‐cytogenetic and backtracking analysis reveals slow evolution of a pre‐MCL clone in parallel with CLL over several years. International Journal of Cancer.
Colacino, J.A., McDermott, S.P., Sartor, M.A., Wicha, M.S. and Rozek, L.S., 2016. Transcriptomic profiling of curcumin-treated human breast stem cells identifies a role for stearoyl-coa desaturase in breast cancer prevention.Breast Cancer Research and Treatment, pp.1-13.
Kravchenko, D.S., Lezhnin, Y.N., Kravchenko, J.E., Chumakov, S.P. and Frolova, E.I., 2016. Study of Molecular Mechanisms of PDLIM4/RIL in Promotion of the Development of Breast Cancer. Biol Med (Aligarh), 8(2), p.2.
Mitt, M., Altraja, A. and Altraja, S., 2016. Altered Gene Expression Profiles In Human Bronchial Epithelial Cells Exposed To E-Cigarette Liquid: Results From A Genome-Wide Monitoring. In B58. BIG AND BIGGER (DATA): OMICS AND BIOMARKERS OF COPD AND OTHER CHRONIC LUNG DISEASES (pp. A4053-A4053). American Thoracic Society.
Na, Y., Kaul, S.C., Ryu, J., Lee, J.S., Ahn, H.M., Kaul, Z., Kalra, R.S., Li, L., Widodo, N., Yun, C.O. and Wadhwa, R., 2016. Stress chaperone mortalin contributes to epithelial-mesenchymal transition and cancer metastasis.Cancer research, pp.canres-2704.
Westphalen, C.B., Takemoto, Y., Tanaka, T., Macchini, M., Jiang, Z., Renz, B.W., Chen, X., Ormanns, S., Nagar, K., Tailor, Y. and May, R., 2016. Dclk1 Defines Quiescent Pancreatic Progenitors that Promote Injury-Induced Regeneration and Tumorigenesis. Cell Stem Cell, 18(4), pp.441-455.
Eddens, T., Campfield, B.T., Serody, K., Manni, M.L., Horne, W., Elsegeiny, W., McHugh, K.J., Pociask, D., Chen, K., Zheng, M. and Alcorn, J.F., 2016. A Novel CD4+ T-cell Dependent Murine Model of Pneumocystis Driven Asthma-like Pathology. American Journal of Respiratory And Critical Care Medicine, (ja).
Takeda, K., Sriram, S., Chan, X.H.D., Ong, W.K., Yeo, C.R., Tan, B., Lee, S.A., Kong, K.V., Hoon, S., Jiang, H. and Yuen, J.J., 2016. Retinoic Acid Mediates Visceral-specific Adipogenic Defects of Human Adipose-derived Stem Cells. Diabetes, p.db151315.
Williams, K.E., Lemieux, G.A., Hassis, M.E., Olshen, A.B., Fisher, S.J. and Werb, Z., 2016. Quantitative proteomic analyses of mammary organoids reveals distinct signatures after exposure to environmental chemicals.Proceedings of the National Academy of Sciences, p.201600645.
Ortea, I., Rodríguez-Ariza, A., Chicano-Gálvez, E., Vacas, M.A. and Gámez, B.J., 2016. Discovery of potential protein biomarkers of lung adenocarcinoma in bronchoalveolar lavage fluid by SWATH MS data-independent acquisition and targeted data extraction. Journal of Proteomics. 2016 Feb 18.
Lamontagne, J., Mell, J.C. and Bouchard, M.J., 2016. Transcriptome-Wide Analysis of Hepatitis B Virus-Mediated Changes to Normal Hepatocyte Gene Expression. PLoS Pathog, 12(2), p.e1005438.
Zhou, H., Manthey, J., Lioutikova, E., Yang, M.Q., Yang, W., Yoshigoe, K. and Wang, H., 2015, November. The upregulation of Myb and Peg3 may mediate EGCG inhibition effect on mouse lung adenocarcinoma. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on (pp. 1532-1535). IEEE.
Andres-Terre, M., McGuire, H.M., Pouliot, Y., Bongen, E., Sweeney, T.E., Tato, C.M. and Khatri, P., 2015. Integrated, Multi-cohort Analysis Identifies Conserved Transcriptional Signatures across Multiple Respiratory Viruses.Immunity, 43(6), pp.1199-1211.
Srivastava, A., Ritesh, K.C., Tsan, Y.C., Liao, R., Su, F., Cao, X., Hannibal, M.C., Keegan, C.E., Chinnaiyan, A.M., Martin, D.M. and Bielas, S.L., 2015. De novo Dominant ASXL3 Mutations Alter H2A Deubiquitination and Transcription in Bainbridge-Ropers Syndrome. Human molecular genetics, p.ddv499.
Lee, S.E., Son, G.W., Park, H.R., Jin, Y.H., Park, C.S. and Park, Y.S., 2015. Integrative analysis of miRNA and mRNA profiles in response to myricetin in human endothelial cells. BioChip Journal, 9(3), pp.239-246.
Sanford, T., Welty, C., Meng, M. and Porten, S., 2015. MP68-18 MOLECULAR ANALYSIS OF UROTHELIAL TUMORS IN PATIENTS WITH AND WITHOUT METASTASIS STRATIFIED BY T STAGE. The Journal of Urology, 193(4), p.e865.
Most existing pathway analysis methods focus on either the number of differentially expressed genes observed in a given pathway (enrichment analysis methods), or on the correlation between the pathway genes and the class of the samples (functional class scoring methods). Both approaches treat pathways as simple sets of genes, disregarding the complex gene interactions that these pathways are built to describe.
More recently, biological annotations have started to include descriptions of gene interactions in the form of gene signaling networks, such as KEGG (Ogata et al., 1999), BioCarta (www.biocarta.com) and Reactome (Joshi-Tope et al., 2005). This richer type of annotations have opened the possibility of an automatic analysis aimed to identify the gene signaling networks that are relevant in a given condition, and perhaps even the specific signals or signal perturbations involved. This approach is not well suited for a systems biology approach that aims to account for system-level dependencies and interactions, as well as identify perturbations and modifications at the pathway or organism level (Stelling, 2004).
Advaita’s products are based on Impact Analysis method that leverages the information about type, function, position and interaction between genes in a given pathway. Impact Analysis combines the evidence obtained from the classical enrichment analysis with a novel type of evidence, which measures the actual perturbation on a given pathway under a given condition. We illustrate the capabilities of the novel method on four real datasets. The results obtained on these data show that Impact Analysis has better specificity and more sensitivity than several widely used pathway analysis methods.
Hi there. Advaita is dedicated to bringing you the most advanced, easiest-to-use Bioinformatics tools out there. And that includes educational materials designed to help you take advantage of all the powerful features we offer. Our last post about p-value correction factors was a bit confusing. This blog post explains how each method works, so you can decide when to use each one.
We are lucky to have a few bioinformaticians around the office, including Dr. Sorin Draghici our CEO and founder. If you don’t have a bioinformatics expert in-house, you might want to pick up his book. It’s full of useful information, and I think the best part about it is how easy it is to read— he makes it fun! For now, if you want to know more about getting the most from your analyses, read on…
A p-value represents the probability of observing an event by random chance. For example, if there are 5 differentially expressed (DE) genes on pathway X out of 100 DE genes in the dataset, the over-enrichment p-value for pathway X is the probability that from a randomly selected set 100 genes in the dataset, 5 or more fall on pathway X. Significance is determined by setting a threshold, in many cases 0.05. If the p-value is less than 0.05, pathway X is considered significant because the chance of randomly observing the same result is less than 5%.
This means that there is still a chance that the observation was in fact due to randomness and pathway X is not significant, what we would call a “false positive.” The chance of pathway X being a false positive is small, but when we perform this test multiple times as we would for multiple pathways, the chance of reporting at least one false positive increases quickly. That is because the probability of reporting a false positive in a group of independent tests is the sum of the individual p-values. When this is done for hundreds of pathways, we are virtually guaranteed to have some pathways that appear to be significant just by chance. This is known as the “multiple comparisons problem,” and we tell you how to correct for it in the first section.
Enrichment tests are used in a number of settings including enrichment pathway analysis  and gene ontology (GO) enrichment analysis. However, the GO has an additional structure that includes a hierarchical organization of its terms, as well as a “true path rule” that allows genes to be associated with entire paths through the ontology, rather than single terms . Because of these additional properties, specific enrichment analysis methods (and associated multiple comparison strategies) have been developed for GO enrichment analysis. Two of these methods will be briefly discussed in the second section.
I. Methods of Correcting for Multiple Comparisons
General methods for multiple comparison corrections may be applied to any enrichment analysis. There are two strategies to limit the number of false positives across a large number of significance tests, and several methods have been developed for each strategy.
Strategy 1. Limit the probability of making a mistake (reporting a false positive) for each individual test
Strategy 2. Limit the rate of false positives, i.e. the proportion of false positive tests
In iPathwayGuide and iVariantGuide, we offer the most widely-cited method for each strategy. Furthermore, the methods we chose provide a range of stringency so that you can choose what is appropriate for your data. Try it out now!
The Bonferroni correction is considered to be the most conservative method to correct for multiple comparisons, meaning that the fewest false positives are returned. The drawback is that some truly meaningful events may not be reported as significant. The Bonferroni method guarantees that the chance of any individual test yielding a false positive is less than the chosen significance threshold [3,4]. In other words, for a 5% significance threshold, the Bonferroni correction guarantees that the probability of generating at least one false positive is less than 5%. The more tests we run, the smaller the individual (raw) p-values must be for them to remain significant after the Bonferroni correction.
False Discovery Rate
In contrast to Bonferroni, FDR is one of the most lenient methods, allowing more true positives to be reported as significant with the drawback that some false positives may also be reported as such. Developed by Benjamini and Hochberg, FDR correction guarantees that the proportion of false positive tests will be smaller than the original significance threshold [5,6]. In other words, for a 5% significance threshold, FDR correction guarantees that the proportion of false positives is less than 5% of the total number of positive tests.
II. Multiple Comparisons in GO enrichment analysis
Due to the True Path Rule, genes associated with a GO term are also associated with its parent terms (for more on this, see Chapter 22 of Dr. Draghici’s book ). This means that simply performing an enrichment analysis for each GO term will count each gene many times, which is a serious problem (see Draghici, Chapter 24). Furthermore, testing the enrichment of all GO terms is not necessary and due to the unavoidable multiple comparison curse will increase the number of false positives reported. Luckily, one can leverage the structure and additional properties of GO in order to limit the number of tests performed, and therefore the number of comparisons one must correct for. In 2006, Alexa  proposed two methods to accomplish this: “Elim” and “Weight.”
In iPathwayGuide and iVariantGuide we offer both methods, each of which follow the same outline.
1) Decouple GO terms from one another
2) Perform significance tests
3) Correct for multiple comparisons
The Elim method assesses the significance of GO terms starting with the most specific terms first. The benefit of this approach is that it is easier to find specialized terms that are significant, e.g. “response to amphetamine” is more descriptive than “response to chemical.” This approach provides a very nice custom cut through the GO hierarchy that “magically” identifies the lowest level of abstraction that contains the significant GO terms in the given experiment.
Given a set of related GO terms, the Weight method is designed to identify the term that best represents the genes of interest, regardless of where the term falls in the hierarchy. This approach is less stringent than Elim, capturing more true positives with the drawback of including additional false positives.
iPathwayGuide and iVariantGuide are the only tools to provide these advanced correction factors to help you minimize false positives. Try them today for FREE and see what is truly significant in your data.
1. Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol, 8(2), e1002375.
2. Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7), 509-515.
3. Dunn, O. J. (1959). Confidence intervals for the means of dependent, normally distributed variables. Journal of the American Statistical Association,54(287), 613-621.
4. Dunn 1961 Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293), 52-64.
5. Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), 289-300.
6. Benjamini, Y. & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, 1165-1188.
7. Drăghici, S. (2011). Statistics and data analysis for microarrays using R and bioconductor. CRC Press. Available here.
8. Alexa, A., Rahnenführer, J., & Lengauer, T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics, 22(13), 1600-1607.
In the world of bioinformatics, we all need to be careful when analyzing our data. I receive countless questions about what background should be used when analyzing gene expression or protein expression data. In this blog post I will attempt to clarify this question.
We have all had the experience where you get a raffle ticket and the ticket says, “Must be present to win.” What they are doing is establishing the pool of candidates from which to draw a winning ticket (and also trying to keep you there, but that’s beyond the scope of this discussion). Performing pathway analysis and other enrichment analyses, is somewhat similar to that.
It is very intuitive that the size of the pool of candidates will dramatically affect the odds of winning. Using our raffle ticket example, let us say 1,000 tickets are given out and there will only be one winner. On the surface, we think our odds of winning are 1 in 1,000. But now let’s say the crowd of people that are actually present for the drawing is only 100 people. Because we “must be present to win,” our odds of winning are now actually 1 in 100. Furthermore, if the raffle organizers wanted to cheat, they could for instance have the raffle draw take place in a small room in which they invite only their friends and relatives. If this room hosts only 10 people at the time of the drawing, the odds of winning would now be 1 in 10. So the odds are really dependent on which background we choose. The same goes for pathway and other enrichment-based analyses.
Let us say you have 1,000 significant genes or proteins that were selected as differentially expressed (DE) in your condition. As I prefaced in my opening paragraph, the question becomes what background should be used when trying to understand what pathways or GO terms are significant. The p-values calculated during the analysis are just another way to tell you about the odds of a given pathway being significant just by chance. And, as we saw in the raffle experiment the choice of the background can have a dramatic effect on the results (odds). Should we use all protein coding genes? How about all genes in NCBI or Ensemble?
The answer is we should always use the set of genes that were measured. This is akin to saying, “you must be present to win.” If the gene or protein was not measured, it should not be in the mix. So if you use an arbitrary set of genes for the background (e.g. all NCBI genes, or all Ensemble genes) your statistic will be heavily skewed. All enrichment programs that have you submit only DE genes or proteins do this. Similarly, if you only use the set of DE genes as the as the background, and further select from there, you can also skew your results (this is like doing the drawing among your 10 friends and increasing your odds of success).
To exemplify this, I took the set of 1,172 significant genes (p<0.05 and Log2FC>|0.6|) from a public dataset (GSE47363) and ran it through a simple enrichment analysis. In the first experiment, I used the set of genes that were measured as the background, about 20,000 genes. Then I ran the exact same set of DE genes, but this time I used “NCBI genes” as provided by another popular web-based pathway analysis application as the background (about 30,000 genes). See Figure 1 below.
Figure 1: Comparing the same set of DEGs, but with different backgrounds. On the left, we use the set of genes that were measured (20k). On the right we use 30k genes from NCBI as the background. Notice the dramatic difference in the number of significant pathways and the p-Values.
While the top pathway is the same in both instances, you will notice little else is the same. In the first set of results, obtained with the appropriate background, we see a total of 64 significant pathways (FDR pV<0.05). The second set of results, obtained with all NCBI genes as background, there are more than 150 significant pathways! Also, you will notice the p-values are much more significant when using NCBI as the background.
You could say: “Well, but the first pathway is the same. So if a pathway is truly relevant, it will be on top no matter what the background is.” First, this is not true. The fact that the two sets of results have the same pathway at the very top is just a coincidence. Secondly, this is an incorrect way of thinking. The very purpose of the p-values is to provide us with the means to distinguish between pathways that may have some differentially expressed genes just by chance, and the pathways that maybe truly be involved with the phenotype.
All pathways with a p-value less than the significance threshold (e.g. 5%) should be carefully studied, not just the very top result, or the top three for that matter. If you have too many significant pathways and you cherry pick from them only the ones that “look familiar”, your results will be severely biased.
A better way is to go back to the criteria you used to select your differentially expressed genes, use more stringent thresholds for p-values and/or fold changes and re-do your analysis. In most cases, using reasonable thresholds for your genes, will give you a set of significant pathways that will actually offer you a good understanding of the underlying biological phenomenon. Assuming of course, that you used a good pathway analysis method. But let us leave this for another posting.
To summarize, using the proper background set of genes or proteins can have a dramatic effect on the number of significant results and the number of false positives. You have to use the entire set of genes that were measured as the appropriate background when analyzing your data. Nothing more, nothing less! This is not a recommendation, nor an advice. This is a must in order to ensure the scientific validity of your findings. This is why in iPathwayGuide, we ask you to submit your entire list of genes. If you ever use an application that only requires you to submit the significant genes, ask yourself, “What is the background being considered?”
For more on this topic, you can read:
Chapter 24 in Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition (Chapman & Hall/CRC Mathematical and Computational Biology)