​In the world of bioinformatics, we all need to be careful when analyzing our data. I receive countless questions about what background should be used when analyzing gene expression or protein expression data. In this blog post I will attempt to clarify this question.

We have all had the experience where you get a raffle ticket and the ticket says, “Must be present to win.” What they are doing is establishing the pool of candidates from which to draw a winning ticket (and also trying to keep you there, but that’s beyond the scope of this discussion). Performing pathway analysis and other enrichment analyses, is somewhat similar to that.

​​It is very intuitive that the size of the pool of candidates will dramatically affect the odds of winning. Using our raffle ticket example, let us say 1,000 tickets are given out and there will only be one winner. On the surface, we think our odds of winning are 1 in 1,000. But now let’s say the crowd of people that are actually present for the drawing is only 100 people. Because we “must be present to win,” our odds of winning are now actually 1 in 100. Furthermore, if the raffle organizers wanted to cheat, they could for instance have the raffle draw take place in a small room in which they invite only their friends and relatives. If this room hosts only 10 people at the time of the drawing, the odds of winning would now be 1 in 10. So the odds are really dependent on which background we choose. The same goes for pathway and other enrichment-based analyses.

Let us say you have 1,000 significant genes or proteins that were selected as differentially expressed (DE) in your condition. As I prefaced in my opening paragraph, the question becomes what background should be used when trying to understand what pathways or GO terms are significant. The p-values calculated during the analysis are just another way to tell you about the odds of a given pathway being significant just by chance. And, as we saw in the raffle experiment the choice of the background can have a dramatic effect on the results (odds). Should we use all protein coding genes? How about all genes in NCBI or Ensemble?

The answer is we should always use the set of genes that were measured. This is akin to saying, “you must be present to win.” If the gene or protein was not measured, it should not be in the mix. So if you use an arbitrary set of genes for the background (e.g. all NCBI genes, or all Ensemble genes) your statistic will be heavily skewed. All enrichment programs that have you submit only DE genes or proteins do this. Similarly, if you only use the set of DE genes as the as the background, and further select from there, you can also skew your results (this is like doing the drawing among your 10 friends and increasing your odds of success).

To exemplify this, I took the set of 1,172 significant genes (p<0.05 and Log2FC>|0.6|) from a public dataset (GSE47363) and ran it through a simple enrichment analysis. In the first experiment, I used the set of genes that were measured as the background, about 20,000 genes. Then I ran the exact same set of DE genes, but this time I used “NCBI genes” as provided by another popular web-based pathway analysis application as the background (about 30,000 genes). See Figure 1 below.

Figure 1: Comparing the same set of DEGs, but with different backgrounds. On the left, we use the set of genes that were measured (20k). On the right we use 30k genes from NCBI as the background. Notice the dramatic difference in the number of significant pathways and the p-Values.

While the top pathway is the same in both instances, you will notice little else is the same. In the first set of results, obtained with the appropriate background, we see a total of 64 significant pathways (FDR pV<0.05). The second set of results, obtained with all NCBI genes as background, there are more than 150 significant pathways! Also, you will notice the p-values are much more significant when using NCBI as the background.

You could say: “Well, but the first pathway is the same. So if a pathway is truly relevant, it will be on top no matter what the background is.” First, this is not true. The fact that the two sets of results have the same pathway at the very top is just a coincidence. Secondly, this is an incorrect way of thinking. The very purpose of the p-values is to provide us with the means to distinguish between pathways that may have some differentially expressed genes just by chance, and the pathways that maybe truly be involved with the phenotype.

All pathways with a p-value less than the significance threshold (e.g. 5%) should be carefully studied, not just the very top result, or the top three for that matter. If you have too many significant pathways and you cherry pick from them only the ones that “look familiar”, your results will be severely biased.

A better way is to go back to the criteria you used to select your differentially expressed genes, use more stringent thresholds for p-values and/or fold changes and re-do your analysis. In most cases, using reasonable thresholds for your genes, will give you a set of significant pathways that will actually offer you a good understanding of the underlying biological phenomenon. Assuming of course, that you used a good pathway analysis method. But let us leave this for another posting.

To summarize, using the proper background set of genes or proteins can have a dramatic effect on the number of significant results and the number of false positives. You have to use the entire set of genes that were measured as the appropriate background when analyzing your data. Nothing more, nothing less! This is not a recommendation, nor an advice. This is a must in order to ensure the scientific validity of your findings. This is why in iPathwayGuide, we ask you to submit your entire list of genes. If you ever use an application that only requires you to submit the significant genes, ask yourself, “What is the background being considered?”

For more on this topic, you can read:

Chapter 24 in Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition (Chapman & Hall/CRC Mathematical and Computational Biology)

Use and misuse of the gene ontology annotations, Nature Reviews Genetics, 2008 July 9(7):509-515, PMID:18475267, DOI:10.1038/nrg2363


Analyze Now

  1. Register to explore demo data
  2. Subscribe to analyze your ‘omics data
  3. Review and interact with pathways impacted in your experiment
  4. Share your results with collaborators for interpretation and analysis iteration
  5. Create publication-ready figures simply and easily

What You Can Expect

  • Better Insights
  • Higher Quality
  • Superior Convenience
  • Unmatched Usability
  • Unparalleled Reproducibility
Register to Explore Advaita’s Platform with Demo Data

Get Started!

Get in touch with Advaita to learn how our software will improve quality and efficiency for your Core Facility, Enterprise Bioinformatics team, or Research Lab.


About the Author: Sorin Draghici

Sorin Draghici
Dr. Draghici is a Professor in the Department of Computer Science, and the head of the Intelligent Systems and Bioinformatics Laboratory at Wayne State University. He also holds a joint appointment in the Department Obstetrics and Gynecology and is an Associate Dean in Wayne State University's College of Engineering. Dr. Draghici is a senior member of IEEE, and an editor of IEEE/ACM Transactions on Computational Biology and Bioinformatics, Protocols in Bioinformatics, Discoveries Journals, Journal of Biomedicine and Biotechnology, and International Journal of Functional Informatics and Personalized Medicine. His publications include two books (”Data Analysis Tools for DNA Microarrays and Statistics” and ”Data Analysis for Microarrays using R”), 8 book chapters, and over 150 peer-reviewed journal and conference publications which gathered over 12,000 citations to date.