​In the world of bioinformatics, we all need to be careful when analyzing our data. I receive countless questions about what background should be used when analyzing gene expression or protein expression data. In this blog post I will attempt to clarify this question.

We have all had the experience where you get a raffle ticket and the ticket says, “Must be present to win.” What they are doing is establishing the pool of candidates from which to draw a winning ticket (and also trying to keep you there, but that’s beyond the scope of this discussion). Performing pathway analysis and other enrichment analyses, is somewhat similar to that.

​​It is very intuitive that the size of the pool of candidates will dramatically affect the odds of winning. Using our raffle ticket example, let us say 1,000 tickets are given out and there will only be one winner. On the surface, we think our odds of winning are 1 in 1,000. But now let’s say the crowd of people that are actually present for the drawing is only 100 people. Because we “must be present to win,” our odds of winning are now actually 1 in 100. Furthermore, if the raffle organizers wanted to cheat, they could for instance have the raffle draw take place in a small room in which they invite only their friends and relatives. If this room hosts only 10 people at the time of the drawing, the odds of winning would now be 1 in 10. So the odds are really dependent on the set from which the winning tickets are selected. The same goes for pathway and other enrichment-based analyses.

Let us say you have 1,000 significant genes or proteins that were selected as differentially expressed (DE) in your condition. As I prefaced in my opening paragraph, the question becomes what background should be used when trying to understand what pathways or GO terms are significant. The p-values calculated during the analysis are just another way to tell you about the odds of a given pathway being significant just by chance. And, as we saw in the raffle experiment the choice of the background can have a dramatic effect on the results (odds). Should we use all protein coding genes? How about all genes in NCBI or Ensemble?

The answer is we should always use the set of genes that were measured. This is akin to saying, “you must be present to win.” If the gene or protein was not measured, it should not be in the mix. So if you use an arbitrary set of genes for the background (e.g. all NCBI genes, or all Ensemble genes) your statistic will be heavily skewed. All enrichment programs that have you submit only DE genes or proteins do this. Similarly, if you only use the set of DE genes as the as the background, and further select from there, you can also skew your results (this is like doing the drawing among your 10 friends and increasing your odds of success).

To exemplify this, I took the set of 1,172 significant genes (p<0.05 and Log2FC>|0.6|) from a public dataset (GSE47363) and ran it through a simple enrichment analysis. In the first experiment, I used the set of genes that were measured as the background, about 20,000 genes. Then I ran the exact same set of DE genes, but this time I used “NCBI genes” as provided by another popular web-based pathway analysis application as the background (about 30,000 genes). See Figure 1 below.

Figure 1: Comparing the same set of DEGs, but with different backgrounds. On the left, we use the set of genes that were measured (20k). On the right we use 30k genes from NCBI as the background. Notice the dramatic difference in the number of significant pathways and the p-Values.

While the top pathway is the same in both instances, you will notice little else is the same. In the first set of results, obtained with the appropriate background, we see a total of 64 significant pathways (FDR p-values<0.05). The second set of results, obtained with all NCBI genes as background, there are more than 150 significant pathways! Also, you will notice the p-values are much more significant when using NCBI as the background.

You could say: “Well, but the first pathway is the same. So if a pathway is truly relevant, it will be on top no matter what the background is.” First, this is not true. The fact that the two sets of results have the same pathway at the very top is just a coincidence. Secondly, this is an incorrect way of thinking. The very purpose of the p-values is to provide us with the means to distinguish between pathways that are truly be involved with the phenotype, and the pathways that may have some differentially expressed genes just by chance. All pathways with a p-value less than the significance threshold (e.g. 5%) should be carefully studied, not just the very top result, or the top three for that matter. If you have too many significant pathways and you cherry pick from them only the ones that “look familiar”, or “make sense”, your results will be severely biased. And by the way, you can never discovery new phenomena if you investigate only those that you already know are involved.

If you have too many DEGs, a better way is to go back to the criteria you used to select your differentially expressed genes, use more stringent thresholds for p-values and/or fold changes and re-do your analysis. In most cases, using reasonable thresholds for your genes, will give you a set of significant pathways that will actually offer you a good understanding of the underlying biological phenomenon. Assuming of course, that you used a good pathway analysis method.

To summarize, using the proper background set of genes or proteins can have a dramatic effect on the number of significant results and the number of false positives. You have to use the entire set of genes that were measured as the appropriate background when analyzing your data. Nothing more, nothing less! This is not a recommendation, nor an advice. This is a must, in order to ensure the scientific validity of your findings. This is why in iPathwayGuide, we ask you to either submit your entire list of genes or explicitly specify the background. If you ever use an application that only requires you to submit the significant genes, ask yourself, “Who was in the room when the winners were picked?”

For more on this topic, you can read:

Chapter 24 in Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition (Chapman & Hall/CRC Mathematical and Computational Biology)

Use and misuse of the gene ontology annotations, Nature Reviews Genetics, 2008 July 9(7):509-515, PMID:18475267, DOI:10.1038/nrg2363