​In the world of bioinformatics, we all need to be careful when analyzing our data. I receive countless questions about what reference set should be used when analyzing gene expression or protein expression data.

If you are thinking “what’s a reference set?”, don’t worry. You aren’t alone. So many researchers have fallen prey to using the default reference set offered by most pathway analysis programs. And could be compromising their data analysis without even realizing it. In this blog post, I will define the reference set and attempt to clarify the question of what is the best reference set to use.

What is a reference set?

We’ve all had the experience of receiving a raffle ticket and the ticket says, “Must be present to win.” What they are doing is establishing the pool of candidates from which to draw a winning ticket (and also trying to keep you there, but that’s beyond the scope of this discussion). Performing pathway analysis and other enrichment analyses, is somewhat similar to that.

​​It is very intuitive that the size of the pool of candidates will dramatically affect the odds of winning. Using our raffle ticket example, let us say 1,000 tickets are given out and there will only be one winner. On the surface, we think our odds of winning are 1 in 1,000. But now let’s say the crowd of people that are actually present for the drawing is only 100 people. Because we “must be present to win,” our odds of winning are now actually 1 in 100. Furthermore, if the raffle organizers wanted to cheat, they could for instance have the raffle draw take place in a small room in which they invite only their friends and relatives. If this room hosts only 10 people at the time of the drawing, the odds of winning would now be 1 in 10. So the odds are really dependent on the set from which the winning tickets are selected. The same goes for pathway and other enrichment-based analyses.

Let us say you have 1,000 significant genes or proteins that were selected as differentially expressed (DE) in your condition. As I prefaced in my opening paragraph, the question becomes what background should be used when trying to understand what pathways or GO terms are significant. The p-values calculated during the analysis are just another way to tell you about the odds of a given pathway being significant just by chance. And, as we saw in the raffle experiment the choice of the background can have a dramatic effect on the results (odds).

What background should I use?

Should we use all protein coding genes? How about all genes in NCBI or Ensemble data bases?

The answer is we should always use the set of genes that were measured. This is akin to saying, “you must be present to win.” If the gene or protein was not measured, it should not be in the mix. So if you use an arbitrary set of genes for the background (e.g. all NCBI genes, or all Ensemble genes) your statistic will be heavily skewed. All enrichment programs that have you submit only DE genes or proteins do this. Similarly, if you only use the set of DE genes as the as the background, and further select from there, you can also skew your results (this is like doing the drawing among your 10 friends and increasing your odds of success).

To exemplify this, I took the set of 1,172 significant genes (p<0.05 and Log2FC>|0.6|) from a public dataset (GSE47363) and ran it through a simple enrichment analysis. In the first experiment, I used the set of genes that were measured as the background, about 20,000 genes. Then I ran the exact same set of DE genes, but this time I used “NCBI genes” as provided by another popular web-based pathway analysis application as the background (about 30,000 genes). See Figure 1 below.

Figure 1: Comparing pathway results on a set of DEGs with different backgrounds . On the left, we use the set of genes that were measured (20k). On the right we use 30k genes from NCBI as the background. Notice the dramatic difference in the number of significant pathways and the p-Values.

While the top pathway is the same in both instances, you will notice little else is the same. In the first set of results, obtained with the appropriate background, we see a total of 64 significant pathways (FDR p-values<0.05). The second set of results, obtained with all NCBI genes as background, there are more than 150 significant pathways! Also, you will notice the p-values are much more significant when using NCBI as the background.

You could say: “Well, the first pathway is the same. So, if a pathway is truly relevant, it will be on top no matter what the background is.” Not true. The fact that the two sets of results have the same pathway at the very top is just a coincidence. To further illustrate, I compared using all measured genes as the reference background or using the entire NCBI gene database and the effect on a single pathway.

For this example, I used 36,000 measured genes of which 10% are differentially expressed, or 3,600 genes. Both of these numbers are similar to what you would get in a typical RNA-seq experiment. Next, I assumed 100 genes are associated with a single arbitrary pathway and that 12 genes from my differentially expressed dataset are enriched in this list. As you can see in Table 1 below, when I used the total list of measure genes as the background, I found that this pathway was not enriched in my data set with a p-value of 0.19. However, keeping everything else the same but, instead, analyze using a default reference set which is roughly equal to the number in the NCBI database (52,000), this pathway is now significant with a p-value of 0.02. Just by using the “default” reference, I’ve identified a GO term enriched in my dataset that is most likely a false positive.

  All measured genes as reference Entire NCBI database as reference
Number of genes in reference set 36,000 52,000
Differentially expressed genes 3,600 3,600
Genes annotated to pathway in database
100 100
Differentially expressed genes annotated to pathway
12 12
p-value 0.19 0.02

Table 1: Comparing the outcomes on a single pathway using different backgrounds . On the left, we use the set of genes that were measured (36k). On the right we use 52k genes from the NCBI as the background. Notice that the p-value becomes significant when using the larger reference set.

What’s in a p-value?

The very purpose of the p-values is to provide us with the means to distinguish between pathways that are truly be involved with the phenotype, and the pathways that may have some differentially expressed genes just by chance. All pathways with a p-value less than the significance threshold (e.g. 5%) should be carefully studied, not just the very top result, or the top three for that matter. If you have too many significant pathways and you cherry pick from them only the ones that “look familiar”, or “make sense”, your results will be severely biased. And by the way, you can never discover new phenomena if you investigate only those that you already know are involved.

If you have too many DEGs, a better way is to go back to the criteria you used to select your differentially expressed genes, use more stringent thresholds for p-values and/or fold changes and re-do your analysis. In most cases, using reasonable thresholds for your genes, will give you a set of significant pathways that will actually offer you a good understanding of the underlying biological phenomenon. Assuming of course, that you used a good pathway analysis method.

To summarize, using the proper background set of genes or proteins can have a dramatic effect on the number of significant results and the number of false positives. You have to use the entire set of genes that were measured as the appropriate background when analyzing your data. Nothing more, nothing less! This is not a recommendation or advice. This is a must, in order to ensure the scientific validity of your findings. This is why in iPathwayGuide, we ask you to either submit your entire list of genes or explicitly specify the background. If you ever use an application that only requires you to submit the significant genes, ask yourself, “Who was in the room when the winners were picked?”

For more on this topic, you can read:

Chapter 24 in Statistics and Data Analysis for Microarrays Using R and Bioconductor, Second Edition (Chapman & Hall/CRC Mathematical and Computational Biology)

Use and misuse of the gene ontology annotations, Nature Reviews Genetics, 2008 July 9(7):509-515, PMID:18475267, DOI:10.1038/nrg2363