In the world of bioinformatics, we all need to be careful when analyzing our data. I receive countless questions about what background should be used when analyzing gene expression or protein expression data. In this blog post I will attempt to clarify this question.
We have all had the experience where you get a raffle ticket and the ticket says, “Must be present to win.” What they are doing is establishing the pool of candidates from which to draw a winning ticket (and also trying to keep you there, but that’s beyond the scope of this discussion). Performing pathway analysis and other enrichment analyses, is somewhat similar to that.
It is very intuitive that the size of the pool of candidates will dramatically affect the odds of winning. Using our raffle ticket example, let us say 1,000 tickets are given out and there will only be one winner. On the surface, we think our odds of winning are 1 in 1,000. But now let’s say the crowd of people that are actually present for the drawing is only 100 people. Because we “must be present to win,” our odds of winning are now actually 1 in 100. Furthermore, if the raffle organizers wanted to cheat, they could for instance have the raffle draw take place in a small room in which they invite only their friends and relatives. If this room hosts only 10 people at the time of the drawing, the odds of winning would now be 1 in 10. So the odds are really dependent on which background we choose. The same goes for pathway and other enrichment-based analyses.
Let us say you have 1,000 significant genes or proteins that were selected as differentially expressed (DE) in your condition. As I prefaced in my opening paragraph, the question becomes what background should be used when trying to understand what pathways or GO terms are significant. The p-values calculated during the analysis are just another way to tell you about the odds of a given pathway being significant just by chance. And, as we saw in the raffle experiment the choice of the background can have a dramatic effect on the results (odds). Should we use all protein coding genes? How about all genes in NCBI or Ensemble?
The answer is we should always use the set of genes that were measured. This is akin to saying, “you must be present to win.” If the gene or protein was not measured, it should not be in the mix. So if you use an arbitrary set of genes for the background (e.g. all NCBI genes, or all Ensemble genes) your statistic will be heavily skewed. All enrichment programs that have you submit only DE genes or proteins do this. Similarly, if you only use the set of DE genes as the as the background, and further select from there, you can also skew your results (this is like doing the drawing among your 10 friends and increasing your odds of success).
To exemplify this, I took the set of 1,172 significant genes (p<0.05 and Log2FC>|0.6|) from a public dataset (GSE47363) and ran it through a simple enrichment analysis. In the first experiment, I used the set of genes that were measured as the background, about 20,000 genes. Then I ran the exact same set of DE genes, but this time I used “NCBI genes” as provided by another popular web-based pathway analysis application as the background (about 30,000 genes). See Figure 1 below.
Figure 1: Comparing the same set of DEGs, but with different backgrounds. On the left, we use the set of genes that were measured (20k). On the right we use 30k genes from NCBI as the background. Notice the dramatic difference in the number of significant pathways and the p-Values.
While the top pathway is the same in both instances, you will notice little else is the same. In the first set of results, obtained with the appropriate background, we see a total of 64 significant pathways (FDR pV<0.05). The second set of results, obtained with all NCBI genes as background, there are more than 150 significant pathways! Also, you will notice the p-values are much more significant when using NCBI as the background.
You could say: “Well, but the first pathway is the same. So if a pathway is truly relevant, it will be on top no matter what the background is.” First, this is not true. The fact that the two sets of results have the same pathway at the very top is just a coincidence. Secondly, this is an incorrect way of thinking. The very purpose of the p-values is to provide us with the means to distinguish between pathways that may have some differentially expressed genes just by chance, and the pathways that maybe truly be involved with the phenotype.
All pathways with a p-value less than the significance threshold (e.g. 5%) should be carefully studied, not just the very top result, or the top three for that matter. If you have too many significant pathways and you cherry pick from them only the ones that “look familiar”, your results will be severely biased.
A better way is to go back to the criteria you used to select your differentially expressed genes, use more stringent thresholds for p-values and/or fold changes and re-do your analysis. In most cases, using reasonable thresholds for your genes, will give you a set of significant pathways that will actually offer you a good understanding of the underlying biological phenomenon. Assuming of course, that you used a good pathway analysis method. But let us leave this for another posting.
To summarize, using the proper background set of genes or proteins can have a dramatic effect on the number of significant results and the number of false positives. You have to use the entire set of genes that were measured as the appropriate background when analyzing your data. Nothing more, nothing less! This is not a recommendation, nor an advice. This is a must in order to ensure the scientific validity of your findings. This is why in iPathwayGuide, we ask you to submit your entire list of genes. If you ever use an application that only requires you to submit the significant genes, ask yourself, “What is the background being considered?”
For more on this topic, you can read: