What is a “good” number of DE genes to use in order to try to understand the underlying biological phenomena using a pathway analysis? 

It is clear that in a whole transcriptome experiment that produces  20-30,000 measurements, selecting a list of DE genes including only 5 or 10 genes would be completely insufficient to understand the underlying biological phenomena using any kind of pathway analysis, systems biology, or functional analysis. At the same time, selecting 10,000 DE genes is also unlikely to produce good results.

So the question is, how many genes should we aim for when we select our list of DE genes?

Some people feel that they want to include more genes in their DE list, such that they do not lose anything. That is not a good idea.  More DE genes do not necessarily  give you better analysis results. Selecting a “good” set of DE genes is more important and will give you more meaningful results than selecting many DE genes. This is somewhat counter-intuitive. It should stand to reason that if I include more DE genes in my analysis, I will cover more biological processes, pathways, etc. Why would this not be true?

The answer relies in the principles behind the subsequent analysis to be undertaken with these DE genes. The simplest type of analysis that can be done is the enrichment analysis. This simple counts the number of DE genes that fall on a given pathway and compare that with the number of genes that are expected to fall on that pathway just by chance. In turn, the randomly expected number depends on the size of the DE gene list. If the DE gene list includes 25% of all genes (let’s say 5,000 genes out of 20,000 measured), then 50% of the genes on any pathway are expected to be DE just by chance. If a pathway has 60 genes, one needs to find significantly more than 15 DE genes on it for this pathway to be found to be significantly enriched. Let us now compare this with a situation in which the list of DE genes includes only 10% of the total number of genes. The number of DE genes expected by chance is now only 6.  It is significantly more difficult for a pathway to have more than 50 DE genes than to have more than 5 DE genes. Hence, using the huge list of 10,000 DE genes will in fact reduce the number of pathways that will be found to be significant. This basically means that by including a very large number of genes, we actually decreased our sensitivity, i.e. our ability to discover the significant pathways.

You also need to take into consideration the fact that not all genes are annotated. The union of all genes included on at least one KEGG pathway is only a few thousand genes, let’s say 3,000. If we now increase the size of the list of DE genes from 5,000 to 10,000, our pathway with 60 genes now needs to have considerably more than 30 DE genes on it in order to become significant. However, assuming that the relevant genes are higher up in the list of DE genes (because they are functionally relevant), by including more genes we only added genes that are not annotated to anything. The effect of this is that we artificially raised the bar that a pathway needs to pass in order for it to be reported as significant, while the number of truly important genes remained the same. Again, that reduces our sensitivity.

At the other end of the spectrum, having a list of gene that has only very few DE genes is also going to be damaging. Let’s say that we have a list of only 10 DE genes. At this point, it is likely that there will be only one or at maximum two genes per pathway for the few pathway that are going to have any DE genes at all. In this case, we will be basing our conclusion that a pathway is significant only on the one gene that fell on that pathway. Given that these measurements are still noisy, the results will be very unreliable. Furthermore, if you use a more sophisticated approach such as the impact analysis, the algorithms will not be able to discovery the underlying mechanisms because there will be only one or two genes per pathway.

Ok, you would say. I got it! Too many DE genes are not good and also too few DE genes are not good. What is the Goldilock number or range of DE genes that we should include in our list in order to get good results from the subsequent pathway and systems analysis?

Unfortunately, there is no magic formula.. In general, we recommend having at least a couple of hundred DE genes but not more than a couple of thousands. Roughly, you should aim for something between 5% and 10% from the total number of genes measured. The best way is to do the analysis at several thresholds which will yield different lists of DE genes. Once you have done the analysis at 3-4 different thresholds, you should do a meta-analysis between those. to see which findings are conserved across various thresholds and which are not.

One question that is often asked is whether to use raw or corrected p-values when selecting the DE genes.  This question was answered in a previous article.