Pathway Analysis Best Practices
This article will answer three key questions Investigators face when doing pathway analysis.
- What is the best way to choose the set of differentially expressed (DE) genes in order to get the best results from the subsequent pathway analysis?
- Should we use corrected p-values or raw p-values?
- What would happen if I use the wrong type of p-value?
Here are some guidelines that we have developed after many years of doing this type of analysis.
In general, we recommend that you use genes selected based on corrected p-values as the input to your iPathwayGuide analysis. However, the number of genes you provide as input can be very critical. If the number of genes selected based on the corrected p-values is too low, you can either relax your threshold (e.g go from alpha = 1% to 5% , 10% or even more), or use the raw p-values instead. Please keep in mind that using raw p-values means you are likely to include some false positives, depending on the total number of genes measured.
For instance, using a threshold of 0.01 (ie. 1%) and raw p-values, you will have 1 false positive for every hundred genes measured in your experiment. If you measured 1,000 genes to start with and have selected 100 differentially expressed genes, you are likely to have 10 false positives. This may not be a problem because it’s only about 10% of your list of differentially expressed genes that are false positives. However, if you measured 30,000 genes, and you have selected 400 genes as differentially expressed, based on a threshold of 1%, you will have about 300 false positives out of your 400 DE genes. Here, you have 75% of your DE genes being false positives which will be a serious problem.
Here is a detailed outline of what we would recommend:
- Choose a threshold for the significance of the genes you want to consider differentially expressed (DE). For instance 1% or 0.01
- Used corrected p-values to select your DE genes
- If you have enough genes, your analysis will be very informative
- If you do not have enough genes, relax your threshold, for instance from 1% to 5% or even 10% and continue to use corrected p-values
- If that does still does not give you enough DE genes, switch to using raw p-values
- Calculate the number of false positives you expect. This is your significance threshold x number of genes measured. For instance, if you have expression values for 30,000 genes and you use a threshold of 1%, you will have 300 false positives
- Calculate the ratio between the number of genes selected at that threshold and the number of false positives you know are included in those. For instance, if the 1% threshold on raw p-values gave you 1,000 differentially expressed genes, you know that 70% of them are likely to be truly DE while the rest of 30% are false positives. You need to decide what is acceptable to you.
- If you are forced to use a large percentage of false positives, we recommend that you do the analysis at several thresholds which will yield different lists of DE genes. Once you have done the analysis at 3-4 different thresholds, you should do a meta-analysis between those to see which findings are conserved across various thresholds and which are not.
A Word of Caution
Some people feel that they want to include more genes such that they do not lose anything. More DE genes do not necessarily give you better analysis results. Selecting a “good” set of DE genes is more important than selecting many DE genes. A follow-up question is therefore “how do you select a good set of DE genes?” or its equivalent, “what is the ideal size for a set of DE genes?” This will be addressed in a follow up article.