What is Gene Ontology (GO)?
The confusion about gene ontology and gene ontology analysis can start right from the term itself. There are actually two different entities that are commonly referred to as gene ontology or “GO”:
- the ontology itself, which is a set of terms with their precise definitions and defined relationships between them, and
- the associations between gene products and GO terms, which are used to capture the existing knowledge about what each gene is known to do.
But the term gene ontology, or GO, is commonly used to refer to both, which is sometimes a source of potential confusion. In order to avoid this, here we will use the term “GO ontology” to describe the set of terms and their hierarchical structure and “GO annotations” to describe the set of associations between genes and GO terms.
There are 3 types of terms, or domains if you wish, in the gene ontology:
- Biological Processes (BP)
- Molecular Functions (MF)
- Cellular Components (CC)
GO structure and data representation
In general, an ontology such as the gene ontology consists of a number of explicitly defined terms that are names for biological objects or events. These terms are depicted as nodes (also called vertices) in a graph that describe the relationships between the nodes. For instance, “cytoplasm” is a node, which is linked by an edge to its parent “intracellular part“. The type of this edge is “is a” and this structure simply means that cytoplasm is an intracellular part.
But there is more to it. That graph formed with these nodes and edges is not just any kind of graph. It is a so called “directed acyclic graph” or DAG. There are several important features of DAGs. Firstly, the edges are directed i.e. there is a source and a destination for each edge. In the gene ontology, the source is referred to as the parent term and the destination is referred to as the child term. This tells us that the cytoplasm is an intracellular part rather than an intracellular part is a cytoplasm.
Secondly, unlike a general graph, a DAG does not have cycles, which is to say that one cannot complete a loop by following the directed edges. Among other things, this restriction means that two terms can not be both parents and children of each other (otherwise they would form a loop between themselves), and that there must be at least one node that has no children, ie. a root. A DAG is similar to a tree with the difference that in a tree each node can have only one parent, while in a DAG a node can have multiple parents. In the figure below, A shows a tree (each node has only one parent), B shows a DAG (node 3 has 2 parents), and C shows a general graphs (nodes 1, 2 and 3 form a loop). So let us remember that the structure of the GO ontology is a DAG like in panel B, below.
The differences between a tree, directed acyclic graph and genera graph
Here is an example showing the biological process of “negative regulation of programmed cell death” and its various relationship with all its ancestors.
The hierarchical structure and ancestry relationships in the gene ontology. Terms on the lower levels are more specific while terms higher up are more general.
What is a gene ontology analysis?
Fundamentally, the gene ontology analysis is meant to answer a very simple question:
“Given a list of genes found to be differentially expressed in my phenotype (e.g. disease) vs. control (e.g. healthy), what are the biological processes, cellular components and molecular functions that are implicated in this phenotype?”
In a nutshell, the premise here is that if many of the genes associated with a given biological process are differentially expressed in the given disease, that biological process is implicated in that disease. Essentially, the gene ontology analysis aims to identify those biological processes, cellular locations and molecular functions that are impacted in the condition studied.
But…the question now becomes, how do you decide whether or not a given gene ontology term is important or not? After all, any biological term can end up with some genes that are differentially expressed just but chance or just because those genes are also associated with other biological processes that could be more germane to the condition studied.
Therefore, I will briefly outline the main approaches used to address this problem. Some of these are better, some are worse. In fact, some are completely wrong. Nevertheless, I will review them here so you can understand how the thinking evolved about this problem and be in a position to choose wisely the approach you want to use.
Keep in mind that the following is just a brief outline with no mathematical details. If you really want a thorough discussion, you can lean more about these approaches in Chapter 16 of this book.
The simplest gene ontology analysis: Over-representation analysis (ORA) or enrichment analysis
If the processing of the list of differentially expressed (DE) genes were to be done manually, one would take each accession number corresponding to a DE gene, search various public databases and compile a list with, for instance, the biological processes that the gene is involved in. The same type of analysis could be carried out for other functional categories such as biochemical function, cellular role, etc. This task can be performed repeatedly, for each gene, in order to construct a master list of all biological processes in which at least one gene is involved. Further processing of this list can provide a list of those biological processes that are common between several of the DE genes. It is intuitive to expect that those biological processes that occur more frequently in this list would be more relevant to the condition studied. If 200 genes have been found to be differentially expressed and 160 of them are known to be involved in, let us say, mitosis, it is intuitive to conclude that mitosis is a biological process important in the given condition. Right?
As we shall see in the following example, this intuitive reasoning is incorrect and a more careful analysis must be done in order to identify the truly relevant biological processes.
Let us consider that we are using a panel containing 2,000 genes to investigate the effect of ingesting a certain substance X. Let’s say that there are 200 differentially expressed genes. Let us focus on the biological processes for instance, and let us assume that the results for the 200 differentially regulated genes are as follows: 160 of the 200 genes are involved in mitosis, 80 in oncogenesis, 60 in the positive control of cell proliferation and 40 in glucose transport.
A common mistake in gene ontology analysis is to use raw counts. Here, mitosis seems to be the most important biological process in this experiment.
If we now look at the functional profile described above, we might conclude that substance X may be related to cancer since mitosis, oncogenesis and cell proliferation would all make sense in that context. However, a reasonable question is: what would happen if all the genes on the panel used were part of the mitotic pathway? Would mitosis continue to be significant? Clearly, the answer is no. Therefore, in order to draw correct conclusions, it is necessary to always compare the actual number of occurrences with the expected number of occurrences for each individual category.
This comparison is shown in the figure below by the line showing the ratio between what was observed vs what was expected (percentage shown on the vertical axis on the right). In this light, the same data tells a completely different story. There are indeed 160 mitotic genes but, in spite of this being the largest number, we actually expected to observe 160 such genes so this is not better than chance alone. The same is true for oncogenesis. The positive control of cell proliferation starts to be interesting because we expected 20 and observed 60. This is 3 times more than expected. However, the most interesting is the glucose transport. We expected to observe only 10 such genes and we observed 40, which is 4 times more than expected. Taking into consideration the expected numbers of genes radically changed the interpretation of the data. In light of these data, we may want to consider the correlation of X with diabetes instead of cancer.
Gene ontology analysis: the need to compare raw counts with expected values. Even though mitosis had the highest number of differentially expressed genes, this was no more than what was expected by chance. In contrast, glucose transport, even though it had the lowest absolute count, had the most significant enrichment at 4x the number expected by chance.
This example illustrates that the simple frequency of occurrence of a particular functional category among the genes found to be regulated can be misleading. In order to draw correct conclusions, one must analyze the observed frequencies in the context of the expected frequencies.
The problem is that an event such as observing 40 genes when we expect 10 can still occur just by chance. This is unlikely, but it can happen. The bottom line is that one needs to assess the significance of these categories based on the probability of the observed values appearing just by chance. This can be done with various statistical models including hypergeometric, Fisher’s exact test, or chi-square. If you are interested in details, I explained the formulae elsewhere (Chapter 23 in this book) but here, the bottom line is that you should never try to draw conclusions from a count graph as above, but rather use software that calculate a p-value for each term and don’t forget to correct for multiple comparisons. Whatever you do, please do not publish graphs showing just raw counts of gene ontology terms. As shown in the example above, they are not only uninformative but they can also be completely misleading. And if I am one of the reviewers of your paper or grant, you will hear strong complaints from me.
A step further in gene ontology analysis: Functional Class Scoring (FCS)
The simplest approach that would provide sound scientific results is the over-representation approach described above. However, more sophisticated methods have been developed over time. An important category of methods includes functional class scoring methods. The best known methods in this category is the Gene Set Enrichment Analysis or GSEA.
As a first step, GSEA ranks the genes based on the association of each gene with the phenotype. This association is established using an arbitrary test, for example a t-test. Once the ranked list of genes L is produced, an enrichment score (ES) is computed for each set in the gene set list. The list L is walked from the top to the bottom, and a statistic is increased every time a gene belonging to the set is encountered, and decreased otherwise. The value of the increment (or decrement) depends on the ranking of the gene. If you imagine a situation in which all genes at the top of the list are associated to a given biological process, the score for that process will increase with every gene. At the end of the list, the enrichment score is the maximum distance from zero encountered during the walk.
An example of the statistics calculated by the gene set enrichment analysis (GSEA).
Image from Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov PNAS October 25, 2005 102 (43) 15545-15550; first published September 30, 2005; https://doi.org/10.1073/pnas.0506580102
In this figure, the upper graph shows the enrichment values during the walk through the gene list. The vertical lines represents the genes belonging to the set S at the positions they appear in the ranked list. The lower graph shows the degree to which each gene is correlated with the phenotype.
In principle, higher enrichment scores are yielded when the graph departs considerably from zero. However, the enrichment score by itself cannot be used to assess significance much like the raw counts of genes cannot be used that way. The reason is the same: in principle, any score can appear with a given non-zero probability. We have to focus only on those that appear more often than expected by chance. This is done with a bootstrap approach. In essence, the bootstrap approach assesses the frequency with which something appears just by chance by randomly permuting the labels. There are two possible permutation criteria: permutation of the phenotype samples or permutation of the gene labels. In general, the label permutation method is preferred as it preserves gene-gene correlations. This step of the algorithm produces a null distribution which allows the computation of an empirical p-value. The empirical p-value is calculated as the number of random bootstrap runs which resulted in an enrichment score equal to or larger than the one observed for the correct labels.
Next and last step, the significance levels are adjusted for multiple hypotheses testing. Each Enrichment Score is normalized with the size of the set obtaining a Normalized Enrichment Score (NES), and then the false discovery rate (FDR) of each NES is computed.
More sophisticated gene ontology methods: elim and weight
The approaches described above focus on the problem of accurately interpreting the number of differentially expressed genes associated with a gene ontology term. However, these approaches ignore the structure of the gene ontology and the relationship between various terms. In order to understand more sophisticated GO analysis methods we need to learn a few more things about the gene ontology.
The GO is organized in a hierarchical structure that uses the types of relationship described above: “is a”, “part of” and “regulates”. For instance, “induction of apoptosis by extracellular signals” is a type of “induction of apoptosis” which in turn is a “positive regulation of apoptosis” which in turn is a kind of “regulation of apoptosis”, etc. Generally speaking, traversing the DAG following “is a” relationships as above can be seen as moving across levels of abstraction. The root, BP (or “All”), would correspond to the highest level of abstraction, or the lowest level of details. In contrast, leaf nodes such as “induction of apoptosis by hormones” would correspond to the lowest level of abstraction, with the most details. Similarly, the “cell outer membrane” is part of the “cell envelope”, etc. Traversing “part of” relationships could be interpreted as changing scales. In general, terms closer to the roots (BP, CC, MF) are more general, while the ones closer to the leaves are more specific. In GO Consortium’s terminology, the children are more specialized than the parents. When annotating genes with GO terms, efforts are made to annotate the genes with the highest level of details possible. In a general way, this corresponds to the lowest level of abstraction. For example, if a gene is known to induce apoptosis in response to hormones, it will be annotated with the term “induction of apoptosis by hormones” and not merely with one of the higher level terms such as “induction of apoptosis” or “apoptosis”.
When a gene is annotated with a term, all inferences that can be inferred from the structure of the GO must also hold true. In other words, if the child term describes a gene product, then all its parent terms must also apply to that gene product. Let’s revisit the ontology from above (figure is repeated here for convenience). For instance, if a gene is annotated as having a role in “regulation of programmed cell death” then it will necessarily be involved in “cellular process” because “regulation of programmed cell death” regulates “programmed cell death,” which is a type of “cell death,” which is a type of “cellular process.” This property is known as the “True Path Rule.” But, because of this, if the GO analysis is done independently, for each term, each differentially expressed gene will be counted multiple times, once for every term from the lowest term it is annotated with, all the way up to the root. This has two consequences. First, it injects a great deal of redundancy in the process and second, it tends to report as significant lots of general terms which are really not very informative with respect to the phenotype studies.
Two methods have been proposed to deal with this by Alexa et al (2006). Elim starts at the lowest level of GO, with the most specific terms and calculates their enrichment p-value. If that is not significant, the approach moves up in the hierarchy and calculates the p-value as usual. If “induction of apoptosis by extracellular signals” is not significant, all DE genes associated with it will also be counted for its parent, “induction of apoptosis”, as if the two terms were independent. However, if the more specific term is indeed significant, elim will eliminate all genes associated with it from all its ancestors, thus eliminating the redundancy and giving a chance to the more specific terms to be reported as significant.
The second method proposed by Alexa is called weight. The idea behind this approach is that if many very specific terms are significant, and there is a term slightly more general that would encompass all those terms, it may be useful to identify this more general term.
Pitfalls in gene ontology analysis
Gene ontology analysis is a powerful tool. Like any powerful tool, it is subject to misuse and misunderstanding.
The most common mistake in gene ontology analysis is choosing the incorrect background (or not choosing an explicit background). What is this about?
Well, let’s go back to the enrichment analysis. Let us assume that Mary measured 1,000 genes in a panel, and she found 200 genes to be differentially expressed. That’s a proportion of 20%. Let’s consider a biological process that is associated with 100 genes and let us assume that she has found that 30 of these 100 genes are differentially expressed. In this case, if this process is not related to the phenotype, she would expect to find about 20% of its genes being differentially expressed just by chance, or about 20. We found 27 instead of 20, we can calculate the probability of this happening just by chance as being about 0.076 or 7.6%. This is not meeting the usually accepted significance threshold of 1% or even 5% so there is not enough evidence to indicate that this process is involved in the phenotype and Mary should not spend too much time studying it.
But… most people do this analysis with some software, not manually as above. And sometimes, such software only requires the user to upload the list of differentially expressed genes. This would be the list of 200 genes. Now, Mary is using such software and she does not specify that she only used a panel of 1,000 genes, ie. the statistical background of this analysis, the software might think that she has selected those 200 genes from a genome-wide RNA-Seq experiment. In that case, the numbers would be rather different. Now, the proportion of DE genes appears to be 200/30,000 or 2/300 (or 0.0066), and a biological process with 100 genes is only expected to get 1 gene just by chance, at most (actually 0.666…). The probability of having 25 genes just by chance, or the reported p-value, is essentially zero. With such hugely significant p-values, Mary gets super excited and there is a real danger that she would spend weeks or month trying to understand or validate the involvement of this biological process in this experiment, where in fact, this extremely significant p-value was only due to the incorrect choice of the background set.
The moral of this story is that an enrichment analysis must always be done by specifying explicitly the set of genes measured. The safest bet is to use a software platform that allows you to upload the entire list of genes measured and specify which of those you want to consider as differentially expressed. If that is not an option, you need to be able to either upload separately your reference set, or specify it in another way. Otherwise, your p-value may lead you to perdition…
The second most common mistake in this area is failing to correct for multiple comparisons. The need to do this correction is explained here or in Chapter 16 of this book. Since ontologies have a hierarchical relationships, it is important to apply appropriate correction factors to minimize errors. For instance, Using a False Discovery Rate (FDR) or Family-wise Error Rate correction factor may not be appropriate for GO analysis.
The third most common mistake is failing to detect the cross-talk between various biological phenomena. Many genes are involved in several processes, in addition to the redundancy stemming from the gene ontology hierarchy itself. Sometime, the same group of differentially expressed genes make several processes appear as significant. Unless this overlap and cross-talk is detected and eliminated, a lot of time may be wasted. A detailed mathematical approach to detect and eliminated cross-talk has been proposed here. A more practical and user friendly way to identify cross-talk is also included in iPathwayGuide.
Other, more subtle mistakes related to gene ontology analysis include:
- Misinterpreting annotations such as
- ND = no biological data available. If an ND evidence code surfaces in your analysis, you do not need to waste time on additional literature searches.
- NOT = it can be confusing to interpret the negative annotation NOT. NOT is not a complete list, just a list of where the NOT was a surprise.
- Misinterpreting the direction of arrows in the output of GO – the directed acyclic graph (DAG). Clear labeling is a must here.
A more detailed discussion of these and other pitfalls can be found in this paper:
Rhee, S., Wood, V., Dolinski, K., Draghici, S. (2008). Use and misuse of the gene ontology annotations Nature Reviews Genetics 9(7), 509-515. https://dx.doi.org/10.1038/nrg2363.
And most importantly, keep in mind that the ultimate purpose of gene expression experiments is to produce biological knowledge not numbers.
Sorin Draghici (2012)
Using iPathwayGuide for gene ontology analysis