|Abstract: ||Annotation Enrichment Analysis is a widely used analytical methodology to process data generated by high-throughput genomic and proteomic experiments such as gene expression microarrays. The analysis uncovers and summarizes discriminating background information for sets of genes identified by the previous processing stages (e.g., a set of differentially expressed genes, a cluster). Enrichment analysis algorithms attach annotations to the genes and then discover statistical fluctuations of individual annotation terms in a given gene subset. The annotation terms represent different aspects of biological knowledge and come from databases such as GO, BIND, KEGG. Typical statistical models used to detect enrichments or depletions of annotation terms are hypergeometric, binomial and X2. At the end, the discovered information is utilized by human experts to find biological interpretations of the experiments.
The main drawback of AEA is that it isolates and tests for overrepresentation of isolated individual annotation terms or groups of similar terms. As a result, AEA is limited in its ability to uncover complex phenomena involving relationships between multiple annotation terms from various knowledge bases. Also, AEA assumes that annotations describe the whole object of interest, which makes it difficult to apply it to sets of compound objects (e.g., sets of protein-protein interactions) and to sets of objects having an internal structure (e.g., protein complexes).
To overcome this shortcoming, we propose a novel logic-based Annotation Concept Synthesis and Enrichment Analysis (ACSEA) approach. In this approach, the source annotation information, experimental data and uncovered enriched annotations are represented as First-Order Logic (FOL) statements. ACSEA uses the fusion of inductive logic reasoning with statistical inference to uncover more complex phenomena captured by the experiments. The proposed paradigm allows a synthesis of enriched annotation concepts that better describe the observed biological processes.
The methodological advantage of Annotation Concept Synthesis and Enrichment Analysis is six-fold. Firstly, it is easier to represent complex, structural annotation information. Information already captured and formalized in OWL and RDF knowledge bases can be directly utilized. Secondly, it is possible to synthesize and analyze complex annotation concepts. Thirdly, it is possible to perform the enrichment analysis for sets of aggregate objects (such as sets of genetic interactions, physical protein-protein interactions or sets of protein complexes). Fourthly, annotation concepts are straightforward to interpret by a human expert. Fifthly, the logic data model and logic induction are a common platform that can integrate specialized analytical tools (e.g. tools for numerical, structural and sequential analysis). Sixthly, used statistical inference methods are robust on noisy and incomplete data, scalable and trusted by human experts in the field.
In this thesis we developed and implemented the ACSEA approach. We evaluate it on large-scale datasets from several microarray experiments and on a clustered genome-wide genetic interaction network using different biological knowledge bases. Also, we define a statistical model of experimental and annotation data and evaluate ACSEA on synthetic datasets. The discovered interpretations are more enriched in terms of P- and Q-values than the interpretations found by AEA, are highly integrative in nature, and include analysis of quantitative and structured information present in the knowledge bases. The results suggest that ACSEA can significantly boost the effectiveness of the processing of high-throughput experiment data.|