Introduction back to ToC
There is a general lack of a comprehensive ontology to serve as a data schema for describing knowledge in the field of gene expression. This may seem surprising, considering the ubiquity of gene expression analyses in molecular biology and the existence of multiple well-established resources for gene expression, such as Bgee, Expression Atlas (EA), GENEVESTIGATOR, RefEx (Reference Expression dataset) or the Tissue Expression database. We note here that different gene expression databases often use distinct criteria to assert expressed in or absent in relations.
To the best of our knowledge, at least 3 semantic models currently exist as initial attempts to structure gene expression related data: the Relation Ontology (RO), the Expression Atlas semantic model (discontinued) and Bioschema.org / Schema.org. The Relation Ontology and Schema.org define only a few terms within the domain of gene expression and is not uniquely designed for this knowledge domain. Notably, they contain an 'expressed in' relation. The Expression Atlas defines a semantic model related to gene expression that mainly focuses on modelling the Expression Atlas data itself and not the domain of gene expression generally. In this EA model, additional data interpretations (i.e., semantics) are not explicitly represented, such as a given gene is expressed or lowly expressed in some sample relative to others. Although it would be possible to obtain this information through a more complex query on the Expression Atlas SPARQL endpoint (currently discontinued), we lack an explicit representation, which would allow us to compare gene expression data from these different databases.
To provide a first step toward a general-purpose gene expression ontology, we defined a semantic model called GenEx. GenEx is aligned with the Relation Ontology and Expression Atlas models to facilitate interoperability with existing RDF stores. Furthermore, we reuse parts of the data schemas of the ORTH and UniProtKB core ontologies to provide an easy way to interoperate with other biological databases from different knowledge domains that are relevant to the gene expression domain. For example, integrating orthology and gene expression data are relevant since we might want to predict gene expression conservation for orthologous genes.
GenEx is designed to capture and structure gene expression information at a higher level relative to others.
By doing so, we intend to model the data meanings which are expected by the end-users such as biologists. For example, often biologists are not interested in knowing expression scores, that depend on the method used or data transformations. Rather, for instance, they might look for an interpretation of expression scores, such as whether a gene is highly expressed.
GenEx was defined by using the [OWL 2 Description Logics (DL)](https://www.w3.org/TR/owl2-overview/).
The methodology to develop GenEx was inspired by the Simplified Agile Methodology for Ontology Development [SAMOD](http://essepuntato.github.io/samod/).
We defined a concept for describing absence of gene expression so-called [genex:AbsenceExpression](#AbsenceExpression). The [genex:AbsenceExpression](#AbsenceExpression) class is a subclass of the complement of [genex:Expression](#Expression) class (i.e., [genex:AbsenceExpression](#AbsenceExpression) ⊑ ¬[genex:Expression](#Expression). [genex:Expression](#Expression) is the concept for representing gene expression according to a given experimental condition (e.g. sex and anatomical entity). [genex:AbsenceExpression](#AbsenceExpression) is not equivalent to the negation of the [genex:Expression](#Expression) because it may lack experiments to conclude that a gene is expressed (i.e., missing gene expression information). Therefore, solely data processed from experiments which confirm no presence of expression are described by using the [genex:AbsenceExpression](#AbsenceExpression) class. We highlight that the [Bgee database](https://bgee.org) provides absence of expression, Bastian et al. in [(1)](https://doi.org/10.1093/nar/gkaa793) describe how the absence of expression information is computed in the Bgee database by also discussing about expression calls.
We mostly considered to reuse the following ontologies:
- The [Relation Ontology (RO)](http://www.obofoundry.org/ontology/ro.html). For example, we reuse ro:RO_0002245 OWL object property labelled as “over-expressed in” or “highly expressed”. In DL, we can formalise as follows: ∃[ro:RO_0002245](#http://purl.obolibrary.org/obo/RO_0002245).⊤ ⊑ [orth:SequenceUnit](#http://purl.org/net/orth#SequenceUnit) where [orth:SequenceUnit](#http://purl.org/net/orth#SequenceUnit) is the inferred domain of [ro:RO_0002245](#http://purl.obolibrary.org/obo/RO_0002245) and ⊤ is the top concept (i.e., [owl:Thing](https://www.w3.org/TR/2004/REC-owl-semantics-20040210/#owl_Thing) term).
- The [Orthology (ORTH) ontology](http://qfo.github.io/OrthologyOntology/). For example, we reuse the [orth:SequenceUnit](#http://purl.org/net/orth#SequenceUnit) class (i.e., [orth:SequenceUnit](#http://purl.org/net/orth#SequenceUnit) ⊑ ⊤). The reuse of terms and a part of the data schema of ORTH ontology significantly facilitate the integration with orthology databases (e.g. OMA). In doing so, we address the use case of predicting presence or absence of expression for orthologous genes.
- The [Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/efo/). This ontology provides a description of various experimental variables (e.g., organism part, material entity and developmental stage) available in EBI databases such as EA. In GenEx, we mainly reuse classes such as [efo:EFO_0000635](#http://www.ebi.ac.uk/efo/EFO_0000635) (labelled “organism part”), [efo:EFO_0005135](#http://www.ebi.ac.uk/efo/EFO_0005135) (labelled “strain”), and [efo:EFO_0000399](#http://www.ebi.ac.uk/efo/EFO_0000399) (labelled “developmental stage”).
- The [Confidence Information Ontology (CIO)](http://www.obofoundry.org/ontology/cio.html) classes as instance values of the [genex:hasConfidenceLevel](#hasConfidenceLevel) OWL object property. In the GenEx context, CIO classes are individuals to represent the confidence levels of gene expression predictions. This is possible in OWL 2 thanks to [punning](https://www.w3.org/TR/owl2-new-features/#F12:_Punning) . The following DL constraint is stated ⊤ ⊑∀[genex:hasConfidenceLevel](#hasConfidenceLevel).{[cio:0000029](#http://purl.obolibrary.org/obo/CIO_0000029), [cio:0000030](#http://purl.obolibrary.org/obo/CIO_0000030), [cio:0000031](#http://purl.obolibrary.org/obo/CIO_0000031)}. This means that the [genex:hasConfidenceLevel](#hasConfidenceLevel) property values (i.e., property range) can only be “high confidence level”', “medium confidence level”, or “low confidence level”.
- The Uber-anatomy (UBERON) ontology. In the GenEx context, UBERON classes are considered as individuals
of genex:AnatomicalEntity or [efo:EFO_0000399](#http://www.ebi.ac.uk/efo/EFO_0000399) (i.e., developmental stage) classes. The classes of the developmental stage per species ontologies in the [wiki](#https://github.com/obophenotype/developmental-stage-ontologies/wiki) developmental-stage-ontologies wiki are also considered as individuals of [efo:EFO_0000399](#http://www.ebi.ac.uk/efo/EFO_0000399) class. Therefore, for GenEx, these ontologies are controlled vocabularies assigned as values of [genex:hasDevelopmentalStage](#hasDevelopmentalStage) and genex:hasAnatomicalEntity OWL object properties. These properties describe gene expression experimental conditions. In UBERON, despite the “life cycle stage” term is a synonym of “developmental stage” EFO term, we decided to reuse EFO “developmental stage” term in GenEx rather than UBERON one. This is because we want to separate the UBERON terms which are considered as OWL individuals in the GenEx context (by applying [OWL 2 punning](https://www.w3.org/TR/owl2-new-features/#F12:_Punning) feature) and the “developmental stage” class from EFO.
One of the main semantic modelling issues GenEx addresses is to provide terms and a data schema for explicitly describing and structuring multiple combined conditions for gene expression. For example, the HBB human gene is highly expressed in the cerebellum of 6-12 year-old child stage. Thus, experiments to derive this conclusion were mainly performed under two conditions: a human developmental stage and the cerebellum anatomical entity.
In GenEx, we explicitly define a relation for each experimental condition such as the [genex:hasAnatomicalEntity](#hasAnatomicalEntity) property assigned with an [Uber-anatomy (UBERON)](http://uberon.github.io/) class as value by [punning](https://www.w3.org/TR/owl2-new-features/#F12:_Punning). We reuse [Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/efo/) classes instead of defining literals.
As a limitation, at present, GenEx does not support differential expression analysis such as included in the Expression Atlas RDF schema (that is discontinued). This is because we mostly focus on describing information that may be extracted from the present state of the Bgee database. GenEx can be viewed as a start point towards an ontology to structure gene expression information. We consider GenEx a semantic model rather than an ontology. This is because, currently, it is not necessarily a shared conceptualisation among all stakeholders of interest, which include but not only the following gene expression database teams Bgee, Expression Atlas, GENEVESTIGATOR, and Tissue, and others (e.g., the Model Organisms Databases).