OrthoXML Schema
This Schema defines the OrthoXML version 0.3.
Author(s): Sanjit Roopra, Dave Messina, Fabian Schreiber, Thomas Schmitt, and Erik Sonnhammer.
SBC - Stockholm Bioinformatics Centre. 2011. More info at http://orthoxml.org
The OrthoXML root element.
The source program/database of the file for
instance OMA or InParanoid.
The version number of the file.
The version or release number of the source
program/database at time the file was generated.
The species element contains all sequences of one
species.
The NCBI Taxonomy identifier of the species to
identify it
unambiguously.
The name of the species.
A database element contains all genes from a single
database/source.
A Uniform Resource Identifier (URI) pointing to
the gene. In the simplest case one could imagine a URL which in
concatenation with the gene identifier links to the website of the
gene in the source database. However, how this is used depends on
the source of the orthoXML file.
Name of the database.
A Uniform Resource Identifier (URI) pointing to
the protein.
A Uniform Resource Identifier (URI) pointing to
the transcript.
Version number of the database.
A gene element represents a list of genes.
The gene element represents a single gene, protein
or transcript. It is in fact a set of identifiers: one internal
identifier that is used to link from geneRef elements in ortholog
clusters and gene identifiers, transcript identifiers and protein
identifiers to identify the molecule. The proper term for this
element would therefore rather be molecule. However, as the general
purpose of orthoXML is to represent orthology data for genes the
term gene is used instead. Gene, protein and transcipt identifiers
are optional but at least one of the three should be given. The
source database of the gene is defined through the database element
in which the gene element lies and the identifiers should stem from
this source.
Identifier of the gene in the source database.
Multiple splice forms are possible by having the same geneId more
than once.
Internal identifier to link to the gene via the geneRef elements.
Identifier of the protein in the source database.
Identifier of the transcript in the source database.
A list of score definitions.
Represents the list of ortholog groups. Note that
the purpose of OrthoXML is to store orthology assignment hence on
the top level only ortholog groups are allowed.
A group of genes or nested groups. In case of a
orothologGroup element, all genes in the group or in the nested groups are
orthologs to each other i.e. stem from the same gene in the last common
ancester of the species. In case of a paralogGroup the genes are
paralogs to each other. Subgroups within the group allow the
represention of phylogenetic trees. For more details and examples
see http://orthoxml.org/orthoxml_doc.html.
A group can may contain two or more of the three alternatives
geneRef, paralogGroup, and orthologGroup. By combining these,
complex phylogenies are possible.
Identifier for the group in context of the resource. This attribute is
not required but if your resource provides identifiers for the ortholog
groups we strongly recommend to use it at least for the top level groups.
The geneRef element is a link to the gene
definition under the species element. It defines the members of an
ortholog or paralog group. The same gene can be referenced muliple
times. The geneRef element can have multiple score elements and a
notes elements as children. The notes element can for instance be
used for special, ortholog-database-specific information (with InParanoid,
for example, we could use it to mark the seed orthologs).
Internal identifier for a gene element defined under the species element.
The scoreDef element defines a score. One of the
concepts of orthoXML is to be as flexible as possible but still
uniformly parsable. Part of this is to allow every
ortholog resource to give their own types of scores for groups or
group members, which is done using score elements. Score elements
can be defined to apply to either groups or geneRefs. It is possible to define multiple scores.
An internal identifier to link to the
scoreDef from a score element.
Description of the score.
The score element gives the value of a score and
links it to the scoreDef element, which defines the score. It can be
child of a group or a geneRef element to allow scoring on
different levels.
An identifier linking to the scoreDef element,
which defines the score.
The actual value of the score. For instance a
confidence score of a group member.
Key-value pair for group annotations, for instance
statistics about the group members.
The key of the key-value annotation pair.
The value of the key-value annotation pair. Optional
to allow flag like annotations.
The notes element is a special element, which
allows adding information that is not general enough to be part
of the standard. I.e. something specific to a particular ortholog
database or algorithm. Notes elements will not be validated,
so any child elements are legal. Notes elements can be children of the
root element orthoXML, the species element, the orthologGroup
element, the paralogGroup element, or the geneRef element.