Merge pull request #540 from geneontology/suzialeksander-patch-109

Update gene-product-information-gpi-format-20.md
geneontology · May 8, 2024 · a4c24b5 · a4c24b5
2 parents 301a7ce + 5cad4f7
commit a4c24b5
Showing 1 changed file with 16 additions and 14 deletions.
diff --git a/_docs/gene-product-information-gpi-format-20.md b/_docs/gene-product-information-gpi-format-20.md
@@ -72,6 +72,7 @@ The file format comprises 10 tab-delimited fields. Fields with multiple values (
 
 #### DB:DB Object ID
 The **DB** prefix is the database abbreviation (namespace) from which the unique identifier **DB Object ID** is drawn and must be one of the values from the set of GO database cross-references. The **DB:DB Object ID** is the combined identifier for the database object.
+
 This field is mandatory, cardinality 1.
 
 <!--In GPI 1.0 format, the identifier may reference a top-level primary gene or gene product identifier, or an identified variant of a gene or gene product, for example identifiers that specify distinct proteins produced by differential splicing, alternative translational starts, post-translational cleavage, or post-translational modification. Identifiers for functional RNAs and protein complexes can also be included in this column. 
@@ -81,15 +82,15 @@ This field is mandatory, cardinality 1.
 #### DB Object Symbol
 A (unique and valid) symbol to which the **DB:DB_Object_ID** is matched. No white spaces allowed.
 
+The text entered in the **DB_Object_Symbol** should refer to the entity in **DB:DB_Object_ID**. The **DB_Object_Symbol** field should contain a symbol that is recognizable to a biologist wherever possible (gene product symbol, abbreviation widely used in the literature, ORF name, etc.). It is not a unique identifier or an accession number (unlike the **DB:DB_Object_ID**), although IDs can be used as a **DB_Object_Symbol** if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated). For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in **DB:DB_Object_ID**, but with the same gene symbol in the **DB_Object_Symbol** column. 
+
 This field is mandatory, cardinality 1.
-The **DB_Object_Symbol** field should contain a symbol that is recognizable to a biologist wherever possible (an abbreviation widely used in the literature, for example). It is not a unique identifier or an accession number (unlike the **DB:DB_Object_ID**), although IDs can be used as a **DB_Object_Symbol** if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated). ORF names can be used for otherwise unnamed genes or proteins. If gene products are annotated, the gene product symbol can be used if available. Many gene product annotation entries may share a gene symbol. 
-The text entered in the **DB_Object_Symbol** should refer to the entity in **DB:DB_Object_ID**. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in **DB:DB_Object_ID**, but with the same gene symbol in the **DB_Object_Symbol** column. 
 
 #### DB Object Name
-The name of the gene or gene product in **DB:DB_Object_ID**.
+The name of the gene or gene product in **DB:DB_Object_ID**. The text entered in the **DB_Object_Name** should refer to the entity in **DB:DBB_Object_ID**. White spaces are allowed in this field. 
+
+This field is not mandatory, cardinality 0, 1.
 
-This field is not mandatory, cardinality 0, 1 [white space allowed]
-The text entered in the **DB_Object_Name** and **DB_Object_Symbol** should refer to the entity in **DB:DBB_Object_ID**. 
 #### DB Object Synonym
 These entries may be a gene symbol or other text. Note that we strongly recommend that synonyms are included in the GPI file, as this aids the searching of GO.
 
@@ -107,36 +108,37 @@ An ontology identifier for the type of gene or gene product being annotated. Thi
 * marker or uncloned locus 	SO:0001645
 * any subtype of ncRNA in the Sequence Ontology
 
-This field is mandatory, cardinality 1.
 The object type (gene, transcript, protein, protein_complex, etc.) listed in the **DB_Object_Type** field must match the database entry identified by the **DB:DB_Object_ID**. Note that **DB_Object_Type** refers to the database entry (i.e. it represents a protein, functional RNA, etc.); this column does not reflect anything about the GO term or the evidence on which the annotation is based. 
 
+
+This field is mandatory, cardinality 1.
+
 #### DB Object Taxon
-The NCBI taxon ID of the species encoding the gene product.
+The NCBI taxon ID of the species encoding the gene product, specified as a number with the prefix `NCBItaxon:`. 
 
 This field is mandatory, cardinality 1.
-The taxon should be specified as a number with the prefix "taxon". 
+
 #### Encoded by
-For proteins and transcripts, **Encoded by** refers to the gene id that encodes those entities.
+For proteins and transcripts, **Encoded by** refers to the gene ID that encodes those entities.
 
 This field is not mandatory, cardinality 0, 1, >1 ; for cardinality >1 use a pipe to separate entries. 
 
 #### Parent Protein
 When column 1 refers to a protein isoform or modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
 
 This field is optional, cardinality 0+; multiple identifiers should be pipe-separated.
+
 #### Protein Containing Complex Members
 When column 1 references a protein-containing complex, this column contains the gene-centric reference protein accessions
 
 This field is optional, cardinality 0+; multiple identifiers should be pipe-separated.
 
 #### DB_Xrefs
-Identifiers for the object in **DB:DB_Object_ID** found in other databases.
+Identifiers for the object in **DB:DB_Object_ID** found in other databases. Identifiers used must be standard 2-part global identifiers, e.g. UniProtKB:OK0206. For gene products in model organism databases, **DB_Xrefs** must include the UniProtKB ID, and may also include NCBI gene or protein IDs, etc. 
 
 This field is optional, cardinality 0+; multiple identifiers should be pipe-separated.
-Identifiers used must be a standard 2-part global identifiers, e.g. UniProtKB:OK0206 
-
-This column should be used to record IDs for this object in other databases; for gene products in model organism databases, this must include the UniProtKB ID, and may also include NCBI gene or protein IDs, etc. 
 
 #### Gene Product Properties
-This field is optional, cardinality 0+; multiple properties should be pipe-separated.
 The Properties column can be filled with a pipe separated list of values in the format "property_name = property_value". There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, Kidney etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL). 
+
+This field is optional, cardinality 0+; multiple properties should be pipe-separated.