Merge pull request #682 from geneontology/pgaudet-patch-85

Update gene-product-information-gpi-format-20.md
geneontology · Dec 6, 2024 · 2645582 · 2645582
2 parents 6a7f060 + eb7153c
commit 2645582
Showing 1 changed file with 39 additions and 60 deletions.
diff --git a/_docs/gene-product-information-gpi-format-20.md b/_docs/gene-product-information-gpi-format-20.md
@@ -21,20 +21,19 @@ Mandatory elements of the GPI 2.0 file header are:
 - the name of database or group generating the file, as listed in [dbxrefs.yaml file](https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml)
 - the date the file was generated conforming to the date portion of [ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html) standards, i. e. `YYYY-MM-DD`
 - Example GPI 2.0 header:
-
-
+    ```
     !gpi-version: 2.0
     !generated-by: SGD
     !date-generated: 2024-05-01
-
+    ```   
 - Additional information may also be included, for example project URL and funding sources. For example:
-
+     ```
     !URL: http://www.yeastgenome.org/
     !Project-release: WS275
     !Funding: NHGRI grant number HG012212
+     ```
 
-## GPI file fields
-
+## GPI File Contents
 The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple values, those should be separated by pipes (`|`).
 **Required fields are shown with an asterisk (*).**
 
@@ -52,84 +51,64 @@ The GPI 2.0 file comprises 11 tab-delimited fields. For fields that multiple val
 | 10 | [Cross-reference(s)](#10-db-xrefs "Definition and requirements for DB_Xref(s) (column 10)") | 0 or > | NCBIGene:154796 \|<br/>ENSEMBL:ENSG00000126016 | NCBIGene:154796 \|<br/>ENSEMBL:ENSG00000126016 | ComplexPortal:CPX-1016 | UniProtKB:Q9DAQ4-1 | ENSG00000276365 | 
 | 11 | [Gene Product Properties](#11-gene-product-properties "Definition and requirements for Gene Product Properties (column 11)") | 0 or > |	db_subset=Swiss-Prot| | | | 
 
-### Definitions and requirements for field contents
-
+### Definitions and requirements for GPI 2.0 field contents
 #### 1. DB:Object ID
 * A unique identifier for the entity being annotated, composed of two elements: a **DB** prefix is the database, that must be described in the GO [dbxrefs.yaml file](https://github.com/geneontology/go-site/blob/master/metadata/db-xrefs.yaml), and a **DB Object ID**, which is the alphanumerical identifier corresponding to the entity. The **DB:DB Object ID** is the combined identifier for the database object. Examples:
-
-    UniProtKB:P99999
-    SGD:S000002164
-    MGI:MGI:1919306
-
-The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file. 
-
+  + UniProtKB:P99999
+  + SGD:S000002164
+  + MGI:MGI:1919306
+* The identifier may reference the canonical form of a gene or gene product including functional RNAs, as well as gene variants, distinct proteins produced by to differential splicing, alternative translational starts, post-translational cleavage or post-translational modification. If the gene product is not a canonical gene or gene product identifier, the corresponding canonical form must be referenced in Column 8 (Parent Protein) of the GPI file. 
 * Cardinality = 1
-
+---
 #### 2. Object Symbol
-The unique symbol corresponding to the **DB:Object_ID** in Column 1; usually the name of the gene. No white spaces allowed.
-
-The symbol is not a unique identifier or an accession number (unlike the **DB:Object_ID**), but if the entity does not have a symbol, the **DB:Object_ID** may be used as **Object Symbol**. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in **DB:Object_ID**, but with the same gene symbol in the **Object_Symbol** column. 
-
+* The unique symbol corresponding to the **DB:Object_ID** in Column 1; usually the name of the gene. No white spaces allowed.
+* The symbol is not a unique identifier or an accession number (unlike the **DB:Object_ID**), but if the entity does not have a symbol, the **DB:Object_ID** may be used as **Object Symbol**. For example, several alternative transcripts from one gene may be annotated separately, each with specific gene product identifiers in **DB:Object_ID**, but with the same gene symbol in the **Object_Symbol** column. 
 * Cardinality = 1
-
+---
 #### 3. Object Name
-The name of the gene or gene product corresponding to the **DB:Object_ID** in Column 1. White spaces are allowed in this field. 
-
+* The name of the gene or gene product corresponding to the **DB:Object_ID** in Column 1. White spaces are allowed in this field. 
 * Cardinality = 0 or 1
-
+---
 #### 4. Object Synonym
-Alternative names for the entity in **DB:Object_ID** in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching. 
-
+* Alternative names for the entity in **DB:Object_ID** in Column 1. These entries may be a gene symbol, clone ID, or any other label ot identifier. Object synonyms are useful for searching. 
 * Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated. 
-
+---
 #### 5. Object Type
-An ontology identifier describing the class of biological entity of the **DB:Object_ID** in Column 1. The ontology identifier must be a value from Protein Ontology for proteins,  Gene Ontology for protein complexes, or Sequence Ontology for all other entities. Allowed entity types: 
-
-* [PR:000000001](http://purl.obolibrary.org/obo/PR_000000001): protein 
-* [GO:0032991](http://purl.obolibrary.org/obo/PR_000000001): protein-containing complex 
-* [SO:0001217](http://purl.obolibrary.org/obo/SO_0001217): protein-coding gene 
-* [SO:0000704](http://purl.obolibrary.org/obo/SO_0000704): gene 
-* [SO:0000655](http://purl.obolibrary.org/obo/SO_0000655): ncRNA or any SO child term
-* [SO:0001263](http://purl.obolibrary.org/obo/SO_0001263): ncRNA-coding gene or any SO child term
-
-**Note on object types**: This field should descibe the type of biological object as defined by the contributing database. For example, [WormBase identifiers](https://wormbase.org/species/c_elegans/gene/WBGene00000001) represent [genes](http://purl.obolibrary.org/obo/SO_0000704), PomBase identifiers represent [protein-coding genes](http://purl.obolibrary.org/obo/SO_0001217), and [SGD identifiers](https://www.yeastgenome.org/locus/S000002429) represent [proteins](http://purl.obolibrary.org/obo/PR_000000001). 
-
-GO strongly recommends against using 'gene' or 'gene product' as this does not allow to differentiate between proteins and ncRNAs. 
-
+* An ontology identifier describing the class of biological entity of the **DB:Object_ID** in Column 1. The ontology identifier must be a value from Protein Ontology for proteins,  Gene Ontology for protein complexes, or Sequence Ontology for all other entities. Allowed entity types: 
+  * [PR:000000001](http://purl.obolibrary.org/obo/PR_000000001): protein 
+  * [GO:0032991](http://purl.obolibrary.org/obo/PR_000000001): protein-containing complex 
+  * [SO:0001217](http://purl.obolibrary.org/obo/SO_0001217): protein-coding gene 
+  * [SO:0000704](http://purl.obolibrary.org/obo/SO_0000704): gene 
+  * [SO:0000655](http://purl.obolibrary.org/obo/SO_0000655): ncRNA or any SO child term
+  * [SO:0001263](http://purl.obolibrary.org/obo/SO_0001263): ncRNA-coding gene or any SO child term
+* **Note on object types**: This field should descibe the type of biological object as defined by the contributing database. For example, [WormBase identifiers](https://wormbase.org/species/c_elegans/gene/WBGene00000001) represent [genes](http://purl.obolibrary.org/obo/SO_0000704), PomBase identifiers represent [protein-coding genes](http://purl.obolibrary.org/obo/SO_0001217), and [SGD identifiers](https://www.yeastgenome.org/locus/S000002429) represent [proteins](http://purl.obolibrary.org/obo/PR_000000001). 
+* GO strongly recommends against using 'gene' or 'gene product' as this does not allow to differentiate between proteins and ncRNAs. 
 <!--- 
 SGD feature type named ORF in SGD --->
-
 * Cardinality = 1
-
+---
 #### 6. Object Taxon
-The [NCBI taxon ID](https://www.ncbi.nlm.nih.gov/taxonomy) of the organism (species or strain) encoding the **DB:Object_ID** from Column 1, in the format `NCBITaxon:numerical_identifier`. 
-
+* The [NCBI taxon ID](https://www.ncbi.nlm.nih.gov/taxonomy) of the organism (species or strain) encoding the **DB:Object_ID** from Column 1, in the format `NCBITaxon:numerical_identifier`. 
 * Cardinality = 1
-
+---
 #### 7. Encoded by
-For proteins and transcripts, **Encoded by** refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
-
+* For proteins and transcripts, **Encoded by** refers to the gene ID that encodes those entities, e.g. ENSG00000197153.
 * Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated. 
-
+---
 #### 8. Parent Protein
-When the **DB:Object_ID** in Column 1 describes a protein isoform or a modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
-
+* When the **DB:Object_ID** in Column 1 describes a protein isoform or a modified protein, this column refers to the gene-centric reference protein accession of the column 1 entry.
 * Cardinality = 0, 1, > 1; for cardinality >1, values must be pipe-separated. 
 <!--- 
 How can that be??? this should be 0,1 --->
-
-
+---
 #### 9. Protein-Containing Complex Members
-When the **DB:Object_ID** in Column 1 describes a protein-containing complex, this column contains the gene-centric reference protein accessions.
-
+* When the **DB:Object_ID** in Column 1 describes a protein-containing complex, this column contains the gene-centric reference protein accessions.
 * Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated. 
-
+---
 #### 10. Database cross-references (DB_Xrefs)
-Identifiers for the object in **DB:Object_ID** found in other databases. Identifiers used must be standard 2-part global identifiers, e.g. UniProtKB:Q60FP0. For proteins in model organism databases, **DB_Xrefs** must include the correponding UniProtKB ID, and may also include NCBI gene or protein IDs, etc. 
-
+* Identifiers for the object in **DB:Object_ID** found in other databases. Identifiers used must be standard 2-part global identifiers, e.g. UniProtKB:Q60FP0. For proteins in model organism databases, **DB_Xrefs** must include the correponding UniProtKB ID, and may also include NCBI gene or protein IDs, etc. 
 * Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated. 
-
+---
 #### 11. Gene Product Properties
-The Properties column can be filled with a pipe separated list of values in the format "property_name = property_value". There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, kidney, etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL). 
-
+* The Properties column can be filled with a pipe separated list of values in the format "property_name = property_value". There is a fixed vocabulary for the property names and this list can be extended when necessary. Supported properties will include: 'GO annotation complete', "Phenotype annotation complete' (the value for these two properties would be a date), 'Target set' (e.g. Reference Genome, kidney, etc.), 'Database subset' (e.g. Swiss-Prot, TrEMBL). 
 * Cardinality = 0, 1, > 1; for cardinality > 1, values must be pipe-separated.