Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGD import models 20240124 #3

Merged
merged 2 commits into from
Jan 25, 2024
Merged

SGD import models 20240124 #3

merged 2 commits into from
Jan 25, 2024

Conversation

dustine32
Copy link
Contributor

For geneontology/project-management#34. New batch of 7075 SGD models from Protein2GO-sourced GPAD. Had to do a few things to this GPAD to make it Noctua import-able:

  1. Map UniProtKB IDs in columns 1 (DB object), 7 (with/from), and 11 (extensions) to SGD IDs.
  2. Map RNAcentral IDs in column 1 to SGD IDs.
  3. Map some of the annotation_properties (col 12) contributor-id values to ORCIDs (e.g., contributor-id=GOC:Kara Dolinski -> contributor-id=https://orcid.org/0000-0002-7010-0264).

More details and scripts to come in comment(s) below.

@dustine32
Copy link
Contributor Author

To clarify the changes we made to the upstream Protein2GO GPAD:

  1. Map UniProtKB IDs in columns 1 (DB object), 7 (with/from), and 11 (extensions) to SGD IDs:
python3 scripts/remap_gpad_ids_from_gpi.py -g SGD_gpad4noctua.gpa -i source/gpi.sgd > SGD_gpad4noctua.gpa.remapped
  1. Map RNAcentral IDs in column 1 to SGD IDs:
grep -e "^RNAcentral" SGD_gpad4noctua.gpa.remapped > SGD_gpad4noctua.gpa.rnacentral
join -1 2 -2 6 -t "    " -o 2.1,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,1.13 <(sed 's/RNAcentral:/RNAcentral:\t/g' SGD_gpad4noctua.gpa.rnacentral | sort -k2) <(sort -t "    " -k6 source/rnacentral_sgd_lkp.tsv) | sed 's/^/SGD:/g' > SGD_gpad4noctua.gpa.rnacentral_fixed
cat <(grep -v -e "^RNAcentral" SGD_gpad4noctua.gpa.remapped | grep -v -e "^UniProtKB") SGD_gpad4noctua.gpa.rnacentral_fixed > SGD_gpad4noctua.gpa.all_fixed_pre_valid
  1. Map some of the annotation_properties (col 12) contributor-id values to ORCIDs:
python3 scripts/replace_gpad_properties.py -g SGD_gpad4noctua.gpa.all_fixed_pre_valid -n source/contrib_orcid_lkp.tsv > source/SGD_gpad4noctua.gpa.remapped

The final file output from this process that was then submitted for GO rule validation (to make go_cam_sgd_valid.gpad) and GO-CAM conversion (to make the TTL files in models/) is SGD_gpad4noctua.gpa.remapped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant