Data pre-processing pipeline for HINT
-
Download up-to-date UniProt reference files from FTP site (https://ftp.uniprot.org/pub/databases/uniprot/current_release/)
sh download.sh
Downloaded files will be saved at $SOURCE_DIR
specified in the script
# SOURCE_DIR structure
.
|-- uniparc
|-- knowledgebase
|-- idmapping
| |-- idmapping.dat.gz
|-- complete
| |-- uniprot_sprot.fasta.gz
| |-- uniprot_sprot_varsplic.fasta.gz
| |-- uniprot_trembl.fasta.gz
|-- docs
- Run
parse_source_data.py
to process reference files downloaded from UniProt FTP siteThe following files will be processed:python parse_source_data.py $SOURCE_DIR/knowledgebase
- FASTA files (
uniprot_sprot.fasta.gz
,uniprot_sprot_varsplic.fasta.gz
,uniprot_trembl.fasta.gz
) - to extract protein meta information - species file
docs/speclist.txt
to extract species taxonomy information - secondary-to-primary accession mapping file
docs/sec_ac.txt
- FASTA files (
Run prepare_dataset.py
to prepare protein interaction data from sources of interest. The following steps will be processed by running the script:
- Collect datasets from sources of interest. Active datasets will be downloaded from the source and inactive dataset will be copied from our self-maintained cache directory. Raw data files will be saved at
$UPDATE_DIR/data/parseTargets
The following data sources are included (Last update: 2024.6.28)
- Active:
- BioGRID
- IntAct
- PDB: generate IRES files (listed below) using scripts in
pdb_data_prep/
Note: consider to switch to up-to-date IRES files generated by in-housenightly_script
on serverires_perpdb_alltax.txt
ires_perpdb_alltax_pdblike.txt
(newly added for large PDB structures saved as pdb-bundle TAR format)
- Inactive:
- DIP (
dip20170205.txt
) - iRef (
All.mitab.03022013.txt
) - HPRD (
BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt
) - MIPS (
mppi
xml format)
|-- static_datasets |-- dip20170205.txt |-- All.mitab.03022013.txt |-- mppi |-- BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt
- DIP (
-
Parse raw data and generate initial
raw_interactions.txt
file saved at$UPDATE_DIR/outputs/
-
Revise parsed raw_interaction data and fill in UniProt IDs if available in source.Output files will be saved at
$UPDATE_DIR/outputs/cache/
. The following files are generated in this step:
raw_interactions_filled_partial.txt
revised raw interaction files with UniProt IDs filled when available in sourcemapping_targets_by_type.json
IDs remain to be mapped to UniProt IDs orgainized by ID types. Supported ID types can be found inconstants.py
Run create_idmapping.py
to parse idmapping.dat.gz
from UniProt FTP site and generate ID mapping dictionary from source IDs to UniProt IDs
Inputs
idmapping.dat.gz
mapping_targets_by_type.json
Outputs
target_type_to_uprot.json
dictionary of ID mapping organized by ID type (if an ID is mapped to multiple UniProt IDs, all UniProt IDs will be kept & concatenated by'|'
)prot_gene_info.tsv
descriptions for each UniProt ID columns:(uprot | UniProtKB-ID | Gene_Name | Gene_ORFName | NCBI_TaxID)
target_result.json
mapping of target IDs to UniProt when available
Format:"ID_TYPE|SOURCE_ID": "UNIPROT_ID1(|UNIPROT_ID2|UNIPROT_ID3...)"
(Example: '"DIP|DIP-17064N": "Q9TW27"')
Run codes in jupyter notebook: 4-HINT_data_curation.ipynb
. The following files will be generated and organized in the following structure in output directory.
|-- taxid2name_short.txt
|-- raw_interactome.txt
|-- HINT_format
|-- protein_meta.txt
|-- binary_all.txt
|-- binary_hq.txt
|-- both_all.txt
|-- both_hq.txt
|-- cocomp_all.txt
|-- cocomp_hq.txt
|-- htb_hq.txt
|-- htc_hq.txt
|-- lcb_hq.txt
|-- lcc_hq.txt
|-- taxa
|-- HomoSapiens
| |-- HomoSapiens_binary_all.txt
| |-- HomoSapiens_binary_hq.txt
| |-- HomoSapiens_both_all.txt
| |-- HomoSapiens_both_hq.txt
| |-- HomoSapiens_cocomp_all.txt
| |-- HomoSapiens_cocomp_hq.txt
| |-- HomoSapiens_htb_hq.txt
| |-- HomoSapiens_htc_hq.txt
| |-- HomoSapiens_lcb_hq.txt
| |-- HomoSapiens_lcc_hq.txt
|-- MusMusculus
| |-- MusMusculus_binary_all.txt
| |-- MusMusculus_binary_hq.txt
| |-- MusMusculus_both_all.txt
| |-- MusMusculus_both_hq.txt
| |-- MusMusculus_cocomp_all.txt
| |-- MusMusculus_cocomp_hq.txt
| |-- MusMusculus_htb_hq.txt
| |-- MusMusculus_htc_hq.txt
| |-- MusMusculus_lcb_hq.txt
| |-- MusMusculus_lcc_hq.txt
|-- ...
Run 5-plot_venn.py
to generate venn diagrams of all interaction sets for human and yeast. By default the plots will be saved as PDF files in $UPDATE_DIR/figures
(Modify path specs in script if necessary).
The UniProt Archive (UniParc) is a non-redundant protein sequence archive, containing all new and revised protein sequences from all publicly available sources.
Sometimes, protein IDs in source database are mapped to UniProt IDs that are deleted in current release.
In that case, we can retrieve protein information from UniParc data. Raw UniParc files are downloaded from https://ftp.uniprot.org/pub/databases/uniprot/current_release/uniparc/xml/all/ in download.sh
script.
Run parse_uniparc.sh
to parse raw XML format UniParc files into TAB-separated format and extract inactive entries.
- Example of parsed file
accession uniprot is_reviewed status taxa protein_name gene_name seq_length source_file
UPI00000001D2 Q71UK0 False Y 10090 Growth factor receptor (Fragment) 61 uniparc_p1
UPI000000075A Q0NCH3 False N 587201 Membrane protein VARV_SAF65_102_152 277 uniparc_p1
UPI00000002D3 Q549A3 False Y 4232 LIM domain protein PLIM1a PLIM1a 219 uniparc_p1
UPI0000000563 Q6LCT9 False Y 9606 Diabetes mellitus type I autoantigen ICA1 87 uniparc_p1
[Optional] Continue with the second part in parse_uniparc.py
script to extract information for target species.