Data

Raw

Raw data were taken from Brimacombe, C. (2023, March 30). Shortcomings of using freely available open species interaction networks produced by different publications. https://doi.org/10.17605/OSF.IO/MY9TV

The files goes to data/raw/networks.

Data files that exceeded GitHub's size limits were compressed into zip format.

Processed

Network data

📁 link-predict ← root folder
- 📁 data
  - 📁 processed
    - 📁 features
      - 📄 features_py.csv ← features generated by python script
      - 📄 features_R.csv ← features generated by R script
    - 📁 networks
      - 📄 subsamples_edge_lists.csv ← sub-sampled networks (inc original networks)
      - 📄 subsamples_metadata.csv ← sub-sampled networks metadata

features_py.csv & features_R.csv

Field	Description
link_ID	Auto generated ID of link (existing an non-existing)

Other fields are the features themselves, where they differ between the two files as different features are computed by two different scripts, a python script and a R script.

subsamples_metadata.csv

Field	Description
subsample_ID	Auto generated ID of a sampled network
name	Name of the network
community	Ecological community (e.g., Plant-Pollinator, Plant-Seed Dispersers, etc.)
fraction	Represents the proportion of observed links after sub-sampling. Currently have only 0.8 (80% observed links) and 1.0 (Original network)
~~type~~	Deprecated
~~layer~~	Deprecated
~~repetition~~	Deprecated

subsamples_edge_lists.csv

Field	Description
link_ID	Auto generated ID of link (existing and non-existing)
subsample_ID	Auto generated ID of a sampled network
higher_level	Name of species of the higher trophic level
lower_level	Name of species of the lower trophic level
weight	Weight of the link, but currently not used so it is converted to binary (1.0)
class	Classifies links (1), non-links (0), and subsampled-links (-1) which are converted to 1 or 0 depending on the step (0 for feature extraction, 1 in test set)

Results data

/results/ directory is described by the following files tree. The folders and code files are ordered according to the execution steps.

📁 results
- 📄 results_preprocess.Rmd ← Loading and processing the results data, so each figure will have its own prepared dataset.
- 📄 results_figs.Rmd ← Loading the output of results_preprocess.Rmd and generating figures
- 📁 raw ← Contains the "raw" results, which are mainly the output of the ML pipeline
  - 📄 results_domains.csv ← Contains results from a ML model trained and tested on varying groups (network domains/communities) combinations to assess cross-group generalization.
  - 📄 results_models.csv ← Contains results from different ML models
  - 📄 results_other_models.csv ← Contains results from different predictive models
  - 📄 feature_importance.csv ← Feature importances of all ML models
  - 📄 params_models.csv ← Parameters space and best parameters selected for each model
  - 📄 results_ML_by_single_networks.csv ← Results for ML transductive model
- 📁 intermediate ← Contains intermediate processed files, mainly the output of results_preprocess.Rmd
  - 📄 df_pred_heatmap.csv ← Result of a specific network in the test set, intended for demonstration figure.
  - 📄 metrics_df_long.csv ← Evaluation metrics of each network, long format
  - 📄 metrics_multi_df_long.csv ← Evaluation metrics of each network with multiple models, long format
  - 📄 metrics_type_df_long.csv ← Evaluation metrics of each network with varying group, long format
  - 📄 compare_other_models_metrics_df.csv ← Results of different predictive models
  - 📄 network_lvl_features.csv ← Features (network level only) for EDA
  - 📄 pr_df.csv ← Results of precision-recall curve
  - 📄 roc_df.csv ← Results of roc curve
  - 📄 auc_df.csv ← AUC values of roc and pr curves
  - 📄 test_data.csv ← Test set (link ids in test set) with metadata
  - 📄 bounds_summary_df.csv ← Results of theoretical bounds of each metric
  - 📄 bounds_summary_df_transductive.csv ← Results of theoretical bounds of each metric (ML transductive model)
  - 📄 pca_df.csv ← PCA components of network-level features
- 📁 final ← Contains the final figures and tables, mainly the output of results_figs
  - 📄 ILP_vs_TLP.pdf ← Comparing inductive and transductive models, multiple evaluation metrics
  - 📄 roc_curve.pdf ← ROC curve
  - 📄 pr_curve.pdf ← Precision-Recall curve
  - 📄 communities.pdf ← Distributions of performance measures - by community
  - 📄 cross_community_prediction.pdf ← Heatmap of prediction within and between community types
  - 📄 model_bounds_ILP_TLP.pdf ← Bounds of model predictions, comparing inductive and transductive models
  - 📄 SI_networks_PCA.pdf ← PCA of networks, separated by network-level topological features
  - 📄 ILP_vs_TLP_community.pdf ← Comparing inductive and transductive models, multiple evaluation metrics, per community
  - 📄 SI_networks_summary_properties.csv ← Summary of network properties
  - 📄 SI_KW_communities.csv ← Results of Kruskal Wallis test, comparing metrics of different communities
  - 📄 SI_KW_communities_Dunn.csv ← Dunn post-hoc tests for SI_KW_communities.csv
  - 📄 SI_models.pdf ← ML models performance comparison, multiple evaluation metrics
  - 📄 SI_predictions.pdf ← Link prediction example for a host-parasite network
  - 📄 feature_importance.pdf ← Feature importance for tested ML model (RandomForest)
  - 📄 SI_importance.pdf ← Feature importance for all tested ML models
  - 📄 SI_probabilities.pdf ← Distribution of link probabilities obtained from the model
  - 📄 SI_PR_tradeoff.pdf ← The precision-recall tradeoff as a function of classification threshold
  - 📄 SI_probabilities_community.pdf ← Density plot of link probabilities, for each community, by class
  - 📄 networks_table.csv ← Information (source) about each network
  - 📄 eval_all.pdf ← Distributions of performance measures
  - 📄 features.csv ← Information about each feature
  - 📄 model_bounds.pdf ← Bounds of model predictions
  - 📄 SI_KW_cross.csv ← Results of Kruskal Wallis test, comparing metrics of cross communities
  - 📄 SI_KW_cross_Dunn.csv ← Dunn post-hoc tests for SI_KW_cross_Dunn.csv
  - 📄 SI_cross_community.pdf ← Link prediction within and between community types
  - 📄 SI_community.pdf ← Distribution of link probabilities across different ecological communities
  - 📄 SI_features_hist.pdf ← Histogram of selected network properties
  - 📄 SI_features_hist_all_nets.pdf ← Histogram of selected network properties, across networks

common fields in csvs:

Field	Description
link_ID	Auto generated ID of link (existing and non-existing)
community	Ecological community (e.g., Plant-Pollinator, Plant-Seed Dispersers, etc.)
name	Name of the network
fold	Number of the CV fold the instance are from (usually between 1-5 or 1-3)
model	Name of the ML model used
y_proba	Probability of link of the instance, given by the model
y_true	True class of the instance
metric	Name of the evaluation metric used
feature	Name of the feature
importance	Importance value of the feature
SBM_Prob	Probability of link of the instance, given by SBM model
C_Prob	Probability of link of the instance, given by connectance model
type_train	Links of which communities are forming the train data
type_test	Links of which communities are forming the test data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly