Skip to content
Barry edited this page Jun 30, 2024 · 14 revisions

Raw

Raw data were taken from Brimacombe, C. (2023, March 30). Shortcomings of using freely available open species interaction networks produced by different publications. https://doi.org/10.17605/OSF.IO/MY9TV

Processed

Network data

  • πŸ“ link-predict ← root folder
    • πŸ“ data
      • πŸ“ processed
        • πŸ“ features
          • πŸ“„ features_py.csv ← features generated by python script
          • πŸ“„ features_R.csv ← features generated by R script
        • πŸ“ networks
          • πŸ“„ subsamples_edge_lists.csv ← sub-sampled networks (inc original networks)
          • πŸ“„ subsamples_metadata.csv ← sub-sampled networks metadata

subsamples_metadata.csv

fields:

  • subsample_ID - Auto generated ID of a sampled network
  • name - Name of the network
  • community - Ecological community (Plant-Pollinator, Plant-Seed Dispersers, etc..)
  • fraction - Represent the proportion of observed links after sub-sampling. currently have only 0.8 (80% observed links) and 1.0 (Original network)
  • type - Deprecated
  • layer - Deprecated
  • repetition - Deprecated

subsamples_edge_lists.csv

fields:

  • link_ID - Auto generated ID of link (existing an non-existing)
  • subsample_ID - Auto generated ID of a sampled network
  • higher_level - Name of species of the higher trophic level
  • lower_level - Name of species of the lower trophic level
  • weight - weight of the link, but currently not used so it is converted to binary so 1.0
  • class - link (1), non-links(0), and subsampled-links(-1) which are converted to 1 or 0 depending on the step (0 for feature extraction, 1 in test set..)

Results data

/results/ directory is described by the following files tree. The folders and code files are ordered according to the execution steps.

  • πŸ“ results
    • πŸ“„ results_preprocess.Rmd ← Loading and process the results data, so each fig will have its own prepared dataset.

    • πŸ“„ results_figs.Rmd ← Loading the output of results_preprocess.Rmd and generate figures

    • πŸ“ raw ← Contains the "raw" results, which are mainly the output of the ML pipeline

      • πŸ“„ results_domains.csv ← Contains results from a ML model trained and tested on varying groups (network domains/communities) combinations to assess cross-group generalization.
      • πŸ“„ results_models.csv ← Contains results from different ML models
      • πŸ“„ results_other_models.csv ← Contains results from different predictive models
      • πŸ“„ feature_importance.csv ← Feature importances of all ML models
    • πŸ“ intermediate ← Contains intermediate processed fils, mainly the output of results_preproccessing

      • πŸ“„ df_pred_heatmap.csv ← Result of a specific network in the test set, intended for demonstration figure.
      • πŸ“„ metrics_df_long.csv ← Evaluation metrics of each network, long format
      • πŸ“„ metrics_multi_df_long.csv ← Evaluation metrics of each network with multiple models, long format
      • πŸ“„ metrics_type_df_long.csv ← Evaluation metrics of each network with varying group, long format
      • πŸ“„ compare_other_models_metrics_df.csv ← Results of different predictive models
      • πŸ“„ network_lvl_features.csv ← Features (network level only) for EDA
      • πŸ“„ pr_df.csv ← Results of precision-recall curve
      • πŸ“„ roc_df.csv ← Results of roc curve
      • πŸ“„ auc_df.csv ← AUC values of roc and pr curves
      • πŸ“„ test_data.csv ← Test set(link ids in test set) with metadata
      • πŸ“„ bounds_summary_df.csv ← Results of theoretical bounds of each metric
      • πŸ“„ pca_df.csv ← PCA components of network-level features
    • πŸ“ final ← Contains the final figures and table, mainly the output of results_figs

      • πŸ“„ communities.pdf ← Distributions of performance measures - by community
      • πŸ“„ eval_all.pdf ← Distributions of performance measures
      • πŸ“„ features.csv ← Information about each feature
      • πŸ“„ importance_pres.pdf ← Feature importance for tested ML model (RandomForest)
      • πŸ“„ kruskal_wallis.csv ← Results of Kruskal Wallis test, comparing metrics of different communities
      • πŸ“„ mann_whitney.csv ← Results of Mann-Whitney U Tests comparing the distributions of some metrics for various training and test combinations
      • πŸ“„ networks_table.csv ← Information (source) about each network
      • πŸ“„ networks_summary_properties.csv ← Summary of network properties
      • πŸ“„ predictions.pdf ← Link prediction example for a host-parasite network
      • πŸ“„ ROC.pdf ← ROC curve + PR curve
      • πŸ“„ split_set.pdf ← Link prediction within and between community types
      • πŸ“„ SI_community.pdf ← Distribution of link probabilities across different ecological communities
      • πŸ“„ SI_complete ← Comparing learning from complete vs subsampled networks
      • πŸ“„ SI_features_hist.pdf ← Histogram of selected network properties
      • πŸ“„ SI_importance.pdf ← Feature importance for all tested ML models
      • πŸ“„ SI_models.pdf ← ML models performance comparison, multiple evaluation metrics
      • πŸ“„ SI_probabilities.pdf ← Distribution of link probabilities obtained from the model
      • πŸ“„ SI_sensitivity ← Comparing performance for different fraction of removed linked
      • πŸ“„ SI_sensitivity_com ← Comparing performance for different fraction of removed linked, for each community
      • πŸ“„ SI_tradeoff.pdf ← The precision-recall tradeoff as a function of classification threshold

common fields in csvs:

Clone this wiki locally