Skip to content

The goal of `workflow.data.preparation` is to prepare all of the necessary data inputs for the Transition Monitor web application.

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

RMI-PACTA/workflow.data.preparation

Repository files navigation

Lifecycle: stable docker

workflow.data.preparation

workflow.data.preparation orchestrates the PACTA data preparation process, combining production, financial, scenario, and currency data into a format suitable for use in a PACTA for investors analysis. Assuming that the computing resource being used has sufficient memory (which can be >16Gb depending on the inputs), storage space, and access to the necessary inputs, this is intended to work on a desktop or laptop using RStudio or run using the included Dockerfile and docker-compose.yml.

Running in RStudio

R package dependencies

Running workflow.data.preparation has a number of R package dependencies that are listed in the DESCRIPTION file. These can be installed manually or by using something like pak::local_install_deps().

Setting appropriate config options

To make things easier, the recommended way to specify the desired config set when running locally in RStudio is by setting the active config set to desktop and modifying/adding only a few of the properties in the desktop config set. By doing so, you benefit from inheriting many of the appropriate configuration values without having to explicitly specify each one.

You will need to set the inherits parameter, e.g. inherits: 2022Q4, to select which of the config sets specified in the config.yml file that is desired.

You will need to set data_prep_outputs_path to an existing directory where you want the outputs to be saved, e.g. data_prep_outputs_path: "./outputs" to point to an existing directory named outputs in the working directory of the R session you will be running data.prep in. This directory must exist before running data.prep (and ideally be empty). The script will throw an error early on if it does not exist.

You will need to set asset_impact_data_path to the locally accessible directory where the necessary asset data files are located (absolute, or relative to the working directory of the R session you will be running data.prep in).

You will need to set factset_data_path to the locally accessible directory where the necessary financial data files are located (absolute, or relative to the working directory of the R session you will be running data.prep in).

You will need to set scenarios_data_path to the locally accessible directory where the necessary scenario data files are located (absolute, or relative to the working directory of the R session you will be running data.prep in).

Setting the active config set

Before you begin, you must set the active config in an open R session with Sys.setenv(R_CONFIG_ACTIVE = "desktop").

Running data.prep

Once the above steps have been completed, you should be able to run run_pacta_data_preparation.R, either by sourcing it, e.g. source("run_pacta_data_preparation.R"), or by running it line-by-line (or select lines of it) interactively.

Running locally with docker compose

Running the workflow requires a file .env to exist in the root directory, that looks like...

# .env
HOST_FACTSET_EXTRACTED_PATH=/PATH/TO/factset-extracted
HOST_ASSET_IMPACT_PATH=/PATH/TO/asset-impact
HOST_SCENARIO_INPUTS_PATH=/PATH/TO/scenario-sources
HOST_OUTPUTS_PATH=/PATH/TO/YYYYQQ_pacta_analysis_inputs_YYYY-MM-DD/YYYYQQ
R_CONFIG_ACTIVE=YYYYQQ
  • HOST_FACTSET_EXTRACTED_PATH the local path to where the FactSet input files live. docker-compose volume mounts this directory and reads files from it, so it requires appropriate permissions on the host filesystem. The pacta.data.preparation process requires input files that must exist in this directory and they must have filenames that match those specified in the config.yml for the specified config. See "Required Input Files" (below) for more information.
  • HOST_ASSET_IMPACT_PATH the local path to where the Asset Impact input files live. docker-compose volume mounts this directory and reads files from it, so it requires appropriate permissions on the host filesystem. The pacta.data.preparation process requires input files that must exist in this directory and they must have filenames that match those specified in the config.yml for the specified config. See "Required Input Files" (below) for more information.
  • HOST_SCENARIO_INPUTS_PATH the local path to where the scenarios input files live. docker-compose volume mounts this directory and reads files from it, so it requires appropriate permissions on the host filesystem. The pacta.data.preparation process requires input files that must exist in this directory and they must have filenames that match those specified in the config.yml for the specified config. See "Required Input Files" (below) for more information.
  • HOST_OUTPUTS_PATH the local path to save the output files. docker-compose volume mounts this directory and writes files to it, so it requires appropriate permissions on the host filesystem.
  • R_CONFIG_ACTIVE the name of the config to use. The config.yml file contains named configurations which define the settings used during PACTA data preparation. See top-level yaml names of config.yml for valid options.

Run docker compose up from the root directory, and docker will build the image (if necessary), and then run the data.prep process given the specified options in the .env file.

Use docker compose build --no-cache to force a rebuild of the Docker image.

Running Data Preparation interactively on Azure VM

Instructions specific to the RMI-PACTA team's Azure instance are in Italics.

  1. Prerequisites: These steps have been completed on the RMI Azure instance.

    • Ensure a Virtual Network with a Gateway has been set up, permitting SSH (Port 22) access. Details of setting this up are out of scope for these instructions. Talk to your network coordinator for help.
    • Set up Storage Accounts containing the required files. While all the files can exist on a single file share, in a single storage account, the workflow can access different storage accounts, to allow for read-only access to raw data, to prevent accident manipulation of source data. The recommended structure (used by RMI) is:
      • Storage Account: pactadatadev: (read/write). Naming note: RMI QAs datasets prior to moving them to PROD with workflow.pacta.data.qa.
        • File Share workflow-data-preparation-outputs: Outputs from this workflow.
      • Storage Account: pactarawdata (read-only)
    • (Optional, but recommended) Create a User Assigned Managed Identity. Alternately, after creating the VM with a system-managed identity, you can assign all appropriate permissions. RMI: The workflow-data-preparation Identity exists with all the appropriate permissions.
    • Grant Appropriate permissions to the Identity:
      • pactadatadev: "Reader and Data Access".
      • pactarawdata: "Reader and Data Access" Note that this gives read/write access the Storage Account via the Storage Account Key. To grant read-only access to the VM, use the mount_afs script without the -w flag, as shown below.
  2. Start a VM While the machine can be deployed via the Portal (WebUI), for simplicity, the following code block is provided which ensures consistency:

    # The options here work with the RMI-PACTA team's Azure setup.
    # Change values for your own instance as needed.
    
    # Get Network details.
    VNET_RESOURCE_GROUP="RMI-PROD-EU-VNET-RG"
    VNET_NAME="RMI-PROD-EU-VNET"
    SUBNET_NAME="RMI-SP-PACTA-DEV-VNET"
    SUBNET_ID="/subscriptions/feef729b-4584-44af-a0f9-4827075512f9/resourceGroups/RMI-PROD-EU-VNET-RG/providers/Microsoft.Network/virtualNetworks/RMI-PROD-EU-VNET/subnets/RMI-SP-PACTA-DEV-VNET"
    
    # Use the identity previously setup (see Prerequisites)
    MACHINEIDENTITY="/subscriptions/feef729b-4584-44af-a0f9-4827075512f9/resourceGroups/RMI-SP-PACTA-PROD/providers/Microsoft.ManagedIdentity/userAssignedIdentities/workflow-data-preparation"
    # This size has 2 vCPU, and 32GiB memory, recommended settings.
    MACHINE_SIZE="Standard_E4-2as_v4"
    # Using epoch to give machine a (probably) unique name
    MACHINE_NAME="dataprep-runner-$(date +%s)"
    # NOTE: Change this to your own RG as needed.
    VM_RESOURCE_GROUP="RMI-SP-PACTA-DEV"
    
    # **NOTE: Check these options prior to running**
    # Non-RMI users may choose to omit the --public-ip-address line for public SSH Access.
    
    az vm create \
      --admin-username azureuser \
      --assign-identity "$MACHINEIDENTITY" \
      --generate-ssh-keys  \
      --image Ubuntu2204 \
      --name "$MACHINE_NAME" \
      --nic-delete-option delete \
      --os-disk-delete-option delete \
      --public-ip-address "" \
      --resource-group "$VM_RESOURCE_GROUP" \
      --size "$MACHINE_SIZE" \
      --subnet "$SUBNET_ID"
    

    If this command successfully runs, it will output a JSON block describing the resource (VM) created.

  3. Connect to the Network. (Optional) RMI: Connecting to the VPN will enable SSH access. Connect to the Virtual Network specified above, as the comand above does not create a Public IP Address. Details for this are out of scope for these instructions. Contact your network coordinator for help.

  4. Connect to the newly created VM via SSH.

    # This connects to the VM created above via SSH.
    # See above block for envvars referenced here.
    
    az ssh vm \
        --local-user azureuser \
        --name "$MACHINE_NAME" \
        --prefer-private-ip \
        --resource-group "$VM_RESOURCE_GROUP"
    
  5. Connect the VM to required resources Clone this repo, install the az cli utility, and mount the appropriate Azure File Shares.

    # Clone this repo through https to avoid need for an SSH key
    git clone https://github.com/RMI-PACTA/workflow.data.preparation.git ~/workflow.data.preparation
    
    # Install az cli
    sudo apt update
    # See https://aka.ms/installcli for alternate instructions
    curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
    
    # Login to azure with assigned identity
    az login --identity
    
    # Use script from this repo to connect to file shares
    ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "factset-extracted" -m "/mnt/factset-extracted"
    ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-PROD" -a "pactarawdata" -f "asset-impact" -m "/mnt/asset-impact"
    ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-DEV" -a "pactadatadev" -f "workflow-scenario-preparation-outputs" -m "/mnt/workflow-scenario-preparation-outputs"
    
    # Note the outputs directory has the -w flag, meaning write permissions are enabled.
    ~/workflow.data.preparation/scripts/mount_afs.sh -r "RMI-SP-PACTA-DEV" -a "pactadatadev" -f "workflow-data-preparation-outputs" -m "/mnt/workflow-data-preparation-outputs" -w
    
  6. Install Docker

    # install docker
    sudo apt -y install \
        docker-compose \
        docker.io
    
    # Allow azureuser to run docker without sudo
    sudo usermod -aG docker azureuser

    At this point, you need to log out of the shell to reevaluate group memberships (add the docker group to azureuser). You can log back in with the az ssh command from step 3. When you are back into the shell, you can run docker run --rm hello-world to confirm that docker is working correctly, and you are able to run as a non-root user.

  7. Prepare .env file The ubuntu2204 image used for the VM includes both vim and nano. Create a .env file in the workflow.data.preparation directory, according to the instructions in the running locally section of this file.

  8. Build Docker image The cloned git repo in the home directory, and mounted directories should sill be in place after logging in again. Additionally, azureuser should be part of the docker group. you can confirm this with:

    groups
    ls ~
    ls /mnt

    With that in place, you are ready to build the workflow.data.preparation docker image. To ensure that a dropped network connection does not kill the process, you should run this in tmux.

    # navigate to the workflow.data.preparation repo
    cd ~/workflow.data.preparation
    
    tmux
    
    docker compose build
    
    docker compose up
    

Required Input Files

Asset Impact Data

Files from Asset Impact provide production forecasts. All required files must exist at $HOST_ASSET_IMPACT_PATH, in a single directory (no subdirectories).

The required files are:

  • masterdata_ownership e.g. "2022-08-15_rmi_masterdata_ownership_2021q4.csv"
  • masterdata_debt e.g. "2023-01-13_rmi_masterdata_debt_2021q4.csv"
  • ar_company_id__factset_entity_id e.g. "2022-08-17_rmi_ar_fs_id_bridge_2021q4.csv"

FactSet Data

Files exported by {workflow.factset} provide financial data to tie to production data. See the workflow.factset README for more information on expected file format. All required files must exist at $HOST_FACTSET_EXTRACTED_PATH, in a single directory (no subdirectories).

The required files are:

  • factset_entity_financing_data.rds
  • factset_entity_info.rds
  • factset_financial_data.rds
  • factset_fund_data.rds
  • factset_isin_to_fund_table.rds
  • factset_iss_emissions.rds
  • factset_issue_code_bridge.rds
  • factset_industry_map_bridge.rds
  • factset_manual_pacta_sector_override.rds

Scenarios Data

Files exported by {workflow.scenario.preparation} provide scenario data to be combined with the ABCD data. See the workflow.scenario.preparation README for more information on expected file format. All required files must exist at $HOST_SCENARIO_INPUTS_PATH, in a single directory (no subdirectories).

The required files are:

  • dependent on what sceanrios are meant to be included

Promotion of Datasets to PROD

workflow.transiton.monitor

Data sets to prepare images from workflow.transition.monitor are stored in the pactadatadev Storage Account (RMI-SP-PACTA-DEV Resource Group), in the file share workflow-data-preparation-outputs. The dataset used is defined by the directory name in the build config for each image (build/config/rmi_pacta_YYYYqX_ZZZZ.json), in the data_share_path key.

workflow.pacta.webapp and workflow.pacta.dashboard

For the workflow.pacta.webapp and workflow.pacta.dashboard images, the PACTA data is expected as a bind mount to the docker image (rather than "baked in", as with workflow.transition.monitor). For Azure Container Instances running on our tenant, the expected file share to mount is pacta_data, in the rmipactawebappdata Storage Account (in the RMI-SP-PACTA-WEU-PAT-DEV Resource Group). The top level directories in that File Share correspond to the directories in the pactadatadev/workflow-data-preparation-outputs file share, and should be passed as environment variables to the docker image (see workflow repos for more detail).

Transferring from pactadatadev to rmipactawebappdata

Prepared datasets can be copied from pactadatadev to rmipactawebappdata with the following commands:

DIRNAME="2023Q4_20240718T150252Z" # Change as needed.
TOKEN_START=$(date -u -j '+%Y-%m-%dT%H:%MZ')
TOKEN_EXPIRY=$(date -u -j -v "+20M" '+%Y-%m-%dT%H:%MZ')

DESTINATION_ACCOUNT_NAME="rmipactawebappdata"
DESTINATION_SHARE="pacta-data"
DESTINATION_SAS="$(
    az storage share generate-sas \
        --account-name $DESTINATION_ACCOUNT_NAME \
        --expiry $TOKEN_EXPIRY \
        --permissions rcw \
        --name $DESTINATION_SHARE \
        --start $TOKEN_START \
        --output tsv
)"

# note permissions are different. rcl allows listing contents, rcw above is to write
SOURCE_ACCOUNT_NAME="pactadatadev"
SOURCE_SHARE="workflow-data-preparation-outputs"
SOURCE_SAS="$(
    az storage share generate-sas \
        --account-name $SOURCE_ACCOUNT_NAME \
        --expiry $TOKEN_EXPIRY \
        --permissions rcl \
        --name $SOURCE_SHARE \
        --start $TOKEN_START \
        --output tsv
)"

COPY_SOURCE="https://$SOURCE_ACCOUNT_NAME.file.core.windows.net/$SOURCE_SHARE/$DIRNAME"?$SOURCE_SAS
COPY_DESTINATION="https://$DESTINATION_ACCOUNT_NAME.file.core.windows.net/$DESTINATION_SHARE/$DIRNAME?$DESTINATION_SAS" 
echo "$COPY_SOURCE"
echo "$COPY_DESTINATION"

azcopy copy \
    "$COPY_SOURCE" \
    "$COPY_DESTINATION" \
    --as-subdir=false \
    --recursive

About

The goal of `workflow.data.preparation` is to prepare all of the necessary data inputs for the Transition Monitor web application.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages