Skip to content

APP : Installation

Nuwan Waidyanatha edited this page Sep 25, 2023 · 13 revisions

Getting Started

The section describes the basic steps for getting started. Given that rezaware is built on Apache Spark for structured and non-structured data platforms with Python supported libraries, the steps are:

  1. Installing the Prerequisites
  2. Starting a new rezaware project
  3. Setting up the app.cfg configuration files

Prerequisites

Current rezaware, depending on the modules used, require Apache Spark, MongoDB community edition, and 9PostgreSQL with PostGIS[(https://postgis.net/). Furthermore, rezaware supports working with localhost, Amazon Web Services, and Google Cloud Computing

git clone --recurse-submodules --remote-submodules <repo-URL>

Apache Spark

We specifically uses pyspark. Pyspark is not included in the requirements.txt file because when you install Apache Spark and set ``findspark()``` library, it will automatically detect to run between python and py4j class functions for you. At the time of writing these instructions, rezaware was developed and tested with apache spark version 3.4.

Follow these instructions to install apache spark 3.4 with java openJDK. Optionally, you may follow the instructions below as well.

  1. Follow the instructions to download apache spark 3.4.1. Be sure to select version 3.4.1 in the** Choose a Spark release drop down**.
    • A more reliable download: wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
    • You may try the latest version but then you will need to ensure the respective supporting jar files and drivers are installed.
  2. Extract and move the entire folder to directory of choice.
    • untar the compressed file: tar xvf spark-3.4.1-bin-hadoop3.tgz
  3. Also rename the folder to a shorted name like spark_hadoop_3.4.
    • rename the folder mv spark-3.4.1-bin-hadoop3 spark_hadoop_3
  4. Typically, we place it in the /opt/ folder.
    • move to /opt/ folder: sudo mv spark_hadoop_3 /opt/
  5. This would also be the value of your $SPARK_HOME environment variable ($SPARK_HOME='/opt/spark_hadoop_3/').
  6. Since spark is built on Java, you will need Java running on your machine.
  7. Be sure to install findpyspark otherwise, the class functions will complain that the are unable to find pyspark.
  8. Additionally you may need postgresql-42.6.0.jar in your $SPARK_HOMe/jars/ folder. You may copy from any jar file repository like Maven JAR repo
    • copy essential jar files into $SPARK_HOME/jars/ ```cp HERO/defaults/jars/*.jar $SPARK_HOME/jars/

MongoDB Community Edition

  1. Install 4.4 community edition

Starting a New Project

  1. Create an empty git repository with the a desired project name; e.g., MyNewProj .
    • Presupose that you have git installed and initialized on your computer.
  2. Clone your MyNewProj into a desired director location; for example
    • cd ~/all_rez_projects/
    • git clone https://github.com/<my_git_user_name>/MyNewProj.git
  3. Move into the newly created project folder
    • cd ~/all_rez_projects/MyNewProj
  4. Now clone and initialize rezaware platform as a submodule
    • git submodule add -b main https://github.com/waidyanatha/rezaware.git rezaware
    • git submodule init; will copy the mapping from the .gitmodules file into the local ./.git/config file
  5. (Recommended) you may also consider installing and setting up an Anaconda environment with python-3.8.10 to avoid any distutils issues.
    • create a new environment using the requirements.txt file that is in the rezaware folder:
      • conda create --name rezenv python=3.8.10 --file requirements.txt
    • Thereafter, check if all packages, listed in requirements.txt was installed
      • conda list will print a list to stdout
    • Activate your conda environment;
      • e.g. conda activate rezenv
  6. Navigate into the rezaware folder and run setup to initialize the project with AI/ML functional app classes
    • cd rezaware
    • In the next command run the setup for rezaware separately and the apps separately
      • python3 -m 000_setup --app=rezaware --with_ini_files; it is important to use the --with_ini_files directive_ flag because it instructs 000_setup.py to build the rezaware app and python init.py and app.ini files necessary for the seamless package integration
      • python3 -m 000_setup; at the onset you would not have any wrangler, mining, and visuals code in the respective modules folders; hence, you cannot build the python init.py and app.ini files. Without the --with_ini_files directive the process will simply generate the app folder structure and default app.cfg file.
    • You have now created your MyNewProj with the rezaware platform framework and can begin to start coding.
    • Note you need to configure the app.cfg in the mining,wrangler,and visuals apps
      • each time you add new module packages; it needs to be added or removed from app.cfg
      • any other parameters, specific to the project must be changed.
  7. Change back to the project director
    • cd .. or cd ~/all_rez_projects/MyNewProj
  8. Add the submodule and initialize
    • git add .gitmodules rezaware/
    • git init
  9. (Optional) Include a README.md file, if not already
    • echo "# Welcome to MyNewProj" >> README.md
  10. Add and commit all newly created files and folders in MyNewProj
    • git add .
    • git commit -m "added rezaware submudle and setup project"
  11. Push the submodule and new commits to the repo
    • git push origin main
    • Check your github project in the browser; you will see a folder rezaware @ xxxxxxx; where xxxxxxx is the last 7 digits from the rezaware.git repo commit code

Test the new Project

Run pytest by executing the command in your terminal prompt

  • pytest

Update rezaware from remote repo

From time to time you will need to update the rezaware submodule, in your project.

  1. change your directory to MyNewProj folder
    • cd ~/all_rez_projects/MyNewProj
  2. fetch latest changes from rezaware.git repository, and merge them into current MyNewProj branch.
    • git submodule update --remote --merge
  3. update the repo in github:
    • git commit -s -am "updating rezaware submodule with latest"
    • git push origin main

Reconfiguring existing project

When you add a new module package into the mining, wrangler, and visuals app folders; as well as defining them in the app.cfg file, the init and app.ini framework files need to be updated. For such simply run the 000_setup.py

  • cd ~/all_rez_projects/MyNewProj/rezaware navigate into the rezaware folder
  • python3 -m 000_setup --with_ini_files will re-configure all the apps
  • Alternatively python3 -m 000_setup --app=wrangler,mining will only re-configure the specific apps

About the Post Setup Artifacts

  1. Mining - Arificial Intelligence (AI) and Machine Learning (ML) analytical methods
  2. Wrangler- for processing data extract, transform, and load automated pipelines
  3. Visuals - interactive dashboards with visual analytics for Business Intelligence (BI)
  4. utils.py- contains a set of framework functions useful for all apps
  5. app.cfg - defines the app specific config section-wise key/value pairs
  6. Folders - each of the mining, wrangler, and visuals folders will contain a set of subfolders
    • dags - organizing airflow or other scheduler pipelines scripts
    • data - specific parametric data and tmp files
    • db - database scripts for creating the schema, tables, and initial data
    • logs - log files created by each module package
    • modules - managing the package functional class libraries
    • notebooks - jupyter notebooks for developing and testing pipeline scripts
    • tests - pytest scripts for applying unit & functional tests for any of the packages