-
Notifications
You must be signed in to change notification settings - Fork 0
APP : Installation
The section describes the basic steps for getting started. Given that rezaware is built on Apache Spark for structured and non-structured data platforms with Python supported libraries, the steps are:
- Installing the Prerequisites
- Starting a new rezaware project
- Setting up the app.cfg configuration files
Current rezaware, depending on the modules used, require Apache Spark, MongoDB community edition, and 9PostgreSQL with PostGIS[(https://postgis.net/). Furthermore, rezaware supports working with localhost, Amazon Web Services, and Google Cloud Computing
git clone --recurse-submodules --remote-submodules <repo-URL>
We specifically uses pyspark. Pyspark is not included in the requirements.txt file because when you install Apache Spark and set ``findspark()``` library, it will automatically detect to run between python and py4j class functions for you. At the time of writing these instructions, rezaware was developed and tested with apache spark version 3.4.
Follow these instructions to install apache spark 3.4 with java openJDK. Optionally, you may follow the instructions below as well.
- Follow the instructions to download apache spark 3.4.1. Be sure to select version 3.4.1 in the** Choose a Spark release drop down**.
- A more reliable download:
wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
- You may try the latest version but then you will need to ensure the respective supporting jar files and drivers are installed.
- A more reliable download:
- Extract and move the entire folder to directory of choice.
- untar the compressed file:
tar xvf spark-3.4.1-bin-hadoop3.tgz
- untar the compressed file:
- Also rename the folder to a shorted name like
spark_hadoop_3.4
.- rename the folder
mv spark-3.4.1-bin-hadoop3 spark_hadoop_3
- rename the folder
- Typically, we place it in the
/opt/
folder.- move to /opt/ folder:
sudo mv spark_hadoop_3 /opt/
- move to /opt/ folder:
- This would also be the value of your
$SPARK_HOME
environment variable ($SPARK_HOME='/opt/spark_hadoop_3/'). - Since spark is built on Java, you will need Java running on your machine.
- To be safe, install Java sdk and jre
java --version
. - Current rezaware spark is working with openJDK version 16. You can follow the same instructions written for openJDK version 17 on Unbuntu 22.04 but change the digits 17 to 16 to get openJDK version 16 running. CHANGE 17 to 16 to get rezaware tested version.
- To be safe, install Java sdk and jre
- Be sure to install
findpyspark
otherwise, the class functions will complain that the are unable to find pyspark. - Additionally you may need
postgresql-42.6.0.jar
in your$SPARK_HOMe/jars/
folder. You may copy from any jar file repository like Maven JAR repo- copy essential jar files into $SPARK_HOME/jars/ ```cp HERO/defaults/jars/*.jar $SPARK_HOME/jars/
- Install 4.4 community edition
- follow these instructions to install on ubuntu 20.04
- may have to resolve Depends: libssl1.1 (>= 1.1.0) but it is not installable error
sudo wget http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb
sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb
- Create an empty git repository with the a desired project name; e.g., MyNewProj .
- Presupose that you have git installed and initialized on your computer.
- Clone your MyNewProj into a desired director location; for example
cd ~/all_rez_projects/
git clone https://github.com/<my_git_user_name>/MyNewProj.git
- Move into the newly created project folder
cd ~/all_rez_projects/MyNewProj
- Now clone and initialize rezaware platform as a submodule
git submodule add -b main https://github.com/waidyanatha/rezaware.git rezaware
-
git submodule init
; will copy the mapping from the .gitmodules file into the local ./.git/config file
- (Recommended) you may also consider installing and setting up an Anaconda environment with python-3.8.10 to avoid any distutils issues.
- create a new environment using the requirements.txt file that is in the rezaware folder:
conda create --name rezenv python=3.8.10 --file requirements.txt
- Thereafter, check if all packages, listed in requirements.txt was installed
-
conda list
will print a list to stdout
-
- Activate your conda environment;
- e.g.
conda activate rezenv
- e.g.
- create a new environment using the requirements.txt file that is in the rezaware folder:
- Navigate into the rezaware folder and run setup to initialize the project with AI/ML functional app classes
cd rezaware
- In the next command run the setup for rezaware separately and the apps separately
-
python3 -m 000_setup --app=rezaware --with_ini_files
; it is important to use the --with_ini_files directive_ flag because it instructs 000_setup.py to build the rezaware app and python init.py and app.ini files necessary for the seamless package integration -
python3 -m 000_setup
; at the onset you would not have any wrangler, mining, and visuals code in the respective modules folders; hence, you cannot build the python init.py and app.ini files. Without the --with_ini_files directive the process will simply generate the app folder structure and default app.cfg file.
-
- You have now created your MyNewProj with the rezaware platform framework and can begin to start coding.
-
Note you need to configure the app.cfg in the mining,wrangler,and visuals apps
- each time you add new module packages; it needs to be added or removed from app.cfg
- any other parameters, specific to the project must be changed.
- Change back to the project director
-
cd ..
orcd ~/all_rez_projects/MyNewProj
-
- Add the submodule and initialize
git add .gitmodules rezaware/
git init
- (Optional) Include a README.md file, if not already
echo "# Welcome to MyNewProj" >> README.md
- Add and commit all newly created files and folders in MyNewProj
git add .
git commit -m "added rezaware submudle and setup project"
- Push the submodule and new commits to the repo
git push origin main
- Check your github project in the browser; you will see a folder rezaware @ xxxxxxx; where xxxxxxx is the last 7 digits from the rezaware.git repo commit code
Run pytest by executing the command in your terminal prompt
pytest
From time to time you will need to update the rezaware submodule, in your project.
- change your directory to MyNewProj folder
cd ~/all_rez_projects/MyNewProj
- fetch latest changes from rezaware.git repository, and merge them into current MyNewProj branch.
git submodule update --remote --merge
- update the repo in github:
git commit -s -am "updating rezaware submodule with latest"
git push origin main
When you add a new module package into the mining, wrangler, and visuals app folders; as well as defining them in the app.cfg file, the init and app.ini framework files need to be updated. For such simply run the 000_setup.py
-
cd ~/all_rez_projects/MyNewProj/rezaware
navigate into the rezaware folder -
python3 -m 000_setup --with_ini_files
will re-configure all the apps - Alternatively
python3 -m 000_setup --app=wrangler,mining
will only re-configure the specific apps
- Mining - Arificial Intelligence (AI) and Machine Learning (ML) analytical methods
- Wrangler- for processing data extract, transform, and load automated pipelines
- Visuals - interactive dashboards with visual analytics for Business Intelligence (BI)
- utils.py- contains a set of framework functions useful for all apps
- app.cfg - defines the app specific config section-wise key/value pairs
-
Folders - each of the mining, wrangler, and visuals folders will contain a set of subfolders
- dags - organizing airflow or other scheduler pipelines scripts
- data - specific parametric data and tmp files
- db - database scripts for creating the schema, tables, and initial data
- logs - log files created by each module package
- modules - managing the package functional class libraries
- notebooks - jupyter notebooks for developing and testing pipeline scripts
- tests - pytest scripts for applying unit & functional tests for any of the packages
Rezaware abstract BI augmented AI/ML entity framework © 2022 by Nuwan Waidyanatha is licensed under Creative Commons Attribution 4.0 International