-
Notifications
You must be signed in to change notification settings - Fork 0
Package : Loader : Files Class
from rezaware.etl.loader import sparkFile
The abstraction allows for seamlessly reading and writing unstructured data from:
- local file systems on your personal computer
- cloud storage such as AWS S3 Buckets, Google GCP, Azure blog storage
- Other remote NTFS, FAT32, etc type hard drives
One must define the storage mode (e.g., 'LOCAL-FS', 'AWS-S3-Bucket') the root path or name (e.g. AWS S3 Bucket Name or absolute path to the main rezaware folder: /home/username/rezaware/). Thereafter, data can be retrieved from any relative folder path (or AWS S3 Key) that is converted into a desired data type (e.g. DICT, TEXT, PANDAS, and so on).
self._storeModeList = [
'local-fs', # local hard drive on personal computer
'aws-s3-bucket', # cloud amazon AWS S3 Bucket storage
'google-storage',# google cloud storage
]
By setting the asType class property: self._asType
; the read file data will be returned in the specified data type.
self._asTypeList = [
'STR', # text string ""
'LIST', # list of values []
'DICT', # dictionary {}
'ARRAY', # numpy array ()
'SET', # set of values ()
'PANDAS', # pandas dataframe
'SPARK', # spark dataframe
] # list of data types to convert content to
# Set the storage mode
clsRW.storeMode = "LOCAL-FS"
# Get the storage mode
_mode = clsRW.storeMode
It is either the root folder in the local-fs, bucket-name in AWS S3 and GCS.
# Set the store root
clsRW.storeRoot = "/home/USER/workspace/rezaware"
#
# OR for AWS Bucket
clsRW.storeRoot = "MY-BUCKET-NAME"
#
# Get the store root will display the current set root
clsRW.storeRoot
# to receive the data as a pandas dataframe
__as_type__ = "PANDAS"
# relative path to the file
__folder_path__ = "wrangler/data/ota/scraper/hospitality/bookings/"
# here we are reading a json file
__file_name__ = "something.json"
# specify the file type if reading multiple files
__file_type__ = None
# set initialize the properties to read the data
store_props = {
"asType":__as_type__,
"filePath":__folder_path__,
"fileName":__file_name__
}
# set the store data
clsRW.storeData = store_props
# view the data
print(clsRW.storeData
- running version of Spark Hadoop 3.0 or latest with pyspark
The pyspark integration requires several .jar
file for the implemented local, aws s3, and gcs read/write to work.
- These packages can be downloaded from the [https://mvnrepository.com/](Marven MVN Repository).
- They must be copied to
$SPARK_HOME/jars/
folder (e.g./opt/spark/jars/
- Depending on the set storeMode in one of the items in the storeModeList the class property
jarDir
must be set to the relevant.jar
file.-
local-fs (optional)
self.jarDir = "$SPARK_HOME/jars/postgresql-42.5.0.jar"
-
aws-s3-bucket
self.jarDir = "$SPARK_HOME/jars/aws-java-sdk-s3-1.12.376.jar"
- google-storage ``self.jarDir = "$SPARK_HOME/jars/gcs-connector-hadoop3-2.2.10.jar"```
-
local-fs (optional)
- It also helps copying
$HADOOP_HOME/etc/hadoop/core-site.xml
to$SPARK_HOME/conf/
NOTE it has not been fully tested whether all of them are necessary Additionally the JAR files must be compatible with the Apache Spark installation. The following JAR files were tested with Spark 3.3.3 and Hadoop 3.3.4
* postgresql-42.5.0.jar
* aws-java-sdk-core-1.12.376.jar
* aws-java-sdk-dynamodb-1.12.376.jar
* aws-java-sdk-s3-1.12.376.jar
* hadoop-aws-3.3.4.jar
* jets3t-0.9.4.jar
Refer to GitHub for more details. hadoop-connectors/gcs at master · GoogleCloudDataproc/hadoop-connectors (github.com).
#### Install google cloud utilities
* google-cloud-cli
* google-cloud-sdk
* gsutils
simply execute ```sudo apt instll -y google-cloud-cli google-cloud-sdk gsutils```
#### JAR files required
Follow the [google cloud instructions](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage). The jar files should be made available in your ```$SPARK_HOME/jars``` folder. The jar files, for the specific version, can be downloaded from [Maven](https://mvnrepository.com/) and [JARdownload](https://jar-download.com/) repositories.
* gcs-connector-hadoop3-latest.jar
* google-http-client-1.42.3.jar
* google-api-client-2.1.1.jar
* avro-1.11.2.jar
* antlr4-runtime-4.9.3.jar (need matching version for your instance)
Rezaware abstract BI augmented AI/ML entity framework © 2022 by Nuwan Waidyanatha is licensed under Creative Commons Attribution 4.0 International