Skip to content

Package : Loader : Files Class

Nuwan Waidyanatha edited this page Sep 3, 2023 · 2 revisions

from rezaware.etl.loader import sparkFile

Introduction

The abstraction allows for seamlessly reading and writing unstructured data from:

  • local file systems on your personal computer
  • cloud storage such as AWS S3 Buckets, Google GCP, Azure blog storage
  • Other remote NTFS, FAT32, etc type hard drives

One must define the storage mode (e.g., 'LOCAL-FS', 'AWS-S3-Bucket') the root path or name (e.g. AWS S3 Bucket Name or absolute path to the main rezaware folder: /home/username/rezaware/). Thereafter, data can be retrieved from any relative folder path (or AWS S3 Key) that is converted into a desired data type (e.g. DICT, TEXT, PANDAS, and so on).

How to use it

Currently supported

Storage modes

        self._storeModeList = [
            'local-fs',      # local hard drive on personal computer
            'aws-s3-bucket', # cloud amazon AWS S3 Bucket storage
            'google-storage',# google cloud storage
        ]

return data types

By setting the asType class property: self._asType; the read file data will be returned in the specified data type.

        self._asTypeList = [
            'STR',   # text string ""
            'LIST',  # list of values []
            'DICT',  # dictionary {}
            'ARRAY', # numpy array ()
            'SET',   # set of values ()
            'PANDAS', # pandas dataframe
            'SPARK',  # spark dataframe
        ]   # list of data types to convert content to

storeMode

        # Set the storage mode
        clsRW.storeMode = "LOCAL-FS"
        # Get the storage mode
        _mode = clsRW.storeMode

storeRoot

It is either the root folder in the local-fs, bucket-name in AWS S3 and GCS.

        # Set the store root
        clsRW.storeRoot = "/home/USER/workspace/rezaware"
        #
        # OR for AWS Bucket
        clsRW.storeRoot = "MY-BUCKET-NAME"
        #
        # Get the store root will display the current set root
        clsRW.storeRoot

Read a file

        # to receive the data as a pandas dataframe
        __as_type__ = "PANDAS"
        # relative path to the file
        __folder_path__ = "wrangler/data/ota/scraper/hospitality/bookings/"
        # here we are reading a json file
        __file_name__ = "something.json"
        # specify the file type if reading multiple files
        __file_type__ = None
        # set initialize the properties to read the data
        store_props = {
            "asType":__as_type__,
            "filePath":__folder_path__,
            "fileName":__file_name__
        }
        # set the store data
        clsRW.storeData = store_props
        # view the data
        print(clsRW.storeData

Prerequisites

  • running version of Spark Hadoop 3.0 or latest with pyspark

The pyspark integration requires several .jar file for the implemented local, aws s3, and gcs read/write to work.

  • These packages can be downloaded from the [https://mvnrepository.com/](Marven MVN Repository).
  • They must be copied to $SPARK_HOME/jars/ folder (e.g. /opt/spark/jars/
  • Depending on the set storeMode in one of the items in the storeModeList the class property jarDir must be set to the relevant .jar file.
    • local-fs (optional) self.jarDir = "$SPARK_HOME/jars/postgresql-42.5.0.jar"
    • aws-s3-bucket self.jarDir = "$SPARK_HOME/jars/aws-java-sdk-s3-1.12.376.jar"
    • google-storage ``self.jarDir = "$SPARK_HOME/jars/gcs-connector-hadoop3-2.2.10.jar"```
  • It also helps copying $HADOOP_HOME/etc/hadoop/core-site.xml to $SPARK_HOME/conf/

Required JAR files

NOTE it has not been fully tested whether all of them are necessary Additionally the JAR files must be compatible with the Apache Spark installation. The following JAR files were tested with Spark 3.3.3 and Hadoop 3.3.4

Local File System

* postgresql-42.5.0.jar

AWS-S3-Bucket

* aws-java-sdk-core-1.12.376.jar
* aws-java-sdk-dynamodb-1.12.376.jar
* aws-java-sdk-s3-1.12.376.jar
* hadoop-aws-3.3.4.jar
* jets3t-0.9.4.jar

Google-Storage

Refer to GitHub for more details. hadoop-connectors/gcs at master · GoogleCloudDataproc/hadoop-connectors (github.com).


#### Install google cloud utilities

* google-cloud-cli
* google-cloud-sdk
* gsutils

simply execute ```sudo apt instll -y google-cloud-cli google-cloud-sdk gsutils```

#### JAR files required

Follow the [google cloud instructions](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage). The jar files should be made available in your ```$SPARK_HOME/jars``` folder. The jar files, for the specific version, can be downloaded from [Maven](https://mvnrepository.com/) and [JARdownload](https://jar-download.com/) repositories.

* gcs-connector-hadoop3-latest.jar
* google-http-client-1.42.3.jar
* google-api-client-2.1.1.jar
* avro-1.11.2.jar
* antlr4-runtime-4.9.3.jar (need matching version for your instance)