Skip to content

Module : Extract, Load, Transfrom

Nuwan Waidyanatha edited this page Sep 2, 2023 · 1 revision

Introduction

The wrangler app is instrumental to ETL tasks.

  • Extracts or streams data, of various formats, from any source; mainly using utils/etl/loads apache spark workloads
    • spark file workloads (e.g., csv,txt,pdf,json)
    • spark RDBMS workloads (e.g., postgres,mysql,etc)
    • spark NoSQL workloads unstructured data (mongoDB,couchDB,etc)
  • Transforms the data, using utils/etl/transform into a format that makes domain and functional sense.
    • The data is extracted and stored in a cleansed and raw form.
    • raw data is further cleaned, transformed, cataloged, and historically achieved.
    • historic data is available for further curation and use for data mining (AI/ML), visual analytics, and datamart services.
  • The ETL processes are, usually, automated with airflow using dag files.