Skip to content

Latest commit

 

History

History
26 lines (24 loc) · 824 Bytes

README.md

File metadata and controls

26 lines (24 loc) · 824 Bytes

pyspark-tutorial

A diary of my learning journey into the world of Apache Spark (pyspark) from an developer (Data Engineering) perspective

To follow my journey you will need:

  • Azure Acount
  • Azure Databricks
  • Azure Data Lake Gen 2
  • Python packages (ref. Pipfile)
    • findspark
    • jupyter
    • numpy
    • pandas
    • pypandoc
    • pyspark2.4.5

My learning path:

  • Day 1: Installing a local Spark environment
  • Day 2: My first Spark application and some basic concepts
  • Day 3: Taking a deeper insight into DataFrames
  • Day 4: Getting an Overview on the pyspark.sql module
  • Day 5: Doing some math and aggregations
  • Day 6: Tackling the date and time challenge
  • Day 7: Handling of NULL values
  • Day 8: JSON and complex data types to analyse semi-/unstructured data
  • Day 9: Joins
  • Day 10 : Connectors and I/O performance