Skip to content

Package : Transform : Clean‐n‐Enrich Class

Nuwan Waidyanatha edited this page Sep 2, 2023 · 1 revision

Introduction

The package builds on pyspark dataframe SQL functions for transforming data. It mainly handles ETL jobs for cleansing and enriching data. Currently supports

  1. imputation - to use mean, median, or mode to fill in missing data
  2. pivot and unpivot data tables
  3. count nulls - column-wise count the number of null cells

Functions (@staticmethod)

There are several functions to support data cleansing and enrichment that can be directly performed using the same class.

Data imputation

    @staticmethod
    def impute_data(
        self,
        data,
        column_subset:list=[],
        strategy:str="mean",
        **kwargs
    ) -> DataFrame:
  • The data set must be a valid pyspark dataframe
  • A list column_subset (optional) can be specified to apply the imputation on; else will default to all columns
  • The strategy defines the mean, median, or mode methods to apply
  • At this instance the **kwargs are unused.

Count column Nulls

    @staticmethod
    def count_column_nulls(
        data,
        column_subset:list=[],
        **kwargs,
    ) -> DataFrame:
  • The data set must be a valid pyspark dataframe
  • A list column_subset (optional) can be specified to count the null cells; else will default to all columns
  • At this instance the **kwargs are unused.