Package : Transform : Clean‐n‐Enrich Class

Introduction

The package builds on pyspark dataframe SQL functions for transforming data. It mainly handles ETL jobs for cleansing and enriching data. Currently supports

imputation - to use mean, median, or mode to fill in missing data
pivot and unpivot data tables
count nulls - column-wise count the number of null cells

Functions (@staticmethod)

There are several functions to support data cleansing and enrichment that can be directly performed using the same class.

Data imputation

    @staticmethod
    def impute_data(
        self,
        data,
        column_subset:list=[],
        strategy:str="mean",
        **kwargs
    ) -> DataFrame:

The data set must be a valid pyspark dataframe
A list column_subset (optional) can be specified to apply the imputation on; else will default to all columns
The strategy defines the mean, median, or mode methods to apply
At this instance the **kwargs are unused.

Count column Nulls

    @staticmethod
    def count_column_nulls(
        data,
        column_subset:list=[],
        **kwargs,
    ) -> DataFrame:

The data set must be a valid pyspark dataframe
A list column_subset (optional) can be specified to count the null cells; else will default to all columns
At this instance the **kwargs are unused.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Package : Transform : Clean‐n‐Enrich Class

Introduction

Functions (@staticmethod)

Data imputation

Count column Nulls

Clone this wiki locally