-
Notifications
You must be signed in to change notification settings - Fork 0
Package : Transform : Clean‐n‐Enrich Class
Nuwan Waidyanatha edited this page Sep 2, 2023
·
1 revision
The package builds on pyspark dataframe SQL functions for transforming data. It mainly handles ETL jobs for cleansing and enriching data. Currently supports
- imputation - to use mean, median, or mode to fill in missing data
- pivot and unpivot data tables
- count nulls - column-wise count the number of null cells
There are several functions to support data cleansing and enrichment that can be directly performed using the same class.
@staticmethod
def impute_data(
self,
data,
column_subset:list=[],
strategy:str="mean",
**kwargs
) -> DataFrame:
- The
data
set must be a valid pyspark dataframe - A list
column_subset
(optional) can be specified to apply the imputation on; else will default to all columns - The
strategy
defines the mean, median, or mode methods to apply - At this instance the
**kwargs
are unused.
@staticmethod
def count_column_nulls(
data,
column_subset:list=[],
**kwargs,
) -> DataFrame:
- The
data
set must be a valid pyspark dataframe - A list
column_subset
(optional) can be specified to count the null cells; else will default to all columns - At this instance the
**kwargs
are unused.
Rezaware abstract BI augmented AI/ML entity framework © 2022 by Nuwan Waidyanatha is licensed under Creative Commons Attribution 4.0 International