- Drop duplicates
- [WIP] Remove irrelevant rows
- [WIP] Remove outliers
- [WIP] Remove anomalies
- Impute
- Reformat values
- [WIP] Unit conversion
- [WIP] Type conversion
- Remove collinear columns
- Remove columns with single value
- Remove columns with empty values
- Remove irrelevant columns
- Clean column names
Numbers
- [WIP] Scale values
- WIP
Text
- [WIP] Encode values
- WIP
Images
- WIP
Audio
- WIP
Video
- WIP
Cleaning actions for e-commerce
- WIP
Remove duplicate rows based on matching N columns.
Remove irrelevant rows using matching function e.g. if function returns true, remove row.
Remove outliers based on values in N columns.
Remove anomalies based on values in N columns.
Impute missing values using different strategies:
- Numeric imputation: mean, ratio, reg
- Hot deck imputation: rand, seq, pmm
- kNN imputation
- Standardize capitalization
- Date formats
- Number formats; e.g. 23, twenty, eihgteen (spelled incorrectly)
- Currency
- Dates: change UTC to PST
- Currency: change USD to Yen
- Weight
- dtype conversion: int to float
- Primitive type conversion: number to category
Columns with high correlation are redundant to a model.
Columns with a single value is useless for a model.
Columns with a lot of missing values may not contain enough relevant information for a model to learn from.
Remove columns with:
- PII
- Boilerplate text
- Tracking codes
- and more
Rename columns by:
- Removing excessive blank space around and between text
- Converting special characters to underscores
- Lowercasing the name
- and more
- Standardize
- Normalize
- One-hot encoding
- Ordinal encoding
- Label encoding
- Label hashing
- Embeddings