ProxiFlow is a data preprocessig tool for any kind of data processing (eg. machine learning) that performs data cleaning, normalization, and feature engineering.
The biggest advantage of this library (which is basically a wrapper over polars data frame) is that it is configurable via YAML configuration file which makes it suitable for MLOps pipelines or for building API requests over it.
Read the full documentation here.
To use ProxiFlow, install it via pip:
pip install proxiflow
You can then call it from the command line:
proxiflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv
Here's an example of a YAML configuration file:
input_format: csv
output_format: csv
data_cleaning: #mandatory
# NOTE: Not handling missing values can cause errors during data normalization
handle_missing_values:
drop: false
mean: true # Only Int and Float columns are handled
# mode: true # Turned off for now.
knn: true
handle_outliers: true # Only Float columns are handled
remove_duplicates: true
data_normalization: # mandatory
min_max: #mandatory but values are not mandatory. It can be left empty
# Specify columns:
- Age # not mandatory
z_score:
- Price
log:
- Floors
feature_engineering:
one_hot_encoding: # mandatory
- Bedrooms # not mandatory
feature_scaling: # mandatory
degree: 2 # not mandatory. It specifies the polynominal degree
columns: # not mandatory
- Floors # not mandatory
The above configuration specifies that duplicate rows should be removed and missing values should be dropped.
ProxiFlow can also be used as a Python library. Here's an example:
import polars as pl
from proxiflow.config import Config
from proxiflow.core import Cleaner
# Load the data
df = pl.read_csv("mydata.csv")
# Load the configuration
config = Config("myconfig.yaml")
# Clean the data
cleaner = Cleaner(config)
cleaned_data = cleaner.clean_data(data)
# Perform data normalization
normalizer = Normalizer(config)
normalized_data = normalizer.normalize(cleaned_data)
# Perform feature engineering
engineer = Engineer(config)
engineered_data = engineer.execute(normalized_data)
# Write the output data
engineered_data.write_csv("cleaned_data.csv")
- Data cleaning
- Missing values handling
- Mean
- Drop
- KNN Imputer
- Median
- Missing values handling
- Data normalization
- Min Max normalization
- Z-Score normalization
- Logarithmic normalization
- Feature engineering
- One Hot Encoding
- Feature Scaling
- Recursive Feature Elimination
- SelectKBest
- LASSO regularization
- Text Preprocessing
- Tokenization
- Stemming
- Stopword removal
- Text Vectorization
- Bag of Words
- TF-IDF
- Word embeddings
- Word2Vec
- GloVe
- BERT
- Categorical Encoding
- Target encoding
- Count encoding
- Binary encoding
- Dimensionality reduction
- Principal Component Analysis (PCA)
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Uniform Manifold Approximation and Projection (UMAP)