ProxiFlow

ProxiFlow is a data preprocessig tool for any kind of data processing (eg. machine learning) that performs data cleaning, normalization, and feature engineering.

The biggest advantage of this library (which is basically a wrapper over polars data frame) is that it is configurable via YAML configuration file which makes it suitable for MLOps pipelines or for building API requests over it.

Documentation

Read the full documentation here.

Usage

To use ProxiFlow, install it via pip:

pip install proxiflow

You can then call it from the command line:

proxiflow --config-file myconfig.yaml --input-file mydata.csv --output-file cleaned_data.csv

Here's an example of a YAML configuration file:

input_format: csv
output_format: csv

data_cleaning: #mandatory
  # NOTE: Not handling missing values can cause errors during data normalization
  handle_missing_values:
    drop: false
    mean: true # Only Int and Float columns are handled 
    # mode: true # Turned off for now. 
    knn: true

  handle_outliers: true # Only Float columns are handled
  remove_duplicates: true

data_normalization: # mandatory
  min_max: #mandatory but values are not mandatory. It can be left empty
    # Specify columns:
    - Age # not mandatory
  z_score: 
    - Price 
  log:
    - Floors

feature_engineering:
  one_hot_encoding: # mandatory
    - Bedrooms      # not mandatory

  feature_scaling:  # mandatory
    degree: 2       # not mandatory. It specifies the polynominal degree
    columns:        # not mandatory
      - Floors      # not mandatory

The above configuration specifies that duplicate rows should be removed and missing values should be dropped.

API

ProxiFlow can also be used as a Python library. Here's an example:

import polars as pl
from proxiflow.config import Config
from proxiflow.core import Cleaner

# Load the data
df = pl.read_csv("mydata.csv")

# Load the configuration
config = Config("myconfig.yaml")

# Clean the data
cleaner = Cleaner(config)
cleaned_data = cleaner.clean_data(data)

# Perform data normalization
normalizer = Normalizer(config)
normalized_data = normalizer.normalize(cleaned_data)

# Perform feature engineering
engineer = Engineer(config)
engineered_data = engineer.execute(normalized_data)

# Write the output data
engineered_data.write_csv("cleaned_data.csv")

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
docs/source		docs/source
proxiflow		proxiflow
tests		tests
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProxiFlow

Documentation

Usage

API

Log

About

Releases 10

Packages

Contributors 2

Languages

License

martin-tomes/proxiflow

Folders and files

Latest commit

History

Repository files navigation

ProxiFlow

Documentation

Usage

API

Log

About

Resources

License

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 2

Languages

Packages