Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to regulate Negative values in the Generated Data #231

Open
Bhargav-Ravinuthala opened this issue Nov 6, 2024 · 7 comments
Open

How to regulate Negative values in the Generated Data #231

Bhargav-Ravinuthala opened this issue Nov 6, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Bhargav-Ravinuthala
Copy link

Bhargav-Ravinuthala commented Nov 6, 2024

Related Issues: #189

This is just a follow-up for the closed issue with respect to handelling negative values in the Generated Dataset.

Problem:
Seeing Negative Values in the generate Data for the Postive Columns from the user input.

How to Recreate

Sample Data
Book1.csv

Code:#which I recreated from the orginal class

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_processors.filter.positive_negative import PositiveNegativeFilter
from sdgx.data_models.metadata import Metadata

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=300),  # For quick demo
    data_connector=data_connector
)

# Read the original data to analyze columns
original_data = pd.read_csv(r"C:\Users\Bhargav\Downloads\Book1.csv")

# Create metadata from original data
metadata = Metadata.from_dataframe(original_data)

# Initialize and configure the PositiveNegativeFilter
pos_neg_filter = PositiveNegativeFilter()
pos_neg_filter.fit(metadata)

# Add the filter to the synthesizer's pipeline
synthesizer.add_processor(pos_neg_filter)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)
print(f"Synthetic data saved to {output_path}")

# Print information about preserved value ranges
for column in sampled_data.columns:
    if pd.api.types.is_numeric_dtype(original_data[column]):
        original_min = original_data[column].min()
        original_max = original_data[column].max()
        synthetic_min = sampled_data[column].min()
        synthetic_max = sampled_data[column].max()
        
        print(f"\nColumn: {column}")
        print(f"Original range: [{original_min}, {original_max}]")
        print(f"Synthetic range: [{synthetic_min}, {synthetic_max}]")

Also Tested with the orginal code#

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

# This will download demo data to ./dataset
dataset_csv = download_demo_data()

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)

print(f"Synthetic data saved to {output_path}")

Expected Behavioure

Values polarities should follow the orginal Data.

Related PR's
#217

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 6, 2024

Thanks for reporting this, @jalr4ever would you like to take a look?

@jalr4ever
Copy link
Collaborator

@Wh1isper Yeah, I have also noticed this issue recently. I will catch this.

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 7, 2024

@jalr4ever Can you confirm this issue has been fixed in #232? I think we should make a release for this.

@jalr4ever
Copy link
Collaborator

@Wh1isper Yeah!The next release is just around the corner! 🎉

@jalr4ever
Copy link
Collaborator

@Bhargav-Ravinuthala Hi, We've just dropped a new release, check out version 0.2.2! You can use your original code, and the internal SDG will ensure the properties of positive and negative values. No need to manually add filters!

@Bhargav-Ravinuthala
Copy link
Author

Any way i can callobrate? i have been brain stroming your entire code, We are trying to use your code for one of the project and we need to fix this....

@jalr4ever
Copy link
Collaborator

@Bhargav-Ravinuthala Hi, did you try the new release? Didn't it resolve the Bug? I have tested you data in local by 0.2.2, its seems Okay.🧐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants