Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About csvLoader.loadDataSource #342

Open
pablo3p opened this issue May 16, 2023 · 4 comments
Open

About csvLoader.loadDataSource #342

pablo3p opened this issue May 16, 2023 · 4 comments

Comments

@pablo3p
Copy link

pablo3p commented May 16, 2023

Hi there,

From this tutorial on regression:
https://github.com/oracle/tribuo/blob/main/tutorials/regression-tribuo-v4.ipynb

var wineSource = csvLoader.loadDataSource(Paths.get("winequality-red.csv"),"quality");

This wineSource, is a data structure, but don't see enough documentation.
I am assuming that wineSource here, is a tabular data structure, and hoping that it is similar to Python Pandas DataFrame.

If that is the case, is there a Print-Method, so one can print to the terminal to see the data.

There is not much out there on this.

Kind Regards,

Pablo

@Craigacp
Copy link
Member

CSVLoader returns a CSVDataSource. The DataSource interface doesn't have much in the way of accessor methods, you should construct a MutableDataset from that data source which will populate the feature & output information objects that you can query. If you want to print out the examples you can iterate the data source and print each Example object.

Tribuo has a row-wise view of data, and doesn't provide a data frame style interface. If you want something more like a dataframe in Java then I think JTablesaw is supposed to be good for that, but I've not used it much.

@pablo3p
Copy link
Author

pablo3p commented May 17, 2023

Hi there, thanks for your quick reply.
SO when passing in data, I want to make sure that it is proper, so it looks like there is no way to determine that once it is loaded and creates a CSVDataSource.
I would prefer to load then the data from CSV into something like JTablesaw, and from JTablesaw pass that into a Tribuo DataSource.
Wondering if this is possible?
Hope you can let me know.

P.

@Craigacp
Copy link
Member

You can inspect the examples after they have been loaded to make sure the pipeline is valid. I recommend looking at CSVDataSource rather than using CSVLoader as it's more flexible. There's a columnar data tutorial which explains the mechanisms - https://tribuo.org/learn/4.3/tutorials/columnar-tribuo-v4.html.

We don't currently support loading from JTablesaw into Tribuo because we can't capture the necessary provenance & reproducibility information out of a tablesaw dataset. It would be pretty useful to have though, but due to the provenance issues we've not got around to it.

@pablo3p
Copy link
Author

pablo3p commented May 17, 2023

Hi, thanks again.
The link you provided seems to have a lot of useful concepts etc.

Yes, to have something like JTablesaw, and have that first load the CSV and then pass it onto like the CSVDataSource, I think would be really good, because you can pass on the responsibility of the "integrity" of the data to the Data Science person, because they are the subject matter experts, and they should be able to look into the DataFrame(in this case JTablesaw) and then decide that the data is in proper shape to pass into the CSVDataSource data structure. Allowing for "Human Intervention" especially at the Data-source part of the Data Pipeline, is very valuable to allow the Data Science person more control in the Data Quality aspect of the Data Pipeline. This type or kind, should be an option and should be available in Tribuo. So just wanted to elaborate on my thinking on this.
Thanks again for all your great help, really appreciate it.
Best Regards,

P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants