Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression #237

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open

Regression #237

wants to merge 19 commits into from

Conversation

shrit
Copy link
Member

@shrit shrit commented Sep 29, 2024

Hi,

Ported a couple of examples of regression examples mostly linear

Copy link

Binder 👈 Launch a binder notebook on branch shrit/examples/logistic

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed one example so far. I'm not entirely sure that all of these notebooks will adapt well to C++-only programs. Many of them focus on visualization and exploration, which isn't really a thing we can do just in a standalone program. So I think either they need to be really carefully gone over and all irrelevant parts removed, or maybe we leave them as notebook-only examples. Let me know what you think.

//!mkdir -p data && cat avocado.csv | sed 1d > avocado_trim.csv"
//"Drop columns 1 and 2 (\"Unamed: 0\", \"Date\") as these are not required and their presence cause issues while loading the data."
//!rm avocado_trim.csv"
//"!mv avocado_trim2.csv avocado_trim.csv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't a notebook, we don't have the ability to call out to the shell like this. So maybe we either indicate in the comments that we expect the user to run these commands, or we adapt download_data_set.py to get the data into the right form first (or, the version we link to on datasets.mlpack.org is pre-adapted).

@shrit
Copy link
Member Author

shrit commented Oct 1, 2024

I agree, I am trying to have at least one usage from the examples at least per method, because we have so many methods and really few examples.

I know if someone looks at the docs they would find some examples from them, but I agree, no need anymore to import the old ones, we can be more creative and have new ones.

Let us merge these three once I have solved the comments, I have tested them locally and I know they work, and they are old, as I did them one month ago or even more

@rcurtin
Copy link
Member

rcurtin commented Oct 2, 2024

Let us merge these three once I have solved the comments, I have tested them locally and I know they work, and they are old, as I did them one month ago or even more

Sounds good to me---my primary concern is just making sure that all the comments in the file next to the code (which is what users will actually be looking at) make sense and don't reference things specific to the notebook but not in this version. 👍

@rcurtin
Copy link
Member

rcurtin commented Oct 2, 2024

Also, looks like the steps in .ci/ just need slight modification to also install Pandas.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. In the longer term I am a little worried that we will have too many similar examples that will cause users to get lost, but I think we can sort that out later. I left a comment to that effect in the review, but up to you how you want to handle it (and if the answer is let's merge all three for now and sort it out later, that's fine with me).

default: all

$(TARGET): $(OBJS)
$(CXX) $(OBJS) -o $(TARGET) $(LDFLAGS) $(LIBS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to get CXXFLAGS in here too?

* which can later be unmapped to strings.
*/
/** PLEASE, delete the header of the dataset once you have downloaded the
* datset to your data/ directory. **/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely we should support headers soon, but for now should we modify the dataset downloader to strip the header?

# default rule
default: all

$(TARGET): $(OBJS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with CXXFLAGS

@@ -0,0 +1,138 @@
/**
* Predicting California House Prices with Linear Regression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, this is basically the same example as the avocado one, but with a different dataset. Do you think it is worth including? I think quality over quantity is better, so if this is not demonstrating some new or different functionality vs. the avocado example, I would vote to only keep one of the two.

# for the BLAS and LAPACK libraries that you are using.

TARGET := salary_prediction
SRC := salary-prediction.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably rename this to salary_prediction.cpp for consistency.

ungzip("avocado.csv.gz", "avocado.csv")
avocado_data = pull_csv("avocado.csv")
avocado_data = avocado_data.iloc[:, 2:]
avocado_data.to_csv("avocado.csv", index=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are you already stripping the header here? If so, it would probably be better to omit the comments from the files and work on adding header support to data::Load() quickly, then we don't have to come back to the comments here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the header support for data::Load

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, but what I mean in this comment is, it looks like you are stripping the header from the avocado dataset and also other datasets. So I think we can remove the warning comments in the code that say 'make sure the header is removed', because it is already removed here---and also because we will be adding header removal support very soon to data::Load().

scripts/download_data_set.py Show resolved Hide resolved
* Once the model is trained, we will be able to do some sample predictions.
*/
#include <mlpack.hpp>
#include <cmath>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this include is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants