Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openml x probabl blog #34

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

openml x probabl blog #34

wants to merge 4 commits into from

Conversation

SubhadityaMukherjee
Copy link
Contributor

OpenML x Probabl hackathon blog post. This was written a while back but was never published by the Probabl team. Thought Id put it here for now. (The Probabl team was sent an email about it as well)

@SubhadityaMukherjee
Copy link
Contributor Author

Added another blog post. Experimenting with LLM temperatures. Written when I was working on the AI search module.

Copy link
Member

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be fine converting the images to webp? The photos are very big (5~10Mb each), and might not load for people with worse reception (different countries or trains etc). And when we merge we want to make sure to squash merge to avoid the history of the 20mb+ of photos in the git history.

Copy link
Contributor Author

@SubhadityaMukherjee SubhadityaMukherjee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. I updated them to webp and added the instruction in the README for future reference.

Copy link
Member

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions for typos, clarifications, etc. Deliberately didn't review things like style or content. I guess we should discuss at some point if we want writing style in the blogs consistent in the group or just up to the individual authors at some point. So for now I assume the latter :)

"source": [
"---\n",
"title: OpenML x Probabl Hackathon\n",
"topic: hacakathon\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"topic: hacakathon\n",
"topic: Hackathon\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is also showing up in the blog article for some reason (https://openml-labs.github.io/website/notebooks/OpenMLxProbabl-hackathon.html). Probably needs an extra - or newline at the end of the cell.

"---\n",
"title: OpenML x Probabl Hackathon\n",
"topic: hacakathon\n",
"author: Subhaditya Mukherjee , Emily Chen\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"author: Subhaditya Mukherjee , Emily Chen\n",
"author: Subhaditya Mukherjee, Emily Chen\n",

"\n",
"[scikit-learn](http://scikit-learn.org) is a free and open-source machine learning library for the Python programming language, while [OpenML](http://openml.org) is an open platform for sharing datasets, algorithms, and experiments. While our teams have been working together for many years, we do not always have the time to meet in person. When we do get the chance though, many interesting discussions stem from those conversations.\n",
"\n",
"We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",
"Last June we had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond. We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",

Comment on lines +43 to +49
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information about this hackathon or just to chat with us, feel free to reach out to us on [OpenML Email](mailto:[email protected]), [Probabl](https://probabl.ai/about). To contact the authors, send them an email here - [Subhaditya Mukherjee](mailto:[email protected]), [Emily Chen](mailto:[email protected])."
]
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would put this cell at the end of the blogpost.

"\n",
"- **Mentoring** : Incorporating one-to-one mentoring to help new contributors set up their development environments would help them feel more connected to the project. It is understandable that this is time consuming, so perhaps some way of deciding who gets mentored can be set up in time.\n",
"\n",
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#about-pull-requests) (PR) before.\n",

"\n",
"- **Reviews**: A major challenge is that of approving new reviews in a large organization without passing through too many hoops. How to manage this is still an open question.\n",
"\n",
"- **Sponsorship**: It would be nice to explore sponsorship models like the INRIA foundation, where companies contribute to meetings and have a say in setting priorities.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want a sentence explaining that model or at least provide a link.

Comment on lines +128 to +129
"- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",
"- **Open source tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo as an alternative to GitHub and its issue trackers. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really say they are better (I don't have enough experience with CodeBerg). The reason I raised that in the discussion was that I wanted to talk about their perspective on whether we as open source should adopt those tools, even if they may (in the short term) be worse for the projects (as GitHub as more users, visibility, ...).

"\n",
"- **Community projects**: There are so many successful examples of open source projects, and a lot can be learned from their efforts. Projects like Scientific Python for example, have very well set up CI documentation and governance processes that can be quite readily applied to any open source project.\n",
"\n",
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow as well as AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."

"source": [
"## Discussion on Croissant\n",
"\n",
"Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
"Machine Learning datasets can consist of structure, unstructured data, or both, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous phrasing it was also hard to tell if it pertained to individual datasets or the landscape (I assume the latter), I think this formulation is less ambiguous.

"\n",
"Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
"\n",
"[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",
"[Croissant](https://github.com/mlcommons/croissant) is a dataset metadata format which, among other things, describes how to load and interpret the dataset, which helps load data regardless of which underlying dataset format is used. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants