Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openml x probabl blog #34

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added images/blogs/openmlxsci/discussion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/openmlxsci/discussion2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/openmlxsci/lunch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/openmlxsci/openmlxsci-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/blogs/temperature/6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
432 changes: 432 additions & 0 deletions notebooks/Experiments with Temperature in LLMs.ipynb

Large diffs are not rendered by default.

214 changes: 214 additions & 0 deletions notebooks/OpenMLxProbabl-hackathon.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"---\n",
"title: OpenML x Probabl Hackathon\n",
"topic: hacakathon\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"topic: hacakathon\n",
"topic: Hackathon\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header is also showing up in the blog article for some reason (https://openml-labs.github.io/website/notebooks/OpenMLxProbabl-hackathon.html). Probably needs an extra - or newline at the end of the cell.

"author: Subhaditya Mukherjee , Emily Chen\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"author: Subhaditya Mukherjee , Emily Chen\n",
"author: Subhaditya Mukherjee, Emily Chen\n",

"date: 09-19-2024\n",
"format:\n",
" html:\n",
" code-fold: false\n",
"--"
]
},
{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## Introduction\n",
"\n",
"[scikit-learn](http://scikit-learn.org) is a free and open-source machine learning library for the Python programming language, while [OpenML](http://openml.org) is an open platform for sharing datasets, algorithms, and experiments. While our teams have been working together for many years, we do not always have the time to meet in person. When we do get the chance though, many interesting discussions stem from those conversations.\n",
"\n",
"We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",
"Last June we had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond. We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n",

"\n",
"<div style=\"text-align:center\">\n",
"<img alt=\"View from the office\" src=\"../images/blogs/openmlxsci/openmlxsci-1.png\" style=\"width:40%\">\n",
"</div>\n",
"\n",
"Many interesting topics were brought up, most of which were not only relevant to us but also to the broader open-source community. In the spirit of open-source, we wanted to share these insights with you."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information about this hackathon or just to chat with us, feel free to reach out to us on [OpenML Email](mailto:[email protected]), [Probabl](https://probabl.ai/about). To contact the authors, send them an email here - [Subhaditya Mukherjee](mailto:[email protected]), [Emily Chen](mailto:[email protected])."
]
},
Comment on lines +43 to +49
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would put this cell at the end of the blogpost.

{
"cell_type": "markdown",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"## Community Engagement and Onboarding Contributors\n",
"\n",
"The focus of this discussion was around community engagement and emphasized the importance of effectively attracting, onboarding, and retaining contributors, especially newcomers.\n",
"\n",
"Over the past few years, both in OpenML and scikit-learn, we have been noticing that a majority of contributors are only active for short durations and do not stick around for too long. The ones that do are distributed between a smaller number of more experienced senior developers and a much larger pool of junior developers. This then leads to many PR's being of lower quality and thus requiring a lot more verification and correction. \n",
"\n",
"The question at hand then, is not only how we can attract new contributors to our projects but also how we can make it easier for them and our developers to maintain these projects. Some of the ideas that came up have been tried and tested by communities our colleagues have founded or been part of across the globe.\n",
"\n",
"**Takeaways:**\n",
"\n",
"- **Emotional connection** : All of the participants agreed that the most important part of any community is its people. Contributors only stick around if they have an emotional connection to either the project, or the people contributing to the project. In this vein, it would be nice if the maintainers of the project could also be present at events. It also helps if the contributors use the projects themselves.\n",
"\n",
"- **Focus on beginners** : Since we see that most of our contributors are beginners, it serves to organize sprints that are inclusive of them with a focus on beginner-friendly issues, especially documentation tasks. Having these would not only help them understand the project and contribute better, but also let them form a connection with the project.\n",
"\n",
"- **Curated issues** : Most of our external contributors have enough on their plate already. Having a curated list of issues before sprints would ensure that no time is wasted and let our contributors focus on the tasks suitable for them.\n",
"\n",
"- **Different tiers of events** : To ensure that everyone is given tasks they can handle, it would be nice to have separate events for contributors with varying levels of expertise. This would also have the added benefit of retaining beginner-friendly issues for sprints to prevent more experienced contributors from claiming them early.\n",
"\n",
"- **Mentoring** : Incorporating one-to-one mentoring to help new contributors set up their development environments would help them feel more connected to the project. It is understandable that this is time consuming, so perhaps some way of deciding who gets mentored can be set up in time.\n",
"\n",
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",
"- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#about-pull-requests) (PR) before.\n",

"\n",
"- **Semi regular events** : Some of our colleagues found that they tended to care about a project more if there were semi regular events they could set time aside for. Having these helped them slowly build a sense of community as well.\n",
"\n",
"- **Incentives** : A common question that many developers have is why they should even bother contributing. Helping them understand how contributing to open source projects can aid their careers, bring them closer to the community and also help them get internships would be a good start.\n",
"\n",
"<div style=\"text-align:center\">\n",
"<img alt=\"Discussions\" src=\"../images/blogs/openmlxsci/discussion.png\" style=\"width:40%\">\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Governance, Funding, and Sponsorship\n",
"\n",
"This focus of this discussion was the governance structures of open-source projects, sustainable funding models, and the role of sponsorships in supporting project activities.\n",
"\n",
"**Takeaways:**\n",
"\n",
"- **Evolving Governance** : Since governance is not static, we can treat it as a living document that evolves with the needs of the community.\n",
"\n",
"- **Communication**: It is good practise to maintain open communication channels, such as mailing lists and monthly meetings. This keeps all interested contributors in the loop.\n",
"\n",
"- **Reviews**: A major challenge is that of approving new reviews in a large organization without passing through too many hoops. How to manage this is still an open question.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviews? What do you mean?
Is it supposed to reviewers, or pull requests, perhaps? Not sure.

"\n",
"- **Sponsorship**: It would be nice to explore sponsorship models like the INRIA foundation, where companies contribute to meetings and have a say in setting priorities.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want a sentence explaining that model or at least provide a link.

"\n",
"- **Corporate partnerships**: To keep investors interested, it would also be interesting to look into corporate partnerships (similar to those used by the Linux Foundation) that are of mutual benefit."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Development Tooling and Workflows\n",
"\n",
"Setting up and maintaining CI/CD pipelines across multiple repositories and programming languages puts a huge burden on the maintainers of the repositories.\n",
"\n",
"Having to keep up with the trends and learn new technology very frequently also makes developers more reluctant to change the stack, even if doing so would be beneficial.\n",
"\n",
"This discussion looked at how to tackle these challenges and make it easier for developers to handle such complex workloads.\n",
"\n",
"**Takeaways:**\n",
"\n",
"- **Automation**: Automating as much of the CI/CD pipeline as possible by using bots for linting and code coverage checks makes PR quality control easier.\n",
"\n",
"- **Better workflows**: Migrating to Github Actions and Azure workflows for testing and deployment also seems to help significantly.\n",
"\n",
"- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",
Comment on lines +128 to +129
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",
"- **Open source tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo as an alternative to GitHub and its issue trackers. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
"\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really say they are better (I don't have enough experience with CodeBerg). The reason I raised that in the discussion was that I wanted to talk about their perspective on whether we as open source should adopt those tools, even if they may (in the short term) be worse for the projects (as GitHub as more users, visibility, ...).

"- **CircleCI**: Tools for rendering documentation examples directly in the browser can also be used to enhance the review process.\n",
"\n",
"<div style=\"text-align:center\">\n",
"<img alt=\"Brainstorming sessions\" src=\"../images/blogs/openmlxsci/discussion2.png\" style=\"width:40%\">\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Broader Ecosystem and Scope of Collaboration\n",
"\n",
"While both OpenML and scikit-learn focus on the open source ML community, it is sometimes hard to explain how we fit into the broader AI ecosystem. \n",
"\n",
"Especially with the rise of LLMs and Generative AI, stakeholders are somewhat inclined to think that frameworks such as ours are just not \"enough\". Of course, this is not true at all.\n",
"\n",
"**Takeaways:**\n",
"\n",
"- **Community projects**: There are so many successful examples of open source projects, and a lot can be learned from their efforts. Projects like Scientific Python for example, have very well set up CI documentation and governance processes that can be quite readily applied to any open source project.\n",
"\n",
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."
"- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow as well as AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Discussion on Croissant\n",
"\n",
"Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
"Machine Learning datasets can consist of structure, unstructured data, or both, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the previous phrasing it was also hard to tell if it pertained to individual datasets or the landscape (I assume the latter), I think this formulation is less ambiguous.

"\n",
"[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",
"[Croissant](https://github.com/mlcommons/croissant) is a dataset metadata format which, among other things, describes how to load and interpret the dataset, which helps load data regardless of which underlying dataset format is used. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",

"\n",
"**Features of Croissant:**\n",
"\n",
"- **Schema.org**: Croissant was built on top of schema.org with more metadata information specific to ML datasets. Since it does not require any changes to the underlying data structure, existing datasets can quite easily be converted to use it.\n",
"\n",
"- **Layers**: The format has 4 layers \\- Dataset level metadata, resource descriptions, content structure, and ML semantics. Each of which make it possible to encode and maintain structural information about datasets regardless of platform.\n",
"\n",
"- **XAI and Visualization**: Analysis and visualisation of the data works out of the box for all datasets and across multiple platforms. Croissant also supports the Core RAI vocabulary for explainable AI.\n",
"\n",
"- **Supported Platforms**: Every dataset in OpenML has a Croissant representation, while a majority of data on Kaggle and Google Dataset search also support it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"<div style=\"text-align:center\">\n",
"<img alt=\"Our teams enjoying a nice lunch in Paris\" src=\"../images/blogs/openmlxsci/lunch.png\" style=\"width:40%\">\n",
"</div>\n",
"\n",
"Overall, this discussion was quite a successful one for both of our teams. We learnt a lot from each other and found new ways of collaborating on our shared dream of open-source ML. So much in fact, that we wanted to share our discussion with you, dear reader.\n",
"\n",
"We hope you learnt something new. We would love to welcome you to our community and would be glad to support your journey in this ML space.\n",
"\n",
"❤️ The OpenML and scikit-learn team"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "3.11.9",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading