New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

openml x probabl blog #34

Open

SubhadityaMukherjee wants to merge 4 commits into main from openmlxsciblog

Contributor

SubhadityaMukherjee commented Jan 2, 2025

OpenML x Probabl hackathon blog post. This was written a while back but was never published by the Probabl team. Thought Id put it here for now. (The Probabl team was sent an email about it as well)

SubhadityaMukherjee added 3 commits

January 2, 2025 15:31


          openml x sci blog

9b158e9


          fixed date I think

bb518cc


          added another blog - experimenting with lLMs

c016670

Contributor Author

SubhadityaMukherjee commented Jan 2, 2025

Added another blog post. Experimenting with LLM temperatures. Written when I was working on the AI search module.

PGijsbers requested changes

View reviewed changes

Member

PGijsbers left a comment •

edited

Loading

Would you be fine converting the images to webp? The photos are very big (5~10Mb each), and might not load for people with worse reception (different countries or trains etc). And when we merge we want to make sure to squash merge to avoid the history of the 20mb+ of photos in the git history.


          converted images to webp, added notes to the README for future reference

010e1e1

SubhadityaMukherjee commented

View reviewed changes

Contributor Author

SubhadityaMukherjee left a comment

Ah, makes sense. I updated them to webp and added the instruction in the README for future reference.

PGijsbers requested changes

View reviewed changes

Member

PGijsbers left a comment

Minor suggestions for typos, clarifications, etc. Deliberately didn't review things like style or content. I guess we should discuss at some point if we want writing style in the blogs consistent in the group or just up to the individual authors at some point. So for now I assume the latter :)

notebooks/OpenMLxProbabl-hackathon.ipynb

+                 "source": [
+                  "---\n",
+                  "title: OpenML x Probabl Hackathon\n",
+                  "topic: hacakathon\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "topic: hacakathon\n",
          
                "topic: Hackathon\n",

Member

PGijsbers Jan 6, 2025

The header is also showing up in the blog article for some reason (https://openml-labs.github.io/website/notebooks/OpenMLxProbabl-hackathon.html). Probably needs an extra - or newline at the end of the cell.

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "---\n",
+                  "title: OpenML x Probabl Hackathon\n",
+                  "topic: hacakathon\n",
+                  "author: Subhaditya Mukherjee , Emily Chen\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "author: Subhaditya Mukherjee , Emily Chen\n",
          
                "author: Subhaditya Mukherjee, Emily Chen\n",

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "\n",
+                  "[scikit-learn](http://scikit-learn.org) is a free and open-source machine learning library for the Python programming language, while [OpenML](http://openml.org) is an open platform for sharing datasets, algorithms, and experiments. While our teams have been working together for many years, we do not always have the time to meet in person. When we do get the chance though, many interesting discussions stem from those conversations.\n",
+                  "\n",
+                  "We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities.  \n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities.  \n",
          
                "Last June we had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond. We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities.  \n",

notebooks/OpenMLxProbabl-hackathon.ipynb

Comment on lines +43 to +49

+                {
+                 "cell_type": "markdown",
+                 "metadata": {},
+                 "source": [
+                  "For more information about this hackathon or just to chat with us, feel free to reach out to us on [OpenML Email](mailto:[email protected]), [Probabl](https://probabl.ai/about). To contact the authors, send them an email here - [Subhaditya Mukherjee](mailto:[email protected]), [Emily Chen](mailto:[email protected])."
+                 ]
+                },

Member

PGijsbers Jan 6, 2025

Would put this cell at the end of the blogpost.

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "\n",
+                  "- **Mentoring** : Incorporating one-to-one mentoring to help new contributors set up their development environments would help them feel more connected to the project. It is understandable that this is time consuming, so perhaps some way of deciding who gets mentored can be set up in time.\n",
+                  "\n",
+                  "- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n",
          
                "- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests#about-pull-requests) (PR) before.\n",

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "\n",
+                  "- **Reviews**: A major challenge is that of approving new reviews in a large organization without passing through too many hoops. How to manage this is still an open question.\n",
+                  "\n",
+                  "- **Sponsorship**: It would be nice to explore sponsorship models like the INRIA foundation, where companies contribute to meetings and have a say in setting priorities.\n",

Member

PGijsbers Jan 6, 2025

Probably want a sentence explaining that model or at least provide a link.

notebooks/OpenMLxProbabl-hackathon.ipynb

Comment on lines +128 to +129

		"- Better tools: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
		"\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
          
                "\n",
          
                "- **Open source tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo as an alternative to GitHub and its issue trackers. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n",
          
                "\n",

Member

PGijsbers Jan 6, 2025

I can't really say they are better (I don't have enough experience with CodeBerg). The reason I raised that in the discussion was that I wanted to talk about their perspective on whether we as open source should adopt those tools, even if they may (in the short term) be worse for the projects (as GitHub as more users, visibility, ...).

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "\n",
+                  "- **Community projects**: There are so many successful examples of open source projects, and a lot can be learned from their efforts. Projects like Scientific Python for example, have very well set up CI documentation and governance processes that can be quite readily applied to any open source project.\n",
+                  "\n",
+                  "- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."
          
                "- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow as well as AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems."

notebooks/OpenMLxProbabl-hackathon.ipynb

+                 "source": [
+                  "## Discussion on Croissant\n",
+                  "\n",
+                  "Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
          
                "Machine Learning datasets can consist of structure, unstructured data, or both, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",

Member

PGijsbers Jan 6, 2025

in the previous phrasing it was also hard to tell if it pertained to individual datasets or the landscape (I assume the latter), I think this formulation is less ambiguous.

notebooks/OpenMLxProbabl-hackathon.ipynb

+                  "\n",
+                  "Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n",
+                  "\n",
+                  "[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",

Member

PGijsbers Jan 6, 2025

Suggested change

      
                "[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",
          
                "[Croissant](https://github.com/mlcommons/croissant) is a dataset metadata format which, among other things, describes how to load and interpret the dataset, which helps load data regardless of which underlying dataset format is used. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet