diff --git a/README.md b/README.md index d31963f..c6eae00 100644 --- a/README.md +++ b/README.md @@ -26,4 +26,10 @@ format: --- ``` -The `topic` field is used to group similar posts into sections. In each section, the most recent posts are on top. Note that Quarto won't run again the code, so you must run everything before adding the notebook. An example of blog post is given [here](https://github.com/openml-labs/website/blob/main/notebooks/example.ipynb). \ No newline at end of file +The `topic` field is used to group similar posts into sections. In each section, the most recent posts are on top. Note that Quarto won't run again the code, so you must run everything before adding the notebook. An example of blog post is given [here](https://github.com/openml-labs/website/blob/main/notebooks/example.ipynb). + +### Important notes +- For the blog post notebooks + - The `date` field must be in the format `mm-dd-yyyy` , not `dd-mm-yyyy` + - Convert the images to WebP format before adding them to the notebook + - On mac, you can use ` for image in *.png; do magick "$image" "${image%.png}.webp";done` to convert all the images in the folder to WebP format. Replace `*.png` with the appropriate file extension if needed. \ No newline at end of file diff --git a/images/blogs/openmlxsci/discussion.webp b/images/blogs/openmlxsci/discussion.webp new file mode 100644 index 0000000..f100310 Binary files /dev/null and b/images/blogs/openmlxsci/discussion.webp differ diff --git a/images/blogs/openmlxsci/discussion2.webp b/images/blogs/openmlxsci/discussion2.webp new file mode 100644 index 0000000..c2d9e36 Binary files /dev/null and b/images/blogs/openmlxsci/discussion2.webp differ diff --git a/images/blogs/openmlxsci/lunch.webp b/images/blogs/openmlxsci/lunch.webp new file mode 100644 index 0000000..09bc078 Binary files /dev/null and b/images/blogs/openmlxsci/lunch.webp differ diff --git a/images/blogs/openmlxsci/openmlxsci-1.webp b/images/blogs/openmlxsci/openmlxsci-1.webp new file mode 100644 index 0000000..f5a0518 Binary files /dev/null and b/images/blogs/openmlxsci/openmlxsci-1.webp differ diff --git a/images/blogs/temperature/1.webp b/images/blogs/temperature/1.webp new file mode 100644 index 0000000..45c1cfd Binary files /dev/null and b/images/blogs/temperature/1.webp differ diff --git a/images/blogs/temperature/2.webp b/images/blogs/temperature/2.webp new file mode 100644 index 0000000..42cccfa Binary files /dev/null and b/images/blogs/temperature/2.webp differ diff --git a/images/blogs/temperature/3.webp b/images/blogs/temperature/3.webp new file mode 100644 index 0000000..6d8329d Binary files /dev/null and b/images/blogs/temperature/3.webp differ diff --git a/images/blogs/temperature/4.webp b/images/blogs/temperature/4.webp new file mode 100644 index 0000000..b5d2ef4 Binary files /dev/null and b/images/blogs/temperature/4.webp differ diff --git a/images/blogs/temperature/5.webp b/images/blogs/temperature/5.webp new file mode 100644 index 0000000..2a4a022 Binary files /dev/null and b/images/blogs/temperature/5.webp differ diff --git a/images/blogs/temperature/6.webp b/images/blogs/temperature/6.webp new file mode 100644 index 0000000..4f79e4e Binary files /dev/null and b/images/blogs/temperature/6.webp differ diff --git a/notebooks/Experiments with Temperature in LLMs.ipynb b/notebooks/Experiments with Temperature in LLMs.ipynb new file mode 100644 index 0000000..d91ce7b --- /dev/null +++ b/notebooks/Experiments with Temperature in LLMs.ipynb @@ -0,0 +1,432 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "title: Experiments with Temperature in LLMs\n", + "topic: LLM\n", + "author: Subhaditya Mukherjee\n", + "date: 07-08-2024\n", + "format:\n", + " html:\n", + " code-fold: false\n", + "--" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## Introduction\n", + "Over the past few months at OpenML, we have been experimenting with LLM models in an attempt to improve the search experience for our users. While our existing implementation uses ElasticSearch, we wanted to also have the option of having a more \"semantic\" search experience. \n", + "\n", + "Aside from the usual RAG pipeline that everyone and their grandparents seems to be using these days, we also wanted to experiment with using an LLM to semi-automatically generate filters for our search queries. While it may not seem like a big feature, it is something that has always been a bit of an annoyance for some of our users. \n", + "\n", + "So what does this entail? Consider the interface we have at the moment. We have a search bar at the top, and subsequently a bunch of filters that users can use to narrow down their search. While this works pretty well as is, how about trying to automate it a bit.\n", + "\n", + "In summary, we want a query like \"find me a large dataset with multiple classes of flowers\" to automatically generate filters like \"classification\", \"multiclass\", \"sort by size of dataset\" etc.\n", + "\n", + "
\n", + "\"\"\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## Temperature\n", + "\n", + "Think about the first time you used ChatGPT. What stood out to you? Was it how well it could elaborate on a topic? Or was it how creative it could be? The temperature parameter in LLMs is what controls this. \n", + "\n", + "How can we control creativity? Well, saying that we can directly control creativity is a bit of a stretch. We can however use a workaround.\n", + "\n", + "Do you remember the softmax function? The function that takes a vector of arbitrary real-valued scores and squashes it into a vector of probabilities that sum to 1. The inputs to the softmax function are the unnormalized log likelikhoods or the raw per class score assigned by the model. \n", + "\n", + "The softmax function is defined as:\n", + "\n", + "$$\\text{softmax}(x_i) = \\frac{e^{x_i}}{\\sum_{j=1}^{k} e^{x_j}}$$\n", + "\n", + "If we want more control over the distribution of the probabilities, we can use a temperature parameter. This would look like:\n", + "\n", + "$$ \\text{softmax}(x_i) = \\frac{e^{x_i/T}}{\\sum_{j=1}^{k} e^{x_j/T}} $$\n", + "\n", + "where $T$ is the temperature parameter.\n", + "\n", + "- If $T = 1$, the softmax function is the same as the original softmax function.\n", + "\n", + "- If $T > 1$, the probabilities will become \"flatter\". Since the difference between the probabilities will be less, the model can be more exploratory aka more creative.\n", + "\n", + "- If $T < 1$, the distribution of the probabilities are \"peakier\". There will be a higher difference between the probabilities, leading to the model being more confident in its predictions, but also less creative." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualizing Temperature Using Softmax\n", + "\n", + "```python\n", + "from tqdm import tqdm\n", + "import regex as re\n", + "# LangChain supports many other chat models. Here, we're using Ollama\n", + "from langchain_community.chat_models import ChatOllama\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from typing import List, Dict, Any\n", + "import numpy as np \n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd \n", + "sns.set_theme(style=\"white\")\n", + "```\n", + "\n", + "```python\n", + "def softmax(input, t=1.0):\n", + " ex = np.exp(input/t)\n", + " sum = np.sum(ex, axis=0)\n", + " return ex / sum\n", + "```\n", + "\n", + "```python\n", + "# plot softmax over a range of inputs\n", + "x = np.arange(0,1.0, 0.01)\n", + "t = np.array([0.1,.5, .8, 1.0])\n", + "y = np.array([softmax(x, ti) for ti in t])\n", + "\n", + "# Create a DataFrame for Seaborn\n", + "data = pd.DataFrame({\n", + " 'x': np.tile(x, len(t)),\n", + " 'softmax': np.concatenate(y),\n", + " 't': np.repeat(t, len(x))\n", + "})\n", + "\n", + "# Plotting with Seaborn\n", + "plt.figure(figsize=(8, 6))\n", + "sns.lineplot(data=data, x='x', y='softmax', hue='t', palette='viridis')\n", + "plt.xlabel('x')\n", + "plt.ylabel('softmax(x)')\n", + "plt.toc: true\n", + "title('Softmax Function for Different Values of t')\n", + "plt.legend(toc: true\n", + "title='t')\n", + "plt.show()\n", + "```\n", + "\n", + "
\n", + "\"\"\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## Creating the Experimental Setup\n", + "Now, we can focus on testing the effects of temperature for our use case. We are using the `llama3` model for our experiments. The experiments are being run on a 2023 MacBook Pro with an M3 chip and 18GB memory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Defining a Prompt\n", + "We need to first think of a prompt that we can use for our experiments. This prompt can be thought of as an instruction that the model uses along with the query to generate answers. To make it easier for us to use, we only want one/two word answers and for now we are only focusing on a small subset of the filters that we want our model to understand.\n", + "\n", + "```python\n", + "prompt = \"\"\"User Query : {query}\n", + "Based on the query, answer the following questions one by one in one or two words only and a maximum of two with commas only if asked for. Use only the information given and do not make up answers - \n", + "Does the user care about the size of the dataset? Yes/No and if yes, ascending/descending.\n", + "Does the user want to sort by number of downloads? Yes/No.\n", + "Does the user care about missing values? Yes/No.\n", + "If it seems like the user wants a classification dataset, is it binary/multi-class/multi-label? If not, say none.\n", + "\"\"\"\n", + "```\n", + "\n", + "```python\n", + "query = \"Find me a big classification dataset about mushrooms\"\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a Chain\n", + "Since we are using the `langchain` and `ollama` libraries for our experiments, we follow their API and create a chain. The template uses string formatting to insert the prompt and the query into the chain.\n", + "\n", + "```python\n", + "def create_chain(prompt , temperature, llm_model = \"llama3\"):\n", + " prompt = ChatPromptTemplate.from_template(prompt)\n", + " llm = ChatOllama(model=llm_model, temperature=temperature)\n", + " chain = prompt | llm | StrOutputParser()\n", + " return chain\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Parsing the Results\n", + "To make it easier for us to analyze the results, we generate an example answer and then see see if any further processing is needed.\n", + "\n", + "```python\n", + "# functiont to parse responses like this to a list of yes/no/none/yes,aescending/no etc\n", + "def parse_response(response):\n", + " # split by new line and remove first two lines (here are the answers:)\n", + " response = response.split('\\n')[2::]\n", + " # if response has a question mark, split by question mark and remove empty strings\n", + " for i in range(len(response)):\n", + " if '?' in response[i]:\n", + " response[i] = response[i].split('?')[1].strip()\n", + " # replace full stops with empty strings\n", + " response = [x.replace('.','') for x in response]\n", + " response = [x for x in response if x]\n", + " return response\n", + "```\n", + "\n", + "```python\n", + "chain = create_chain(prompt, 0.5)\n", + "response = chain.invoke({\"query\": query})\n", + "print(response)\n", + "```\n", + "\n", + " Here are the answers:\n", + " \n", + " 1. Does the user care about the size of the dataset?\n", + " Yes, ascending.\n", + " \n", + " 2. Does the user want to sort by number of downloads?\n", + " No\n", + " \n", + " 3. Does the user care about missing values?\n", + " No\n", + " \n", + " 4. Is it a classification dataset? If so, is it binary/multi-class/multi-label?\n", + " Yes, multi-class\n", + "\n", + "Yay, it works. We now write a function to generate results for different temperatures.\n", + "\n", + "```python\n", + "def generate_results_for_temp(query:str, range_of_temps : np.ndarray) -> List[List[str]:\n", + " results = []\n", + " for temperature in tqdm(range_of_temps):\n", + " chain = create_chain(prompt, temperature)\n", + " response = chain.invoke({\"query\": query})\n", + " results.append(parse_response(response))\n", + " return results\n", + " \n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Running the Experiments and Plotting Results\n", + "It is time to run the experiments and plot the results. \n", + "We write a function to plot the results in a `stripplot` to see the distribution of the answers for different temperatures.\n", + "\n", + "```python\n", + "def plot_yes_no(df: pd.DataFrame, toc: true\n", + "title:str) -> None:\n", + " fig, axs = plt.subplots(2, 2, figsize=(15, 15))\n", + " fig.suptoc: true\n", + "title(toc: true\n", + "title)\n", + " sns.stripplot(data=df, x='size', y='temperature', ax=axs[0, 0], hue='size')\n", + " sns.stripplot(data=df, x='sort_by_downloads', y='temperature', ax=axs[0, 1], hue='sort_by_downloads')\n", + " sns.stripplot(data=df, x='missing_values', y='temperature', ax=axs[1, 0], hue='missing_values')\n", + " sns.stripplot(data=df, x='classification_type', y='temperature', ax=axs[1, 1], hue='classification_type')\n", + " # tilt x axis labels\n", + " for ax in axs.flat:\n", + " ax.set_xticklabels(ax.get_xticklabels(), rotation=30, horizontalalignment='right')\n", + " plt.show()\n", + "```\n", + "\n", + "Sometimes, the model returns an extra field, we combine the last two fields to plot the results. (This is a bit of a hack, but it works for now and is ONLY used for plotting)\n", + "\n", + "```python\n", + "def combine_last_two_elements(lst):\n", + " # Check if the list has at least two elements\n", + " if len(lst) > 4:\n", + " # Combine the last two elements with a space separator\n", + " combined_element = lst[-2] + ' ' + lst[-1]\n", + "\n", + " # Create a new list with combined element instead of the last two\n", + " return lst[:-2] + [combined_element]\n", + " else:\n", + " return lst\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Experiment 1\n", + "Out first experiment is a rather simple query, \"Find me a big classification dataset about mushrooms\". As you can probably guess, we are looking for a dataset that is large, is a classification dataset and is about mushrooms.\n", + "\n", + "```python\n", + "range_of_temps = np.linspace(0, 1, 20)\n", + "query = \"Find me a big classification dataset about mushrooms\"\n", + "results1 = generate_results_for_temp(query, range_of_temps)\n", + "```\n", + "\n", + " 100%|██████████| 20/20 [00:49<00:00, 2.49s/it]\n", + "\n", + "```python\n", + "results1 = [y for y in x if all(sub not in y for sub in [\"If\", \":\"])] for x in results1]\n", + "```\n", + "\n", + "```python\n", + "df = pd.DataFrame(results1, columns = ['size', 'sort_by_downloads', 'missing_values', 'classification_type'])\n", + "df['temperature'] = range_of_temps\n", + "plot_yes_no(df, toc: true\n", + "title = query)\n", + "```\n", + " \n", + "
\n", + "\"\"\n", + "
\n", + "\n", + "Rather interesting don't you think? At higher temperatures, the model gets the answers wrong. Even at a temperature slightly above 0.1, the model starts adding extra information to it's answers.\n", + "\n", + "Did you notice that I tried to remove sentences that started with \"If\"? There are more examples of this later, but this is because at higher temperatures, the model tends to add random sentences to the answers and this makes it quite hard to plot them.\n", + "\n", + "### Experiment 2\n", + "Our second experiment is super easy. \"Find me a dataset that has a lot of missing values and order by number of downloads\". As you can obviously guess, we are looking for a dataset that has a lot of missing values and we want to order the results by the number of downloads.\n", + "\n", + "```python\n", + "range_of_temps = np.linspace(0, 1, 20)\n", + "query = \"Find me a dataset that has a lot of missing values and order by number of downloads\"\n", + "results2 = generate_results_for_temp(query, range_of_temps)\n", + "results2 = [y for y in x if \"so\" not in y] for x in results2]\n", + "\n", + "```\n", + "\n", + " 100%|██████████| 20/20 [00:34<00:00, 1.74s/it]\n", + "\n", + "```python\n", + "df = pd.DataFrame(results2, columns = ['size', 'sort_by_downloads', 'missing_values', 'classification_type'])\n", + "df['temperature'] = range_of_temps\n", + "plot_yes_no(df, toc: true\n", + "title = query)\n", + "```\n", + "\n", + "
\n", + "\"\"\n", + "
\n", + "\n", + "Hmm, same as before. The model starts adding extra information at higher temperatures and starts getting the answers wrong. (Yes, No?? ) What kind of answer is that?\n", + "\n", + "## Experiment 3\n", + "- Now a slightly more complex query. \"Find me a dataset that has 10 classes and sort by number of downloads\". We want it to understand that we want a multiclass classification dataset and we want to sort the results by the number of downloads.\n", + "\n", + "```python\n", + "range_of_temps = np.linspace(0, 1, 20)\n", + "query = \"Find me a dataset that has 10 classes and sort by number of downloads\"\n", + "results3 = generate_results_for_temp(query, range_of_temps)\n", + "```\n", + "\n", + " 100%|██████████| 20/20 [00:55<00:00, 2.80s/it]\n", + "\n", + "```python\n", + "results3 = [combine_last_two_elements(x) for x in results3]\n", + "```\n", + "\n", + "```python\n", + "df = pd.DataFrame(results3, columns = ['size', 'sort_by_downloads', 'missing_values', 'classification_type'])\n", + "df['temperature'] = range_of_temps\n", + "plot_yes_no(df, toc: true\n", + "title = query)\n", + "```\n", + "\n", + "
\n", + "\"\"\n", + "
\n", + "\n", + "This seems to have been very easy for the model. But as always, the model starts adding extra information at higher temperatures. A lot of extra information in fact. Even though the prompt says to ONLY answer with one or two words\n", + "\n", + "## Experiment 4\n", + "- \"Find me a dataset that 2 classes and is a big dataset\". You know the drill by now. We want a binary classification dataset that is large.\n", + "\n", + "```python\n", + "range_of_temps = np.linspace(0, 1, 20)\n", + "query = \"Find me a dataset that 2 classes and is a big dataset\"\n", + "results4 = generate_results_for_temp(query, range_of_temps)\n", + "```\n", + "\n", + " 100%|██████████| 20/20 [00:42<00:00, 2.14s/it]\n", + "\n", + "```python\n", + "results4 = [combine_last_two_elements(x) for x in results4]\n", + "```\n", + "\n", + "```python\n", + "df = pd.DataFrame(results4, columns = ['size', 'sort_by_downloads', 'missing_values', 'classification_type'])\n", + "df['temperature'] = range_of_temps\n", + "plot_yes_no(df, toc: true\n", + "title = query)\n", + "```\n", + "\n", + "
\n", + "\"\"\n", + "
\n", + "\n", + "Notice how some things changed? At higher temperatures, we get extended answers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In conclusion, we can see that we should probably stick to lower temperatures for our use case. As we go higher, the model starts being more \"creative\" and either adds extra information to the answers or gets them wrong. While this behaviour might be useful in cases like creative writing, it is not something we want in our search.\n", + "\n", + "Using LLMs can sometimes be a bit of a hit or miss. But of course, learning to control it's parameters can help us get the most out of it. This blog post was just a simple experiment, but in the deluge of content made by people who have no idea what Softmax is, I hope this was helpful." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "3.11.9", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/OpenMLxProbabl-hackathon.ipynb b/notebooks/OpenMLxProbabl-hackathon.ipynb new file mode 100644 index 0000000..17e6077 --- /dev/null +++ b/notebooks/OpenMLxProbabl-hackathon.ipynb @@ -0,0 +1,214 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "---\n", + "title: OpenML x Probabl Hackathon\n", + "topic: hacakathon\n", + "author: Subhaditya Mukherjee , Emily Chen\n", + "date: 09-19-2024\n", + "format:\n", + " html:\n", + " code-fold: false\n", + "--" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## Introduction\n", + "\n", + "[scikit-learn](http://scikit-learn.org) is a free and open-source machine learning library for the Python programming language, while [OpenML](http://openml.org) is an open platform for sharing datasets, algorithms, and experiments. While our teams have been working together for many years, we do not always have the time to meet in person. When we do get the chance though, many interesting discussions stem from those conversations.\n", + "\n", + "We recently had a developers hackathon at the Paris office of [Probabl](https://probabl.ai/about), the official operating brand of scikit-learn, focused on maintaining and expanding open-source ML in Europe and beyond). We met to discuss not only the state of AI and how both our organizations fit in, but also to brainstorm solutions to challenges faced by our developers and communities. \n", + "\n", + "
\n", + "\"View\n", + "
\n", + "\n", + "Many interesting topics were brought up, most of which were not only relevant to us but also to the broader open-source community. In the spirit of open-source, we wanted to share these insights with you." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more information about this hackathon or just to chat with us, feel free to reach out to us on [OpenML Email](mailto:openml-labs@tue.nl), [Probabl](https://probabl.ai/about). To contact the authors, send them an email here - [Subhaditya Mukherjee](mailto:msubhaditya@gmail.com), [Emily Chen](mailto:emilyxinyi.chen2021@gmail.com)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## Community Engagement and Onboarding Contributors\n", + "\n", + "The focus of this discussion was around community engagement and emphasized the importance of effectively attracting, onboarding, and retaining contributors, especially newcomers.\n", + "\n", + "Over the past few years, both in OpenML and scikit-learn, we have been noticing that a majority of contributors are only active for short durations and do not stick around for too long. The ones that do are distributed between a smaller number of more experienced senior developers and a much larger pool of junior developers. This then leads to many PR's being of lower quality and thus requiring a lot more verification and correction. \n", + "\n", + "The question at hand then, is not only how we can attract new contributors to our projects but also how we can make it easier for them and our developers to maintain these projects. Some of the ideas that came up have been tried and tested by communities our colleagues have founded or been part of across the globe.\n", + "\n", + "**Takeaways:**\n", + "\n", + "- **Emotional connection** : All of the participants agreed that the most important part of any community is its people. Contributors only stick around if they have an emotional connection to either the project, or the people contributing to the project. In this vein, it would be nice if the maintainers of the project could also be present at events. It also helps if the contributors use the projects themselves.\n", + "\n", + "- **Focus on beginners** : Since we see that most of our contributors are beginners, it serves to organize sprints that are inclusive of them with a focus on beginner-friendly issues, especially documentation tasks. Having these would not only help them understand the project and contribute better, but also let them form a connection with the project.\n", + "\n", + "- **Curated issues** : Most of our external contributors have enough on their plate already. Having a curated list of issues before sprints would ensure that no time is wasted and let our contributors focus on the tasks suitable for them.\n", + "\n", + "- **Different tiers of events** : To ensure that everyone is given tasks they can handle, it would be nice to have separate events for contributors with varying levels of expertise. This would also have the added benefit of retaining beginner-friendly issues for sprints to prevent more experienced contributors from claiming them early.\n", + "\n", + "- **Mentoring** : Incorporating one-to-one mentoring to help new contributors set up their development environments would help them feel more connected to the project. It is understandable that this is time consuming, so perhaps some way of deciding who gets mentored can be set up in time.\n", + "\n", + "- **Contributors guide** : Simplifying the contributor's guide and adding video tutorials would make it a lot more beginner-friendly, especially to those who have never filed a PR before.\n", + "\n", + "- **Semi regular events** : Some of our colleagues found that they tended to care about a project more if there were semi regular events they could set time aside for. Having these helped them slowly build a sense of community as well.\n", + "\n", + "- **Incentives** : A common question that many developers have is why they should even bother contributing. Helping them understand how contributing to open source projects can aid their careers, bring them closer to the community and also help them get internships would be a good start.\n", + "\n", + "
\n", + "\"Discussions\"\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Governance, Funding, and Sponsorship\n", + "\n", + "This focus of this discussion was the governance structures of open-source projects, sustainable funding models, and the role of sponsorships in supporting project activities.\n", + "\n", + "**Takeaways:**\n", + "\n", + "- **Evolving Governance** : Since governance is not static, we can treat it as a living document that evolves with the needs of the community.\n", + "\n", + "- **Communication**: It is good practise to maintain open communication channels, such as mailing lists and monthly meetings. This keeps all interested contributors in the loop.\n", + "\n", + "- **Reviews**: A major challenge is that of approving new reviews in a large organization without passing through too many hoops. How to manage this is still an open question.\n", + "\n", + "- **Sponsorship**: It would be nice to explore sponsorship models like the INRIA foundation, where companies contribute to meetings and have a say in setting priorities.\n", + "\n", + "- **Corporate partnerships**: To keep investors interested, it would also be interesting to look into corporate partnerships (similar to those used by the Linux Foundation) that are of mutual benefit." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Development Tooling and Workflows\n", + "\n", + "Setting up and maintaining CI/CD pipelines across multiple repositories and programming languages puts a huge burden on the maintainers of the repositories.\n", + "\n", + "Having to keep up with the trends and learn new technology very frequently also makes developers more reluctant to change the stack, even if doing so would be beneficial.\n", + "\n", + "This discussion looked at how to tackle these challenges and make it easier for developers to handle such complex workloads.\n", + "\n", + "**Takeaways:**\n", + "\n", + "- **Automation**: Automating as much of the CI/CD pipeline as possible by using bots for linting and code coverage checks makes PR quality control easier.\n", + "\n", + "- **Better workflows**: Migrating to Github Actions and Azure workflows for testing and deployment also seems to help significantly.\n", + "\n", + "- **Better tools**: There are many open source platforms/tools that offer comparable convenience to tools that are currently used. Some examples of these are CodeBerg and Forgejo. Eventually migrating to using these tools might also help manage projects like scikit-learn and OpenML.\n", + "\n", + "- **CircleCI**: Tools for rendering documentation examples directly in the browser can also be used to enhance the review process.\n", + "\n", + "
\n", + "\"Brainstorming\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Broader Ecosystem and Scope of Collaboration\n", + "\n", + "While both OpenML and scikit-learn focus on the open source ML community, it is sometimes hard to explain how we fit into the broader AI ecosystem. \n", + "\n", + "Especially with the rise of LLMs and Generative AI, stakeholders are somewhat inclined to think that frameworks such as ours are just not \"enough\". Of course, this is not true at all.\n", + "\n", + "**Takeaways:**\n", + "\n", + "- **Community projects**: There are so many successful examples of open source projects, and a lot can be learned from their efforts. Projects like Scientific Python for example, have very well set up CI documentation and governance processes that can be quite readily applied to any open source project.\n", + "\n", + "- **Exploring connections**: A longer term focus for both our teams would be to explore connections with other open-source ML frameworks, such as PyTorch, Tensorflow and other AutoML tools. Doing so would also help integrate and strengthen the open-source ML communities and ecosystems." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Discussion on Croissant\n", + "\n", + "Machine Learning datasets are a combination of structured and unstructured data, which makes them all the more complicated to manage. This has led to the rise of multiple \"dataset formats\" which further make it hard to consistently load data across platforms and tools.\n", + "\n", + "[Croissant](https://github.com/mlcommons/croissant) is one such dataset format which directly tackles the issue of consistency. This format is now not only compatible with the most popular ML libraries/platforms (scikit-learn, PyTorch, Tensorflow, Kaggle, Hugging Face), it is also recommended by NeurIPS (one of the topmost conferences in the AI space).\n", + "\n", + "**Features of Croissant:**\n", + "\n", + "- **Schema.org**: Croissant was built on top of schema.org with more metadata information specific to ML datasets. Since it does not require any changes to the underlying data structure, existing datasets can quite easily be converted to use it.\n", + "\n", + "- **Layers**: The format has 4 layers \\- Dataset level metadata, resource descriptions, content structure, and ML semantics. Each of which make it possible to encode and maintain structural information about datasets regardless of platform.\n", + "\n", + "- **XAI and Visualization**: Analysis and visualisation of the data works out of the box for all datasets and across multiple platforms. Croissant also supports the Core RAI vocabulary for explainable AI.\n", + "\n", + "- **Supported Platforms**: Every dataset in OpenML has a Croissant representation, while a majority of data on Kaggle and Google Dataset search also support it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "
\n", + "\"Our\n", + "
\n", + "\n", + "Overall, this discussion was quite a successful one for both of our teams. We learnt a lot from each other and found new ways of collaborating on our shared dream of open-source ML. So much in fact, that we wanted to share our discussion with you, dear reader.\n", + "\n", + "We hope you learnt something new. We would love to welcome you to our community and would be glad to support your journey in this ML space.\n", + "\n", + "❤️ The OpenML and scikit-learn team" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "3.11.9", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}