diff --git a/hacktoberfest_challenges/datasets_without_language.md b/hacktoberfest_challenges/datasets_without_language.md index 25c16fe07..a36464bc3 100644 --- a/hacktoberfest_challenges/datasets_without_language.md +++ b/hacktoberfest_challenges/datasets_without_language.md @@ -71,24 +71,23 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase ## Datasets without language field filled in - | status | pr_url | hub_id | downloads | likes | |--------|-----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------| -| | | [sahil2801/CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) | 2124 | 104 | +| Merged | [here](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k/discussions/5) | [sahil2801/CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) | 2124 | 104 | | | | [facebook/winoground](https://huggingface.co/datasets/facebook/winoground) | 5468 | 57 | | | | [oscar-corpus/OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301) | 7814 | 56 | | | | [MMInstruction/M3IT](https://huggingface.co/datasets/MMInstruction/M3IT) | 62902 | 47 | | | | [huggan/wikiart](https://huggingface.co/datasets/huggan/wikiart) | 344 | 38 | | | | [HuggingFaceH4/CodeAlpaca_20K](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K) | 850 | 36 | | | | [codeparrot/self-instruct-starcoder](https://huggingface.co/datasets/codeparrot/self-instruct-starcoder) | 454 | 25 | -| | | [unaidedelf87777/openapi-function-invocations-25k](https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k) | 47 | 20 | +| | [here](https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k/discussions/3) | [unaidedelf87777/openapi-function-invocations-25k](https://huggingface.co/datasets/unaidedelf87777/openapi-function-invocations-25k) | 47 | 20 | | | [here](https://huggingface.co/datasets/Matthijs/cmu-arctic-xvectors/discussions/4) | [Matthijs/cmu-arctic-xvectors](https://huggingface.co/datasets/Matthijs/cmu-arctic-xvectors) | 158508 | 19 | | | | [skg/toxigen-data](https://huggingface.co/datasets/skg/toxigen-data) | 957 | 17 | | | | [oscar-corpus/colossal-oscar-1.0](https://huggingface.co/datasets/oscar-corpus/colossal-oscar-1.0) | 66 | 17 | -| | | [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro) | 267 | 15 | +| Merged | [here](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro/discussions/2) | [aadityaubhat/GPT-wiki-intro](https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro) | 267 | 15 | | | | [codeparrot/github-jupyter-code-to-text](https://huggingface.co/datasets/codeparrot/github-jupyter-code-to-text) | 11 | 14 | | | [here](https://huggingface.co/datasets/cfilt/iitb-english-hindi/discussions/1#651ab7559c4067f3b896564f) | [cfilt/iitb-english-hindi](https://huggingface.co/datasets/cfilt/iitb-english-hindi) | 1147 | 11 | -| | | [iamtarun/python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca) | 1424 | 10 | +| | [here](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca/discussions/1) | [iamtarun/python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca) | 1424 | 10 | | Merged | [here](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en/discussions/1#651ab6e569d3438f0f246312) | [argilla/databricks-dolly-15k-curated-en](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en) | 9651261 | 9 | | | | [sander-wood/irishman](https://huggingface.co/datasets/sander-wood/irishman) | 456 | 9 | | | | [OleehyO/latex-formulas](https://huggingface.co/datasets/OleehyO/latex-formulas) | 46 | 9 | @@ -101,8 +100,8 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase | Merged | [here](https://huggingface.co/datasets/NbAiLab/norwegian-xsum/discussions/2#651b2951b08a2b1588b8d99e) | [NbAiLab/norwegian-xsum](https://huggingface.co/datasets/NbAiLab/norwegian-xsum) | 0 | 4 | | | [here](https://huggingface.co/datasets/merve/turkish_instructions/discussions/1#651ae7a8cc1c891376b4bb45) | [merve/turkish_instructions](https://huggingface.co/datasets/merve/turkish_instructions) | 36 | 4 | | | | [tianyang/repobench-c](https://huggingface.co/datasets/tianyang/repobench-c) | 240 | 3 | -| | | [HuggingFaceH4/self_instruct](https://huggingface.co/datasets/HuggingFaceH4/self_instruct) | 219 | 3 | -| | | [iamtarun/code_instructions_120k_alpaca](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca) | 141 | 3 | +| | [here](https://huggingface.co/datasets/HuggingFaceH4/self_instruct/discussions/1) | [HuggingFaceH4/self_instruct](https://huggingface.co/datasets/HuggingFaceH4/self_instruct) | 219 | 3 | +| | [here](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca/discussions/1) | [iamtarun/code_instructions_120k_alpaca](https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca) | 141 | 3 | | | [here](https://huggingface.co/datasets/j0selit0/insurance-qa-en/discussions/2#651ab933aa7da01954bdc21f) | [j0selit0/insurance-qa-en](https://huggingface.co/datasets/j0selit0/insurance-qa-en) | 64 | 3 | | | | [billray110/corpus-of-diverse-styles](https://huggingface.co/datasets/billray110/corpus-of-diverse-styles) | 18 | 3 | | | [here](https://huggingface.co/datasets/dmayhem93/agieval-sat-en/discussions/1#651ab8b5e8b2318cdb755b17) | [dmayhem93/agieval-sat-en](https://huggingface.co/datasets/dmayhem93/agieval-sat-en) | 87 | 2 | @@ -112,17 +111,17 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase | | [here](https://huggingface.co/datasets/philschmid/test_german_squad/discussions/1) | [philschmid/test_german_squad](https://huggingface.co/datasets/philschmid/test_german_squad) | 0 | 2 | | | | [gia-project/gia-dataset](https://huggingface.co/datasets/gia-project/gia-dataset) | 1727 | 1 | | | [here](https://huggingface.co/datasets/stas/wmt14-en-de-pre-processed/discussions/1#651ab7aa8a5c072ce16774ac) | [stas/wmt14-en-de-pre-processed](https://huggingface.co/datasets/stas/wmt14-en-de-pre-processed) | 423 | 1 | -| | | [ajaykarthick/imdb-movie-reviews](https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews) | 222 | 1 | +| | [here](https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews/discussions/1) | [ajaykarthick/imdb-movie-reviews](https://huggingface.co/datasets/ajaykarthick/imdb-movie-reviews) | 222 | 1 | | | | [MMInstruction/M3IT-80](https://huggingface.co/datasets/MMInstruction/M3IT-80) | 108 | 1 | -| | | [rizerphe/sharegpt-hyperfiltered-3k-llama](https://huggingface.co/datasets/rizerphe/sharegpt-hyperfiltered-3k-llama) | 35 | 1 | +| | [here](https://huggingface.co/datasets/rizerphe/sharegpt-hyperfiltered-3k-llama/discussions/1) | [rizerphe/sharegpt-hyperfiltered-3k-llama](https://huggingface.co/datasets/rizerphe/sharegpt-hyperfiltered-3k-llama) | 35 | 1 | | | [here](https://huggingface.co/datasets/alvations/globalvoices-en-es/discussions/1#651ab996996b00d2900f310f) | [alvations/globalvoices-en-es](https://huggingface.co/datasets/alvations/globalvoices-en-es) | 33 | 1 | | | | [ejschwartz/oo-method-test](https://huggingface.co/datasets/ejschwartz/oo-method-test) | 27 | 1 | -| | | [soymia/boudoir-dataset](https://huggingface.co/datasets/soymia/boudoir-dataset) | 25 | 1 | +| | [here](https://huggingface.co/datasets/soymia/boudoir-dataset/discussions/1) | [soymia/boudoir-dataset](https://huggingface.co/datasets/soymia/boudoir-dataset) | 25 | 1 | | | | [strombergnlp/offenseval_2020](https://huggingface.co/datasets/strombergnlp/offenseval_2020) | 24 | 1 | | | [here](https://huggingface.co/datasets/vhtran/de-en-2023/discussions/1#651aba022bc734f0fa0c36af) | [vhtran/de-en-2023](https://huggingface.co/datasets/vhtran/de-en-2023) | 23 | 1 | | | | [cw1521/ember2018-malware](https://huggingface.co/datasets/cw1521/ember2018-malware) | 17 | 1 | | | [here](https://huggingface.co/datasets/AgentWaller/german-formatted-oasst1/discussions/1) | [AgentWaller/german-formatted-oasst1](https://huggingface.co/datasets/AgentWaller/german-formatted-oasst1) | 15 | 1 | -| | | [Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data](https://huggingface.co/datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data) | 12 | 1 | +| | [here](https://huggingface.co/datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data/discussions/1) | [Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data](https://huggingface.co/datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data) | 12 | 1 | | Merged | [here](https://huggingface.co/datasets/Photolens/oasst1-en/discussions/2#651aba64e8b2318cdb759528) | [Photolens/oasst1-en](https://huggingface.co/datasets/Photolens/oasst1-en) | 10 | 1 | | | [here](https://huggingface.co/datasets/vhtran/id-en/discussions/1#651ababdc4fdc1c93efb0f2b) | [vhtran/id-en](https://huggingface.co/datasets/vhtran/id-en) | 8 | 1 | | Merged | [here](https://huggingface.co/datasets/openmachinetranslation/tatoeba-en-fr/discussions/1#651aba96b693acb51958884b) | [openmachinetranslation/tatoeba-en-fr](https://huggingface.co/datasets/openmachinetranslation/tatoeba-en-fr) | 8 | 1 | @@ -164,8 +163,8 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase | | | [Isaak-Carter/Function_Calling_Private_GG](https://huggingface.co/datasets/Isaak-Carter/Function_Calling_Private_GG) | 43 | 0 | | | [here](https://huggingface.co/datasets/stas/wmt16-en-ro-pre-processed/discussions/1#651ab96911f562eb7f04aa5e) | [stas/wmt16-en-ro-pre-processed](https://huggingface.co/datasets/stas/wmt16-en-ro-pre-processed) | 40 | 0 | | Merged | [here](https://huggingface.co/datasets/paoloitaliani/news_articles/discussions/1) | [paoloitaliani/news_articles](https://huggingface.co/datasets/paoloitaliani/news_articles) | 40 | 0 | -| | | [pszemraj/simplepile-lite](https://huggingface.co/datasets/pszemraj/simplepile-lite) | 33 | 0 | -| | | [webimmunization/COVID-19-conspiracy-theories-tweets](https://huggingface.co/datasets/webimmunization/COVID-19-conspiracy-theories-tweets) | 31 | 0 | +| Merged | [here](https://huggingface.co/datasets/pszemraj/simplepile-lite/discussions/1) | [pszemraj/simplepile-lite](https://huggingface.co/datasets/pszemraj/simplepile-lite) | 33 | 0 | +| | [here](https://huggingface.co/datasets/webimmunization/COVID-19-conspiracy-theories-tweets/discussions/2) | [webimmunization/COVID-19-conspiracy-theories-tweets](https://huggingface.co/datasets/webimmunization/COVID-19-conspiracy-theories-tweets) | 31 | 0 | | | | [rdpahalavan/UNSW-NB15](https://huggingface.co/datasets/rdpahalavan/UNSW-NB15) | 30 | 0 | | | | [marekk/testing_dataset_article_category](https://huggingface.co/datasets/marekk/testing_dataset_article_category) | 28 | 0 | | Merged | [here](https://huggingface.co/datasets/Suchinthana/Databricks-Dolly-15k-si-en-mix/discussions/1#651ab9d4c69ca64b8dac2f8e) | [Suchinthana/Databricks-Dolly-15k-si-en-mix](https://huggingface.co/datasets/Suchinthana/Databricks-Dolly-15k-si-en-mix) | 24 | 0 | @@ -176,7 +175,7 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase | | [here](https://huggingface.co/datasets/W4nkel/turkish-sentiment-dataset/discussions/1#651ae7c3ad11961965111641) | [W4nkel/turkish-sentiment-dataset](https://huggingface.co/datasets/W4nkel/turkish-sentiment-dataset) | 16 | 0 | | | | [irds/nyt](https://huggingface.co/datasets/irds/nyt) | 15 | 0 | | | [here](https://huggingface.co/datasets/pere/italian_tweets_500k/discussions/1) | [pere/italian_tweets_500k](https://huggingface.co/datasets/pere/italian_tweets_500k) | 14 | 0 | -| | | [generative-newsai/news-unmasked](https://huggingface.co/datasets/generative-newsai/news-unmasked) | 12 | 0 | +| | [here](https://huggingface.co/datasets/generative-newsai/news-unmasked/discussions/1) | [generative-newsai/news-unmasked](https://huggingface.co/datasets/generative-newsai/news-unmasked) | 12 | 0 | | | | [irds/dpr-w100](https://huggingface.co/datasets/irds/dpr-w100) | 12 | 0 | | | [here](https://huggingface.co/datasets/pere/italian_tweets_10M/discussions/1) | [pere/italian_tweets_10M](https://huggingface.co/datasets/pere/italian_tweets_10M) | 11 | 0 | | | [here](https://huggingface.co/datasets/vhtran/de-en/discussions/1#651abad1b61121b12838a021) | [vhtran/de-en](https://huggingface.co/datasets/vhtran/de-en) | 8 | 0 | @@ -350,4 +349,4 @@ datasets_with_at_least_20_downloads = [dataset for dataset in datasets if datase | | [here](https://huggingface.co/datasets/Memis/turkishReviews-ds-mini/discussions/1#651aec95a6e00a1678c00c78) | [Memis/turkishReviews-ds-mini](https://huggingface.co/datasets/Memis/turkishReviews-ds-mini) | 0 | 0 | | | [here](https://huggingface.co/datasets/PulsarAI/turkish_movie_sentiment/discussions/1#651aecb96ef522c487d5ef62) | [PulsarAI/turkish_movie_sentiment](https://huggingface.co/datasets/PulsarAI/turkish_movie_sentiment) | 0 | 0 | | Merged | [here](https://huggingface.co/datasets/ahmet1338/turkishReviews-ds-mini/discussions/1#651aecc4d76ad9bc085fe5e5) | [ahmet1338/turkishReviews-ds-mini](https://huggingface.co/datasets/ahmet1338/turkishReviews-ds-mini) | 0 | 0 | -| | [here](https://huggingface.co/datasets/nogyxo/question-answering-ukrainian-json-answers/discussions/1) | [nogyxo/question-answering-ukrainian-json-answers](https://huggingface.co/datasets/nogyxo/question-answering-ukrainian-json-answers) | 0 | 0 | +| | [here](https://huggingface.co/datasets/nogyxo/question-answering-ukrainian-json-answers/discussions/1) | [nogyxo/question-answering-ukrainian-json-answers](https://huggingface.co/datasets/nogyxo/question-answering-ukrainian-json-answers) | 0 | 0 | \ No newline at end of file