Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump croissant job version #3105

Merged
merged 1 commit into from
Nov 19, 2024
Merged

bump croissant job version #3105

merged 1 commit into from
Nov 19, 2024

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 19, 2024

cc @ccl-core this way all the croissant-crumbs will be updated with the latest changes

@lhoestq lhoestq merged commit 1886a85 into main Nov 19, 2024
25 checks passed
@lhoestq lhoestq deleted the bump-croissant-version branch November 19, 2024 14:14
@ccl-core
Copy link
Contributor

Thank you a lot Quentin! That's great news 🥐

@lhoestq
Copy link
Member Author

lhoestq commented Nov 25, 2024

Hey @ccl-core ! All the croissant metadata are updated and it all went well :)

It's minor, but I received a notification only about two errors on those datasets from the Google Search Console:

It's not a big deal but it looks like their croissant metadata are too long to be indexed, but let me know

Parsing error: Missing'; or '}'
Items with this issue are invalid. Invalid items are not eligible for Google Search's rich results

@ccl-core
Copy link
Contributor

ccl-core commented Dec 3, 2024

Thank you @lhoestq ! I'm having a look.

@ccl-core
Copy link
Contributor

ccl-core commented Dec 3, 2024

Hi @lhoestq , this might help reducing the size of our croissant definitions: #3106

We wouldn't loose much, as those names are the same as the ids, and most of the descriptions were containing redundant information anyways. In the case of datasets with a lot of fields, such as the ones you posted above, this results in considerable size reduction.

Note that since releasing Croissant 1.0 we are now using ids instead of names in mlcroissant, so we are able to parse the new, shorter croissants wihtout any change needed for mlcroissant.

Let us know if you have any further comments or observations :)

@lhoestq
Copy link
Member Author

lhoestq commented Dec 3, 2024

I see, thanks for investigating and for removing redundant info :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants