Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: adding support for other data types in orjson_dumps #1938

Merged
merged 4 commits into from
Oct 6, 2023

Conversation

AndreaFrancis
Copy link
Contributor

@AndreaFrancis AndreaFrancis commented Oct 5, 2023

Part of #1443
Currently, we have 105 cache records with error types like:

  'Type is not JSON serializable: Timedelta',
  'Type is not JSON serializable: Timestamp',
  'Type is not JSON serializable: datetime.timedelta',
  'Type is not JSON serializable: numpy.int64',
  'Type is not JSON serializable: numpy.ndarray'

The problem is at https://github.com/huggingface/datasets-server/blob/main/libs/libcommon/src/libcommon/utils.py#L120 trying to serialize the values.

After this PR, I will force refresh the affected splits (most of them are for split-first-rows-from-streaming)

Also there is an error for "split-descriptive-statistics":

"Dict key must be str"
This was due to non-string keys in ClassLabels like ints. After this PR, I will force refresh the affected splits (49 records)

@codecov-commenter
Copy link

codecov-commenter commented Oct 5, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (8c5a3a2) 95.56% compared to head (57e548a) 90.63%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1938      +/-   ##
==========================================
- Coverage   95.56%   90.63%   -4.94%     
==========================================
  Files          14      237     +223     
  Lines         519    14841   +14322     
==========================================
+ Hits          496    13451   +12955     
- Misses         23     1390    +1367     
Flag Coverage Δ
jobs_cache_maintenance 95.56% <ø> (ø)
jobs_mongodb_migration 86.32% <ø> (?)
libs_libcommon 92.17% <100.00%> (?)
services_admin 85.59% <ø> (?)
services_api 86.79% <ø> (?)
services_rows 84.06% <ø> (?)
services_search 77.46% <ø> (?)
services_sse-api 94.16% <ø> (?)
services_worker 92.56% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
libs/libcommon/src/libcommon/utils.py 99.11% <100.00%> (ø)
libs/libcommon/tests/test_utils.py 100.00% <100.00%> (ø)

... and 221 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AndreaFrancis AndreaFrancis requested a review from a team October 5, 2023 17:20
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just a question below.

libs/libcommon/src/libcommon/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

The datetime formatting is like b'"2022-01-15T00:00:00"' and numpy arrays is like b'[0,1,2,3,4]' which is perfect

@lhoestq
Copy link
Member

lhoestq commented Oct 6, 2023

Cc @severo in case it requires changes in the front-end (might work out of the box no?)

@lhoestq
Copy link
Member

lhoestq commented Oct 6, 2023

This should fix #86 btw :)

@severo
Copy link
Collaborator

severo commented Oct 6, 2023

By default, it will return a string. It should work

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another comment: the datetime and the Timestamp are not currently aligned

  • datetime:
    b'"2017-01-01T12:10:20.345000"'
    
  • Timestamp:
    b'"2017-01-01 12:10:20.345000"'
    

If the alignment is important, this could be achieved with:

def orjson_default(obj):
    ...
    if isinstance(obj, pd.Timestamp):
        return obj.to_pydatetime()

@AndreaFrancis
Copy link
Contributor Author

Just another comment: the datetime and the Timestamp are not currently aligned

Thanks @albertvillanova! I added your suggestion.

@AndreaFrancis AndreaFrancis merged commit a55d31b into main Oct 6, 2023
19 of 20 checks passed
@AndreaFrancis AndreaFrancis deleted the orjson-dumps branch October 6, 2023 18:34
@AndreaFrancis
Copy link
Contributor Author

Fixed:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({"details.copied_from_artifact":{$exists:false}, error_code:"UnexpectedError", "content.error":/Dict key must be str/i})
0
Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({"details.copied_from_artifact":{$exists:false}, error_code:"UnexpectedError", "content.error":/Type is not JSON serializable/i})
0

@severo
Copy link
Collaborator

severo commented Oct 9, 2023

you rock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants