-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats for datetimes #3007
Open
polinaeterna
wants to merge
51
commits into
main
Choose a base branch
from
datetime-stats
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Stats for datetimes #3007
Changes from 10 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
79790e0
compute stats for datetimes
polinaeterna 12f78cc
Merge branch 'main' into datetime-stats
polinaeterna 851ec1b
fix typing
polinaeterna 3347c13
add testcase
polinaeterna 0340b54
moar tests: column with nulls and all nulls column
polinaeterna 4cd6e0d
Merge branch 'main' into datetime-stats
polinaeterna 434b2d8
add datetime to worker
polinaeterna 2604587
add test
polinaeterna 913f812
include timezone aware
polinaeterna 06c1ae5
Merge branch 'main' into datetime-stats
polinaeterna 7f7ecab
Merge branch 'main' into datetime-stats
polinaeterna d517393
refactor
polinaeterna 7046d8b
fix
polinaeterna 945dff0
do not typecheck dateutil
polinaeterna d91d365
Merge branch 'main' into datetime-stats
polinaeterna bdec2e4
fix
polinaeterna f9ffe82
more tests
polinaeterna d2c37c6
fix string to datetime conversion: add format inferring
polinaeterna 658719e
fix style
polinaeterna 5c2d94a
fix check for datetime
polinaeterna 359a30b
minor
polinaeterna 0744e07
mypy
polinaeterna 53e2100
add testcase
polinaeterna a61108f
Merge branch 'main' into datetime-stats
polinaeterna c63e70e
Merge branch 'main' into datetime-stats
polinaeterna 70197aa
Merge branch 'datetime-stats' of github.com:huggingface/datasets-serv…
polinaeterna 3df6264
fix?
polinaeterna 812bf36
add example to docs
polinaeterna c68efb7
fix + add tz string (%Z) to formats
polinaeterna 351ef5c
test for string timezone
polinaeterna 787ad3b
try to debug
polinaeterna 5163500
test identify_datetime_format
polinaeterna 033e29e
test datetime.strptime
polinaeterna 349b651
test
polinaeterna 6c60c27
Update services/worker/src/worker/statistics_utils.py
polinaeterna db10500
keep original timezone for string dates
polinaeterna 8794b7a
let polars identify datetime format by itself
polinaeterna e0e7c91
do not display +0000 in timestamps (if timezone is UTC)
polinaeterna 8afade1
remove utils test
polinaeterna 341676c
refactor: identify datetime format manually only when polars failed
polinaeterna 3b5d950
style
polinaeterna 21977db
log formats in error message
polinaeterna 0ee76bf
update openapi specs
polinaeterna b7fee0b
fallback to string stats if datetime didn't work
polinaeterna 6a76dd9
fix test
polinaeterna f3eefea
update docs
polinaeterna a79eb79
Merge branch 'main' into datetime-stats
polinaeterna 1df95ff
fix openapi specs
polinaeterna 2f27846
Merge branch 'main' into datetime-stats
polinaeterna f9d7a8a
fix polars timezone switching
polinaeterna 720aab9
Merge branch 'main' into datetime-stats
polinaeterna File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Copyright 2024 The HuggingFace Authors. | ||
import datetime | ||
import enum | ||
import io | ||
import logging | ||
|
@@ -50,24 +51,41 @@ class ColumnType(str, enum.Enum): | |
STRING_TEXT = "string_text" | ||
AUDIO = "audio" | ||
IMAGE = "image" | ||
DATETIME = "datetime" | ||
|
||
|
||
class Histogram(TypedDict): | ||
hist: list[int] | ||
bin_edges: list[Union[int, float]] | ||
|
||
|
||
class DatetimeHistogram(TypedDict): | ||
hist: list[int] | ||
bin_edges: list[str] # edges are string representations of dates | ||
|
||
|
||
class NumericalStatisticsItem(TypedDict): | ||
nan_count: int | ||
nan_proportion: float | ||
min: Optional[float] # might be None in very rare cases when the whole column is only None values | ||
max: Optional[float] | ||
min: Optional[Union[int, float]] # might be None in very rare cases when the whole column is only None values | ||
max: Optional[Union[int, float]] | ||
mean: Optional[float] | ||
median: Optional[float] | ||
std: Optional[float] | ||
histogram: Optional[Histogram] | ||
|
||
|
||
class DatetimeStatisticsItem(TypedDict): | ||
nan_count: int | ||
nan_proportion: float | ||
min: Optional[str] # might be None in very rare cases when the whole column is only None values | ||
max: Optional[str] | ||
mean: Optional[str] | ||
median: Optional[str] | ||
std: Optional[str] # string representation of timedelta | ||
histogram: Optional[DatetimeHistogram] | ||
|
||
|
||
class CategoricalStatisticsItem(TypedDict): | ||
nan_count: int | ||
nan_proportion: float | ||
|
@@ -83,7 +101,9 @@ class BoolStatisticsItem(TypedDict): | |
frequencies: dict[str, int] | ||
|
||
|
||
SupportedStatistics = Union[NumericalStatisticsItem, CategoricalStatisticsItem, BoolStatisticsItem] | ||
SupportedStatistics = Union[ | ||
NumericalStatisticsItem, CategoricalStatisticsItem, BoolStatisticsItem, DatetimeStatisticsItem | ||
] | ||
|
||
|
||
class StatisticsPerColumnItem(TypedDict): | ||
|
@@ -699,3 +719,97 @@ def get_shape(example: Optional[Union[bytes, dict[str, Any]]]) -> Union[tuple[No | |
@classmethod | ||
def transform(cls, example: Optional[Union[bytes, dict[str, Any]]]) -> Optional[int]: | ||
return cls.get_width(example) | ||
|
||
|
||
class DatetimeColumn(Column): | ||
transform_column = IntColumn | ||
|
||
@classmethod | ||
def compute_transformed_data( | ||
cls, | ||
data: pl.DataFrame, | ||
column_name: str, | ||
transformed_column_name: str, | ||
min_date: datetime.datetime, | ||
) -> pl.DataFrame: | ||
return data.select((pl.col(column_name) - min_date).dt.total_seconds().alias(transformed_column_name)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. operate on seconds |
||
|
||
@staticmethod | ||
def shift_and_convert_to_string(base_date: datetime.datetime, seconds: Union[int, float]) -> str: | ||
return datetime_to_string(base_date + datetime.timedelta(seconds=seconds)) | ||
|
||
@classmethod | ||
def _compute_statistics( | ||
cls, | ||
data: pl.DataFrame, | ||
column_name: str, | ||
n_samples: int, | ||
) -> DatetimeStatisticsItem: | ||
nan_count, nan_proportion = nan_count_proportion(data, column_name, n_samples) | ||
if nan_count == n_samples: # all values are None | ||
return DatetimeStatisticsItem( | ||
nan_count=n_samples, | ||
nan_proportion=1.0, | ||
min=None, | ||
max=None, | ||
mean=None, | ||
median=None, | ||
std=None, | ||
histogram=None, | ||
) | ||
|
||
min_date: datetime.datetime = data[column_name].min() # type: ignore # mypy infers type of datetime column .min() incorrectly | ||
timedelta_column_name = f"{column_name}_timedelta" | ||
# compute distribution of time passed from min date in **seconds** | ||
timedelta_df = cls.compute_transformed_data(data, column_name, timedelta_column_name, min_date) | ||
timedelta_stats: NumericalStatisticsItem = cls.transform_column.compute_statistics( | ||
timedelta_df, | ||
column_name=timedelta_column_name, | ||
n_samples=n_samples, | ||
) | ||
# to assure mypy that there values are not None to pass to conversion functions: | ||
assert timedelta_stats["histogram"] is not None # nosec | ||
assert timedelta_stats["max"] is not None # nosec | ||
assert timedelta_stats["mean"] is not None # nosec | ||
assert timedelta_stats["median"] is not None # nosec | ||
assert timedelta_stats["std"] is not None # nosec | ||
|
||
datetime_bin_edges = [ | ||
cls.shift_and_convert_to_string(min_date, seconds) for seconds in timedelta_stats["histogram"]["bin_edges"] | ||
] | ||
|
||
return DatetimeStatisticsItem( | ||
nan_count=nan_count, | ||
nan_proportion=nan_proportion, | ||
min=datetime_to_string(min_date), | ||
max=cls.shift_and_convert_to_string(min_date, timedelta_stats["max"]), | ||
mean=cls.shift_and_convert_to_string(min_date, timedelta_stats["mean"]), | ||
median=cls.shift_and_convert_to_string(min_date, timedelta_stats["median"]), | ||
std=str(datetime.timedelta(seconds=timedelta_stats["std"])), | ||
histogram=DatetimeHistogram( | ||
hist=timedelta_stats["histogram"]["hist"], | ||
bin_edges=datetime_bin_edges, | ||
), | ||
) | ||
|
||
def compute_and_prepare_response(self, data: pl.DataFrame) -> StatisticsPerColumnItem: | ||
stats = self.compute_statistics(data, column_name=self.name, n_samples=self.n_samples) | ||
return StatisticsPerColumnItem( | ||
column_name=self.name, | ||
column_type=ColumnType.DATETIME, | ||
column_statistics=stats, | ||
) | ||
|
||
|
||
def datetime_to_string(dt: datetime.datetime, format: str = "%Y-%m-%d %H:%M:%S%z") -> str: | ||
""" | ||
Convert a datetime.datetime object to a string. | ||
|
||
Args: | ||
dt (datetime): The datetime object to convert. | ||
format (str, optional): The format of the output string. Defaults to "%Y-%m-%d %H:%M:%S%z". | ||
|
||
Returns: | ||
str: The datetime object as a string. | ||
""" | ||
return dt.strftime(format) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orjson supports datetime serialization though, maybe I should return datetimes then?