Job to run maintenance actions on the dataset viewer.
Available actions:
backfill
: backfill the cache (i.e. create jobs to add the missing entries or update the outdated entries)backfill-retryable-errors
: backfill the cache for retryable errorscollect-cache-metrics
: compute and store the cache metricscollect-queue-metrics
: compute and store the queue metricspost-messages
: post messages in Hub discussionsskip
: do nothing
The script can be configured using environment variables. They are grouped by scope.
CACHE_MAINTENANCE_ACTION
: the action to launch, amongbackfill
,backfill-retryable-errors
,collect-cache-metrics
,collect-queue-metrics
andpost-messages
. Defaults toskip
.
See ../../libs/libcommon/README.md for the following configurations:
- Cached Assets
- Assets
- S3
- Cache
- Queue
See ../../libs/libcommon/README.md for the following configurations:
- Cache
See ../../libs/libcommon/README.md for the following configurations:
- Queue
DIRECTORY_CLEANING_CACHE_DIRECTORY
: directory location to clean up.DIRECTORY_CLEANING_SUBFOLDER_PATTERN
: sub folder pattern inside the cache directory.DIRECTORY_CLEANING_EXPIRED_TIME_INTERVAL_SECONDS
: time in seconds after a file is deleted since its last accessed time.
Set environment variables to configure the post-messages
job:
DISCUSSIONS_BOT_ASSOCIATED_USER_NAME
: name of the Hub user associated with the dataset viewer bot app.DISCUSSIONS_BOT_TOKEN
: token of the dataset viewer bot used to post messages in Hub discussions.DISCUSSIONS_PARQUET_REVISION
: revision (branch) where the converted Parquet files are stored.
See ../../libs/libcommon/README.md for more information about the common configuration.
make run