-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Data
Before a new queryset can be retrieved from the service, it must be published. The service maintains a database containing all querysets submitted to it.
Before a new queryset (written from scratch or created by merging existing querysets) can be fetched, it must be published to a permanent database on the server. This is done using the publish()
method:
data = new_queryset.publish()
A published queryset can be fetched using the .fetch()
method
data = new_queryset.fetch()
which can also be chained with the publish()
method:
data = new_queryset.publish().fetch()
Communication between the viewser client and the server is by a simple polling model. The client sends the queryset to the server again and again with a pause (currently 5 seconds) between each send.
Each time, the server responds with one of
- a status message informing the user of progress on computing their queryset
- an error message detailing something that has gone wrong (in which case the client will stop sending requests to the service)
- a compressed completed dataset
The fetch process proceeds as follows:
(i) At the first request, the server enters the queryset into a temporary database of 'in-progress' querysets that it is currently working on. Querysets are removed from this database if they cause an error to be generated, or once the completed dataset has been sent to the client.
(ii) Once the queryset has been entered into the temporary database, it is validated to check that, e.g. no non-existent database columns, aggregation functions or transforms have been requested. If this is the case, an error message is sent back to the client and the queryset is deleted from the 'in-progress' database.
(iii) If the queryset passes validation, the server compares the requested columns with what is in its cache to see if some or all of some or all of the columns have already been computed. Already computed columns are not computed again, and partially computed columns are 'pruned' of stages that are already in the cache, so that work is not repeated.
(iv) If raw data needs to be fetched from the database, a job is dispatched to a database queue, which fetches the missing raw data to the cache. While this is in progress, the server returns status messages to the client detailing whether the database fetch is still waiting in the queue, or its progress if the fetch has started.
(v) Once all necessary raw data is in the cache, any transforms that remain to be done are dispatched to a transform queue. During this phase, status messages are returned to the client detailing whether the transforms have been started, and what their progress is.
If errors are encountered during the database fetch or transform stages, an error message is returned to the client and the queryset is removed from the 'in-progress' database.
(vi) Otherwise, all completed queryset columns are written to the cache, then assembled into a single pandas dataframe, which is compressed into parquet format and sent back to the client. The queryset is removed from the 'in-progress' database.
Note that priogrid-level dataframes, even compressed, can be large and can take significant time to download.
When a queryset is passed to the service, it is examined by a validation function which checks for easily-detected errors. Errors found by the validator will be received immediately by the client:
validation failed with illegal aggregation functions: [list of bad aggregation functions]
- indicates that one or more non-existent aggregations was requested
validation failed with repeated column names: [list of repeated column names]
- indicates that one or more column names has been used more than once in the queryset definition
validation failed with non-existent transforms: [list of bad transforms]
- indicates that one or more non-existent transforms was requested
validation failed with disallowed transform loas: [list of bad transform:loa combinations]
- indicates that the transform:loa pairings in the list are illegal
Other kinds of error are only detectable once processing the queryset has started, so these errors may take considerably longer to appear:
db fetch failed - missing columns: [list of bad column names]
- indicates that the listed columns do not exist in the VIEWS database
db fetch failed, to_loa = country_month, columns = ['/base/<bad_loa>.ged_sb_best_sum_nokgi/country_month.sum'], exception = no such loa is available right now!
- indicates that when trying to fetch the column 'ged_sb_best_sum_nokgi', the source loa <bad_loa> does not exist
transform failed, file (path to transform function on server), line XX, in (transform), (specific error message)
- indicates that a transform operation failed, likely because of non-sensical parameters - the specific error message gives more details
While running, viewser attempts to keep users informed of the progress of their queryset's computation. Status messages are displayed on a single self-replacing line which starts with a counter, incrementing every time the client pings the server. A queryset usually passes through two separate queues - one handles fetching of raw data from the database, the second handles transforms. A queryset which passes validation will usually be passed to the database queue. A user will see a message of the form
Queryset [queryset name] dispatched to database queue - n columns to compute
with n the number of columns in the queryset. This message indicates that the queryset is waiting in the database queue. Once fetching of raw data has started, the message will be replaced by one of the form
Queryset [queryset name] db fetch in progress - l of m jobs remaining
where the total number of jobs is summed over all columns and
- Fetching one raw feature from the database is 1 job
- Every transform is 1 job
- Renaming the column after all the transforms have been done is 1 job
Note that the value of m is the total number of jobs required to compute the queryset from scratch. If there are jobs in the cache, this shows itself by the value of l starting out less than m.
If the database fetch completes without errors, the queryset will be passed to the transform queue, and a status message of the form
Queryset [queryset name] dispatched to database queue - n columns to compute
This message indicates only that the queryset is waiting in the transform queue. Once computation of transforms begins, the status message will be replaced by one of the form
Queryset [queryset name] transform in progress - l of m jobs remaining
When all transforms have completed, downloading of the completed dataframe begins. A download meter will appear to give an idea of how long the dataframe will take to download.