Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-calculate total geo area and time max limit for project groups #780

Merged
merged 4 commits into from
Jun 13, 2023

Conversation

thenav56
Copy link
Contributor

@thenav56 thenav56 commented May 12, 2023

Changes

  • Add two new columns in the groups table to use in aggregated data calculation. We are currently calculating this in fly for each calculation run
    • total_area float
    • time_spent_max_allowed float

Why

Caching the geo areas decreases stat calculation time significantly.
For eg: For the whole 2022 year data (by calculating 1 month at a time), It took more than 5 days. After the current implementation, it took around 1-2 hours and the majority of time was spent on initial cache calculation.

Deployment steps

  • Run SQL migration script
postgres/scripts/v2_to_v3/06_alter_groups_data_add_total_area_and_time_limit_columns.sql
  • Run aggregated data collection (Optional)
docker-compose run --rm django ./manage.py update_aggregated_data

@thenav56 thenav56 force-pushed the feature/speed-up-aggregate-calculation branch 5 times, most recently from 4e135bc to b4f335a Compare May 12, 2023 12:20
@thenav56 thenav56 marked this pull request as ready for review May 12, 2023 12:47
@thenav56 thenav56 requested review from Hagellach37 and laurentS and removed request for Hagellach37 May 18, 2023 12:14
@thenav56 thenav56 force-pushed the feature/speed-up-aggregate-calculation branch from 9806151 to 5f8c32f Compare May 23, 2023 08:33
required_count,
progress,
project_type_specifics
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected that we add the total group area max time here as well? Or is this done in another step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using Django for calculating this part. I have also added a comment mentioning that as well.
This can take some time to calculate and also doesn't work with existing ones as well.
So for now doing this on Django made more sense as it is calculated and maintained by Aggregated logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhh okay, not sure if I prefer this way to split it up. Could you investigate how long this will take for a single project during project creation.

I think it's "just" adjusting the sql statement slightly and summing up the task area per group with postgis. @ElJocho might be able to advise here as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally the group information could be set directly when setting up the project, then we do not need to rely on django to perform additional steps at some point in time.

for backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.

It is already difficult to understand which information is set for which project, e.g. considering the various project types. @ElJocho has worked towards refactoring this part of the code so that it will be easier to learn which attributes are expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I wanted to add this to the group as well. Considering the refactor and complex structure right now I wanted to make minimal changes to the mapswipe_workers core part as well. Currently, the calculation/test parts also fit perfectly in the aggregated module as it is tightly coupled with it and only used there.

Maybe @ElJocho can move this aggregated logic to the core part when during the refactor. The SQL query is already defined, will just need to add it to the core logic and maybe define further tests as well. Which seems to be perfect on the refactoring part.

UPDATE_PROJECT_GROUP_DATA = f"""
WITH to_calculate_groups AS (
SELECT
project_id,
group_id
FROM groups
WHERE
(project_id, group_id) in (
SELECT
MS.project_id,
MS.group_id
FROM mapping_sessions MS
WHERE
MS.start_time >= %(from_date)s
AND MS.start_time < %(until_date)s
GROUP BY MS.project_id, MS.group_id
) AND
(
total_area is NULL OR time_spent_max_allowed is NULL
)
),
groups_data AS (
SELECT
T.project_id,
T.group_id,
SUM( -- sqkm
ST_Area(T.geom::geography(GEOMETRY,4326)) / 1000000
) as total_task_group_area,
(
CASE
-- Using 95_percent value of existing data for each project_type
WHEN P.project_type = {Project.Type.BUILD_AREA.value} THEN 1.4
WHEN P.project_type = {Project.Type.COMPLETENESS.value} THEN 1.4
WHEN P.project_type = {Project.Type.CHANGE_DETECTION.value} THEN 11.2
-- FOOTPRINT: Not calculated right now
WHEN P.project_type = {Project.Type.FOOTPRINT.value} THEN 6.1
ELSE 1
END
) * COUNT(*) as time_spent_max_allowed
FROM tasks T
INNER JOIN to_calculate_groups G USING (project_id, group_id)
INNER JOIN projects P USING (project_id)
GROUP BY project_id, P.project_type, group_id
)
UPDATE groups G
SET
total_area = GD.total_task_group_area,
time_spent_max_allowed = GD.time_spent_max_allowed
FROM groups_data GD
WHERE
G.project_id = GD.project_id AND
G.group_id = GD.group_id;
"""

For backtrack calculation of existing projects your approach is good. I'm just concerned about new projects.

The aggregated module will calculate for old and new projects. Basically, if there are any projects with swipe data then the aggregated module will take care of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perfect. sounds good to me then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I have merged this PR and created a new ticket (assigned @ElJocho there)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants