Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mise à jour des scripts BI pour inclure des filtres sur les campagnes "disabled" #29

Open
charlesollion opened this issue May 2, 2023 · 5 comments
Assignees

Comments

@charlesollion
Copy link
Contributor

charlesollion commented May 2, 2023

We want to update bi.campaign_river to set the column disabled to true in the following cases:

  • total distance is 0km
  • Number of trash > 10 * number of GPS localisation (this means the GPS was not updated enough, probably bugged)
@charlesollion charlesollion converted this from a draft issue May 2, 2023
@charlesollion charlesollion self-assigned this May 2, 2023
@charlesollion charlesollion changed the title Mise à jour des scripts BI pour inclure des filtres sur les campagnes "disabled" et propager cette infos sur les bi.trash_river Mise à jour des scripts BI pour inclure des filtres sur les campagnes "disabled" May 2, 2023
@cambouvy
Copy link
Collaborator

First the cut-off point for the ratio trash_count/num_gps_points was investigated. The idea is that we need the point offering the best tradeoff between disabling faulty campaigns, and keeping the valid ones (that might just have a lot of trash on few GPS points).

This ratio trash_count/num_gps_points without outliers has the following distribution parameters:

  • mean: 9.055592
  • std: 25.523943
  • 25%: 0.000000
  • 50%: 1.500000
  • 75%: 7.729167
  • min: 0.000000
  • max: 400.000000
image

It was discussed to choose 7.72 as a cut-off point, as this represents the 75th percentile (and would allow to disable all the outliers).
However, the number of trash points in this upper quantile is 28 864, representing ≈75% of the trash points... Choosing the cut-off point to be 10 would disable ≈67% of the trash points.

@charlesollion any thoughts on this?

@sabineallousurfrider
Copy link

sabineallousurfrider commented May 25, 2023

Update from teams conversation :

we notice that for some campaign distance = 0 in bi.campaign_river even if there are correct gps point and correct distance in bi.campaign (exemple campaign : 751ff2bd-2b2b-4f4f-bc07-7143e6a0ed2e )

the data in the_geom column is indeed different for bi.campaign and bi.campaign_river for the same id (there is only one dot in bi.campaign_river). What's the difference between these two tables?

EDIT from charles: we found that the river 20305 has incomplete referential, therefore the campaign GPS points are mapped to a single point in the river, hence the single point in campaign_river, and distance = 0)

@sabineallousurfrider
Copy link

I would like to add a brief remark: if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone (which is one of the project's goal) ?

@sabineallousurfrider sabineallousurfrider moved this from Todo to In Progress in DATA BASE Project Jun 1, 2023
@cambouvy
Copy link
Collaborator

Update:
The data analysis done earlier was using the GPS points in the_geom, which represent the total amount of GPS points automatically retrieved (independent of the trash points), instead of taking the GPS points associated to trash points. The analysis was re-done using the latter and the cut-off point was updated. The 75th percentile of the ratio is ~6.5. Using this as a cut-off point would remove most outliers but it would also disable many campaigns and a big amount of trash points. Taking a cut-off point of 8 revolves in disabling ~41% of campaigns, containing all in all ~44% of the trash points. This seems like a lot but from visual analysis of the campaigns we saw a lot of outliers. Additionally there are some test campaigns containing a lot of trash points, thus making the number of trash points in the disabled campaigns seem excessive.

@cambouvy
Copy link
Collaborator

cambouvy commented Jun 20, 2023

if we automatically disable campaigns with distance = 0, is there a risk of ghosting campaigns where it has only one observation, for exemple, of an accumulation zone

(As discussed in the Teams chat): This should be configured differently in future campaigns, so it would affect only the past ones. From what we observed most of the campaigns having distance = 0 would still be disabled using the current ratio. Adding num_different_locations <= 1 disables 169 additional campaigns, containing 0.009% of the total amount of trash points so it does not seem like we loose too much valuable information.

@charlesollion charlesollion moved this from In Progress to Done in DATA BASE Project Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants