Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Mapillary Image downloader, add GeoPandas Parser #18

Merged
merged 7 commits into from
Mar 14, 2024

Conversation

dragonejt
Copy link
Contributor

Changes

  • Add Mapillary image downloader script that downloads an image from coordinates (a shapely Point).
  • Add GeoPandas dataframe parser class that can extract a list of Points from the .gpkg GeoDataFrame file outputted by create_points.py
  • Replace requirements.txt with a dependencies section in pyproject.toml, update all installation commands to refer to pyproject.toml
  • Works towards use Mapillary API to retrieve image for each point #7

Testing

  • Running python -m src.mapillary "[MAPILLARY_CLIENT_TOKEN]" [POINTS_FILE] [IMAGE_PATH] succeeds on the Three_Rivers_Michigan_USA_points.gpkg generated by create_points.py and downloads images to the selected image path

Risks

  • The Mapillary API sometimes returns a query timeout exception (HTTP 400), which resolves if that query is tried again a few more times. Sample Query
{
    "error": {
        "message": "Timeout",
        "type": "OAuthException",
        "code": -2,
        "error_subcode": 3404014,
        "is_transient": false,
        "error_user_title": "Query Timeout",
        "error_user_msg": "The query timed out, please retry the query",
        "fbtrace_id": "A6w92zlxmQCqBiboCh9K8te"
    }
}
  • I have added 3 retries to each Mapillary API call, but in large lists of coordinates I am still seeing the query timeout exceptions.

…equirements.txt with dependencies in pyproject.toml
@dragonejt dragonejt added the enhancement New feature or request label Feb 29, 2024
@dragonejt dragonejt linked an issue Feb 29, 2024 that may be closed by this pull request
@danbjoseph
Copy link
Member

  • i added a few things in a docs section for this step, please edit as needed.
  • when getting setup i ran into some issues. however, i don't think it's related to any changes introduced in this PR. see docs clarification - more details about activating virtual environment? make requirements fails? #19
  • can you write the image file name to each point in a new field? and also record if there wasn't an image found or the API request failed?
  • is there any other metadata available from Mapillary for the image (or from the image search) that we should record?

@dragonejt
Copy link
Contributor Author

  • I commented on bullet number 2 in the related issue.
  • So the mapillary downloader should also write a file called image_data.json to the same folder. That image_data.json contains a list of the following:
  {
    "latitude": 41.93223076003517, // latitude of requested point from create_points.py
    "longitude": -85.65572922086078, // longitude of requested point from create_points.py
    "image_id": "332637408194005" | null, // First image id from mapillary API for requested point, null if no images found in that area or API request failed (timeout)
    "image_path": "data/raw/mapillary/332637408194005.jpeg" | null // Downloaded image path, null if no images found in that area or API request failed (timeout)
  }

Is this what you wanted for bullet number 3?

  • This is the list of metadata we can get about an image, I'm not sure what metadata we would want to add.

@danbjoseph
Copy link
Member

danbjoseph commented Mar 2, 2024

ah, i interrupted the process before it fetched every image, so it didn't write the json. for much larger point sets, will it be a problem that it is storing everything for the json in memory until the end and writing it all at once?

at the end of the whole process, i think it will help the user to troubleshoot and give them more flexibility in analysis if more info is in one place inside the geopackage. right now each point in the geopackage has fid and osm_id properties - the osm_id being the identifier of the road segment the point was sampled from. i'd like to add an image_id property. can we also write img_lat and img_lon properties? since the sampled point and the image. location will likely be different by some meters and i'm not yet sure in the analysis and visualization phase which coordinates we will want to use. i think we can ignore the other metadata from Mapillary at this time. thank you.
Screenshot 2024-03-02 at 11 41 57 AM

@dragonejt
Copy link
Contributor Author

Yeah the image_data.json will not appear if it is interrupted before finishing. We could maybe try/catch any KeyboardInterrupt (Ctrl+C) to write the image_data.json before actually exiting.

Currently I extract the list of points (longitude/latitude) from the geopackage before sending them to the mapillary client, so by the time they get to the mapillary client they already don't have access to the original geopackage. Let me think of a way to send the entire geopackage over.

pyproject.toml Outdated Show resolved Hide resolved
src/mapillary.py Outdated Show resolved Hide resolved
src/mapillary.py Outdated Show resolved Hide resolved
src/mapillary.py Outdated Show resolved Hide resolved
@danbjoseph
Copy link
Member

@jayqi do you have any suggestions for how to track each point through this step so that the image details can be added back to the correct point feature in the geodata?

@dragonejt
Copy link
Contributor Author

ah, i interrupted the process before it fetched every image, so it didn't write the json. for much larger point sets, will it be a problem that it is storing everything for the json in memory until the end and writing it all at once?

at the end of the whole process, i think it will help the user to troubleshoot and give them more flexibility in analysis if more info is in one place inside the geopackage. right now each point in the geopackage has fid and osm_id properties - the osm_id being the identifier of the road segment the point was sampled from. i'd like to add an image_id property. can we also write img_lat and img_lon properties? since the sampled point and the image. location will likely be different by some meters and i'm not yet sure in the analysis and visualization phase which coordinates we will want to use. i think we can ignore the other metadata from Mapillary at this time. thank you. Screenshot 2024-03-02 at 11 41 57 AM

Actually, this may influence the separate GeoPandasParser vs put it all in the same function debate. Did you want the Mapillary Image downloader to write to a new (next-step) file, or update the same file?

@danbjoseph
Copy link
Member

danbjoseph commented Mar 5, 2024

[edited based on @jayqi response below] @dragonejt it can write a new file or update the same file, but in either case it does need to carry through data from the previous file.

@jayqi
Copy link
Contributor

jayqi commented Mar 5, 2024

I strongly recommend that we write a new file. The workflow is much easier to reason about and reproduce if we treat the data outputs at each step as immutable, and the overall workflow as a directed acyclic graph (DAG). Here's some discussion about why this is a useful framing.

@dragonejt
Copy link
Contributor Author

dragonejt commented Mar 6, 2024

[edited based on @jayqi response below] @dragonejt it can write a new file or update the same file, but in either case it does need to carry through data from the previous file.

So I did some experimentation with this, and GeoDataFrames only support having one column with geospatial data. This means that I cannot add the coordinates of the downloaded image as another column in the GeoDataFrame. Adding the image id and image path work fine as they are both strings. Do we want to ignore the coordinates of the image then, or encode it as strings/numbers?

The error I was getting is ValueError: GeoDataFrame contains multiple geometry columns but GeoDataFrame.to_file supports only a single geometry column. Use a GeoDataFrame.to_parquet or GeoDataFrame.to_feather, drop additional geometry columns or convert them to a supported format like a well-known text (WKT) using GeoSeries.to_wkt().

@danbjoseph
Copy link
Member

danbjoseph commented Mar 6, 2024

It also appears that multiple coordinates will have the same closest image from the Mapillary API, is this something that we are OK with?

Looks like you edited to remove this? We don't want to use an image more than once.

In our Mapillary query we should be restricting the search to within 10m of the sample point and the sample points are 20 m apart. However, they are 20m apart along the road line features (not a plane) and roads may curve or 2 roads may be closer than 20m.
PXL_20240306_140530383

Regardless of the above, because of the complexities of recording points on the lumpy, curved surface of the Earth and projecting to a 2D representation - I don't think we could rely on the bounding box to eliminate the possibilities of duplicate images.
I reprojected to EPSG:32616, WGS 84, UTM 16N (given the location of our test data),
Screenshot 2024-03-06 at 10 25 10 AM
So that I could do geo-spatial functions in meters instead of degrees... and in addition to overlaps of the 10m buffers of each point at intersections and where roads are otherwise close to each other, the 20m as measured in the createPoints step isn't 20m when calculated with this method (see the slight overlap between all adjacent points).
Screenshot 2024-03-06 at 10 21 29 AM

Can you calculate the distance from the sample point to the image point as you download the images. Then when you are done go through and for each image key that is matched to more than one sample point, throw out the image key for all but the sample point + image key with the shortest distance? The coordinates are in EPSG:4326, WGS84 used in GPS which is global and has a unit of degrees. For more definitive distance measurements we would need to transform to a projection such as EPSG:32618, WGS84 / UTM zone 18N (okay for DC) and which has a unit of meters. But that's a pain and we don't know which projection to use each time as we are making a global tool. It is possible to calculate distance in EPSG:4326, it is not super accurate over long distances but for deciding which pair of points is closest, I think it should be fine.

For example, in the below mock-up (ignore the NULL values in the img_lat and img_lon columns - those would be decimal values) it would be after downloading all the images. Then for all the rows that have img_file=a.png it would look at the dist value (the calculated number of meters between the sample point and the image point) and delete the img attributes for rows 1 and 3 (or somehow mark them as a duplicate). Then do the same for img_file=c.png keeping the img details for row 5 (the smaller distance). And so on...?
Screenshot 2024-03-06 at 3 58 32 PM

For the image point, can you encode it as a string/numbers? A WKT format string would look like POINT(0 0) and would be one column, or it could be separate string values in img_lat and img_lon columns.

@dragonejt
Copy link
Contributor Author

dragonejt commented Mar 6, 2024

Looks like you edited to remove this? We don't want to use an image more than once.

For that, I tested again and there was an issue with my code, now it only uses one image per set of coordinates, and returns nothing if there isn't an image within that bounding box. I can take a look at finding the closest image, since currently it returns all images within the bounding box, and doesn't sort them by closeness to the original point.

@dragonejt
Copy link
Contributor Author

Hey Dan, so the current way I'm avoiding to have the same picture for multiple points is to store a set of all downloaded image IDs while I'm downloading, and filtering out the existing images before searching for the closest image. I think this is a simpler method and avoids the two passes, However, with this method, while that image may be the closest unique image for a point, that point may not be the closest point to that image. Is this method OK, or do you want to ensure that an image is tied to its closest point (probably requiring a second pass)?

@danbjoseph
Copy link
Member

Evan, if you've got that method working, let's stick with it and continue on towards a completed MVP. We can log an issue to remember to take another look at the method later. Thanks!

@dragonejt dragonejt force-pushed the 7-use-mapillary-api-to-retrieve-image-for-each-point branch from c70c34e to 751bae0 Compare March 8, 2024 02:49
@dragonejt
Copy link
Contributor Author

Implementation changes have been committed, will do documentation changes sometime later

@danbjoseph
Copy link
Member

danbjoseph commented Mar 8, 2024

first glance looks good!
Screenshot 2024-03-08 at 5 14 04 PM

i was worried because it seemed like we were missing a lot of images (the white dots on top of the blue represent sample points with no associated image)
Screenshot 2024-03-08 at 5 17 23 PM

but then i filtered for 360 images on Mapillary and it looks like the coverage lines up. there's a gap in the 360 coverage in the north half of the town.
Screenshot 2024-03-08 at 5 09 33 PM

@danbjoseph
Copy link
Member

i think using a .env.example file makes it less likely someone accidentally commits their token, have added that change

@dragonejt
Copy link
Contributor Author

I'll work on documentation changes today

Copy link
Member

@danbjoseph danbjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a few things to revisit at a later date, but i think this is great to move forward with. thanks!

@danbjoseph danbjoseph merged commit 39dff8f into main Mar 14, 2024
1 check passed
@danbjoseph danbjoseph deleted the 7-use-mapillary-api-to-retrieve-image-for-each-point branch March 14, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

use Mapillary API to retrieve image for each point
3 participants