-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db layout and schema discussions #1
Comments
I tried creating the OSM location hierarchy by searching nominatim for all relevant parts of an There is a new django app Here's a screenshot of the django admin of the result: It kind of works, but:
In spite of these facts i propose to follow the original design with individual tables for cities, states and countries (maybe also districts and suburbs) which are all interlinked, e.g.:
a.s.o... It makes forward and backward queries easy. The data can still be queried automatically and missing cities like in the above case can be detected easily and added without untying an existing parent relation. EDIT: About the An osmnames document says: The file disappeared in August 2018 and today maybe this describes it more precisely? Well, it's complicated. And what an amazing project, btw. |
Just some short ideas, I didn't have the time to take a deeper look yet. Tbh I'd like to drop City, State and Country completely. Data source and data publishing should be independent. So in theory all we need is pools. For the current API (which unfortunately needs to stay for a while since people actually started being compatible to it) we need to somehow get a City and a Region. The Region is of no use other than that it is displayed in some of the apps. I have yet so see any practical use and it's not always possible to get it from all sources. Unfortunately we can't easily update all apps so we have to keep it in the API and at least fill on some dummy value. |
@jklmnn I agree with your thought that exact location is the most important thing. That's why i settled on postgis. I havn't tried it yet but it's supposed to make those location queries a delight. Anyways, that whole administrative hierarchy thing is just an idea. Practical use is not so great if i think about it twice, except to show off in the statistics page like look we have so many countries, states, cities. But i still like the idea of getting an I actually like your design of supplying a geojson file with every scraper but i've seen that you have put them together yourselves in cases. That's not something i would require contributers to do, A lat/lon for each lot would be enough.. |
Yes, I also mean https://data.deutschebahn.com/dataset/api-parkplatz.html. The reason for the geojson file is mainly that many data sources do not provide sufficient information (such as the location, total lot count, etc). We need a way to match the static data we have to the data provided by the source. I can't think of a way to generally automate that. What do you do if you have a website with just names and a free count? Tbh I haven't looked into |
I understand. It's just a tiny discussion about the format then. But i think supplying a well-formatted example geojson and some docs should enable contributers to supply them. One thing bugs me: The case that a scraper might find a new lot on a website, e.g. on apag, db-parking or here or basically anywhere. a) If the configuration is fixed in a geojson file, it can not publish the lot until someone updates the file. b) If the configuration is auto-generated from available (website/api-)data, the scraper might publish something erroneous. I prefer a over b, though it would be good to record that event in the database or send a mail. Agree with your point about a good address. Think it must be supplied by hand. Just a note: The OSM project does not have yet permanent IDs for things. The So the required lot meta-info is!?
EDIT: The nominatim reverse search will probably get us the city and region name for every lat/lon without problems. The max zoom choice-box on the website also explains the |
Hey guys, So, i've removed all the OSM and location-hierarchy stuff for now. The current master branch actually holds a working prototype. Please check the README for details. There is still a lot of stuff to discuss but maybe we should just call at some point. |
Thanks! I'll take a look at it. Sorry for the late replies. I still have to take lots of vacation days in December so I might be a bit more active here then ;) |
Dear JK! No worries. As long as i get a few signs of activity i'm okay, Actually i would have been involved in a new project this week but it was postponed so i had the time.. The old API will be re-implemented. Currently i'm just having some fun with django-rest-framework. Although fun is not exactly the right word.. The required parameters that a scraper must supply are in the structs.py file. It's really not much, not even a city ;) The license will be accompanied by attribution source and url as in your code. |
|
Just about this nominatim hierarchy. It is really complex,.. and someone in Dresden is working on the implementation side ;) |
A short comment to |
In general, the PoolInfo and LotInfo (and the resulting geojson) data is only an initial fixture for the database. They might update values that are The problem, of course, is that the database state diverges from the initial fixtures. Don't have a good solution for that, yet, except to not edit in the admin in general and always update from the fixtures. It would also be possible to render the About It can also be determined automatically. or, we could publish an up-time percentage for each pool and lot. |
To be honest, I'm not to thrilled about the admin interface. Having to support a web interface on our production server is something I'd like to avoid as it comes with additional security requirements and administration overhead. Sending a message in case of an error is a good idea. We currently have a bot running that sends a message to the matrix channel if a cities data is older than a day (you don't want to send a message immediately, sometimes the data providers are doing just maintenance, etc, and secondly, our response time isn't fast enough anyway for anything that fixes itself within a day). Publishing an uptime percentage (or data quality measurement) for pools and lots is a good idea. Uptime doesn't make sense for lots, maybe rather some statistics on how often they're updated. Sometimes a data source is "active" and provides data, but the timestamp provided by that source is old. Separately the reliability of a data source could be expressed by a metric that shows if the source was reachable and if we were able to read valid data. |
Okay, we can drop the admin. I agree simply because of the duplicate data problem. I don't think it creates more overhead on the maintenance side (apart of trusting the users that have a login). The django server needs to run anyways to provide the API endpoints and the admin is an integrated part of django. We never had security issues in our customer projects but then again, django isn't a huge attack target. Not like PHP projects ;-) |
Having a long site that requires a separate subdomain and certificates (or a vpn) is already some extra I don't really want to have. Especially given the fact that we're probably not using it 99% of the time. The most secure interface is one you don't have. We already have SSH so that should be sufficient ;) The API endpoints will be routed to the reverse proxy we already have. |
See it the same way. No interface == most secure. It's really not necessary and actually determines the source of truth for the data: it's the scrapers (or their geojson files). However. Now i'm also working enough that i regularly leave the laptop in the office when i'm done. But it won't always be like that. I think we can do the transition, it's just no speedy sprint ;) Currently the biggest differences between the current ParkAPI and the rewrite are the I guess it's not completely knowable what people and apps do with those IDs. I do like the The Just some thoughts.. Have a good time! |
The lot ids are just used internally. At least none of the applications I know uses it (there is a single hard coded exception in our website, but that would be an easy fix). Initially we wanted to use some kind ID that is provided by the data source. But this led to a few (I think just 2) cities having different lot_ids while the rest just uses the names stripped and concatenated. I think we should only provide our own ID format publicly to keep it consistent. We should store the lot ID of the data source as an optional property (we may need it to match meta data). Yes, the region is only relevant in Dresden. Mainly because Dresden was the first data source (hence the name ParkenDD) and their website had a region structure at the time. We have to keep the JSON key though, just to make apps not crash. It would even be valid to just set it to the city name. I don't think that there is any relevant information in the region that anyone cares about. We could also use the ZIP code of the lot. |
This message does not really fit into the issue's topic but i like the diary character of this thread. I've started to implement the v1 API. Apart from the lot_ids and regions, as discussed earlier, this implementation does pretty much the same. What is really missing is the The original api root path which lists the city-dictionary is a bit tricky and we have to test it once all original scrapers are migrated. Code and description here. To make it eventually replace the original API server behind your proxy we probably need to move the url paths. Currently they are behind It requires some static javascript and css which must be delivered by a web server. But we can also drop that interface and just point people to the auto-generated Thanks for listening, my dear diary. Now back to work ;-) |
An explorable swagger API interface like that actually sounds pretty great! |
Yes, it's kind of fun. You still have to type in iso-formatted date strings, though ;-) Configuring the rest-framework interfaces is not so funny. E.g. declaring the return status code for the |
Yeah, don't worry about that one 😅 |
Hehe, i'm not deeply concerned with that coffee endpoint. Just at times i thought this rest-framework would be more intuitive... |
Don't worry about the forecast. If we set the forecast for all cities to Having swagger hosted on our own site shouldn't be a problem. And having it is certainly a big improvement in terms of API documentation (we should really do this for the forecast API, I always forget how to use it and I implemented it). |
Hello @jklmnn and @kiliankoe,
i'd like to continue the discussion about the layout of the database here. offenesdresden/ParkAPI#224
As it's not completely intuitive to read the relations from the code i will write down the basics. The CamelCase names are the django models which translate to database tables.
ParkingData
# open|closed|unknown|nodata|error
ParkingLot
# osm ID if available
# unique permanent identifier of parking lot
# street|garage|underground|bus
Pool
# unique permanent identifier of parking pool
ParkingLot and all below are based on OSMBase model and support automatic retrieval of osm data
City
# unique permanent identifier
State
Country
A few notes:
Pools are not in the code yet
Further changes to existing tables are quite easy to implement in Django. E.g. adding an optional field simply requires to add it to the Model code and call
./manage.py makemigrations
to publish it.The
postgis
requirement for the database adds minimally to the development and CI effort but it will allow to efficiently query by location later on.The City, State and Country tables are just a proposal. They have the only purpose of allowing granulated queries later on.
Identifying locations by
osm_id
instead of names is certainly more duplicate-proof. And it allows to query geo and other information later on. No need for the scraper to provide more than theosm_id
. It does not exist for many parking lots, though.A ParkingLot always requires a City right now. That's the minimum requirement which should not be to hard. Though, a City currently does not require State or Country. All of this data can be tied together later on in the django admin. Also note that City links to State and Country. That means one can actually create nonsense data in the database.
Another way of doing it is to require the ParkingLot -> City -> State -> Country mapping and to require the scraper to provide all the
osm_id
s for it. Still it would be possible to end up with inconsistent data because someone might have a typo in the IDs or bad humor.There is a nominatim endpoint (details/) which displays the address hierarchy of any object. That would allow to get State and Country just from the city
osm_id
. Unfortunately it's not allowed to read from it in an automated fashion. I did not find another way of getting the osm hierarchy yet. EDIT: Guessit's quite reliably possible by using the
features.properties.address
property of the nominatim geojson whenaddressdetails=1
.Another option: scraper needs to supply city
osm_id
and nothing more, the City does not immediately require a State and basically an admin has to tie all missing links together at some point.And finally: The normalized database layout is something different than the type of data that a scraper is required to provide. I think we should be nice to contributers and make it as easy as possible. The current layout can be found in park_data/models/_store.py and the unittest
The text was updated successfully, but these errors were encountered: