-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracker: Offline geocoding #58
Comments
I got carmen to ingest some sample data, which is good. Next I'm going to try and ingest a small OSM extract. |
Carmen seems to be doing geocoding in my test setup, but I'm really unhappy with the size of the index. The fuzzy phrase store and grid store combined are 1 MB compressed for the Seattle metro area. |
I have fuzzy_phrase building for wasm, complete with dynamic loading of the search index, which is a very unexpected turn of events. https://github.com/ellenhp/fuzzy-phrase/tree/wasm This was only possible because of the work done here: https://github.com/phiresky/tantivy-fst Currently working on converting carmen-core to use sqlite instead of rocksdb so we can leverage sql.js-httpvfs to dynamically load the gridstore too. After that I'll perform some analysis to determine how ping time to the server affects geocoding performance. I'm expecting tens of serial HTTP range requests, so ping time will probably be critical. If this does end up working well enough to deploy, it will probably make sense to use a CDN to serve the index. As much as I hate letting cloudflare MITM my TLS connections, there might not be any other way to have a good user experience. And even though access patterns would leak information to the CDN operator, it beats the heck out of sending a free-text geocoding query to $MAPS_COMPANY. |
carmen-core is working with a sqlite backend so now in theory lazy loading is unblocked, but I'm not convinced sqlite is the correct path forward. I'd like to avoid it because simultaneous interop between javascript, C/C++ and Rust sounds kind of hard and I don't understand the emscripten virtual filesystem stuff. Also sql.js is like a megabyte of wasm. I want to build my own key-value store that will eagerly download the index then lazily download the data blocks. It also gives me much more control over latency that way compared to implementing a lazy filesystem for sqlite. After that I think all that remains is building a new wasm_bindgen interface for carmen core, building a lazy fst::FakeArr, then building vtquery with emscripten and using that instead of the vtquery node package. Inevitably there will be issues but this doesn't seem like more than another week of work. I have a 10 day vacation coming up though so my original estimate of 1 month might end up being accurate after all. |
At this point I'm pretty convinced that Mapbox Carmen won't work as-is, which is a bummer. I've started exploring other options but I think it makes sense to get Headway into a working state as originally scoped. A lot of people were excited about it as originally scoped and I don't think I want to block its completion on me writing a geocoder from scratch. |
I'm sure if I spent a few months on this I could get it to tech demo levels of functionality but I want more for for this project than that, so I'm going to move forward with a traditional geocoder stack. Expectations for privacy can be managed in some other way. I think it may eventually be reasonable to build a privacy-preserving replacement for nominatim but it is not reasonable IMO to try to replicate the performance or usability characteristics of photon. There's just like, so much work that's gone into making that fast, generalized and typo-tolerant. I'm going to keep pursuing offline routing though. There are a few user stories that could preserve privacy better if offline routing were to work (route me home, or to any other location I've cached the lat/lng for) |
So I think as long as endpoints are straightforward enough to configure both at buildtime runtime, offline geocoding and routing are not that crucial from a pure privacy perspective, as those whose threat models are strict enough that this is a concern can also figure out something that works. Not having to trust any server is less important if it's easy enough to set up a new server or use a friends'. (There are still reasons why these, including routing, are interesting features) |
Offline geocoding would probably be best accomplished by getting Carmen to run in the browser, per #50. This seems like it will require removing rayon and rocksdb from carmen-core, rewriting or repackaging vtquery and probably also other things. I'm guessing Carmen itself will need to be ported over to use local storage instead of the node filesystem APIs too. After the dust settles performance will also need to be evaluated in terms of response time, latency from a cold cache, and subjective quality of the results.
The text was updated successfully, but these errors were encountered: