-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the typesense server via search.hexdocs.pm #47
Comments
I think just https://typesense.org/docs/27.1/api/search.html#search-parameters would be enough. curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/collections/companies/documents/search\
?q=stark&query_by=company_name&filter_by=num_employees:>100\
&sort_by=num_employees:desc" And there is a similar one but for POST requests for multiple queries: https://typesense.org/docs/27.1/api/federated-multi-search.html#multi-search-parameters -- I think that's what most Typesense clients use by default, e.g. the demo search on https://typesense.org issues "multi-search" requests. It seems like it would only be necessary if the user queries don't fit in URL length limits. |
@ruslandoga Can you show an example of a typical request we will send? If we can extract the package and version from the query or add it as additional parameters we would be able to cache the query results on our CDN. |
We should not worry about caching it per package. The goal is to be able to search across multiple packages at once. |
Should we do any caching* or just pass through? If we can't cache per package we can't manually invalidate the cache and would need a short TTL to ensure we don't get unexpected search results when new packages are published. |
Just pass through I think. Typesense already serves around the globe and if there is any caching, they can probably do it better than us. |
@ruslandoga It's up and running on https://search.staging.hexdocs.pm/collections/hexdocs-staging/documents/search. I noticed we don't have much data in staging so I also temporarily enabled querying against the prod collection https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search. |
@ericmj should we expose a shorter URL? |
👋 @ericmj One thing to keep in mind when proxying to Typesense is that they use geo steering (appears to be Route53) on the "nearest" node domain.
$ dig +short ent97o5sv4dzx2f0p.a1.typesense.net
13.214.203.221
$ dig +short ent97o5sv4dzx2f0p-1.a1.typesense.net
3.140.193.110
$ dig +short ent97o5sv4dzx2f0p-2.a1.typesense.net
3.79.208.114
$ dig +short ent97o5sv4dzx2f0p-3.a1.typesense.net
13.214.203.221 So depending on how the proxy resolves the "nearest" domain it might affect latency. The proxy seems to be resolving correctly, but it's a bit slower than direct access to the "nearest" node.$ echo Ohio
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-1.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-1.a1.typesense.net/collections/hexdocs-prod/documents/search
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 342.42ms 90.68ms 615.89ms 88.46%
Req/Sec 14.42 3.50 16.00 84.62%
Latency Distribution
50% 308.08ms
75% 310.75ms
90% 526.18ms
99% 615.85ms
260 requests in 10.10s, 53.57KB read
Non-2xx or 3xx responses: 260
Requests/sec: 25.75
Transfer/sec: 5.31KB
$ echo Frankfurt
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-2.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-2.a1.typesense.net/collections/hexdocs-prod/documents/search
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 269.80ms 35.63ms 356.93ms 66.06%
Req/Sec 18.96 8.37 40.00 77.63%
Latency Distribution
50% 249.31ms
75% 306.46ms
90% 307.52ms
99% 355.88ms
330 requests in 10.09s, 68.00KB read
Non-2xx or 3xx responses: 330
Requests/sec: 32.72
Transfer/sec: 6.74KB
$ echo Singapore
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-3.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-3.a1.typesense.net/collections/hexdocs-prod/documents/search
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 71.46ms 11.42ms 164.57ms 79.81%
Req/Sec 69.71 17.35 101.00 57.22%
Latency Distribution
50% 68.81ms
75% 75.91ms
90% 84.68ms
99% 113.82ms
1364 requests in 10.08s, 281.06KB read
Non-2xx or 3xx responses: 1364
Requests/sec: 135.25
Transfer/sec: 27.87KB
$ echo Nearest
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 71.63ms 12.76ms 165.08ms 85.03%
Req/Sec 69.45 19.10 101.00 58.67%
Latency Distribution
50% 68.79ms
75% 75.84ms
90% 84.05ms
99% 124.80ms
1371 requests in 10.08s, 282.50KB read
Non-2xx or 3xx responses: 1371
Requests/sec: 136.02
Transfer/sec: 28.03KB
$ echo Proxy
$ wrk --latency https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
Running 10s test @ https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 95.70ms 22.09ms 287.65ms 79.39%
Req/Sec 52.69 11.99 80.00 72.96%
Latency Distribution
50% 95.59ms
75% 103.27ms
90% 113.94ms
99% 176.24ms
1041 requests in 10.08s, 309.29KB read
Non-2xx or 3xx responses: 1041
Requests/sec: 103.28
Transfer/sec: 30.69KB |
Regarding caching, if necessary, we can cache all (non-latest) queries using the full query string as key (ignoring the ordering?). But it might be unnecessary. Instead, maybe we can add some sort of telemetry in the browser to ensure the requests stay fast. Example request: $ export user_query=mua
$ export package=swoosh-1.17.5
$ curl "https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search?q=$user_query&query_by=title,doc&filter_by=package:=$package" | jq
{
"facet_counts": [],
"found": 69,
"hits": [
{
"document": {
"doc": "Raised when no relay is used and recipients contain addresses across multiple hosts.\n\nFor example:\n\n email =\n Swoosh.Email.new(\n to: {\"Mua\", \"[email protected]\"},\n cc: [{\"Swoosh\", \"[email protected]\"}]\n )\n\n Swoosh.Adapters.Mua.deliver(email, _no_relay_config = [])\n\nFields:\n\n - `:hosts` - the hosts for the recipients, `[\"github.com\", \"swoosh.github.com\"]` in the example above",
"id": "78722",
"package": "swoosh-1.17.5",
"proglang": "elixir",
"ref": "Swoosh.Adapters.Mua.MultihostError.html",
"title": "Swoosh.Adapters.Mua.MultihostError",
"type": "exception"
},
... |
A 25ms latency increase is better than I expected. I tested myself and I only get a 13ms increase. The geo location routing should work on our CDN. Fastly uses the same technology to pick a POP datacenter near you and then we will pick the typesense server nearest the POP.
For now I think we can focus making sure everything works and is secure, we can investigate if we can cache things after everything is working. |
We have removed the path component of the URL so what was before https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search?q= should now be https://search.staging.hexdocs.pm/?q= |
I wonder if it also makes sense to expose POST /multi_search endpoint? For when the query string and package filters don't fit in the URL length limits? Assuming 2KB URL length limit and the URL structure as in #49 (comment), the "largest" possible search query would contain about 90 package filters. CalculationsAssuming search URLs of this format:
And assuming average package name length of about 12: $ head autocomplete.ndjson
{"package":"aasm","ref":"AASM.html","title":"AASM","type":"module"}
{"package":"aasm","ref":"AASM.html#aasm/2","title":"AASM.aasm/2","type":"macro"}
{"package":"aasm","ref":"AASM.html#event/3","title":"AASM.event/3","type":"macro"}
{"package":"aasm","ref":"AASM.html#state/1","title":"AASM.state/1","type":"macro"}
{"package":"aba","ref":"ABA.html","title":"ABA","type":"module"}
{"package":"aba","ref":"ABA.html#get_bank/1","title":"ABA.get_bank/1","type":"function"}
{"package":"aba","ref":"ABA.html#routing_number_valid?/1","title":"ABA.routing_number_valid?/1","type":"function"}
{"package":"aba","ref":"ABA.Bank.html","title":"ABA.Bank","type":"module"}
{"package":"aba","ref":"ABA.Bank.html#bank/2","title":"ABA.Bank.bank/2","type":"function"}
{"package":"aba","ref":"ABA.Bank.html#t:t/0","title":"ABA.Bank.t/0","type":"type"}
$ jq -r '.package' autocomplete.ndjson | sort -u | head
a11y_audit
aa
aasm
aba
aba_validator
abac
abacus
abacus_sql
abatap
abbrev
$ jq -r '.package' autocomplete.ndjson | sort -u | wc -l
8586
$ jq -r '.package' autocomplete.ndjson | sort -u | awk '{ total += length; count++ } END { if (count > 0) print total / count; else print 0 }'
11.3109 It would require about 90 package filters to hit the URL length limit. iex> avg_pkg_name_length = 12
12
iex> avg_version_length = byte_size "1.2.3"
5
iex> extra = byte_size "p[]=&" # p[avg_pkg_name]=avg_version&
5
iex> avg_filter_length = (avg_pkg_name_length + avg_version_length + extra)
22
iex> constant = byte_size "https://search.hexdocs.pm/"
26
# almost maximum
iex> avg_filter_length * 90 + constant
2006 Note A fresh Phoenix project contains 39 dependencies (~package filters) + 1 (if the app itself is ExDoc-searchable): https://github.com/ruslandoga/phx-example/blob/master/mix.lock Warning Plausible has 157 dependencies: https://github.com/plausible/analytics/blob/master/mix.lock Example curl 'https://qtg5aekc2iosjh93p-3.a1.typesense.net/multi_search?x-typesense-api-key=8hLCPSQTYcBuK29zY5q6Xhin7ONxHy99' \
-X 'POST' \
-H 'Content-Type: text/plain' \
-H 'Pragma: no-cache' \
-H 'Accept: application/json, text/plain, */*' \
-H 'Sec-Fetch-Site: cross-site' \
-H 'Accept-Language: en-US,en;q=0.9' \
-H 'Cache-Control: no-cache' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Origin: https://typesense.org' \
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.2 Safari/605.1.15' \
-H 'Content-Length: 158' \
-H 'Referer: https://typesense.org/' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Priority: u=3, i' \
--data-binary '{"searches":[{"query_by":"title","prioritize_exact_match":false,"highlight_full_fields":"title","collection":"r","q":"steamedsteamed","page":1,"per_page":5}]}' | unzip |
@ericmj will be the one responsible to implement it.
@ruslandoga, can you please tell us exactly which APIs we should expose?
The text was updated successfully, but these errors were encountered: