Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the typesense server via search.hexdocs.pm #47

Open
josevalim opened this issue Dec 13, 2024 · 12 comments
Open

Expose the typesense server via search.hexdocs.pm #47

josevalim opened this issue Dec 13, 2024 · 12 comments
Assignees

Comments

@josevalim
Copy link
Member

@ericmj will be the one responsible to implement it.

@ruslandoga, can you please tell us exactly which APIs we should expose?

@ruslandoga
Copy link
Contributor

ruslandoga commented Dec 13, 2024

I think just https://typesense.org/docs/27.1/api/search.html#search-parameters would be enough.

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/collections/companies/documents/search\
?q=stark&query_by=company_name&filter_by=num_employees:>100\
&sort_by=num_employees:desc"

And there is a similar one but for POST requests for multiple queries: https://typesense.org/docs/27.1/api/federated-multi-search.html#multi-search-parameters -- I think that's what most Typesense clients use by default, e.g. the demo search on https://typesense.org issues "multi-search" requests. It seems like it would only be necessary if the user queries don't fit in URL length limits.

@josevalim josevalim mentioned this issue Dec 13, 2024
6 tasks
@ericmj
Copy link
Member

ericmj commented Dec 17, 2024

@ruslandoga Can you show an example of a typical request we will send? If we can extract the package and version from the query or add it as additional parameters we would be able to cache the query results on our CDN.

@josevalim
Copy link
Member Author

We should not worry about caching it per package. The goal is to be able to search across multiple packages at once.

@ericmj
Copy link
Member

ericmj commented Dec 17, 2024

Should we do any caching* or just pass through? If we can't cache per package we can't manually invalidate the cache and would need a short TTL to ensure we don't get unexpected search results when new packages are published.

@josevalim
Copy link
Member Author

Just pass through I think. Typesense already serves around the globe and if there is any caching, they can probably do it better than us.

@ericmj
Copy link
Member

ericmj commented Dec 17, 2024

@ruslandoga It's up and running on https://search.staging.hexdocs.pm/collections/hexdocs-staging/documents/search.

I noticed we don't have much data in staging so I also temporarily enabled querying against the prod collection https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search.

@josevalim
Copy link
Member Author

@ericmj should we expose a shorter URL? search.hexdocs.pm/documents?

@ruslandoga
Copy link
Contributor

ruslandoga commented Dec 18, 2024

👋 @ericmj

One thing to keep in mind when proxying to Typesense is that they use geo steering (appears to be Route53) on the "nearest" node domain.

Nearest Node

Geo load-balanced endpoint with automatic failover. Read more on how to configure your client library here.

https://ent97o5sv4dzx2f0p.a1.typesense.net

Individual Nodes

https://ent97o5sv4dzx2f0p-1.a1.typesense.net
https://ent97o5sv4dzx2f0p-2.a1.typesense.net
https://ent97o5sv4dzx2f0p-3.a1.typesense.net

$ dig +short ent97o5sv4dzx2f0p.a1.typesense.net
13.214.203.221
$ dig +short ent97o5sv4dzx2f0p-1.a1.typesense.net
3.140.193.110
$ dig +short ent97o5sv4dzx2f0p-2.a1.typesense.net
3.79.208.114
$ dig +short ent97o5sv4dzx2f0p-3.a1.typesense.net
13.214.203.221

So depending on how the proxy resolves the "nearest" domain it might affect latency.

The proxy seems to be resolving correctly, but it's a bit slower than direct access to the "nearest" node.
$ echo Ohio
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-1.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-1.a1.typesense.net/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   342.42ms   90.68ms 615.89ms   88.46%
    Req/Sec    14.42      3.50    16.00     84.62%
  Latency Distribution
     50%  308.08ms
     75%  310.75ms
     90%  526.18ms
     99%  615.85ms
  260 requests in 10.10s, 53.57KB read
  Non-2xx or 3xx responses: 260
Requests/sec:     25.75
Transfer/sec:      5.31KB

$ echo Frankfurt
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-2.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-2.a1.typesense.net/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   269.80ms   35.63ms 356.93ms   66.06%
    Req/Sec    18.96      8.37    40.00     77.63%
  Latency Distribution
     50%  249.31ms
     75%  306.46ms
     90%  307.52ms
     99%  355.88ms
  330 requests in 10.09s, 68.00KB read
  Non-2xx or 3xx responses: 330
Requests/sec:     32.72
Transfer/sec:      6.74KB

$ echo Singapore
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p-3.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p-3.a1.typesense.net/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    71.46ms   11.42ms 164.57ms   79.81%
    Req/Sec    69.71     17.35   101.00     57.22%
  Latency Distribution
     50%   68.81ms
     75%   75.91ms
     90%   84.68ms
     99%  113.82ms
  1364 requests in 10.08s, 281.06KB read
  Non-2xx or 3xx responses: 1364
Requests/sec:    135.25
Transfer/sec:     27.87KB

$ echo Nearest
$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    71.63ms   12.76ms 165.08ms   85.03%
    Req/Sec    69.45     19.10   101.00     58.67%
  Latency Distribution
     50%   68.79ms
     75%   75.84ms
     90%   84.05ms
     99%  124.80ms
  1371 requests in 10.08s, 282.50KB read
  Non-2xx or 3xx responses: 1371
Requests/sec:    136.02
Transfer/sec:     28.03KB

$ echo Proxy
$ wrk --latency https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
Running 10s test @ https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    95.70ms   22.09ms 287.65ms   79.39%
    Req/Sec    52.69     11.99    80.00     72.96%
  Latency Distribution
     50%   95.59ms
     75%  103.27ms
     90%  113.94ms
     99%  176.24ms
  1041 requests in 10.08s, 309.29KB read
  Non-2xx or 3xx responses: 1041
Requests/sec:    103.28
Transfer/sec:     30.69KB

@ruslandoga
Copy link
Contributor

ruslandoga commented Dec 18, 2024

Regarding caching, if necessary, we can cache all (non-latest) queries using the full query string as key (ignoring the ordering?). But it might be unnecessary. Instead, maybe we can add some sort of telemetry in the browser to ensure the requests stay fast.

Example request:

$ export user_query=mua
$ export package=swoosh-1.17.5
$ curl "https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search?q=$user_query&query_by=title,doc&filter_by=package:=$package" | jq
{
  "facet_counts": [],
  "found": 69,
  "hits": [
    {
      "document": {
        "doc": "Raised when no relay is used and recipients contain addresses across multiple hosts.\n\nFor example:\n\n    email =\n      Swoosh.Email.new(\n        to: {\"Mua\", \"[email protected]\"},\n        cc: [{\"Swoosh\", \"[email protected]\"}]\n      )\n\n    Swoosh.Adapters.Mua.deliver(email, _no_relay_config = [])\n\nFields:\n\n  - `:hosts` - the hosts for the recipients, `[\"github.com\", \"swoosh.github.com\"]` in the example above",
        "id": "78722",
        "package": "swoosh-1.17.5",
        "proglang": "elixir",
        "ref": "Swoosh.Adapters.Mua.MultihostError.html",
        "title": "Swoosh.Adapters.Mua.MultihostError",
        "type": "exception"
      },
...

@ericmj
Copy link
Member

ericmj commented Dec 18, 2024

A 25ms latency increase is better than I expected. I tested myself and I only get a 13ms increase. The geo location routing should work on our CDN. Fastly uses the same technology to pick a POP datacenter near you and then we will pick the typesense server nearest the POP.

$ wrk --latency -H "X-TYPESENSE-API-KEY: $TYPESENSE_API_KEY" https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
Running 10s test @ https://ent97o5sv4dzx2f0p.a1.typesense.net/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    34.88ms   20.16ms 122.05ms   88.87%
    Req/Sec   154.25     50.38   202.00     77.78%
  Latency Distribution
     50%   27.59ms
     75%   28.94ms
     90%   66.43ms
     99%  112.16ms
  3075 requests in 10.07s, 633.62KB read
  Non-2xx or 3xx responses: 3075
Requests/sec:    305.31
Transfer/sec:     62.91KB

$ wrk --latency https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
Running 10s test @ https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    47.06ms   26.70ms 250.20ms   86.16%
    Req/Sec   113.27     39.30   151.00     81.41%
  Latency Distribution
     50%   36.20ms
     75%   37.91ms
     90%   81.33ms
     99%  131.29ms
  2267 requests in 10.07s, 666.62KB read
  Non-2xx or 3xx responses: 2267
Requests/sec:    225.21
Transfer/sec:     66.22KB

For now I think we can focus making sure everything works and is secure, we can investigate if we can cache things after everything is working.

@ericmj
Copy link
Member

ericmj commented Jan 10, 2025

We have removed the path component of the URL so what was before https://search.staging.hexdocs.pm/collections/hexdocs-prod/documents/search?q= should now be https://search.staging.hexdocs.pm/?q=

@ruslandoga
Copy link
Contributor

ruslandoga commented Jan 12, 2025

I wonder if it also makes sense to expose POST /multi_search endpoint? For when the query string and package filters don't fit in the URL length limits? Assuming 2KB URL length limit and the URL structure as in #49 (comment), the "largest" possible search query would contain about 90 package filters.

Calculations

Assuming search URLs of this format:

https://search.hexdocs.pm/?p[phoenix]=1.5.6&p[gleam]=1.3.4&q=hello+world

And assuming average package name length of about 12:

$ head autocomplete.ndjson
{"package":"aasm","ref":"AASM.html","title":"AASM","type":"module"}
{"package":"aasm","ref":"AASM.html#aasm/2","title":"AASM.aasm/2","type":"macro"}
{"package":"aasm","ref":"AASM.html#event/3","title":"AASM.event/3","type":"macro"}
{"package":"aasm","ref":"AASM.html#state/1","title":"AASM.state/1","type":"macro"}
{"package":"aba","ref":"ABA.html","title":"ABA","type":"module"}
{"package":"aba","ref":"ABA.html#get_bank/1","title":"ABA.get_bank/1","type":"function"}
{"package":"aba","ref":"ABA.html#routing_number_valid?/1","title":"ABA.routing_number_valid?/1","type":"function"}
{"package":"aba","ref":"ABA.Bank.html","title":"ABA.Bank","type":"module"}
{"package":"aba","ref":"ABA.Bank.html#bank/2","title":"ABA.Bank.bank/2","type":"function"}
{"package":"aba","ref":"ABA.Bank.html#t:t/0","title":"ABA.Bank.t/0","type":"type"}

$ jq -r '.package' autocomplete.ndjson | sort -u | head
a11y_audit
aa
aasm
aba
aba_validator
abac
abacus
abacus_sql
abatap
abbrev

$ jq -r '.package' autocomplete.ndjson | sort -u | wc -l
8586

$ jq -r '.package' autocomplete.ndjson | sort -u | awk '{ total += length; count++ } END { if (count > 0) print total / count; else print 0 }'
11.3109

It would require about 90 package filters to hit the URL length limit.

iex> avg_pkg_name_length = 12
12

iex> avg_version_length = byte_size "1.2.3"
5

iex> extra = byte_size "p[]=&" # p[avg_pkg_name]=avg_version&
5

iex> avg_filter_length = (avg_pkg_name_length + avg_version_length + extra)
22

iex> constant = byte_size "https://search.hexdocs.pm/"
26

# almost maximum
iex> avg_filter_length * 90 + constant
2006

Note

A fresh Phoenix project contains 39 dependencies (~package filters) + 1 (if the app itself is ExDoc-searchable): https://github.com/ruslandoga/phx-example/blob/master/mix.lock

Warning

Plausible has 157 dependencies: https://github.com/plausible/analytics/blob/master/mix.lock


Example POST /multi_search request from the recipes demo on https://typesense.org:

 curl 'https://qtg5aekc2iosjh93p-3.a1.typesense.net/multi_search?x-typesense-api-key=8hLCPSQTYcBuK29zY5q6Xhin7ONxHy99' \
  -X 'POST' \
  -H 'Content-Type: text/plain' \
  -H 'Pragma: no-cache' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Sec-Fetch-Site: cross-site' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cache-Control: no-cache' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Accept-Encoding: gzip, deflate, br' \
  -H 'Origin: https://typesense.org' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.2 Safari/605.1.15' \
  -H 'Content-Length: 158' \
  -H 'Referer: https://typesense.org/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Priority: u=3, i' \
  --data-binary '{"searches":[{"query_by":"title","prioritize_exact_match":false,"highlight_full_fields":"title","collection":"r","q":"steamedsteamed","page":1,"per_page":5}]}' | unzip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants