Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting OpenSearch & ElasticSearch #311

Closed
wants to merge 1 commit into from

Conversation

everplays
Copy link
Contributor

o/. I just wanted to run this by you before we push any further with it.

Background: we run AWS and we prefer to use managed services as much as possible. As you know, ElasticSearch is not available as an AWS managed service anymore and they offer OpenSearch. So we were wondering if we could get the basic functionality working with OpenSearch and the result is the commit that comes with this PR.

However, obviously, I understand that you would not want to move away from ElasticSearch in your own infrastructure. Hence, I would like to discuss the possibility of supporting both engines in the code. That way, we do not have to fork the code and maintain it ourselves.

After sorting out OpenSearch, we want to help with or implement incremental scans.

rel: #113

@pudo
Copy link
Member

pudo commented Aug 14, 2023

Hiya, thanks a ton for making this PR! As you can see from the age of the ticket, it's really something we've been dragging our heels on.

The problem I have is that both us and a few of our customers are using Elastic cloud, the hosted offering by Elastic. Which - you guessed it - uses some sort of extra auth mechanism (cf. opensearch-project/opensearch-py#104). So I'm a little worried that switching to this lib will make it impossible to connect to our cluster.

I think there's a couple of options:
a) I'm going to do some research on whether connecting opensearch-py to Elastic Cloud is a thing
b) We could support both libraries (because the world is an ugly place) via a wrapper library that equalises their call signatures. Would you perhaps be amenable to scope that out? It'll require much more testing etc. and is generally a bit of a bad solution.
c) ?? Any other ideas?

@pudo
Copy link
Member

pudo commented Aug 14, 2023

Hm, there's one more option, which I guess is to "crack" the ES client lib to remove the OpenSearch check.

@everplays
Copy link
Contributor Author

I kinda think option B would be ideal here because I can see that these two diverging more and more. Both on the library side and the engine side. What I am trying to say is that: we might be able to get away with using opensearch-py or tricking elastic's client. However, it will be a temporary solution.

If you are happy with option B, I would get into implementing it*. Also, I only checked stuff up to index and match. If there are more places that I should check to make sure that they work, please let me know.

* any preferences about how? a base class with two subclasses and an env var to pick between them?

@everplays everplays force-pushed the opensearch branch 2 times, most recently from 4db173a to 753d6c7 Compare August 15, 2023 15:11
@pudo
Copy link
Member

pudo commented Aug 16, 2023

Hey! After some reflection I also agree that B's the way to go. Both ES and OS are very vocal that these are starting to be two different applications, and so the layering will have to grow deeper and deeper over time.

Regarding how: I guess class-based makes a lot of sense. One thing we should try and do is to wrap the exceptions that are being thrown and trade them in for our own SearchException so that we can still catch throwies at the appropriate point later.

If you do indexing and /match, I'm very happy to take care of /search and fetch (/entities).

@everplays
Copy link
Contributor Author

everplays commented Aug 16, 2023

Great. This is gonna be my todo list:

  • get yente to run on ECS with an OpenSearch cluster.
  • implement solution B.
  • get github actions to run the tests against both OpenSearch and ElasticSearch.

Just an initial implementation to see if it is possible to support both
OpenSearch and ElasticSearch in the upstream project itself.
@everplays
Copy link
Contributor Author

ok, after a bit of focusing on other stuff, I can work on this again. In the meantime, we have been running yente on ECS with an OpenSearch cluster. No issues and everything works as expected. However, I definitely like to wrap up what we discussed and upstream OS' support. Not a fan of running forks.

@kizmanj
Copy link

kizmanj commented Jan 22, 2024

We're in a similar situation.
We already use Managed Opensearch in our infrastructure, so not having to manage an Elasticsearch instance for yente would be a bonus.
Having a flag to toggle between ES and OS defaulting to the existing behavior would be cool.

@pudo
Copy link
Member

pudo commented Jul 17, 2024

Hello all! We've now implemented OpenSearch support in main and are looking for some beta testers of this. Please let us know if you're interested in helping us stabilize this for release by running it in your targets environments for a day or two :)

@pudo pudo closed this Jul 17, 2024
@everplays
Copy link
Contributor Author

Thank you @pudo. We are definitely going to take a stab at it and will get back with patches if we notice any issues. :-)

@kizmanj
Copy link

kizmanj commented Jul 22, 2024 via email

@pudo
Copy link
Member

pudo commented Jul 24, 2024

Here's the announcement: https://www.opensanctions.org/articles/2024-07-24-yente4/

@kizmanj
Copy link

kizmanj commented Jul 25, 2024 via email

@pudo
Copy link
Member

pudo commented Jul 25, 2024

That may be an issue with the documentation rather than the service: if you set YENTE_OPENSEARCH_SERVICE and YENTE_OPENSEARCH_REGION then it should be using whatever boto-compatible credentials it can find in the environment - I assume this also includes workload identity inside EKS. Worth a test?

@kizmanj
Copy link

kizmanj commented Jul 25, 2024 via email

@kizmanj
Copy link

kizmanj commented Aug 21, 2024 via email

@kizmanj
Copy link

kizmanj commented Oct 18, 2024

Tried 4.1 release, and getting a different error now:

Traceback (most recent call last):
File "/venv/bin/yente", line 33, in
sys.exit(load_entry_point('yente', 'console_scripts', 'yente')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/yente/cli.py", line 44, in reindex
asyncio.run(update_index(force=force))
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/app/yente/search/indexer.py", line 188, in update_index
async with with_provider() as provider:
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/app/yente/provider/init.py", line 47, in with_provider
provider = await _create_provider()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/yente/provider/init.py", line 25, in _create_provider
return await OpenSearchProvider.create()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/yente/provider/opensearch.py", line 54, in create
await es.cluster.health(wait_for_status="yellow", timeout=5)
File "/venv/lib/python3.12/site-packages/opensearchpy/_async/client/cluster.py", line 131, in health
return await self.transport.perform_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/opensearchpy/_async/transport.py", line 375, in perform_request
await self._async_call()
File "/venv/lib/python3.12/site-packages/opensearchpy/_async/transport.py", line 198, in _async_call
await self._async_init()
File "/venv/lib/python3.12/site-packages/opensearchpy/_async/transport.py", line 163, in _async_init
self.set_connections(self.hosts)
File "/venv/lib/python3.12/site-packages/opensearchpy/transport.py", line 255, in set_connections
connections = list(zip(map(_create_connection, hosts), hosts))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/opensearchpy/transport.py", line 253, in _create_connection
return self.connection_class(metrics=self.metrics, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/opensearchpy/_async/http_aiohttp.py", line 149, in init
self.headers.update(urllib3.make_headers(basic_auth=http_auth))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/urllib3/util/request.py", line 121, in make_headers
] = f"Basic {b64encode(basic_auth.encode('latin-1')).decode()}"
^^^^^^^^^^^^^^^^^
AttributeError: 'AWSV4SignerAuth' object has no attribute 'encode'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants