-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Support Sycamore as a Python extension #62
Comments
Thanks for creating this @austintlee , curious what are your plans with regarding to sycamore integration? |
Love it. Let us know how we can help. Let's move this into opensearch-sdk-py. |
The first thing to discuss is the benefit of an extension vs. just a separate server using the OpenSearch Python Client to send REST layer requests. Transport can be faster, but also from a security standpoint, our assumption has generally been that OpenSearch has initiated most requests. Extensions do have the ability to trigger transport actions on remote extensions and theoretically on OpenSearch as well, but we have not fully integrated that capability pending a more careful design review. It would be helpful to this discussion to know how Sycamore currently intends to interact with OpenSearch. |
@saratvemulapalli I'm sure you've got some input here too. |
I think the obvious advantage of a plugin/extension is that it can be deployed inside a cluster and have access to the cluster settings/configuration. Conceptually, Sycamore is a pre-processing system, so more like data prepper, but when it comes to deploying it users want something like the AWS ingestion service, aka install and don't worry about it. I think extensions could ne a nice way to package something like that and deploy with OpenSearch, on OpenSearch infrastructure, including in a 1-node setup. |
I'm thinking about it as a potential extension that can be embedded at either data prepper or in ingestion pipeline the same way we rewrite payloads into vectors we can add the chunking and segmentation phases as "analysis" the same way Lucene does internally today. With the stated goal of making ingestion of data simpler for the user with minimal external scripting and client complexity. |
@austintlee Sycamore aligns with Data Prepper that we currently support in OpenSearch. Would be great for you to have conversations with @arjunnambiartc on this. |
Honestly the feature feels like a plugin to data-prepper, it really does not need OpenSearch/extensions features. |
Data prepper is a tool that is a big building block of a feature of the AWS maanged service called ingestion service. I think the question is "how do I get data prepper or sycamore bundled with OpenSearch, without standing up dedicated infrastructure", and the answer may very well be to build an extension! An analogy is that in VSCode I can use external tools and services alike via an extension. |
@minalsha @saratvemulapalli w.r.t. the comparisons to Data Prepper, Sycamore is conceptually another data prepping system (we say exactly that in the intro), but its focus is really on unstructured data and GenAI, LLM-powered use cases (applying GenAI to the semantics extraction from unstructured data). Looking at some of the examples in the Data Prepper repo, I see that Data Prepper is very much geared toward log analytics. Another big difference between these two systems is the runtime - Sycamore uses Ray to scale out workloads. I suppose you can tweak Data Prepper to work like Sycamore and vise versa, but I don't think we want to have that discussion here. I'll be happy to have that discussion in a separate thread. It looks like the real discussion that is emerging here is about whether or not there is a real use case to run an ingestion sidecar as an extension. I don't know if it has to be a Python extension necessarily. We could argue that Python (on Ray) is a good choice if an extension is a right choice in the first place. I am guessing that the Ingest Pipeline is also not a good solution since it relies on plugins to inject processors. If we run |
From my POV the main difference would be how the data arrives in Sycamore before it does its magic. If it's only interacting with OpenSearch via REST (and the Opensearch Python Client) then an extension brings little value. If the intent is to process data already ingested into OpenSearch then there's more value in integration. |
@dbwiddis Extensions have access to the transport layer just like plugins, right? |
Do you have an example of this out in the wild? Are there users who have ingested PDFs (or other types of unstructured data) into OpenSearch and are somehow querying against them as blobs? Or are you thinking of something like a re-indexing use case where you are using OpenSearch as a vector database and you want to use Sycamore to re-process and re-index your vectors (this will require keeping the originals on the cluster or paths/URIs to the originals so Sycamore can discover them). This might be an interesting use case, but I am not sure how big a problem this is. |
My comment was intended more as a theoretical "where is there added value", but yes, the idea is if some sort of data is already in OpenSearch and you want to process it, an extension could implement processing in Python. But if the primary interaction with OpenSearch is simply ingesting via the python client, then one could do that without making it an extension.
I don't have a specific example, but a brief perusal of the internet found a relatively common set of steps where one would use Apache Tika to parse a pdf and insert those results as part of an ingestion process. It's a widespread-enough pattern that I would suspect there are likely many such cases. |
IMHO indexing PDF as blobs and then reindexing will not be a common use case for ingestion, might be great down the line, but without immediate ROI. |
Wouldn't you find it valuable to have a 1-click install extension that exposes a new API such as |
Yes, definitely there will be value there. But the question I raise is at which stage is the attachment being processed to make it searchable? It seems more intuitive for me to do at the ingestion stage as oppose to do it async at a later stage via reindexing. |
I would leverage job scheduler to queue a job, and create a document with the job ID in it, then return. Of course other options are possible, such as a different API that returns the queued job. |
@dblock any specific reason why you want to do async as opposed to integrate with ingestion pipeline? I didn't look yet in depth on extensions, is it just easier to leverage extensions via job scheduler at the moment (as opposed to ingest pipeline)? or is it something you think will have other benefits experience/design etc? |
@samuel-oci I think an ingestion pipeline is a great option! |
Hmm.. Python extensions would only work as REST or transport endpoints, right? We would need to make changes to IngestService to turn invocations on Processors into RPC calls over the transport layer. |
Yes. I tried the
I think python extensions should replicate whatever the ingest plugins do now. I don't know how it works underneath 😅
Probably overthinking it. If you can call out to an extension from that pipeline, then we've effectively remoted the implementation (extensions can run remote). |
Presently they are implemented with OpenSearch as transport-only. However, as the extension is independent of the OpenSearch cluster, one could create any needed endpoints on it for any needed protocol.
This is one of a class of extension points where an actual Java object (or factory for them) is communicated to OpenSearch. The object lives in the JVM and processes the data stream there. Obviously if we want to do processing in Python that won't work directly. @owaiskazi19 has described much of the similar process for the Language Analyzer plugins in opensearch-project/opensearch-sdk-java#766 which is the next direction the Java SDK was starting to move, so we've definitely done some thinking about this approach. While the exact implementation is different the concept is the same (process a stream of tokens). The three options described in that issue are assuming a Java analyzer, but only the first one would be relevant to a python extension. Implementing the ingest processor extension point would be similar to how Extensions register their REST actions: there's a single Java object on the OpenSearch side that serves as the "processor" to register with the needed Java-side code, but internally when it is called, it does whatever it needs to execute remotely. In the REST Action case it forwards the request and receives the response over transport. We could do the same send/receive over transport, or we could even experiment with other protocols. |
Right. So, the logic in IngestService which consumes the Processor definitions in Java would need to change to delegate processing tasks to a remote endpoint over the transport layer. |
Exactly.
Given the existing "extension" implementation, yes. The linked issue above also explores the idea of using serverless endpoints (Lambdas, Azure Functions, Cloud Functions, etc.). Which admittedly aren't what this repo is about, but I at least wanted to present the option! |
@dblock Thanks for trying out and writing about the attachment plugin! Whether you intended it or not, that (at least to me) is a huge selling point for Python extensions. The moment you want to introduce dependencies such as Unstructured.io for full-featured document partitioning support or doing any vector embeddings or using LLMs (even locally hosted ones), you would not want to build all of that into the plugin or into Tika. So, here's what I would do. I would implement an Action extension that exposes Sycamore more or less as an
The script, input_path, the index name can all be input parameters. The script can be supplied as an S3 URI or a local path. I think we can also support scheduling (submit for future execution or as a cron job). @samuel-oci what do you think? I would like to gauge interest on this proposal from the wider community before committing any work. @dbwiddis if I want my sycamore ingest nodes to play nice with Ray (Sycamore's runtime for distributed computing) and its autoscale, how might I achieve that? Basically, when a new Sycamore ingest node joins (or leaves), I want to feed that into Ray to scale up and down when necessary. |
I haven't actually looked into the specifics of Ray.... in some of my early work doing distributed ML (in Java) I used JPPF which looks like a similar concept (a head node distributing tasks to worker nodes). My gut instinct would be to make the extension node the "head node" that coordinated any needed scaling. |
@dbwiddis Sorry, I wasn't being very clear. How does my extension instance running on a node learn about the instances of the same extension running on other nodes that are part of the same cluster? Do all extensions get the cluster state change publications or do we need to implement that too? |
The extension nodes aren't really part of the cluster (although you could run one on a node that's part of a cluster, but it'd be a separate process with its own port). Depending on what the extension does, you may not need multiple instances of it. |
We have not (yet?) implemented a pubsub model for cluster state. We do have a cluster state request implemented for an extension to query the state, but that also introduces latency concerns if you query it often just for the updates. |
I see. I guess I misunderstood this statement from dblock:
|
We have implemented access to the settings, and a settings update consumer that gets realtime settings updates relevant to the extension. On startup an extension can get a dump of the environment settings, register its own, and it can register for updates on any setting it cares about. We have not done that for every single element of the cluster state, however. We can grab the whole cluster state when we want it, at the cost of tons of data we don't need. Or we can add new handlers to request bits and pieces of it as they are needed... as a new feature if it's a common use case. |
So ultimately I think that is the main thing to consider. If you dont' need any information from the cluster and you just want to process ingestion and use the |
Relevant (closed) PR regarding getting pieces of cluster state over transport.... opensearch-project/OpenSearch#7066 |
One of the issues we frequently run into, and I am sure this is a common issue, is around achieving well-balanced, high throughput bulk ingestion using the OpenSource client (Python or Java). One of the areas I want to explore with the Python extension is whether or not we can leverage the state (back pressure, e.g.) within the cluster to improve the ingestion performance. Also, because I am looking at this as a way to solve something that requires a coordinated task, my questions about cluster state are to better understand how extension instances can work together (just like ml nodes and ingest nodes). |
Funny, I have a few feature requests on that.... opensearch-project/opensearch-java#453 and opensearch-project/opensearch-api-specification#156 |
Would love to ideate on how that will look, but our extensions work somewhat ended with a single-node version and we haven't had any bandwidth to work on multi-node support. It's on the roadmap ... somewhere in the future. Unless some external contributor picks it up and runs with it.... |
@austintlee I think it's a great start and would definitely provide a path to begin with for users to more easily leverage sycamore as part of their ingest flow. Getting feedback from the wider community is definitely a good idea before doing any work, especially regarding the user preference for the ingestion experience. |
@dbwiddis What do we need to do to add support for ssl? |
Port the equivalent of these classes |
Is your feature request related to a problem? Please describe.
Sycamore is a semantic data preparation system that makes it easy to transform and enrich your unstructured data and prepare it for search applications (https://github.com/aryn-ai/sycamore). It was announced at OpenSearchCon 2023 and it currently supports ingesting processed documents (including text embeddings) into OpenSearch.
In the talk on the OpenSearch Python extension given at the OSC, there was also mention of Sycamore as a possible use case for the Python extension feature.
We at Aryn had not considered this possibility during our initial development, but after hearing and fielding interest from the OpenSearch community, we created this issue as a place to capture and discuss use cases, requirements and ideas.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
https://github.com/aryn-ai/sycamore
The text was updated successfully, but these errors were encountered: