Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Support Sycamore as a Python extension #62

Open
austintlee opened this issue Oct 3, 2023 · 44 comments
Open

[Discussion] Support Sycamore as a Python extension #62

austintlee opened this issue Oct 3, 2023 · 44 comments
Labels
enhancement New feature or request

Comments

@austintlee
Copy link

austintlee commented Oct 3, 2023

Is your feature request related to a problem? Please describe.
Sycamore is a semantic data preparation system that makes it easy to transform and enrich your unstructured data and prepare it for search applications (https://github.com/aryn-ai/sycamore). It was announced at OpenSearchCon 2023 and it currently supports ingesting processed documents (including text embeddings) into OpenSearch.

In the talk on the OpenSearch Python extension given at the OSC, there was also mention of Sycamore as a possible use case for the Python extension feature.

We at Aryn had not considered this possibility during our initial development, but after hearing and fielding interest from the OpenSearch community, we created this issue as a place to capture and discuss use cases, requirements and ideas.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
https://github.com/aryn-ai/sycamore

@austintlee austintlee added enhancement New feature or request untriaged Issues that require attention from the maintainers. labels Oct 3, 2023
@sam-herman
Copy link

Thanks for creating this @austintlee , curious what are your plans with regarding to sycamore integration?
Long term what do you think would be the right thing to do? Take the DJL approach or try leveraging extensions?

@austintlee
Copy link
Author

@dblock @dbwiddis

@dblock
Copy link
Member

dblock commented Oct 4, 2023

Love it. Let us know how we can help. Let's move this into opensearch-sdk-py.

@dblock dblock transferred this issue from opensearch-project/OpenSearch Oct 4, 2023
@dbwiddis
Copy link
Member

dbwiddis commented Oct 4, 2023

we created this issue as a place to capture and discuss use cases, requirements and ideas.

The first thing to discuss is the benefit of an extension vs. just a separate server using the OpenSearch Python Client to send REST layer requests.

Transport can be faster, but also from a security standpoint, our assumption has generally been that OpenSearch has initiated most requests. Extensions do have the ability to trigger transport actions on remote extensions and theoretically on OpenSearch as well, but we have not fully integrated that capability pending a more careful design review.

It would be helpful to this discussion to know how Sycamore currently intends to interact with OpenSearch.

@dbwiddis
Copy link
Member

dbwiddis commented Oct 4, 2023

@saratvemulapalli I'm sure you've got some input here too.

@dblock
Copy link
Member

dblock commented Oct 5, 2023

I think the obvious advantage of a plugin/extension is that it can be deployed inside a cluster and have access to the cluster settings/configuration. Conceptually, Sycamore is a pre-processing system, so more like data prepper, but when it comes to deploying it users want something like the AWS ingestion service, aka install and don't worry about it. I think extensions could ne a nice way to package something like that and deploy with OpenSearch, on OpenSearch infrastructure, including in a 1-node setup.

@sam-herman
Copy link

sam-herman commented Oct 5, 2023

I'm thinking about it as a potential extension that can be embedded at either data prepper or in ingestion pipeline the same way we rewrite payloads into vectors we can add the chunking and segmentation phases as "analysis" the same way Lucene does internally today.

With the stated goal of making ingestion of data simpler for the user with minimal external scripting and client complexity.

@minalsha
Copy link

minalsha commented Oct 5, 2023

@austintlee Sycamore aligns with Data Prepper that we currently support in OpenSearch. Would be great for you to have conversations with @arjunnambiartc on this.

@saratvemulapalli
Copy link
Member

saratvemulapalli commented Oct 5, 2023

Honestly the feature feels like a plugin to data-prepper, it really does not need OpenSearch/extensions features.
@austintlee what are you really looking for ?

@dblock
Copy link
Member

dblock commented Oct 5, 2023

Data prepper is a tool that is a big building block of a feature of the AWS maanged service called ingestion service. I think the question is "how do I get data prepper or sycamore bundled with OpenSearch, without standing up dedicated infrastructure", and the answer may very well be to build an extension!

An analogy is that in VSCode I can use external tools and services alike via an extension.

@austintlee
Copy link
Author

@minalsha @saratvemulapalli w.r.t. the comparisons to Data Prepper, Sycamore is conceptually another data prepping system (we say exactly that in the intro), but its focus is really on unstructured data and GenAI, LLM-powered use cases (applying GenAI to the semantics extraction from unstructured data). Looking at some of the examples in the Data Prepper repo, I see that Data Prepper is very much geared toward log analytics. Another big difference between these two systems is the runtime - Sycamore uses Ray to scale out workloads. I suppose you can tweak Data Prepper to work like Sycamore and vise versa, but I don't think we want to have that discussion here. I'll be happy to have that discussion in a separate thread.

It looks like the real discussion that is emerging here is about whether or not there is a real use case to run an ingestion sidecar as an extension. I don't know if it has to be a Python extension necessarily. We could argue that Python (on Ray) is a good choice if an extension is a right choice in the first place. I am guessing that the Ingest Pipeline is also not a good solution since it relies on plugins to inject processors. If we run ingest nodes as (Python) extensions, we could kill two birds with one stone.

@dbwiddis
Copy link
Member

dbwiddis commented Oct 7, 2023

From my POV the main difference would be how the data arrives in Sycamore before it does its magic. If it's only interacting with OpenSearch via REST (and the Opensearch Python Client) then an extension brings little value. If the intent is to process data already ingested into OpenSearch then there's more value in integration.

@austintlee
Copy link
Author

@dbwiddis Extensions have access to the transport layer just like plugins, right?

@austintlee
Copy link
Author

If the intent is to process data already ingested into OpenSearch then there's more value in integration.

Do you have an example of this out in the wild? Are there users who have ingested PDFs (or other types of unstructured data) into OpenSearch and are somehow querying against them as blobs?

Or are you thinking of something like a re-indexing use case where you are using OpenSearch as a vector database and you want to use Sycamore to re-process and re-index your vectors (this will require keeping the originals on the cluster or paths/URIs to the originals so Sycamore can discover them). This might be an interesting use case, but I am not sure how big a problem this is.

@dbwiddis
Copy link
Member

dbwiddis commented Oct 9, 2023

Or are you thinking of something like a re-indexing use case

My comment was intended more as a theoretical "where is there added value", but yes, the idea is if some sort of data is already in OpenSearch and you want to process it, an extension could implement processing in Python.

But if the primary interaction with OpenSearch is simply ingesting via the python client, then one could do that without making it an extension.

Do you have an example of this out in the wild? Are there users who have ingested PDFs (or other types of unstructured data) into OpenSearch and are somehow querying against them as blobs?

I don't have a specific example, but a brief perusal of the internet found a relatively common set of steps where one would use Apache Tika to parse a pdf and insert those results as part of an ingestion process. It's a widespread-enough pattern that I would suspect there are likely many such cases.

@sam-herman
Copy link

sam-herman commented Oct 9, 2023

IMHO indexing PDF as blobs and then reindexing will not be a common use case for ingestion, might be great down the line, but without immediate ROI.
I don't have the quantitative data to support this statement, but I would imagine a more common well known use case is the ingestion of such docs via attachment plugin which leverages libraries like Tika. Sycamore seems to me in that case as another potential extension of this capability.
To @dblock and @austintlee comments earlier, I agree with the point that it seems an overkill to spin up new service (data prep) with more infra for the sole purpose of in-place transformation into vector that in current design is meant to be facilitated via processors. Generalizing processors into an extension with a different runtime environment (python or whatever) does make a lot of sense to me.

@dblock
Copy link
Member

dblock commented Oct 10, 2023

IMHO indexing PDF as blobs and then reindexing will not be a common use case for ingestion

Wouldn't you find it valuable to have a 1-click install extension that exposes a new API such as PUT /_doc/id with a PDF (XLS, whatever) multipart attachment instead of JSON as the document and then be able to search for it, or ask questions about it?

@sam-herman
Copy link

sam-herman commented Oct 10, 2023

Wouldn't you find it valuable to have a 1-click install extension that exposes a new API such as PUT /_doc/id with a PDF (XLS, whatever) multipart attachment instead of JSON as the document and then be able to search for it, or ask questions about it?

Yes, definitely there will be value there. But the question I raise is at which stage is the attachment being processed to make it searchable? It seems more intuitive for me to do at the ingestion stage as oppose to do it async at a later stage via reindexing.
If done async with reindexing how would it consolidate to a single call of PUT /_doc/id that would make the doc searchable?

@dblock
Copy link
Member

dblock commented Oct 10, 2023

Wouldn't you find it valuable to have a 1-click install extension that exposes a new API such as PUT /_doc/id with a PDF (XLS, whatever) multipart attachment instead of JSON as the document and then be able to search for it, or ask questions about it?

Yes, definitely there will be value there. But the question I raise is at which stage is the attachment being processed to make it searchable? It seems more intuitive for me to do at the ingestion stage as oppose to do it async at a later stage via reindexing. If done async with reindexing how would it consolidate to a single call of PUT /_doc/id that would make the doc searchable?

I would leverage job scheduler to queue a job, and create a document with the job ID in it, then return. Of course other options are possible, such as a different API that returns the queued job.

@sam-herman
Copy link

Wouldn't you find it valuable to have a 1-click install extension that exposes a new API such as PUT /_doc/id with a PDF (XLS, whatever) multipart attachment instead of JSON as the document and then be able to search for it, or ask questions about it?

Yes, definitely there will be value there. But the question I raise is at which stage is the attachment being processed to make it searchable? It seems more intuitive for me to do at the ingestion stage as oppose to do it async at a later stage via reindexing. If done async with reindexing how would it consolidate to a single call of PUT /_doc/id that would make the doc searchable?

I would leverage job scheduler to queue a job, and create a document with the job ID in it, then return. Of course other options are possible, such as a different API that returns the queued job.

@dblock any specific reason why you want to do async as opposed to integrate with ingestion pipeline? I didn't look yet in depth on extensions, is it just easier to leverage extensions via job scheduler at the moment (as opposed to ingest pipeline)? or is it something you think will have other benefits experience/design etc?

@dblock
Copy link
Member

dblock commented Oct 11, 2023

@samuel-oci I think an ingestion pipeline is a great option!

@austintlee
Copy link
Author

@austintlee
Copy link
Author

Hmm.. Python extensions would only work as REST or transport endpoints, right? We would need to make changes to IngestService to turn invocations on Processors into RPC calls over the transport layer.

@austintlee
Copy link
Author

Alternatively, we would need the Python ingestion extension to come up as an "ingest" node so that coordinator nodes can route index and bulk actions to it.

@dblock @dbwiddis what do you guys think?

@dblock
Copy link
Member

dblock commented Oct 11, 2023

You mean a Python version of https://github.com/opensearch-project/opensearch-sdk-java/blob/main/src/main/java/org/opensearch/sdk/api/IngestExtension.java?

Yes.

I tried the ingest-attachment plugin, here's a cookbook: https://code.dblock.org/2023/09/29/how-to-ingest-a-pdf-document-into-opensearch-with-ingest-attachment.html - it requires one to ingest the PDF as base64-encoded string which isn't a big deal, then processes it server-side. I think we're saying we want a version of that as an extension in pure python, which will make it very easy to do more than just parse an attachment, call out to LLMs, etc., using one of the myriad libraries available in Python.

Python extensions would only work as REST or transport endpoints, right?

I think python extensions should replicate whatever the ingest plugins do now. I don't know how it works underneath 😅

Alternatively, we would need the Python ingestion extension to come up as an "ingest" node so that coordinator nodes can route index and bulk actions to it.

Probably overthinking it. If you can call out to an extension from that pipeline, then we've effectively remoted the implementation (extensions can run remote).

@dbwiddis
Copy link
Member

extensions would only work as REST or transport endpoints, right?

Presently they are implemented with OpenSearch as transport-only. However, as the extension is independent of the OpenSearch cluster, one could create any needed endpoints on it for any needed protocol.

I think python extensions should replicate whatever the ingest plugins do now. I don't know how it works underneath

This is one of a class of extension points where an actual Java object (or factory for them) is communicated to OpenSearch. The object lives in the JVM and processes the data stream there. Obviously if we want to do processing in Python that won't work directly.

@owaiskazi19 has described much of the similar process for the Language Analyzer plugins in opensearch-project/opensearch-sdk-java#766 which is the next direction the Java SDK was starting to move, so we've definitely done some thinking about this approach. While the exact implementation is different the concept is the same (process a stream of tokens).

The three options described in that issue are assuming a Java analyzer, but only the first one would be relevant to a python extension.

Implementing the ingest processor extension point would be similar to how Extensions register their REST actions: there's a single Java object on the OpenSearch side that serves as the "processor" to register with the needed Java-side code, but internally when it is called, it does whatever it needs to execute remotely. In the REST Action case it forwards the request and receives the response over transport.

We could do the same send/receive over transport, or we could even experiment with other protocols.

@austintlee
Copy link
Author

Right. So, the logic in IngestService which consumes the Processor definitions in Java would need to change to delegate processing tasks to a remote endpoint over the transport layer.

@dbwiddis
Copy link
Member

Right. So, the logic in IngestService which consumes the Processor definitions in Java would need to change to delegate processing tasks to a remote endpoint

Exactly.

over the transport layer.

Given the existing "extension" implementation, yes. The linked issue above also explores the idea of using serverless endpoints (Lambdas, Azure Functions, Cloud Functions, etc.). Which admittedly aren't what this repo is about, but I at least wanted to present the option!

@austintlee
Copy link
Author

@dblock Thanks for trying out and writing about the attachment plugin! Whether you intended it or not, that (at least to me) is a huge selling point for Python extensions. The moment you want to introduce dependencies such as Unstructured.io for full-featured document partitioning support or doing any vector embeddings or using LLMs (even locally hosted ones), you would not want to build all of that into the plugin or into Tika.

So, here's what I would do. I would implement an Action extension that exposes Sycamore more or less as an ingest node. The way Sycamore works is you give it a path (an S3 URI or a local path) and a Python script and it does its thing and bulk ingests the JSON outputs to OpenSearch. So, the action or the REST call would look something like this:

curl -X POST localhost:9200/_extensions/_sycamore/execute -H "Content-Type: application/json" -d
'{
"script": "
# Import and initialize the Sycamore library.
import sycamore
from sycamore.transforms.partition import UnstructuredPdfPartitioner
from sycamore.transforms.embed import SentenceTransformerEmbedder

context = sycamore.init()

# Read a collection of PDF documents into a DocSet.
doc_set = context.read.binary(paths=["/path/to/pdfs/"], binary_format="pdf")

# Segment the pdfs using the Unstructured partitioner.
partitioned_doc_set = doc_set.partition(partitioner=UnstructuredPdfPartitioner())

# Compute vector embeddings for the individual components of each document.
embedder=SentenceTransformerEmbedder(batch_size=100, model_name="sentence-transformers/all-MiniLM-L6-v2")
embedded_doc_set = partitioned_doc_set.explode() \
                                      .embed(embedder)

# Write the embedded documents to a local OpenSearch index.
os_client_args = {
    "hosts": [{"host": "localhost", "port": 9200}],
    "use_ssl":True,
    "verify_certs":False,
    "http_auth":("admin", "admin")
}
embedded_doc_set.write.opensearch(os_client_args, "my_index_name")"
}'

The script, input_path, the index name can all be input parameters. The script can be supplied as an S3 URI or a local path. I think we can also support scheduling (submit for future execution or as a cron job).

@samuel-oci what do you think? I would like to gauge interest on this proposal from the wider community before committing any work.

@dbwiddis if I want my sycamore ingest nodes to play nice with Ray (Sycamore's runtime for distributed computing) and its autoscale, how might I achieve that? Basically, when a new Sycamore ingest node joins (or leaves), I want to feed that into Ray to scale up and down when necessary.

@dbwiddis
Copy link
Member

@dbwiddis if I want my sycamore ingest nodes to play nice with Ray (Sycamore's runtime for distributed computing) and its autoscale, how might I achieve that? Basically, when a new Sycamore ingest node joins (or leaves), I want to feed that into Ray to scale up and down when necessary.

I haven't actually looked into the specifics of Ray.... in some of my early work doing distributed ML (in Java) I used JPPF which looks like a similar concept (a head node distributing tasks to worker nodes). My gut instinct would be to make the extension node the "head node" that coordinated any needed scaling.

@austintlee
Copy link
Author

@dbwiddis Sorry, I wasn't being very clear. How does my extension instance running on a node learn about the instances of the same extension running on other nodes that are part of the same cluster? Do all extensions get the cluster state change publications or do we need to implement that too?

@dbwiddis
Copy link
Member

How does my extension instance running on a node learn about the instances of the same extension running on other nodes that are part of the same cluster?

The extension nodes aren't really part of the cluster (although you could run one on a node that's part of a cluster, but it'd be a separate process with its own port). Depending on what the extension does, you may not need multiple instances of it.

@dbwiddis
Copy link
Member

Do all extensions get the cluster state change publications or do we need to implement that too?

We have not (yet?) implemented a pubsub model for cluster state. We do have a cluster state request implemented for an extension to query the state, but that also introduces latency concerns if you query it often just for the updates.

@austintlee
Copy link
Author

I see. I guess I misunderstood this statement from dblock:

I think the obvious advantage of a plugin/extension is that it can be deployed inside a cluster and have access to the cluster settings/configuration.

@dbwiddis
Copy link
Member

We have implemented access to the settings, and a settings update consumer that gets realtime settings updates relevant to the extension. On startup an extension can get a dump of the environment settings, register its own, and it can register for updates on any setting it cares about.

We have not done that for every single element of the cluster state, however. We can grab the whole cluster state when we want it, at the cost of tons of data we don't need. Or we can add new handlers to request bits and pieces of it as they are needed... as a new feature if it's a common use case.

@dbwiddis
Copy link
Member

So ultimately I think that is the main thing to consider. If you dont' need any information from the cluster and you just want to process ingestion and use the opensearch-py client to index it, you don't need an extension. If you want to make decisions on your indexing based on something you can get from the cluster real-time, you might find value in making it an extension.

@dbwiddis
Copy link
Member

Relevant (closed) PR regarding getting pieces of cluster state over transport.... opensearch-project/OpenSearch#7066

@austintlee
Copy link
Author

One of the issues we frequently run into, and I am sure this is a common issue, is around achieving well-balanced, high throughput bulk ingestion using the OpenSource client (Python or Java). One of the areas I want to explore with the Python extension is whether or not we can leverage the state (back pressure, e.g.) within the cluster to improve the ingestion performance. Also, because I am looking at this as a way to solve something that requires a coordinated task, my questions about cluster state are to better understand how extension instances can work together (just like ml nodes and ingest nodes).

@dbwiddis
Copy link
Member

dbwiddis commented Oct 12, 2023

One of the areas I want to explore with the Python extension is whether or not we can leverage the state (back pressure, e.g.) within the cluster to improve the ingestion performance.

Funny, I have a few feature requests on that.... opensearch-project/opensearch-java#453 and opensearch-project/opensearch-api-specification#156

@dbwiddis
Copy link
Member

Also, because I am looking at this as a way to solve something that requires a coordinated task, my questions about cluster state are to better understand how extension instances can work together (just like ml nodes and ingest nodes).

Would love to ideate on how that will look, but our extensions work somewhat ended with a single-node version and we haven't had any bandwidth to work on multi-node support. It's on the roadmap ... somewhere in the future. Unless some external contributor picks it up and runs with it....

@sam-herman
Copy link

So, here's what I would do. I would implement an Action extension that exposes Sycamore more or less as an ingest node. The way Sycamore works is you give it a path (an S3 URI or a local path) and a Python script and it does its thing and bulk ingests the JSON outputs to OpenSearch. So, the action or the REST call would look something like this:

@austintlee I think it's a great start and would definitely provide a path to begin with for users to more easily leverage sycamore as part of their ingest flow. Getting feedback from the wider community is definitely a good idea before doing any work, especially regarding the user preference for the ingestion experience.

@dblock dblock removed the untriaged Issues that require attention from the maintainers. label Jun 6, 2024
@dblock
Copy link
Member

dblock commented Jun 6, 2024

[Triage -- attendees 1, 2, 3, 4, 5, 6, 7]

@austintlee
Copy link
Author

@dbwiddis What do we need to do to add support for ssl?

@dbwiddis
Copy link
Member

@dbwiddis What do we need to do to add support for ssl?

Port the equivalent of these classes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants