Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #739

Closed
clashofphish opened this issue May 1, 2024 · 13 comments
Closed

[BUG] #739

clashofphish opened this issue May 1, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@clashofphish
Copy link

What is the bug?

When I set up an OpenSearch.client using the header for authentication and attempt to use the bulk helper (opensearchpy.helpers.bulk) the client.bulk() call in _process_bulk_chunk returns a string, which causes the _process_bulk_chunk_success function to raise a TypeError when resp['index'] is called at line 185 in opensearchpy/helpers/actions.py.

I tried this with my header defining "Content-Type" as "application/json" and as "application/json; boundary=NL".

How can one reproduce the bug?

from opensearchpy import OpenSearch
from opensearchpy.helpers import bulk
import uuid

client = OpenSearch(
    host ='apigateway.host.vpc.url',
    url_prefix='search',
    use_ssl=True,
    port=443,
    headers={
      'Authorization': base64_auth_header(os.environ['URL_KEY']),
      'Content-Type': 'application/json; boundary=NL',  # also tried with 'application/json'
    }
)

# Test connection works
responseGet = client.indices.get('test-index')

# Build request objects
requests = []
for n in nodes[0:3]:
    text = n.text
    metadata = {
        "noticeId": n.metadata['noticeId'],
        "department": n.metadata['department'],
    }
    request = {
        "_id": str(uuid.uuid4()),
        "_op_type": "index",
        "_index": 'test-index',
        "text": text,
        "metadata": metadata,
    }
    requests.append(request)
    
bulk(
    client,
    requests,
    max_chunk_bytes=1 * 1024 * 1024,
)

What is the expected behavior?

I would expect that the response is a json object that can be indexed using a string key.

What is your host/environment?

  • Mac OS 14.4.1
  • Python 3.11.8
  • opensearch-py==2.5.0
  • OpenSearch setup behind a VPC with access through API Gateway. I do not have the configuration settings for these.

Do you have any screenshots?

image

Do you have any additional context?

Oddly enough this behavior does not happen on the OpenSearch domain that I deployed outside of the VPC when I use the http_auth=(username, password) parameter.

@clashofphish clashofphish added bug Something isn't working untriaged Need triage labels May 1, 2024
@saimedhi saimedhi removed the untriaged Need triage label May 1, 2024
@saimedhi
Copy link
Collaborator

saimedhi commented May 2, 2024

Hello @clashofphish, I will try to replicate this case in a VPC environment. In the meantime, if you find the cause or bug, please feel free to contribute. Thank you!

@dblock
Copy link
Member

dblock commented May 2, 2024

@clashofphish Is this in Amazon Managed OpenSearch? Do you have this reproduced with curl or awscurl so we can see if the problem is the client or the server?

@clashofphish
Copy link
Author

clashofphish commented May 2, 2024

@dblock This is the Amazon Managed OpenSearch. When I use curl I don't get the same error. Only when I attempt to use the SDK.

Let me know if you need more information.

@dblock
Copy link
Member

dblock commented May 2, 2024

@clashofphish This is helpful. Will you please post the working curl(s)?

I think the whole VPC business is a red herring. I'd start by removing the content type from your python code because bulk is ld-json, not json. Next I'd dig through the code to see exactly what's being sent up in the python client and received back and compare to the curl i/o.

@clashofphish
Copy link
Author

clashofphish commented May 2, 2024

I'll get the curl info for you.

In the mean time, I can tell you that when I turned logging on, the log messages from OpenSearch show that OS is getting the records and writing them correctly. Also, the count of records increases. It's just that the response object is a string rather than a json object.

The log message:
image

Also, I tried the code without the "Content-Type" specified in the header and had the exact same issue.

@dblock
Copy link
Member

dblock commented May 2, 2024

It's just that the response object is a string rather than a json object.

That is saying that the content type of the result is not evaluated properly, so needs to be debugged.

@clashofphish
Copy link
Author

It's just that the response object is a string rather than a json object.

That is saying that the content type of the result is not evaluated properly, so needs to be debugged.

Agreed. The SDK does not evaluate the resulting response of the request to the endpoint correctly when I do my authorization using the header rather than http_auth parameter. Because it does not evaluate that result correctly it errors in the _process_bulk_chunk_success function.

Is there another way to tell the SDK to how to parse the result object that I'm missing?

Or am I not understanding what you are trying to say correctly?

@dblock
Copy link
Member

dblock commented May 2, 2024

Or am I not understanding what you are trying to say correctly?

I'm just saying it's not supposed to happen this way. It should "just work" (TM). So there's a bug somewhere :) Since you have a way to reproduce I am hoping you'll narrow it down by walking through the code ;)

Ideally, turn this into a failing unit test? I can try to fix from there.

@clashofphish
Copy link
Author

clashofphish commented May 2, 2024

I can't get the bulk curl request to work because it keeps giving me an error about having to end in a newline when I clearly have a newline in my call (I also tried having the data in a json file and using @reqs.json after --data-raw) --

curl -X POST --location 'https://<url>/test-index/_bulk' --header 'Authorization: <base64key>' --header 'Content-Type: application/json' --data-raw '{ "index": { "_index": "test-index", "_id": "1" } }\n{"id": "1", "text": "bob", "metadata": {"noticeId": "c7c, "department": "HOUSING"}}\n{ "index": { "_index": "test-index", "_id": "2" } }\n{"id": "2", "text": "jane", "metadata": {"noticeId": "6e9", "department": "HOUSING"}}\n'

I can tell you that my co-workers have been able to successfully make fetch calls to push documents to the index --

const request = await fetch('https://<url>/_bulk', {
    body: batch.map(JSON.stringify).join('\n') + '\n',
    method: 'POST',
    headers: {
      'Authorization': <token here>,
        'Content-Type': 'application/json; boundary=NL',
      },
  } )

I can also tell you that I know the request to client.bulk() that the bulk helper performs is working because my documents end up in my index. It's just that the response is a sting so it causes the post-processing of the response to fail. This only happens when I use the header to specify my authentication token for OS behind the VPC. It does not happen to the OS when I use http_auth with an OS instance not behind a VPC. From what I can see the calls to OS are the same in both instances.

I don't know what to do from here. I'm happy to provide more, but I need guidance on what you need.

@clashofphish
Copy link
Author

This ticket can be closed. I narrowed the error down to the way the API Gateway and VPC where built. Sorry for the mix up. Thanks for your help regardless.

@dblock
Copy link
Member

dblock commented May 3, 2024

This ticket can be closed. I narrowed the error down to the way the API Gateway and VPC where built. Sorry for the mix up. Thanks for your help regardless.

I'm glad you fixed the issue. Could you help understand what the root problem/cause was here and how you figured it out?

@clashofphish
Copy link
Author

clashofphish commented May 3, 2024

The problem was that the API Gateway was configured incorrectly. The lesson is that even when you trust your coworkers, sometimes you still have to double check their work. The Gateway was setup such that it was stringifying the response object inside of a stringified object.

I figured this out by poking at my coworker for more help. It's partially my fault for being ignorant of how API Gateways works/was setup in this instance.

@dblock
Copy link
Member

dblock commented May 3, 2024

The Gateway was setup such that it was stringifying the response object inside of a stringified object.

I mean how was it setup to enable this behavior?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants