-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow big seeds #447
Allow big seeds #447
Conversation
for i, csv_chunk in enumerate(csv_chunks): | ||
is_first = i == 0 | ||
is_last = i == len(csv_chunks) - 1 | ||
code = "custom_glue_code_for_dbt_adapter\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have this per chunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so the way this works is that it slices the data across multiple statement executions, appending the data to an array, with an execute per chunk.
Since cursor implementation differentiates between python code and sql code by checking if the code contains "custom_glue_code_for_dbt_adapter", otherwise wrapping it in the SqlWrapper, we need it so it's executed as python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
""" | ||
if not is_last: | ||
code += f''' | ||
SqlWrapper2.execute("""select 1""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's so the cursor execute can retrieve the response. Otherwise it breaks when retrieving the result on
if self.connection.use_arrow:
result_bucket = self.response.get("result_bucket")
result_key = self.response.get("result_key")
if result_bucket and result_key:
pdf = get_pandas_dataframe_from_result_file(result_bucket, result_key)
self.result = pdf.to_dict('records')[0]
or on
chunks = output.get("Data", {}).get("TextPlain", None).strip().split('\n')
logger.debug(f"chunks: {chunks}")
self.response = json.loads(chunks[0])
I noticed it being handled in this same way SqlWrapper2.execute("""select 1""")
on other parts of impl.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -86,3 +88,15 @@ def test_get_table_type(self): | |||
connection = adapter.acquire_connection("dummy") | |||
connection.handle # trigger lazy-load | |||
self.assertEqual(adapter.get_table_type(target_relation), "iceberg_table") | |||
|
|||
def test_create_csv_table_slices_big_datasets(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add another test to use custom_glue_code_for_dbt_adapter
with a big seed to verify that we do not make breaking change from the previous version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get what you mean. What would be the test scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it was my bad, this test case already covered required one.
Could you please look failed tasks? |
Thank you for your contribution! |
resolves #446
Description
This pull requests allows to upload seeds with serialized lengths over 68000 characters. It allows so by splitting csv records across chunks which are appended to a session variable on different statement executions.
Checklist
CHANGELOG.md
and added information about my change to the "dbt-glue next" section.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.