Allow big seeds #447

jausanca · 2024-09-23T20:58:42Z

resolves #446

Description

This pull requests allows to upload seeds with serialized lengths over 68000 characters. It allows so by splitting csv records across chunks which are appended to a session variable on different statement executions.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-glue next" section.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

moomindani · 2024-09-30T01:54:18Z

dbt/adapters/glue/impl.py

+        for i, csv_chunk in enumerate(csv_chunks):
+            is_first = i == 0
+            is_last = i == len(csv_chunks) - 1
+            code = "custom_glue_code_for_dbt_adapter\n"


Do we need to have this per chunk?

Yes, so the way this works is that it slices the data across multiple statement executions, appending the data to an array, with an execute per chunk.
Since cursor implementation differentiates between python code and sql code by checking if the code contains "custom_glue_code_for_dbt_adapter", otherwise wrapping it in the SqlWrapper, we need it so it's executed as python.

Makes sense

moomindani · 2024-09-30T01:56:06Z

dbt/adapters/glue/impl.py

+"""
+            if not is_last:
+                code += f'''
+SqlWrapper2.execute("""select 1""")


What is this for?

That's so the cursor execute can retrieve the response. Otherwise it breaks when retrieving the result on

if self.connection.use_arrow: result_bucket = self.response.get("result_bucket") result_key = self.response.get("result_key") if result_bucket and result_key: pdf = get_pandas_dataframe_from_result_file(result_bucket, result_key) self.result = pdf.to_dict('records')[0]

or on

chunks = output.get("Data", {}).get("TextPlain", None).strip().split('\n') logger.debug(f"chunks: {chunks}") self.response = json.loads(chunks[0])

I noticed it being handled in this same way SqlWrapper2.execute("""select 1""") on other parts of impl.py

moomindani · 2024-09-30T01:57:31Z

tests/unit/test_adapter.py

@@ -86,3 +88,15 @@ def test_get_table_type(self):
            connection = adapter.acquire_connection("dummy")
            connection.handle  # trigger lazy-load
            self.assertEqual(adapter.get_table_type(target_relation), "iceberg_table")
+
+    def test_create_csv_table_slices_big_datasets(self):


Can we add another test to use custom_glue_code_for_dbt_adapter with a big seed to verify that we do not make breaking change from the previous version?

Sorry, I don't get what you mean. What would be the test scenario?

Ah it was my bad, this test case already covered required one.

moomindani · 2024-09-30T07:28:04Z

Could you please look failed tasks?
e.g. https://github.com/aws-samples/dbt-glue/actions/runs/11068856012/job/30838954297?pr=447

moomindani · 2024-10-01T09:22:20Z

Thank you for your contribution!

Jaume Sanjuan added 3 commits September 23, 2024 18:17

allow bigger seeds

f57894d

fix typo

1670367

Update Changelog

034dde5

github-actions bot added the beginning-contributor label Sep 23, 2024

moomindani self-assigned this Sep 25, 2024

moomindani added the enable-functional-tests This label enable functional tests label Sep 25, 2024

Jaume Sanjuan added 2 commits September 27, 2024 12:16

Merge remote-tracking branch 'upstream/main' into allow_big_seeds

db6b328

update changelog

f4687a6

moomindani reviewed Sep 30, 2024

View reviewed changes

moomindani approved these changes Sep 30, 2024

View reviewed changes

fix type hints for python3.8

ca3b153

moomindani added enable-functional-tests This label enable functional tests and removed enable-functional-tests This label enable functional tests labels Oct 1, 2024

moomindani merged commit abf8c52 into aws-samples:main Oct 1, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow big seeds #447

Allow big seeds #447

jausanca commented Sep 23, 2024

moomindani Sep 30, 2024

jausanca Sep 30, 2024

moomindani Sep 30, 2024

moomindani Sep 30, 2024

jausanca Sep 30, 2024

moomindani Sep 30, 2024

moomindani Sep 30, 2024

jausanca Sep 30, 2024

moomindani Sep 30, 2024

moomindani commented Sep 30, 2024

moomindani commented Oct 1, 2024

Allow big seeds #447

Allow big seeds #447

Conversation

jausanca commented Sep 23, 2024

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moomindani commented Sep 30, 2024

moomindani commented Oct 1, 2024