Skip to content
This repository has been archived by the owner on Sep 23, 2024. It is now read-only.

Inconsistent usage of table_name vs tap_stream_id #98

Closed
aaronsteers opened this issue Sep 5, 2020 · 1 comment
Closed

Inconsistent usage of table_name vs tap_stream_id #98

aaronsteers opened this issue Sep 5, 2020 · 1 comment

Comments

@aaronsteers
Copy link

aaronsteers commented Sep 5, 2020

Background:

  • I am working on a source (dynamodb) whose upstream table names can contain special characters, and in our case they contain dashes which are parsed in a special and significant way by this target.
  • In the effort to provide isolation between upstream and downstream table name, I research into the spec and found that (according to my interpretation of the spec here) table_name is intended to describe the upstream source and tap_stream_id is intended to drive downstream behavior.

I have created singer-io/tap-dynamodb#25 to resolve this on the tap side.

Problem:

When sending events now through that tap, it appears that there is inconsistency on when this target uses table_name and when it uses tap_stream_id. (Again, according to my understanding of the spec here, table_name should be used by the tap and tap_stream_id should govern naming in the target.)

Log below comes from a single table sync operation. Note that first it uses the correct table name, and second it uses the table name "TABLE", which is likely coming from parsing the table_name instead of tap_stream_id.

I'm planning to submit a PR but first wanted to post this to create awareness and promote discussion.

Thanks!


Here's the full log...

Note that the upstream table_name in this examples is dev_mes-employeeAssessment-table and tap_stream_id of employeeAssessment. The target checks first if employeeAssessment exists and then (mistakenly) if TABLE exists.

2020-09-04 16:06:16,334 - INFO - Beginning running command: tap-dynamodb --config /mnt/c/Files/Source/slalom-data-platform-core/data/taps/.secrets/tmp/tap-me-slalom-config.json --catalog ./.output/taps/me-slalom-catalog/me-slalom-employeeAssessment-catalog.json --state /tmp/tmpa9t8bkaj/me-slalom-employeeAssessment-state.json | target-snowflake --config /mnt/c/Files/Source/slalom-data-platform-core/data/taps/.secrets/tmp/target-snowflake-config-employeeAssessment.json > /tmp/tmpa9t8bkaj/me-slalom-employeeAssessment-state-new.json...
INFO Found credentials in shared credentials file: /mnt/c/Files/Source/slalom-data-platform-core/infra/dev/.secrets/aws-credentials
INFO Attempting to assume_role on RoleArn: arn:aws:iam::489003720472:role/TEST-AJ-DynamoDB-SingerExtracts-Role
INFO Starting sync.
INFO employeeAssessment: Starting sync
INFO Syncing full table for stream: dev_mes-employeeAssessment-table
INFO Scanning table dev_mes-employeeAssessment-table with params:
INFO    TableName = dev_mes-employeeAssessment-table
INFO    Limit = 1000
INFO employeeAssessment: Completed sync (17 rows)
INFO
+Sync Summary--------+--------------------+---------------+---------------------+
| table name         | replication method | total records | write speed         |
+--------------------+--------------------+---------------+---------------------+
| employeeAssessment | FULL_TABLE         | 17 records    | 19.3 records/second |
+--------------------+--------------------+---------------+---------------------+
INFO Done syncing.
time=2020-09-04 16:06:18 name=target_snowflake level=INFO message=Getting catalog objects from table cache...
time=2020-09-04 16:06:20 name=target_snowflake level=INFO message=Table 'RAW_MES."EMPLOYEEASSESSMENT"' does not exist. Creating...
time=2020-09-04 16:06:23 name=target_snowflake level=INFO message=Table 'RAW_MES."TABLE"' exists
time=2020-09-04 16:06:23 name=target_snowflake level=INFO message=Uploading 17 rows to external snowflake stage on S3
time=2020-09-04 16:06:23 name=target_snowflake level=INFO message=Target S3 bucket: dataplatformtest01-data-44635, local file: /tmp/records_2ph8vxjw.csv.gz, S3 
key: data/raw/me-slalom/employeeAssessment/v1/pipelinewise_dev_mes-employeeAssessment-table_20200904-160623-191772.csv.gz
time=2020-09-04 16:06:24 name=target_snowflake level=INFO message=Loading 17 rows into 'RAW_MES."TABLE"'
time=2020-09-04 16:06:25 name=target_snowflake level=INFO message=Loading into RAW_MES."TABLE": {"inserts": 0, "updates": 17, "size_bytes": 119}
time=2020-09-04 16:06:25 name=target_snowflake level=INFO message=Emitting state {"bookmarks": {"employeeAssessment": {"last_replication_method": "FULL_TABLE"}, "dev_mes-employeeAssessment-table": {"version": 1599260777218, "initial_full_table_complete": true, "success_timestamp": "2020-09-04T23:06:18.099907Z"}}, "currently_syncing": "dev_mes-employeeAssessment-table"}
@aaronsteers
Copy link
Author

aaronsteers commented Sep 5, 2020

On further research, it may be possible that the tap is still sending schema messages that identify the incorrect tap_stream_id. I will close this (at least temporarily) while I confirm the schema messages are in fact being sent as expected.

UPDATE: It is indeed the upstream tap sending incorrect schema messages. No action needed here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant