Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For dbt seeds and csv format of unit testing, empty strings ("") and "null" both convert to NULL #4867

Open
1 task done
dbeatty10 opened this issue Feb 7, 2024 · 1 comment
Labels
content Improvements or additions to content dbt Core The changes proposed in this issue relate to dbt Core improvement Use this when an area of the docs needs improvement as it's currently unclear

Comments

@dbeatty10
Copy link
Contributor

dbeatty10 commented Feb 7, 2024

Contributions

  • I have read the contribution docs, and understand what's expected of me.

Link to the page on docs.getdbt.com requiring updates

https://docs.getdbt.com/docs/build/seeds

What part(s) of the page would you like to see updated?

Many people are surprised about how CSV seeds treat certain values.

Namely, all of these will produce NULL values within the database when loaded via dbt seed:

seeds/null_examples.csv

type,value
doublequotes,""
empty,
spaces,  
nothing
string_null,null

But these will produce not null strings (`` and None respectively):

seeds/not_null_examples.csv

type,value
single_quotes,''
string_none,None

Example commands & output

dbt show -s not_null_examples --output json
16:57:53  Running with dbt=1.6.5
16:57:53  Registered adapter: spark=1.6.2
16:57:53  Found 3 seeds, 1 model, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:57:53  
16:57:56  Concurrency: 1 threads (target='http')
16:57:56  
16:57:56  1 of 1 START seed file dbt_dbeatty.not_null_examples ........................... [RUN]
16:58:03  1 of 1 OK loaded seed file dbt_dbeatty.not_null_examples ....................... [INSERT 2 in 6.89s]
16:58:03  {
  "node": "not_null_examples",
  "show": [
    {
      "type": "single_quotes",
      "value": "''"
    },
    {
      "type": "string_none",
      "value": "None"
    }
  ]
}
dbt show -s null_examples --output json 
16:58:14  Running with dbt=1.6.5
16:58:15  Registered adapter: spark=1.6.2
16:58:15  Found 3 seeds, 1 model, 0 sources, 0 exposures, 0 metrics, 398 macros, 0 groups, 0 semantic models
16:58:15  
16:58:17  Concurrency: 1 threads (target='http')
16:58:17  
16:58:17  1 of 1 START seed file dbt_dbeatty.null_examples ............................... [RUN]
16:58:24  1 of 1 OK loaded seed file dbt_dbeatty.null_examples ........................... [INSERT 5 in 6.82s]
16:58:24  {
  "node": "null_examples",
  "show": [
    {
      "type": "doublequotes",
      "value": null
    },
    {
      "type": "empty",
      "value": null
    },
    {
      "type": "spaces",
      "value": null
    },
    {
      "type": "nothing",
      "value": null
    },
    {
      "type": "string_null",
      "value": null
    }
  ]
}

Net consequence

There are two values that you can't load directly with CSV seeds:

  1. An empty / blank string ("" / '')
  2. A string with the value "null"

In order to create those values, then we'd described a workaround at a high-level here.

you can opt for a "placeholder" empty string value (e.g. EMPTY), coupled with a staging model that wraps around the seed

👉 It might be handy for us to actually provide an example of that:

seeds/sentinel_examples.csv

type,value
sentinel_empty_string,EMPTY_STRING
sentinal_null_string,NULL_STRING
doublequotes,""
empty,
spaces,  
nothing
string_null,null
single_quotes,''
string_none,None

models/example_seed_wrapping.sql

select
  value as orginal_value,
  case
    when value = 'EMPTY_STRING' then ''
    when value = 'NULL_STRING' then 'null'
    else value
  end as converted_value
from {{ ref('sentinel_examples') }}
dbt build -s +wrap_seed --full-refresh
dbt show -s wrap_seed --limit 10 --output json
Output
{
  "node": "wrap_seed",
  "show": [
    {
      "type": "sentinel_empty_string",
      "orginal_value": "EMPTY_STRING",
      "converted_value": ""
    },
    {
      "type": "sentinal_null_string",
      "orginal_value": "NULL_STRING",
      "converted_value": "null"
    },
    {
      "type": "doublequotes",
      "orginal_value": null,
      "converted_value": null
    },
    {
      "type": "empty",
      "orginal_value": null,
      "converted_value": null
    },
    {
      "type": "spaces",
      "orginal_value": null,
      "converted_value": null
    },
    {
      "type": "nothing",
      "orginal_value": null,
      "converted_value": null
    },
    {
      "type": "string_null",
      "orginal_value": null,
      "converted_value": null
    },
    {
      "type": "single_quotes",
      "orginal_value": "''",
      "converted_value": "''"
    },
    {
      "type": "string_none",
      "orginal_value": "None",
      "converted_value": "None"
    }
  ]
}

Additional information

Here are some of the related issues and discussions:

@dbeatty10 dbeatty10 added content Improvements or additions to content improvement Use this when an area of the docs needs improvement as it's currently unclear labels Feb 7, 2024
@dbeatty10 dbeatty10 added the dbt Core The changes proposed in this issue relate to dbt Core label Apr 9, 2024
@dbeatty10 dbeatty10 changed the title For dbt seeds, empty strings ("") and "null" both convert to NULL For dbt seeds and csv format of unit testing, empty strings ("") and "null" both convert to NULL May 9, 2024
@runleonarun
Copy link
Collaborator

From #5476:

Overall, we should really just implement #4867 and make sure it covers all of these cases:

null values in csv format for unit tests
null values in dict format for unit tests
null values in sql format for unit tests
null values in CSV seeds
empty string values in csv format for unit tests
empty string values in dict format for unit tests
empty string values in sql format for unit tests
empty string values in CSV seeds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Improvements or additions to content dbt Core The changes proposed in this issue relate to dbt Core improvement Use this when an area of the docs needs improvement as it's currently unclear
Projects
None yet
Development

No branches or pull requests

2 participants