Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embulk-output-redshift should not convert TEXT columns to VARCHAR(65535) #225

Open
mjalkio opened this issue Jan 9, 2018 · 6 comments
Open

Comments

@mjalkio
Copy link
Contributor

mjalkio commented Jan 9, 2018

TEXT columns created by Embulk default to VARCHAR(65535). This is definitely larger than necessary in the majority of situations and is a Redshift antipattern.

Based on the comment in the code:

It seems that this could have been implemented before Redshift had support for the TEXT type. According to the Redshift documentation, TEXT now converts to VARCHAR(256), which is a more reasonable size.

@hito4t
Copy link
Contributor

hito4t commented Jan 12, 2018

@mjalkio
Thank you for the information!
We should correct even though it will lose compatibility.

@hito4t
Copy link
Contributor

hito4t commented Jan 12, 2018

@mjalkio
I've investigated and found that the line won't be executed.

  • Redshift JDBC driver doesn't return 'TEXT' as type name. If you created a column whose type is 'TEXT', Redshift JDBC driver will return 'VARCHAR(256)' as type.
  • The default database column type for embulk string is 'CLOB', not 'TEXT'.
    (In most cases CLOB is too large, but this is embuk-output-jdbc specification. You can specify any type by `column_options'.)
  • If you set 'TEXT' as type of column_options, it won't be converted and will be used as it is.

@mjalkio
Copy link
Contributor Author

mjalkio commented Jan 12, 2018

Thank you for looking into this @hito4t! Can you point me to where CLOB is defined as the default? CLOB still seems like a strange choice because (1) as you mention it is usually larger than needed and (2) it is only supported by two of the database types that redshift-output-jdbc supports.

We don't want to have to define column_options for every one of our tables. This would introduce considerable overhead. We want to only define column_options for tables that need different settings than the default (which we would hope works in most cases). So if the option isn't available in the public build we will work from a fork.

@hito4t
Copy link
Contributor

hito4t commented Jan 19, 2018

@mjalkio
Unfortunately it is not easy to change default database column type for string because of compatibility reason.

Can you point me to where CLOB is defined as the default?

In the following line.
https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-jdbc/src/main/java/org/embulk/output/jdbc/AbstractJdbcOutputPlugin.java#L895

Or you can change default Redshift column type for string by changing the following line.
https://github.com/embulk/embulk-output-jdbc/blob/master/embulk-output-redshift/src/main/java/org/embulk/output/redshift/RedshiftOutputConnection.java#L54

@mjalkio
Copy link
Contributor Author

mjalkio commented Jan 20, 2018

Okay, so to be clear you would not want to change Redshift to have this method?

    @Override
    protected String buildColumnTypeName(JdbcColumn c)
    {
        switch(c.getSimpleTypeName()) {
        case "CLOB":
            return "TEXT";
        case "BLOB":
            return "BYTEA";
        default:
            return super.buildColumnTypeName(c);
        }
    }

@hito4t
Copy link
Contributor

hito4t commented Jan 23, 2018

TEXT should be TEXT (not VARCHAR(65535)) and CLOB should be VARCHAR(65535) (not TEXT), because CLOB should be extremely large as CLOBs of other DBMSs are.
#226
Although in many cases CLOB might be too larger than necessary, but embulk string should be converted to CLOB because both have extremely large capacity.

By the way, I think replace mode is a little inconvenient because it will create columns ignoring types of columns of existing table.
It will be useful if embulk-output-jdbc has an option to create columns that has same types as columns of existing table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants