-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet write to s3 is not queryable by Athena #124
Comments
I'm not sure of the details, but I had the same problem and this option solved it for me. |
Would you please give more details about that workaround? |
@ngocdaothanh
|
This issue is very related to this one. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
The parquet file generated by this library is compatible with Spark but not queryable using Athena.
I wrote a file to s3 and all array columns would break with the error GENERIC_INTERNAL_ERROR: null
AWS Premium Support told me that after doing a bit of research, the main reason for these issues is the different ways parquet files can be created. After using parquet-tools to inspect the sample data I provided they informed me that this is written in some hybrid parquet format.The parquet format generated by parquetjs allows for the final parquet file to exclude columns if that column is blank in the data. For example, if a record (row) does not have any value for the "x" column, then the "x" column is omitted from the actual parquet file itself.
Athena uses the Hive parquet SerDe (org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe). As a result, the SerDe expects that all columns will be present in the source parquet file. In this case, empty columns are not included within the record (row) of the parquet file. Unfortunately, the previously mentioned Hive parquet SerDe is the only parquet SerDe supported in Athena.
The text was updated successfully, but these errors were encountered: