Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support of HiveTableScan TextHive #709

Closed
wants to merge 2 commits into from

Conversation

amahussein
Copy link
Collaborator

Fixes #681

  • look at the node description for the serde implementation as below
        org.apache.hadoop.hive.serde2.AbstractSerDe
          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe -> HiveText
            org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe -> HiveParquet
            org.apache.hadoop.hive.serde2.avro.AvroSerDe -> AvroParquet
            org.apache.hadoop.hive.serde2.OpenCSVSerde -> CSVParquet
  • Only HiveText is supported for now.
  • we cannot pull the schema . So, we leave it empty

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@amahussein amahussein added feature request New feature or request core_tools Scope the core module (scala) labels Dec 29, 2023
@amahussein amahussein self-assigned this Dec 29, 2023
Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>
@amahussein
Copy link
Collaborator Author

amahussein commented Jan 8, 2024

@nartal1
We need to discuss in more details the changes as I found that there are some design issues that need to need addressed in order to fully support this feature.

  1. In some situations, we need to check "Spark properties" in order to decide the action to take with a given operator/exec.
  2. I believe that after adding this feature, DataSourceInfo is supposed to include the format of "Scan Hive/NativeScan" . Currently, this doe not happen because the tool looks for "ReadSchema" in the node info which does not apply for ScanHive. It will be nice to confirm if I am getting that right or not.
  3. This point could be considered improvement; but it will be nice to generate a recommendation based on the spark properties. For example, if the CPU properties are not enabling Spark conversion, then we recommend that they do it. The point is where do we keep those recommendations or how do we pass it to the AutoTuner when it integrates with the Qual tool.

@nartal1
Copy link
Collaborator

nartal1 commented Jan 8, 2024

2. I believe that after adding this feature, DataSourceInfo is supposed to include the format of "Scan Hive/NativeScan" . Currently, this doe not happen because the tool looks for "ReadSchema" in the node info which does not apply for ScanHive. It will be nice to confirm if I am getting that right or not.

You are right. Currently the tool looks for "ReadSchema" which is present in eventlogs for ORC and Parquet file formats. We try to get the schema as well. IIRC, schema is not present in the eventlogs for Scan Hive. We need to update that function to support Scan Hive file formats.

"HiveParquet"),
HiveScanSerdeClasses("org.apache.hadoop.hive.serde2.avro.AvroSerDe", "HiveAvro"),
HiveScanSerdeClasses("org.apache.hadoop.hive.serde2.OpenCSVSerde", "HiveCSV"),
HiveScanSerdeClasses("org.apache.hadoop.hive.ql.io.orc.OrcSerde", "HiveORC")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if these are the only SerdeClasses that are used OR is there a possibility that a different class could be used for HiveParquet, HiveORC etc. Understand that we cannot cover all the classes, but would be good to document if that's the case.

@amahussein
Copy link
Collaborator Author

#723 supersedes this PR.

@amahussein amahussein closed this Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool: Parse HiveTableScan in read format and investigate InsertIntoHiveTable
2 participants