[WIP] Add support of HiveTableScan TextHive #709

amahussein · 2023-12-29T00:23:48Z

Fixes #681

look at the node description for the serde implementation as below

        org.apache.hadoop.hive.serde2.AbstractSerDe
          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe -> HiveText
            org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe -> HiveParquet
            org.apache.hadoop.hive.serde2.avro.AvroSerDe -> AvroParquet
            org.apache.hadoop.hive.serde2.OpenCSVSerde -> CSVParquet

Only HiveText is supported for now.
we cannot pull the schema . So, we leave it empty

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein · 2024-01-08T15:16:28Z

@nartal1
We need to discuss in more details the changes as I found that there are some design issues that need to need addressed in order to fully support this feature.

In some situations, we need to check "Spark properties" in order to decide the action to take with a given operator/exec.
I believe that after adding this feature, DataSourceInfo is supposed to include the format of "Scan Hive/NativeScan" . Currently, this doe not happen because the tool looks for "ReadSchema" in the node info which does not apply for ScanHive. It will be nice to confirm if I am getting that right or not.
This point could be considered improvement; but it will be nice to generate a recommendation based on the spark properties. For example, if the CPU properties are not enabling Spark conversion, then we recommend that they do it. The point is where do we keep those recommendations or how do we pass it to the AutoTuner when it integrates with the Qual tool.

nartal1 · 2024-01-08T18:11:23Z

2. I believe that after adding this feature, DataSourceInfo is supposed to include the format of "Scan Hive/NativeScan" . Currently, this doe not happen because the tool looks for "ReadSchema" in the node info which does not apply for ScanHive. It will be nice to confirm if I am getting that right or not.

You are right. Currently the tool looks for "ReadSchema" which is present in eventlogs for ORC and Parquet file formats. We try to get the schema as well. IIRC, schema is not present in the eventlogs for Scan Hive. We need to update that function to support Scan Hive file formats.

nartal1 · 2024-01-09T09:16:41Z

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/HiveParseHelper.scala

+      "HiveParquet"),
+    HiveScanSerdeClasses("org.apache.hadoop.hive.serde2.avro.AvroSerDe", "HiveAvro"),
+    HiveScanSerdeClasses("org.apache.hadoop.hive.serde2.OpenCSVSerde", "HiveCSV"),
+    HiveScanSerdeClasses("org.apache.hadoop.hive.ql.io.orc.OrcSerde", "HiveORC")


Wonder if these are the only SerdeClasses that are used OR is there a possibility that a different class could be used for HiveParquet, HiveORC etc. Understand that we cannot cover all the classes, but would be good to document if that's the case.

amahussein · 2024-01-10T18:12:58Z

#723 supersedes this PR.

Add support of HiveTableScan TextHive

0570536

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein added feature request New feature or request core_tools Scope the core module (scala) labels Dec 29, 2023

amahussein self-assigned this Dec 29, 2023

Add ORCHive serde to the lookup table

8ddd43c

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein mentioned this pull request Dec 29, 2023

Add NativeScan name as filter for Gluten #694

Closed

nartal1 reviewed Jan 9, 2024

View reviewed changes

amahussein closed this Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support of HiveTableScan TextHive #709

[WIP] Add support of HiveTableScan TextHive #709

amahussein commented Dec 29, 2023

amahussein commented Jan 8, 2024 •

edited

Loading

nartal1 commented Jan 8, 2024

nartal1 Jan 9, 2024

amahussein commented Jan 10, 2024

[WIP] Add support of HiveTableScan TextHive #709

[WIP] Add support of HiveTableScan TextHive #709

Conversation

amahussein commented Dec 29, 2023

amahussein commented Jan 8, 2024 • edited Loading

nartal1 commented Jan 8, 2024

nartal1 Jan 9, 2024

Choose a reason for hiding this comment

amahussein commented Jan 10, 2024

amahussein commented Jan 8, 2024 •

edited

Loading