[Bug][Tis]任务描述：执行Hive2Doris导入，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。 #391

alldatafounder · 2024-11-05T14:42:53Z

任务描述：执行Hive2Doris导入，由于hive表的字段为null，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。
相关版本： Doris2.0.7, hive客户端是2.1.1-cdh , hive 服务端：2.3.2

目前我做了以下尝试：

hive表：

解决思路判断：

判断理由：mysql,oracle都执行了这部分代码进行type赋值。从xml临时文件读取selectedTab的结构，然后对其type进行JDBCTypes赋值DataType。

猜测：hive这块没有完成这一步导致此问题

    public static void fillSelectedTabMeta(ISelectedTab tab,
                                           Function<ISelectedTab, Map<String, ColumnMetaData>> tableColsMetaGetter) {
        Map<String, ColumnMetaData> colsMeta = tableColsMetaGetter.apply(tab);
        ColumnMetaData colMeta = null;
        if (colsMeta.size() < 1) {
            throw new IllegalStateException("table:" + tab.getName() + " relevant cols meta can not be null");
        }
        for (CMeta col : tab.getCols()) {
            colMeta = colsMeta.get(col.getName());
            if (colMeta == null) {
                throw new IllegalStateException("col:" + col.getName() + " can not find relevant 'col' on " + tab.getName() + ",exist Keys:[" + colsMeta.keySet().stream().collect(Collectors.joining(",")) + "]");
            }
            col.setPk(colMeta.isPk());
            col.setType(colMeta.getType());
            col.setComment(colMeta.getComment());
            col.setNullable(colMeta.isNullable());
        }
    }

baisui1981 · 2024-11-05T14:49:18Z

对应表的建表 DDL

 CREATE TABLE `pokes`(                              
   `foo` int,                                       
   `bar` string)                                    
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.mapred.TextInputFormat'       
 OUTPUTFORMAT                                       
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/pokes' 
 TBLPROPERTIES (                                    
   'transient_lastDdlTime'='1730453007');

insert into pokes(foo,bar) values (1,'name1'),(2,'name2'),(3,'name3'),(4,'name4');

alldatafounder · 2024-11-05T14:55:07Z

Hive服务端是2.3.1，Doris2.0.7, hive客户端是2.1.1-cdh，我们tis带的那个版本：
目前hive服务通过docker拉起，可以快速拉起hive查看问题，连接使用9083和10000端口
链接: https://pan.baidu.com/s/1yWRi1sLhZEqJah-YvyUYYA 提取码: q7ew

datavane/tis#391

baisui1981 · 2024-11-07T07:17:59Z

创建一个支持 PARQUET文件格式的表

CREATE TABLE customer_transactions (
    transaction_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    product_code STRING,
    transaction_date TIMESTAMP
)
STORED AS PARQUET
LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/customer_transactions' 
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',          -- 设置压缩算法为Snappy
    'parquet.block.size'='134217728',        -- 设置块大小为128MB
    'parquet.page.size'='1048576',           -- 设置页面大小为1MB
    'parquet.dictionary.enabled'='TRUE',     -- 启用字典编码
    'parquet.enable.dictionary'='TRUE',      -- 启用字典编码（重复参数，但确保生效）
    'parquet.write.support'='org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  -- 指定输出格式类
);

初始化几条记录

INSERT INTO customer_transactions (transaction_id, customer_id, amount, product_code, transaction_date)
VALUES 
(1, 101, 150.00, 'A123', '2024-01-01 00:00:00'),
(2, 102, 200.50, 'B456', '2024-01-02 00:00:00'),
(3, 103, 75.25, 'C789', '2024-01-03 00:00:00'),
(4, 104, 300.00, 'D101', '2024-01-04 00:00:00'),
(5, 105, 50.00, 'E102', '2024-01-05 00:00:00');

解释：

parquet.compression: 设置压缩算法，常用的值有 SNAPPY, GZIP, LZO 等。
parquet.block.size: 设置每个Parquet文件的块大小，单位是字节，默认值通常是128MB。
parquet.page.size: 设置每个页面的大小，单位是字节，默认值通常是1MB。
parquet.dictionary.enabled: 启用或禁用字典编码，默认值是 TRUE。
parquet.enable.dictionary: 另一个启用字典编码的参数，确保字典编码生效。
parquet.write.support: 指定输出格式类，通常情况下不需要修改。

这些参数可以根据你的具体需求进行调整，以优化存储和查询性能。

baisui1981 added a commit to qlangtech/plugins that referenced this issue Nov 6, 2024

remove hudi component,and add processing for fill dataType for hive col

5b9b8fa

datavane/tis#391

baisui1981 added a commit that referenced this issue Nov 8, 2024

eanble parquet format for hive #391

8d0308b

baisui1981 added a commit to qlangtech/DataX that referenced this issue Nov 8, 2024

eanble parquet format for hive datavane/tis#391

0d8bbd2

baisui1981 added a commit to qlangtech/plugins that referenced this issue Nov 8, 2024

eanble parquet format for hive datavane/tis#391

24319e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Tis]任务描述：执行Hive2Doris导入，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。 #391

[Bug][Tis]任务描述：执行Hive2Doris导入，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。 #391

alldatafounder commented Nov 5, 2024 •

edited by baisui1981

Loading

baisui1981 commented Nov 5, 2024 •

edited

Loading

alldatafounder commented Nov 5, 2024

baisui1981 commented Nov 7, 2024 •

edited

Loading

[Bug][Tis]任务描述：执行Hive2Doris导入，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。 #391

[Bug][Tis]任务描述：执行Hive2Doris导入，传递SelectedTab的cols结构体中type: null，导致校验不通过，无法创建同步管道任务。 #391

Comments

alldatafounder commented Nov 5, 2024 • edited by baisui1981 Loading

baisui1981 commented Nov 5, 2024 • edited Loading

alldatafounder commented Nov 5, 2024

baisui1981 commented Nov 7, 2024 • edited Loading

创建一个支持 PARQUET文件格式的表

初始化几条记录

解释：

alldatafounder commented Nov 5, 2024 •

edited by baisui1981

Loading

baisui1981 commented Nov 5, 2024 •

edited

Loading

baisui1981 commented Nov 7, 2024 •

edited

Loading