Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug][Tis]任务描述:执行Hive2Doris导入,传递SelectedTab的cols结构体中type: null,导致校验不通过,无法创建同步管道任务。 #391

Open
alldatafounder opened this issue Nov 5, 2024 · 3 comments

Comments

@alldatafounder
Copy link

alldatafounder commented Nov 5, 2024

任务描述:执行Hive2Doris导入,由于hive表的字段为null,传递SelectedTab的cols结构体中type: null,导致校验不通过,无法创建同步管道任务。
相关版本: Doris2.0.7, hive客户端是2.1.1-cdh , hive 服务端:2.3.2
image

目前我做了以下尝试:
image

hive表:
image
image
image
image
image
image

解决思路判断:

判断理由:mysql,oracle都执行了这部分代码进行type赋值。从xml临时文件读取selectedTab的结构,然后对其type进行JDBCTypes赋值DataType。

猜测:hive这块没有完成这一步导致此问题

    public static void fillSelectedTabMeta(ISelectedTab tab,
                                           Function<ISelectedTab, Map<String, ColumnMetaData>> tableColsMetaGetter) {
        Map<String, ColumnMetaData> colsMeta = tableColsMetaGetter.apply(tab);
        ColumnMetaData colMeta = null;
        if (colsMeta.size() < 1) {
            throw new IllegalStateException("table:" + tab.getName() + " relevant cols meta can not be null");
        }
        for (CMeta col : tab.getCols()) {
            colMeta = colsMeta.get(col.getName());
            if (colMeta == null) {
                throw new IllegalStateException("col:" + col.getName() + " can not find relevant 'col' on " + tab.getName() + ",exist Keys:[" + colsMeta.keySet().stream().collect(Collectors.joining(",")) + "]");
            }
            col.setPk(colMeta.isPk());
            col.setType(colMeta.getType());
            col.setComment(colMeta.getComment());
            col.setNullable(colMeta.isNullable());
        }
    }
@baisui1981
Copy link
Member

baisui1981 commented Nov 5, 2024

对应表的建表 DDL

 CREATE TABLE `pokes`(                              
   `foo` int,                                       
   `bar` string)                                    
 ROW FORMAT SERDE                                   
   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  
 STORED AS INPUTFORMAT                              
   'org.apache.hadoop.mapred.TextInputFormat'       
 OUTPUTFORMAT                                       
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/pokes' 
 TBLPROPERTIES (                                    
   'transient_lastDdlTime'='1730453007');

insert into pokes(foo,bar) values (1,'name1'),(2,'name2'),(3,'name3'),(4,'name4');

@alldatafounder
Copy link
Author

Hive服务端是2.3.1,Doris2.0.7, hive客户端是2.1.1-cdh,我们tis带的那个版本:
目前hive服务通过docker拉起,可以快速拉起hive查看问题,连接使用9083和10000端口
链接: https://pan.baidu.com/s/1yWRi1sLhZEqJah-YvyUYYA 提取码: q7ew

@baisui1981
Copy link
Member

baisui1981 commented Nov 7, 2024

创建一个支持 PARQUET文件格式的表

CREATE TABLE customer_transactions (
    transaction_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    product_code STRING,
    transaction_date TIMESTAMP
)
STORED AS PARQUET
LOCATION                                           
   'hdfs://namenode:8020/user/hive/warehouse/customer_transactions' 
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',          -- 设置压缩算法为Snappy
    'parquet.block.size'='134217728',        -- 设置块大小为128MB
    'parquet.page.size'='1048576',           -- 设置页面大小为1MB
    'parquet.dictionary.enabled'='TRUE',     -- 启用字典编码
    'parquet.enable.dictionary'='TRUE',      -- 启用字典编码(重复参数,但确保生效)
    'parquet.write.support'='org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  -- 指定输出格式类
);

初始化几条记录

INSERT INTO customer_transactions (transaction_id, customer_id, amount, product_code, transaction_date)
VALUES 
(1, 101, 150.00, 'A123', '2024-01-01 00:00:00'),
(2, 102, 200.50, 'B456', '2024-01-02 00:00:00'),
(3, 103, 75.25, 'C789', '2024-01-03 00:00:00'),
(4, 104, 300.00, 'D101', '2024-01-04 00:00:00'),
(5, 105, 50.00, 'E102', '2024-01-05 00:00:00');

解释:

  • parquet.compression: 设置压缩算法,常用的值有 SNAPPY, GZIP, LZO 等。
  • parquet.block.size: 设置每个Parquet文件的块大小,单位是字节,默认值通常是128MB。
  • parquet.page.size: 设置每个页面的大小,单位是字节,默认值通常是1MB。
  • parquet.dictionary.enabled: 启用或禁用字典编码,默认值是 TRUE。
  • parquet.enable.dictionary: 另一个启用字典编码的参数,确保字典编码生效。
  • parquet.write.support: 指定输出格式类,通常情况下不需要修改。

这些参数可以根据你的具体需求进行调整,以优化存储和查询性能。

baisui1981 added a commit that referenced this issue Nov 8, 2024
baisui1981 added a commit to qlangtech/DataX that referenced this issue Nov 8, 2024
baisui1981 added a commit to qlangtech/plugins that referenced this issue Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants