Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could it support chinese? #21

Open
houzi1099857978 opened this issue Mar 5, 2021 · 12 comments
Open

could it support chinese? #21

houzi1099857978 opened this issue Mar 5, 2021 · 12 comments

Comments

@houzi1099857978
Copy link

houzi1099857978 commented Mar 5, 2021

when i try to insert into chinese,i got zhe flowing error:

postgres@linux-dtq3:/home/fdw/postgres-elasticsearch-fdw-master> 2021-03-05 05:37:13.358 EST [1705] 错误:  INDEX for /test-index/doc/1 and document {'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01} failed: ({'istrue': u't', 'testjson': {}, 'createdate': datetime.date(2021, 3, 5), 'time': u'11:27:00', 'flag': u'111', 'num': 666L, 'testbytea': '\xe6\xb5\x8b\xe8\xaf\x95', 'address': u'\u5317\u4eac\u5e02\u4e30\u53f0\u533a', 'price2': u'66.666', 'message': u'\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5\u6d4b\u8bd5\u5b57\u6bb5', 'datetime': datetime.datetime(2021, 3, 5, 12, 22, 54), 'price': 11.01}, UnicodeDecodeError('ascii', '"\xe6\xb5\x8b\xe8\xaf\x95"', 1, 2, 'ordinal not in range(128)'))
2021-03-05 05:37:13.358 EST [1705] 语句:  insert into test_es(id,num,flag,istrue,createdate,price,price2,datetime,address,message,testbytea,testjson,time) 
	values(1,666,'111'::"bit",true,'2021-03-05',11.01,66.666,'2021-03-05 12:22:54','北京市丰台区','测试字段测试字段测试字段测试字段','测试'::bytea,'{}','11:27')
@houzi1099857978 houzi1099857978 changed the title Does it support chinese? could it support chinese? Mar 5, 2021
@matthewfranglen
Copy link
Owner

I've created this branch: https://github.com/matthewfranglen/postgres-elasticsearch-fdw/tree/chinese which has a test for inserting Chinese. It passes.

There seems to be a problem with your system configuration for this as it is attempting to decode the string as ascii:

UnicodeDecodeError('ascii', '"\xe6\xb5\x8b\xe8\xaf\x95"', 1, 2, 'ordinal not in range(128)')

At the moment I don't really have a solid idea why your system is defaulting to decoding the string as ascii. I suggest looking at how the docker images differ to your system.

@matthewfranglen
Copy link
Owner

I think you should look into your locale settings. See locale.getdefaultlocale which may be the cause of your issue.

Inside the postgres docker container the $LANG is en_US.utf8.

@houzi1099857978
Copy link
Author

I find the problem is not because i used chinese,it is couse by useing bytea type.
when i insert into chinese to a bytea column,i will use the flowing sql to convert data to bytea by the flowing sql:
INSERT INTO test_es(id,testbytea) values (1,'中文测试'::bytea);
that will couse the problem.

@matthewfranglen
Copy link
Owner

What's the table definition?

@houzi1099857978
Copy link
Author

houzi1099857978 commented Mar 9, 2021

What's the table definition?

CREATE EXTENSION multicorn;

CREATE SERVER multicorn_es FOREIGN DATA WRAPPER multicorn
OPTIONS (
  wrapper 'pg_es_fdw.ElasticsearchFDW'
);

DROP FOREIGN TABLE IF EXISTS "public"."test_es";
CREATE FOREIGN TABLE test_es (
  "id" int8 NOT NULL,
  "num" int4,
  "flag" bit(3),
  "istrue" bool,
  "createdate" date,
  "price" numeric(5,2),
  "price2" float8,
  "datetime" timestamp(6),
  "address" varchar(255) COLLATE "pg_catalog"."default",
  "message" text COLLATE "pg_catalog"."default",
  "testbytea" bytea,
  "testjson" json,
  "time" time(6)
)
SERVER multicorn_es
OPTIONS(
   host '10.10.0.160',
   port '9200',
   index 'test_es',
   type 'doc',
   rowid_column 'id',
   query_column 'query',
   query_dsl 'false',
--   score_column 'message',
--   default_sort 'last_updated:desc',
--   sort_column 'id',
   refresh 'false',
   complete_returning 'false',
   timeout '20',
   username 'elastic',
   password 'changeme'
)
;
INSERT INTO "public"."test_es" VALUES (1, 666, '111', 't', '2021-03-05', 11.01, 66.666, '2021-03-05 12:22:54', 'asdfa', 'adsfaaaa', '中文测试'::bytea, '{}', '11:27:00');

@matthewfranglen
Copy link
Owner

The elasticsearch schema would be great if you have that as well.

@houzi1099857978
Copy link
Author

houzi1099857978 commented Mar 9, 2021

The elasticsearch version is 6.8.6,the index is created by default when i insert data

{
  "state": "open",
  "settings": {
    "index": {
      "creation_date": "1614943999235",
      "number_of_shards": "5",
      "number_of_replicas": "1",
      "uuid": "30jZAEgwQLunvvH8A1FzGA",
      "version": {
        "created": "6080699"
      },
      "provided_name": "test_es"
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "address": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "flag": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "num": {
          "type": "long"
        },
        "createdate": {
          "type": "date"
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "istrue": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "datetime": {
          "type": "date"
        },
        "testbytea": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "price666": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "testjson": {
          "type": "object"
        },
        "price": {
          "type": "float"
        },
        "time": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "price2": {
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  },
  "aliases": [],
  "primary_terms": {
    "0": 1,
    "1": 1,
    "2": 1,
    "3": 1,
    "4": 1
  },
  "in_sync_allocations": {
    "0": [
      "7UsFkimCRji0kviOgKk7gg",
      "srAnowRfQBSmlzSTGge6bA"
    ],
    "1": [
      "OhjG7VQ2Tm2l5VKU3c7IZw",
      "w_y52WfjTuGU-p7unXTQ6w"
    ],
    "2": [
      "rpfDNJheRjSdLpg-ZYkxMg",
      "YT1fVCEZT1mrdNyFrqnSbg"
    ],
    "3": [
      "8OCOll5qQjyyJ11pgyczzw",
      "Osh_2xoQRgm_zbhNArTTvQ"
    ],
    "4": [
      "hsy1xN8aRDqBNhex3kjqNA",
      "yzp3uT-OSlWPRFdvIhCbKA"
    ]
  }
}

@matthewfranglen
Copy link
Owner

That's great, thank you.

@matthewfranglen
Copy link
Owner

I wonder if the problem is this field definition:

"testbytea": {
  "type": "text",
  "fields": {
    "keyword": {
      "ignore_above": 256,
      "type": "keyword"
    }
  }
}

It looks like it should have a type of binary. The current fdw would not work with it as it needs to encode the data as base64. I think that even with that change the schema will need to be created manually rather than using the automatic creation because the elasticsearch definition would still be text.

I'll see what I can do about fixing this on the postgres side.

@houzi1099857978
Copy link
Author

I wonder if the problem is this field definition:

"testbytea": {
  "type": "text",
  "fields": {
    "keyword": {
      "ignore_above": 256,
      "type": "keyword"
    }
  }
}

It looks like it should have a type of binary. The current fdw would not work with it as it needs to encode the data as base64. I think that even with that change the schema will need to be created manually rather than using the automatic creation because the elasticsearch definition would still be text.

I'll see what I can do about fixing this on the postgres side.

I change the es index type to binary,but the problem still exsist,maybe it is caused by php conventer.

@matthewfranglen
Copy link
Owner

matthewfranglen commented Mar 11, 2021

The problem is in the PG FDW. Elasticsearch expects the data to be base64 encoded.

I need to add a data conversion layer for the more complex types, which I've been putting off. Doing that should also allow me to address the GeoField ticket. Sorry about the delay.

@houzi1099857978
Copy link
Author

Thanks for help,I find a way to solve the problem.
when I login python print the default system encoding is ascill

linux_dtq4:~ # python 
Python 2.7.13 (default, Jan 11 2017, 10:56:06) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.getdefaultencoding()
ascii

So I changed the system encoding by create a sitecustomize.py in python Lib\site-packages:

cd /usr/lib/python2.7/site-packages/
vi sitecustomize.py

# encoding=utf8 
import sys
reload(sys)
sys.setdefaultencoding('utf8')

reboot the system,it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants