Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong character encoding #69

Open
torhar opened this issue Jul 18, 2014 · 1 comment
Open

Wrong character encoding #69

torhar opened this issue Jul 18, 2014 · 1 comment
Labels
Milestone

Comments

@torhar
Copy link

torhar commented Jul 18, 2014

If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.

In line

346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();

rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)

and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .

@bejean bejean added this to the ASAP milestone Sep 22, 2014
@bejean bejean added the bug label Sep 22, 2014
@bejean
Copy link
Owner

bejean commented Sep 22, 2014

Please provide a sample URL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants