Wrong character encoding #69

torhar · 2014-07-18T11:18:00Z

If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.

In line

346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();

rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)

and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .

bejean · 2014-09-22T14:15:50Z

Please provide a sample URL

bejean added this to the ASAP milestone Sep 22, 2014

bejean added the bug label Sep 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong character encoding #69

Wrong character encoding #69

torhar commented Jul 18, 2014

bejean commented Sep 22, 2014

Wrong character encoding #69

Wrong character encoding #69

Comments

torhar commented Jul 18, 2014

bejean commented Sep 22, 2014