You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.
In line
346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();
rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)
and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .
The text was updated successfully, but these errors were encountered:
If a html page is returned with character encoding ISO-8859-1 and pipeline runs with system encoding UTF-8, DocTextExtractor produces invalid characters in doExtract.
In line
346 if (input==null && rawData!=null) input = new ByteArrayInputStream(rawData.getBytes();
rawData.getBytes() returns byte representation of data-string with system-encoding (UTF-8)
and after that TikaWrapper seems to process the bytes with ISO-8859-1 to stream to filesystem. ISO-8859-1 is the origin encoding of content returned from web server. In that case, TikaWrapper should use system encoding (UTF-8) to handle the bytes .
The text was updated successfully, but these errors were encountered: