SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

johnwnowlin · 2022-02-17T15:13:14Z

Tika is crashing on a PDF (which has confidential information, sorry can't post). at line 30 of StreamTextExtractor.cs attempting to extract text from the PDF.

var textExtractor = new TextExtractor();
var extraction = textExtractor(@"filename");

Exception details:
System.NullReferenceException
HResult=0x80004003
Message=Object reference not set to an instance of an object.
Source=TikaOnDotNet
StackTrace:
at org.apache.jempbox.impl.XMLUtil.getStringValue(Element node)

Oddly, even though this code is in a try/finally block it trows an exception. If it would let me catch the exception, we could just ignore this file and keep going.

using (var inputStream = streamFactory(metadata))
{
    try
    {
        parser.parse(inputStream, handler, metadata, parseContext);
    }
    finally
    {
        inputStream.close();
    }
}

I can open the file in adobe. Have saved as new pdf which also fails.

Is it possible to catch this error so the code can keep going?

johnwnowlin · 2022-03-04T20:47:31Z

The file causing the error came from a Konica copier and appears to be a TIFF parked in a PDF. I suspect this error is related to issues #145 and #142 , only because Tika needs to extract information from a TIFF. I do not see how to add the optional dependencies to the .Net build to see if that is the problem. Does anybody know how that is accomplished?

KevM · 2022-07-14T15:13:42Z

It would be really nice to get an example that crashes so we could try to correct this issue in future releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

johnwnowlin commented Feb 17, 2022

johnwnowlin commented Mar 4, 2022

KevM commented Jul 14, 2022

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

SystemNullReferenceException at parser.parse in StreamTextExtractor.cs #150

Comments

johnwnowlin commented Feb 17, 2022

johnwnowlin commented Mar 4, 2022

KevM commented Jul 14, 2022