Passing Unicode directly raises TypeError in textanalyzer #11

DeaconDesperado · 2013-05-06T18:48:56Z

Perhaps I'm doing something wrong, but it's worth it to check.

My input to the ReadabilityTool is unicode utf-8 text. The input is already encoded, and I received a TypeError when trying to run the tests on it.

Traceback (most recent call last):
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 137, in metrics
    flesch_readability(story),
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 23, in wrapper
    return fn(text,*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 30, in wrapper
    ret = fn(*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 49, in flesch_readability
    contrib_score = rt.FleschReadingEase(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 87, in FleschReadingEase
    self.__analyzeText(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 49, in __analyzeText
    words = t.getWords(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 50, in getWords
    text = self._setEncoding(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 130, in _setEncoding
    text = unicode(text, "utf8").encode("utf8")
TypeError: decoding Unicode is not supported

It appears the logic at line 130 in textanalyzer.py expects to perform a encoding that is already performed.

def _setEncoding(self,text):
        try:
            text = unicode(text, "utf8").encode("utf8")
        except UnicodeError:
            try:
                text = unicode(text, "iso8859_1").encode("utf8")
            except UnicodeError:
                text = unicode(text, "ascii", "replace").encode("utf8")
        return text

Is there something I need to configure in order to make the module expect Unicode by default?

The text was updated successfully, but these errors were encountered:

kmike · 2013-05-06T20:10:49Z

Hi @DeaconDesperado ,

I don't have experience with nltk_contrib and textanalyzer.py, but "unicode" and "utf-8 text" are in some sense antonyms, not synonyms, because "utf8-encoded" means "binary". So what is your input exactly, unicode string or binary utf8-encoded string?

Based on _setEncoding method it looks like textanalyzer expects binary utf8-encoded string, so if you have unicode input, encoding it to utf8 before passing to readability will make algorithm work.

Based on your code snippet, it looks to me that unicode handling in nltk_contrib.readability is bad; it really should accept only unicode (and not binary data). There are many such issues fixed in NLTK 3.0 alpha, but nltk_contrib was not developed accordingly. Contributions are appreciated!

DeaconDesperado · 2013-05-08T14:12:58Z

@kmike Thanks! The input was in fact unicode text, so I did as you suggested and did the encoding in my application code before handing it off to readability.

We used the contrib module in a pretty awesome R&D project last Friday and I'd love to contribute.

I'm going to work on amending the handling in my fork and will file a pull request ASAP.

DeaconDesperado · 2013-05-08T20:18:39Z

@kmike I know you said that you haven't much used this particular module in contrib, but what do you think would be the most desirable handling of unicode here? Should the analyzer only check the encoding once and perhaps use it as an instance property? The problem seems to be that the functions are called separately in the ReadabilityTool. I'd like to start working on it but it would be good to have a consensus.

We should enforce unicode as input, agreed, but what from there?

kmike · 2013-05-08T21:30:36Z

I think that analyzer should only accept unicode text and leave the task of decoding to user, and that almost every ".encode" / ".decode" is in incorrect place now :)

Encoding/decoding should be done as closely as possible to IO: e.g. when file is read or written or when web page is downloaded. If some code doesn't perform IO, it shouldn't ever encode or decode text. Algorithms that work with text should work on unicode data (that is already decoded), and they should produce unicode data (that could be encoded later).

It could be reasonable for URLextracter to guess encoding of web page and decode data to unicode. But there are dedicated modules for this task (e.g. chardet or UnicodeDammit from BeautifulSoup) that handle much more cases than current _setEncoding methods.

textanalyzer shouldn't decode text itself;
crawler handles encodings strangely (it tries to decode data and if it fails binary data is used - this is pointless);
languageclassifier should decode text before building a classifier, when files are read (also, why not use nltk classifiers?);
syllables_no.py should use unicode internally (and it should have all constants unicode).

DeaconDesperado · 2013-05-09T13:27:41Z

Thanks for the excellent writeup, I agree with all your points 👍

I did some experimentation with this yesterday and noticed what you mean. Amending the encoding in the _setEncoding method leads to a gaggle of problems down the line, especially where string method are used on word instances (startswith for example tries to find str in a unicode type.)

I'm going to have another look at this today from the perspective of your guidelines.

kmike mentioned this issue Jan 29, 2015

readabilitytests problem with utf-8 characters #17

Closed

brendanwood mentioned this issue Jan 29, 2015

Modified readability.textanalyzer to use unicode internally. #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing Unicode directly raises TypeError in textanalyzer #11

Passing Unicode directly raises TypeError in textanalyzer #11

DeaconDesperado commented May 6, 2013

kmike commented May 6, 2013

DeaconDesperado commented May 8, 2013

DeaconDesperado commented May 8, 2013

kmike commented May 8, 2013

DeaconDesperado commented May 9, 2013

Passing Unicode directly raises TypeError in textanalyzer #11

Passing Unicode directly raises TypeError in textanalyzer #11

Comments

DeaconDesperado commented May 6, 2013

kmike commented May 6, 2013

DeaconDesperado commented May 8, 2013

DeaconDesperado commented May 8, 2013

kmike commented May 8, 2013

DeaconDesperado commented May 9, 2013