Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing Unicode directly raises TypeError in textanalyzer #11

Open
DeaconDesperado opened this issue May 6, 2013 · 5 comments
Open

Comments

@DeaconDesperado
Copy link

Perhaps I'm doing something wrong, but it's worth it to check.

My input to the ReadabilityTool is unicode utf-8 text. The input is already encoded, and I received a TypeError when trying to run the tests on it.

Traceback (most recent call last):
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 137, in metrics
    flesch_readability(story),
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 23, in wrapper
    return fn(text,*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/mlstripper.py", line 30, in wrapper
    ret = fn(*args,**kwargs)
  File "/Users/uname/projects/news_genome/news_genome/features.py", line 49, in flesch_readability
    contrib_score = rt.FleschReadingEase(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 87, in FleschReadingEase
    self.__analyzeText(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/readabilitytests.py", line 49, in __analyzeText
    words = t.getWords(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 50, in getWords
    text = self._setEncoding(text)
  File "/usr/local/lib/python2.7/site-packages/nltk_contrib/readability/textanalyzer.py", line 130, in _setEncoding
    text = unicode(text, "utf8").encode("utf8")
TypeError: decoding Unicode is not supported

It appears the logic at line 130 in textanalyzer.py expects to perform a encoding that is already performed.

def _setEncoding(self,text):
        try:
            text = unicode(text, "utf8").encode("utf8")
        except UnicodeError:
            try:
                text = unicode(text, "iso8859_1").encode("utf8")
            except UnicodeError:
                text = unicode(text, "ascii", "replace").encode("utf8")
        return text

Is there something I need to configure in order to make the module expect Unicode by default?

@kmike
Copy link
Member

kmike commented May 6, 2013

Hi @DeaconDesperado ,

I don't have experience with nltk_contrib and textanalyzer.py, but "unicode" and "utf-8 text" are in some sense antonyms, not synonyms, because "utf8-encoded" means "binary". So what is your input exactly, unicode string or binary utf8-encoded string?

Based on _setEncoding method it looks like textanalyzer expects binary utf8-encoded string, so if you have unicode input, encoding it to utf8 before passing to readability will make algorithm work.

Based on your code snippet, it looks to me that unicode handling in nltk_contrib.readability is bad; it really should accept only unicode (and not binary data). There are many such issues fixed in NLTK 3.0 alpha, but nltk_contrib was not developed accordingly. Contributions are appreciated!

@DeaconDesperado
Copy link
Author

@kmike Thanks! The input was in fact unicode text, so I did as you suggested and did the encoding in my application code before handing it off to readability.

We used the contrib module in a pretty awesome R&D project last Friday and I'd love to contribute.

I'm going to work on amending the handling in my fork and will file a pull request ASAP.

@DeaconDesperado
Copy link
Author

@kmike I know you said that you haven't much used this particular module in contrib, but what do you think would be the most desirable handling of unicode here? Should the analyzer only check the encoding once and perhaps use it as an instance property? The problem seems to be that the functions are called separately in the ReadabilityTool. I'd like to start working on it but it would be good to have a consensus.

We should enforce unicode as input, agreed, but what from there?

@kmike
Copy link
Member

kmike commented May 8, 2013

I think that analyzer should only accept unicode text and leave the task of decoding to user, and that almost every ".encode" / ".decode" is in incorrect place now :)

Encoding/decoding should be done as closely as possible to IO: e.g. when file is read or written or when web page is downloaded. If some code doesn't perform IO, it shouldn't ever encode or decode text. Algorithms that work with text should work on unicode data (that is already decoded), and they should produce unicode data (that could be encoded later).

It could be reasonable for URLextracter to guess encoding of web page and decode data to unicode. But there are dedicated modules for this task (e.g. chardet or UnicodeDammit from BeautifulSoup) that handle much more cases than current _setEncoding methods.

  • textanalyzer shouldn't decode text itself;
  • crawler handles encodings strangely (it tries to decode data and if it fails binary data is used - this is pointless);
  • languageclassifier should decode text before building a classifier, when files are read (also, why not use nltk classifiers?);
  • syllables_no.py should use unicode internally (and it should have all constants unicode).

@DeaconDesperado
Copy link
Author

Thanks for the excellent writeup, I agree with all your points 👍

I did some experimentation with this yesterday and noticed what you mean. Amending the encoding in the _setEncoding method leads to a gaggle of problems down the line, especially where string method are used on word instances (startswith for example tries to find str in a unicode type.)

I'm going to have another look at this today from the perspective of your guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants