-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source of language datasets #14
Comments
@DonaldTsang I really don’t know because I’m not the dev but isn’t it in Sent with GitHawk |
@DonaldTsang (inside the lib folder) Sent with GitHawk |
@DonaldTsang But it’s weird because there isn’t all language and the ones which are in it are not written in the actual language (for example: in “fr” it isn’t written in French and I don’t understand what’s written) Sent with GitHawk |
@DonaldTsang The dev used primarily Unicode checking to determine the language tho Sent with GitHawk |
@Animenosekai if it does only use Unicode checking, that would actually be really sweet as that is very useful for my cause of making language checking easier (which I hope can re implement in Python). |
The |
I don't think that it uses only Unicode checking but why don't you open |
Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.
The text was updated successfully, but these errors were encountered: