Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source of language datasets #14

Open
DonaldTsang opened this issue Nov 20, 2019 · 7 comments
Open

Source of language datasets #14

DonaldTsang opened this issue Nov 20, 2019 · 7 comments

Comments

@DonaldTsang
Copy link

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@Animenosekai
Copy link

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang I really don’t know because I’m not the dev but isn’t it in _languageData.js?

Sent with GitHawk

@Animenosekai
Copy link

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang (inside the lib folder)

Sent with GitHawk

@Animenosekai
Copy link

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang But it’s weird because there isn’t all language and the ones which are in it are not written in the actual language (for example: in “fr” it isn’t written in French and I don’t understand what’s written)

Sent with GitHawk

@Animenosekai
Copy link

Where is the source text dataset for the Ngrams of those 100 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@DonaldTsang The dev used primarily Unicode checking to determine the language tho

Sent with GitHawk

@DonaldTsang
Copy link
Author

@Animenosekai if it does only use Unicode checking, that would actually be really sweet as that is very useful for my cause of making language checking easier (which I hope can re implement in Python).

@DonaldTsang
Copy link
Author

The _languageData.js seems like N-Gram data.

@Animenosekai
Copy link

@Animenosekai if it does only use Unicode checking, that would actually be really sweet as that is very useful for my cause of making language checking easier (which I hope can re implement in Python).

I don't think that it uses only Unicode checking but why don't you open guessLanguage.js as it should contain everything you wanna know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants