Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds number_of_n_syllable_words_all function #21

Merged
merged 2 commits into from
Nov 5, 2024

Conversation

ThHuberResearch
Copy link
Contributor

This adds the number_of_n_syllable_words_all function to descriptive_statistics.py.
For a list of texts, it counts the frequency of all n-syllable words for all values of n that it finds.

At the moment it is already possible, but a bit cumbersome, to find the frequencies of all syllables that exist in a text.
Consider this example:

docs = ['This has a very long word: Pneumonoultramicroscopicsilicovolcanoconiosis']

The very long word (apparently) has 13 syllables. But I don't necessarily know that it exists in my corpus. Currently I could find it by calling the number_of_n_syllable_words in a loop, and using a large enough number:

from linguaf.descriptive_statistics import number_of_n_syllable_words

docs = ['This has a very long word: Pneumonoultramicroscopicsilicovolcanoconiosis']
for i in range(1, 14):
    freqs = number_of_n_syllable_words(docs, n=(i, i+1))
    print(i, freqs)

But this is a bit inefficient.
With the new function this becomes much easier:

from linguaf.descriptive_statistics import number_of_n_syllable_words_all

docs = ['This has a very long word: Pneumonoultramicroscopicsilicovolcanoconiosis']
freqs = number_of_n_syllable_words_all(docs)
print(freqs)
print(freqs[5])  # how many 5-syllable words?

Prints:

defaultdict(<class 'int'>, {1: 6, 13: 1})
0

This new function also allows the user to easily find the frequency of only certain n-syllable words, such as "frequency of all 3-syllable and 5-syllable words, but nothing else". Again, this is already possible with number_of_n_syllable_words but all texts need to be processed twice with this approach (due to requiring two function calls - one with n=(3,4) and one with n=(5,6) which is a bit slow.

@Perevalov
Copy link
Member

Thanks @ThHuberSG, gonna test it locally

Copy link
Member

@Perevalov Perevalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tested locally

@Perevalov Perevalov merged commit cc8f4b6 into WSE-research:main Nov 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants