-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for several improvements #195
Comments
well @cvzi it must be a huge update of package and the only thing that worries me is that the new version will not be backwards compatible. This package is used by 28 thousands repositories. |
I assume we would have to release a minor version like def get_emoji_regexp():
warnings.warn("emoji.get_emoji_regexp() is deprecated and will be removed in version 2.0.0", DeprecationWarning)
... I am not sure what is the correct way to do this though, I need to read some more Python documentation or check out how other packages have done this in the past. |
I will focus on the demojize() performance first. This is a change that can be done without breaking changes. It seems to work so far and I am just writing a few more tests to make sure. |
When I was writing tests, I realized an unclear case: We could add emojize(string, language='alias') == emojize(string, use_aliases=True) == emojize(string, use_aliases=True, language='en') As a breaking change in the future, we could then decide to remove |
I looked into what is the proper way to handle version numbers and breaking changes according to https://semver.org/ and also how other projects on pypi.org did this. This is my idea for the time line: Current release is
|
Since the
EMOJI_DATA
"dict of dicts" was added in #189 that contains all the data now, I like to propose some further (breaking) changes to the package:If we make any of these changes, I would suggest to update the version number to 2.0.0 to indicate that it is not backwards compatible.
I would implement these changes myself.
Please reply with your thoughts and if you have any other suggestions or objections.
Remove the old dicts
The old dicts like
EMOJI_UNICODE_ENGLISH, UNICODE_EMOJI_ENGLISH, EMOJI_UNICODE_GERMAN,
etc are now generated from the dict of dictsEMOJI_DATA
.We don't need any of the old
UNICODE_EMOJI_*
dicts anymore, because we can just doEMOJI_DATA[unicode_point]
.Currently they are public in the package, so removing them would be a breaking change.
Only generate the dict that is needed for the language
We still need the
EMOJI_UNICODE_*
to reverse the dict of dict:EMOJI_UNICODE_ENGLISH[":apple:"] = unicode_point_apple
Currently we generate all EMOJI_UNICODE_ENGLISH, EMOJI_UNICOD_GERMAN, etc. when the package is imported.
I would like to generate them only if the language is actually used, because it is unlikely that someone uses different languages at the same time.
I suggest to use the following to store the dicts and only generate the data if needed:
I would change to these variable names and prefix with
_
to indicate that they are not public.Because they are only generated when needed and otherwise empty, they should no be used and therefore should not be public.
Improve demojize performanceImplemented in Version 1.6.2
The speed of `demojize()` has become slower with the Python 3 releases. The problem is that we use a huge regular expression that consists of all unicode code points. `len(emoji.get_emoji_regexp().pattern) = 19559` This was fast in Python 2, but in Python 3 it is really slow (it is about 15x slower on Python 3.10 than on Python 2.7). I am working on this already and I replaced the regular expression with a search in a [tree](https://en.wikipedia.org/wiki/Tree_(data_structure)) to improve the speed by about 5-10x, I will test this further to see if it works.There was a suggestion in #30 to use ranges e.g
[\U0001F1E6-\U0001F1FF]
for consecutive codepoints in the regular expression. I initially tested this a little bit, by replacing a few emojis with ranges by hand. It would also be faster than the current regular expression, but I think it would be harder to implement and it is unclear to me how much faster if would be in the end.Update the language data
The emojiterra.com website uses the tag release-38 for the translations, so e.g. https://github.com/unicode-org/cldr/raw/release-38/common/annotations/de.xml It is from October 2020
There are three newer releases for the translations, release-38-1 from December 2020, release-39 from April 2021 and release-40 from end of October 2021.
I don't know if want to use the newer ones, or if we want to wait until emojiterra.com updates.
At the moment our translation data is a mixture of different (older) versions. When we update to either 38, 38.1, 39 or 40 it will change some existing translations. That is a problem if someone has stored a text with the old translation in their database, so it is a breaking change.
I would suggest to update at least to release-38, so we have the same names as emojiterra.com
Also we should think about how we deal with future updates: The translations can change with every release. A translation from release-38 might be different in release-39, so with the next update, we might have a breaking change again.
We should make a note in the README how we are going to deal with this in the future.
How to handle the case when a string contains an emoji that is supported in 'en' but not in the language selected?
The
demojize()
method currently skips the emoji when it is not available in the selected language.I think it is not obvious that this can happen. One would assume that after
demojize(string, language='xy')
thestring
does not contain any emoji anymore.Example with 🪹 which is in Unicode 14 and has not been translated into Spanish 'es' yet:
I don't think we need to change this, but we should add a note to the README that explains what can happen.
If we want to change it, it would be a breaking change as well, so now would be the time to do it.
The text was updated successfully, but these errors were encountered: