Proposal for several improvements #195

cvzi · 2021-11-07T22:12:41Z

Since the EMOJI_DATA "dict of dicts" was added in #189 that contains all the data now, I like to propose some further (breaking) changes to the package:

Remove the old dicts
Only generate the dict that is needed for the language
Update the language data
Specify what happens when an emoji is not translated

If we make any of these changes, I would suggest to update the version number to 2.0.0 to indicate that it is not backwards compatible.
I would implement these changes myself.
Please reply with your thoughts and if you have any other suggestions or objections.

Remove the old dicts

The old dicts like EMOJI_UNICODE_ENGLISH, UNICODE_EMOJI_ENGLISH, EMOJI_UNICODE_GERMAN, etc are now generated from the dict of dicts EMOJI_DATA.
We don't need any of the old UNICODE_EMOJI_* dicts anymore, because we can just do EMOJI_DATA[unicode_point].
Currently they are public in the package, so removing them would be a breaking change.

Only generate the dict that is needed for the language

We still need the EMOJI_UNICODE_* to reverse the dict of dict: EMOJI_UNICODE_ENGLISH[":apple:"] = unicode_point_apple
Currently we generate all EMOJI_UNICODE_ENGLISH, EMOJI_UNICOD_GERMAN, etc. when the package is imported.
I would like to generate them only if the language is actually used, because it is unlikely that someone uses different languages at the same time.

I suggest to use the following to store the dicts and only generate the data if needed:

_EMOJI_UNICODE = {
    'en': None,
    'es': None,
    'pt': None,
    'it': None,
    'fr': None,
    'de': None,
}
_ALIASES_UNICODE = {}

I would change to these variable names and prefix with _ to indicate that they are not public.
Because they are only generated when needed and otherwise empty, they should no be used and therefore should not be public.

Improve demojize performance

Implemented in Version 1.6.2

The speed of `demojize()` has become slower with the Python 3 releases. The problem is that we use a huge regular expression that consists of all unicode code points. `len(emoji.get_emoji_regexp().pattern) = 19559` This was fast in Python 2, but in Python 3 it is really slow (it is about 15x slower on Python 3.10 than on Python 2.7). I am working on this already and I replaced the regular expression with a search in a [tree](https://en.wikipedia.org/wiki/Tree_(data_structure)) to improve the speed by about 5-10x, I will test this further to see if it works.

There was a suggestion in #30 to use ranges e.g [\U0001F1E6-\U0001F1FF] for consecutive codepoints in the regular expression. I initially tested this a little bit, by replacing a few emojis with ranges by hand. It would also be faster than the current regular expression, but I think it would be harder to implement and it is unclear to me how much faster if would be in the end.

Update the language data

The emojiterra.com website uses the tag release-38 for the translations, so e.g. https://github.com/unicode-org/cldr/raw/release-38/common/annotations/de.xml It is from October 2020

There are three newer releases for the translations, release-38-1 from December 2020, release-39 from April 2021 and release-40 from end of October 2021.
I don't know if want to use the newer ones, or if we want to wait until emojiterra.com updates.

At the moment our translation data is a mixture of different (older) versions. When we update to either 38, 38.1, 39 or 40 it will change some existing translations. That is a problem if someone has stored a text with the old translation in their database, so it is a breaking change.

I would suggest to update at least to release-38, so we have the same names as emojiterra.com

Also we should think about how we deal with future updates: The translations can change with every release. A translation from release-38 might be different in release-39, so with the next update, we might have a breaking change again.
We should make a note in the README how we are going to deal with this in the future.

How to handle the case when a string contains an emoji that is supported in 'en' but not in the language selected?

The demojize() method currently skips the emoji when it is not available in the selected language.
I think it is not obvious that this can happen. One would assume that after demojize(string, language='xy') the string does not contain any emoji anymore.
Example with 🪹 which is in Unicode 14 and has not been translated into Spanish 'es' yet:

    >> text_with_emoji_from_version_14 = emoji.emojize(':knife: :empty_nest:')
    >> emoji.demojize(text_with_emoji_from_version_14, language='en')
    >> ':knife: :empty_nest:'
    >> emoji.demojize(text_with_emoji_from_version_14, language='es')
    >> ':knife: \U0001fab9'

I don't think we need to change this, but we should add a note to the README that explains what can happen.
If we want to change it, it would be a breaking change as well, so now would be the time to do it.

The text was updated successfully, but these errors were encountered:

TahirJalilov · 2021-11-14T07:33:59Z

well @cvzi it must be a huge update of package and the only thing that worries me is that the new version will not be backwards compatible. This package is used by 28 thousands repositories.

cvzi · 2021-11-17T18:32:14Z

I assume we would have to release a minor version like 1.7.0 with Deprecation Warnings first, something like this:

def get_emoji_regexp():
    warnings.warn("emoji.get_emoji_regexp() is deprecated and will be removed in version 2.0.0", DeprecationWarning)
   ...

I am not sure what is the correct way to do this though, I need to read some more Python documentation or check out how other packages have done this in the past.

cvzi · 2021-11-17T18:33:33Z

I will focus on the demojize() performance first. This is a change that can be done without breaking changes. It seems to work so far and I am just writing a few more tests to make sure.

cvzi · 2021-11-19T14:11:39Z

When I was writing tests, I realized an unclear case:
Someone can use a language and aliases e.g. emojize(string, language='it', use_aliases=True) and it's unclear what happens because we only have English aliases.
I would suggest to raise a warning in case someone uses use_aliases=True and another language than 'en' by accident.

We could add language='alias' or language='aliases' as an alternative to make it clearer (but keep the use_alias parameter as well):

emojize(string, language='alias') == emojize(string, use_aliases=True) == emojize(string, use_aliases=True, language='en')

As a breaking change in the future, we could then decide to remove use_aliases parameter (less parameters is a good thing) or to raise an error instead of a warning.

cvzi · 2022-01-28T11:32:42Z

I looked into what is the proper way to handle version numbers and breaking changes according to https://semver.org/ and also how other projects on pypi.org did this.

This is my idea for the time line:

Current release is 1.6.3

First create a beta or release candidate version 2.0.0rc0 or 2.0.0b0 that contains all the breaking changes, just to make sure everything I suggested is actually possible.
Release a version 1.7.0 same as 1.6.3 but with deprecation warnings:
Anyone who updates regularly and uses the deprecated stuff will see the deprecation warning in the next few weeks.
The deprecation warning would be something like this:
'emoji.EMOJI_UNICODE_SPANISH' is deprecated and will be removed in version 2.0.0. Use 'emoji.EMOJI_DATA' instead. You can hide this warning by using emoji version 1.6.x
Release 2.0.0 a few weeks/months later with breaking changes
If in the future a bug appears in 1.6.3 we can still fix it by releasing a version 1.6.4 and 1.7.1

Release
1.6.3	current version
(1.6.x)	(can be released in future for bug fixes)
1.7.0	same as 1.6.3 but with deprecation warnings
2.0.0rc0	pre release with breaking changes
2.0.0	breaking changes

cvzi mentioned this issue Feb 10, 2022

Typo in emoji_lis and distinct_emoji_lis? #186

Closed

cvzi mentioned this issue Feb 19, 2022

v2.0.0 #205

Merged

11 tasks

cvzi closed this as completed Aug 3, 2022

idanmiara mentioned this issue Oct 26, 2023

MINOR: Emoji library parameter being deprecated terraform-compliance/cli#611

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for several improvements #195

Proposal for several improvements #195

cvzi commented Nov 7, 2021 •

edited

Loading

TahirJalilov commented Nov 14, 2021

cvzi commented Nov 17, 2021

cvzi commented Nov 17, 2021

cvzi commented Nov 19, 2021

cvzi commented Jan 28, 2022

Proposal for several improvements #195

Proposal for several improvements #195

Comments

cvzi commented Nov 7, 2021 • edited Loading

Remove the old dicts

Only generate the dict that is needed for the language

Improve demojize performance

Update the language data

How to handle the case when a string contains an emoji that is supported in 'en' but not in the language selected?

TahirJalilov commented Nov 14, 2021

cvzi commented Nov 17, 2021

cvzi commented Nov 17, 2021

cvzi commented Nov 19, 2021

cvzi commented Jan 28, 2022

cvzi commented Nov 7, 2021 •

edited

Loading