Feature: Use muan/emojilib dataset #286

FredHappyface · 2024-03-31T18:15:24Z

Background

I've fallen down a bit of a rabbit hole as I've been looking for ways to search emoji by plaintext which often calls for the use of aliasses to do so. For example: '🤗' isn't called 'hug' however, this is a useful alias (which this lib supports)

I innstalled element on my phone a few months ago for discussion on another project and saw that is has a really great search functionality for a ton of aliases that are not present in this lib For example, ':)' for '😊'. And started digging

Basically, they use the following python script during build to fetch the latest emoji and aliasses from a few other sources https://github.com/element-hq/element-android/blob/def2a8a83351c06cb65fdbd4d483ac811329b023/tools/import_emojis.py#L20

One of these is the dataset available from https://github.com/muan/emojilib which seems really good for this

The questions/ feature request

Would you accept a pr to add the aliasses from muan/emojilib to this project?

Also, I noticed that the demojise function only exposes the first alias if available so I've written my own implementation for a lib that returns a underscore seperated string of keywords. Is such a function (maybe called get_aliases) something you'd accept a pr for?

Thanks for your time and for the awesome project

cvzi · 2024-04-01T12:41:21Z

I think the aliases that are listed in muan/emojilib are too broad for this project.
For example search for ":D" in muan/emojilib and it actually refers to three different emoji, but this library requires unique aliases.

cvzi · 2024-04-01T12:45:47Z

I don't know if you saw this, but our database is actually just a python-dict. We recently needed to compress the dict into a single line, which makes it unreadable, but you can look at an older version to see how everything is stored:
https://raw.githubusercontent.com/carpedm20/emoji/f14ece8475a1f2323326a4b850a209509310e470/emoji/unicode_codes/data_dict.py (Look for the key 'alias')

Extending the dict with custom aliases is possible during runtime. See #268 (comment) on how to add a single alias. So it would be easy to just load the JSON file from muan/emojilib and add aliases.

Regarding demojize, there is also the function replace_emoji which can be used to do what you want:

def repl(emj, emj_data):
    name_list = [emj_data['en']]
    if 'alias' in emj_data:
      name_list += emj_data['alias']
    # Here you could also add aliases from muan/emojilib
    # just look up `emj` in their json data
    return "_".join(name_list)


print(emoji.replace_emoji('Test 🤗', replace=repl))
# Outputs: Test :smiling_face_with_open_hands:_:hugging_face:_:hugs:


# In the repl function:
# emj = "🤗"
# emj_data = {
#   'match_start': 5,
#   'match_end': 6
#   'en': ':smiling_face_with_open_hands:', 
#   'status': 2, 
#   'E': 1, 
#   'alias': [':hugging_face:', ':hugs:'], 
#   'de': ':gesicht_mit_umarmenden_händen:',
#   'es': ':cara_con_manos_abrazando:',
#   ...}

FredHappyface · 2024-04-01T15:10:41Z

Oh awesome stuff! Thank you so much for that! I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?

Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?

Thanks again for your help :)

cvzi · 2024-04-01T18:32:20Z

I guess the remaining question is would you like me to open a pr to merge any of the other aliases? I see myself using this functionality in a couple of downstream repos and it seems a bit silly to write a wrapper library to include this if it would be useful here too?

Did you already do anything or is it just a plan at this point?

I am not so sure it is feasible. As I said, the aliases need to be unique, one alias can only belong to one emoji. For each alias that has multiple meanings in muan/emojilib you would have to decide to which emoji it should belong. Presumably you would have to do this manually for each emoji.

Also as a general question, how come the data is directly in python? I'm assuming this has a performance benefit?

It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference (at least with newer Python versions). I am thinking about moving to JSON and also splitting the file into several smaller files. I recently did a comparison between the python-dict and JSON: #280 (comment)

cvzi · 2024-04-01T18:58:26Z

FYI there is a proposed major change in keywords in muan/emojilib, see
muan/emojilib#194 and muan/emojilib#226

I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords and not in alias in the EMOJI_DATA. For example for 🤗:

    '\U0001F917': {
        'en': ':smiling_face_with_open_hands:',
        'status': fully_qualified,
        'E': 1,
        'keywords': ['hugging_face', 'face', 'smile', 'hug'],
        'alias': [':hugging_face:', ':hugs:'],
        'de': ':gesicht_mit_umarmenden_händen:',
        'es': ':cara_con_manos_abrazando:',
        ...
    },

An then add a function to retrieve them - as you suggested get_aliases or something like that - and a function to search by keyword.
This would widen the functionality of this library to searching for emoji, but it wouldn't interfere with the original functionality i.e. emojize()/demojize()

Btw we also use a script to update the emoji and aliases. The aliases specifically are added here:

emoji/utils/get_codes_from_unicode_emoji_data_files.py

Lines 586 to 601 in ceddc11

    
           # Add alias from GitHub API 
        
           github_aliases = find_github_aliases(emj, github_alias_dict, v, emj_no_variant) 
        
           aliases.update(shortcut for shortcut in github_aliases if shortcut not in all_existing_aliases_and_en) 
        
           used_github_aliases.update(github_aliases) 
        
           # Add alias from cheat sheet 
        
           if emj in cheat_sheet_dict and cheat_sheet_dict[emj] not in all_existing_aliases_and_en: 
        
               aliases.add(cheat_sheet_dict[emj][1:-1]) 
        
           if emj_no_variant in cheat_sheet_dict and cheat_sheet_dict[emj_no_variant] not in all_existing_aliases_and_en: 
        
               aliases.add(cheat_sheet_dict[emj_no_variant][1:-1]) 
        
           # Add alias from youtube 
        
           if emj in youtube_dict: 
        
               aliases.update(shortcut[1:-1] for shortcut in youtube_dict[emj] if shortcut not in all_existing_aliases_and_en) 
        
           if emj_no_variant in youtube_dict: 
        
               aliases.update(shortcut[1:-1] for shortcut in youtube_dict[emj_no_variant] if shortcut not in all_existing_aliases_and_en)

Adding a new entry keywords to this script would be simple.

FredHappyface · 2024-08-25T15:38:07Z

Hi thanks for getting back to me on this and apologies for the silence for a few months

Did you already do anything or is it just a plan at this point?

I've not had the oppotunity to look at this unfortunately, happy to help however I can though - appreciate this may be too little too late - as I should have some more free time

It was already directly in Python when I started contributing to this project and the original developers are no longer contributing, so I don't know. There is a performance benefit compared to a JSON file, but it is not that big a difference

Makes sense tbh, and yeah I wonder if that'll help with some of the maintaining stuff? But yeah the perf improvements in python have helped a lot. I guess one option is a ci/cd step which glues together a load of json files and wraps them in some python for the best of both worlds?

I am thinking it might be better to include muan/emojilib keywords as a separate entry as keywords and not in alias in the EMOJI_DATA
Yeah I think that makes a lot of sense actually, that way it's less likely to break existing users too which is always a bonus!

Thanks again :)

cvzi · 2024-09-04T10:24:51Z

I forgot about this project, I had started implementing a JSON solution in April. One JSON file for each language and the possibility to extend it with custom data like this emojilib. As far as I remember my implementation was almost ready, I'll try to find some time in the next weeks for a pull request.

cvzi · 2024-09-06T20:50:42Z

I just realized that extending the database with custom data is not as simple as a I thought.
The problem is that at the moment the database is global: if you add custom data, this will not just affect your own code but also any third-party library that depends on the emoji library.

I guess the solution is to use a class/object to keep separate databases, something like this (pseudo code):

emoji_config = emoji.new_emoji_instance()     # Create a new copy of the database
emoji_config.extend_database(custom_aliases)  # Modify the new database
emoji_config.emojize(':a_custom_alias:')      # Use emojize/demojize with the new database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Use muan/emojilib dataset #286

Feature: Use muan/emojilib dataset #286

FredHappyface commented Mar 31, 2024

cvzi commented Apr 1, 2024

cvzi commented Apr 1, 2024

FredHappyface commented Apr 1, 2024

cvzi commented Apr 1, 2024

cvzi commented Apr 1, 2024

FredHappyface commented Aug 25, 2024

cvzi commented Sep 4, 2024

cvzi commented Sep 6, 2024 •

edited

Loading

Feature: Use muan/emojilib dataset #286

Feature: Use muan/emojilib dataset #286

Comments

FredHappyface commented Mar 31, 2024

Background

The questions/ feature request

cvzi commented Apr 1, 2024

cvzi commented Apr 1, 2024

FredHappyface commented Apr 1, 2024

cvzi commented Apr 1, 2024

cvzi commented Apr 1, 2024

FredHappyface commented Aug 25, 2024

cvzi commented Sep 4, 2024

cvzi commented Sep 6, 2024 • edited Loading

cvzi commented Sep 6, 2024 •

edited

Loading