Skip to content

Commit

Permalink
emoji-search/about.md tweaks
Browse files Browse the repository at this point in the history
  • Loading branch information
tim0120 committed Nov 22, 2024
1 parent e636cf8 commit 6c24a3e
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions src/content/projects/emoji-search/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ The first release of Emoji Search introduces the main functionality of the work:
The process for finding a set of emojis matching a text query is as follows:
1. Download emoji data dataset, consisting of emoji characters and corresponding descriptions, e.g., ("😀", "grinning face"). I got my data from [Open Emoji API](https://emoji-api.com/).
2. Find an embedding model. For my purposes, I chose [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1), because it was available on the [HuggingFace Inference API](https://huggingface.co/docs/api-inference/en/index) as a [warm model](https://huggingface.co/docs/api-inference/supported-models).
3. Embed each description from the dataset. This can be stored as a matrix of size `num_emojis x embed_size`.
4. Given a text query, embed the query and take the dot product of this embedding with the embedding matrix. Take the emojis corresponding to the top k (`k=30` for me) highest dot products and return these as the response. Voilà!
3. Embed each description from the dataset. This can be stored as a matrix of size `num_emojis x embed_size` (`num_emojis=1859, embed_size=1024`).
4. Given a text query, embed the query and take the dot product of this embedding with the embedding matrix. Take the emojis corresponding to the top k (`k=30`) highest dot products and return these as the response. Voilà!

#### Limitations
After a few uses of the search, you can probably get the gist of the utility and limitations of the tool. It seems that searching generally gets emojis in the right direction, but there are often unrelated emojis sneaking in there too. The reason behind this is that the method used for searching is quite basic. Using one embedding per emoji, unedited from the embedding model that I get it from, is a usable but pretty underdetermined way to search for emojis via similarity. I plan to improve the underlying algo soon. Stay tuned. 😁
Expand All @@ -37,5 +37,5 @@ One other limitation that I personally want to improve on is the lack of keybind
Finally, I'm using Vercel and HuggingFace free tiers, so hopefully this keeps working (been generally great so far). 🤞

#### Miscellany
- It was fun to add some small nice UI additions to the site, like the InteractiveEmoji component that does hover-scaling and clipboard copying upon click. Going to add a confirmation of the copy soon too. (Thanks to #feedback.)
- It was fun to add some small nice UI additions to the site, like the InteractiveEmoji component that does hover-scaling and clipboard-copying on click. Going to add a confirmation of the copy soon too. (Thanks to #feedback.)
- Of the many artifacts of this work, there is a curious case of the egg emoji 🥚. My users and I discovered that 🥚 is very ubiquitous among search results, coming up in many, many unrelated searches, e.g., "fingers crossed," "excel," "elusive," "Tokyo," and "anything but an egg." I would suspect an unnormalized 🥚 embedding vector, but the model's embeddings are normalized so the mystery remains unsolved for now...

0 comments on commit 6c24a3e

Please sign in to comment.