Skip to content

Track‐query matching

Jared Dantis edited this page Sep 18, 2023 · 3 revisions

When you search for a song on a music service, you expect to see the song you're looking for as the first result. But what if you misspell the song title? Or what if there are many songs with the same title from different artists?

It's not realistic to expect a user to specify the exact song and artist every time they search, and music services understandably spend a lot of time and effort on making sure that users can find the song they're looking for. Blanco should be no different.

Old algorithm

Before Release 0.4.5, Blanco used the following logic to check whether a human search query (say, mona lisa) closely matched a song title and artist (say, Mona Lisa Dominic Fike):

from difflib import get_close_matches


def check_similarity(actual: str, candidate: str) -> float:
    """
    Checks the similarity between two strings. Meant for comparing
    song titles and artists with search results.

    :param actual: The actual string.
    :param candidate: The candidate string, i.e. from a search result.
    :return: A float between 0 and 1, where 1 is a perfect match.
    """
    actual_words = set(actual.lower().split(' '))
    candidate_words = set(candidate.lower().split(' '))
    intersection = actual_words.intersection(candidate_words)
    difference = actual_words.difference(candidate_words)

    # Include close matches
    for word in difference:
        close_matches = get_close_matches(word, candidate_words, cutoff=0.8)
        if len(close_matches) > 0:
            intersection.add(close_matches[0])

    return len(intersection) / len(actual_words)

The actual variable contains the raw search query from the user, and the candidate variable is expected to be a string of the format <song title> <artist>.

The gist of the algorithm is this:

  1. Turn the strings lowercase.
  2. Split the actual string (the search query from the user) and candidate string (a search result from Spotify) into sets of words.
  3. Find the intersection of the two sets, i.e., the words that are in both sets.
  4. Find the difference of the two sets, i.e., the words that are in the actual set but not the candidate set.
  5. If there are any words in the difference set that are close matches to words in the candidate set, add them to the intersection set.

This worked pretty well for the most part, but it had two glaring problems:

  1. It didn't take into account the words that were in the candidate set but not the actual set.
  2. It didn't take into account the ranking from Spotify, which more than likely accounts for popularity and trends.

For example, if the user searched for nightstand and the top search results were NIGHTSTAND by Kxllswxtch and Nightstand by Justus Bennetts, the algorithm would return a similarity score of 1.0 for both, even though the two songs are completely different. In fact, it will return a similarity score of 1.0 for any song with the word nightstand in the title, regardless of the artist.

New algorithm

Blanco releases 0.4.5 and newer build upon check_similarity() above, using smarter logic to match human queries against search results. The new algorithm is as follows:

from thefuzz import fuzz


def check_similarity_weighted(actual: str, candidate: str, candidate_rank: int) -> int:
    """
    Checks the similarity between two strings using a weighted average
    of a given similarity score and the results of multiple fuzzy string
    matching algorithms. Meant for refining search results that are
    already ranked.

    :param actual: The actual string.
    :param candidate: The candidate string, i.e. from a search result.
    :param candidate_rank: The rank of the candidate, from 0 to 100.
    :return: An integer from 0 to 100, where 100 is the closest match.
    """
    naive = check_similarity(actual, candidate) * 100
    tsr = fuzz.token_set_ratio(actual, candidate)
    tsor = fuzz.token_sort_ratio(actual, candidate)
    ptsr = fuzz.partial_token_sort_ratio(actual, candidate)

    return int(
        (naive * 0.7) +
        (tsr * 0.12) +
        (candidate_rank * 0.08) +
        (tsor * 0.06) +
        (ptsr * 0.04)
    )

Like the old algorithm, the actual variable contains the raw search query from the user, and the candidate variable is expected to be a string of the format <song title> <artist>.

The algorithm still takes into account how much of the user's query is present in the search result, but it now also takes into account the ranking from Spotify and the results of a few fuzzy string matching algorithms (namely, Token Set Ratio, Token Sort Ratio, and Partial Token Sort Ratio). All of these factors are combined into a weighted average.

Release 0.4.7

This release updates the threshold used in check_similarity() from 0.8 to 0.9 and changes the candidate_rank values passed to check_similarity_weighted().

If Spotify gives us 10 results, Blanco generates the weights [100, 90, ..., 10] for the results in order, which will then be used as candidate_rank values. This linear scale, however, removes much of the information that Spotify's ranking gives implicitly. If multiple tracks are returned for a query, and each track varies significantly (the title could be completely different for example), this variation is not communicated by weights that differ by a constant value.

Release 0.4.7 changes the weights to an exponential scale using a factor of 0.8. That is, the first weight remains at 100, then 100 * 0.8 = 80, then 100 * 0.8 * 0.8 = 64, and so on. This gives each result significantly more weight than the result after it.

Testing

In the test results below, the original index column refers to the index of the song in Spotify's search API results, and the candidate (title + artist) column refers to the candidate search result from Spotify. The weighted column refers to the similarity score returned by check_similarity_weighted(), and the difflib column refers to the similarity score returned by check_similarity().

Spotify rank is just 100 - (original index * 10) (so that the first result has a rank of 100). The partial token sort ratio, token set ratio, and token sort ratio columns refer to the results of the fuzzy string matching algorithms from thefuzz.

The first table in each section is meant to simulate a vague search query, and the second table is meant to simulate a more specific search query. The rows are then sorted by the weighted values. In production, Blanco picks the result with the top weighted value.

As you will see, the ranking is still not perfect, but it's a lot better than if we were to rank using difflib alone.

Test results: Cherry Wine

Search results for "cherry wine"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Cherry Wine grentperez 97 100 100 78 100 67
1 Cherry Wine - Live Hozier 95 100 80 78 100 65
2 Cherry Wine - Live Hozier 94 100 64 78 100 65
3 Cherry Wine Nas Amy Winehouse 92 100 51 73 100 55
5 cherry wine Zachary Knowles 91 100 32 80 100 58
4 Cherry Wine - Live from Spotify SXSW ... 90 100 40 64 100 39
6 Cherry Wine Jasmine Thompson 90 100 26 78 100 56
9 Cherry Wine - Live Hozier 90 100 13 78 100 65
7 Cherry Waves Deftones 51 50 20 80 71 56
8 Like Real People Do Hozier 9 0 16 55 32 32

Search results for "cherry wine hozier"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Cherry Wine - Live Hozier 98 100 100 89 100 88
1 Cherry Wine - Live Hozier 97 100 80 89 100 88
4 Cherry Wine - Live Hozier 94 100 40 89 100 88
2 Cherry Wine - Live from Spotify SXSW ... 93 100 64 72 100 56
7 Cherry Wine - Live Hozier 92 100 20 89 100 88
5 Cherry Wine - Live in Greystones, Cou... 89 100 32 67 100 44
3 Cherry Wine grentperez 66 66 51 64 76 70
9 Francesca Hozier 37 33 13 62 59 53
6 Like Real People Do Hozier 36 33 26 62 50 45
8 Autumn Tree Milo Greene 7 0 16 34 24 24

Test results: Nightstand

Search results for "nightstand"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 NIGHTSTAND Kxllswxtch 97 100 100 100 100 65
1 Nightstand Justus Bennetts 95 100 80 100 100 56
2 Nightstand Dev Lemons 95 100 64 100 100 65
4 Nightstand Lil Candy Paint 92 100 40 100 100 56
5 Nightstand dethcaps 92 100 32 100 100 69
6 Nightstand YNG Martyr YNG ONE 91 100 26 100 100 51
7 Nightstand Dev Lemons 91 100 20 100 100 65
8 Nightstand K. Michelle 91 100 16 100 100 65
3 One Night Standards Ashley McBryde 14 0 51 60 45 45
9 Night Stalkers Megadeth ICE-T 11 0 13 80 41 41

Search results for "nightstand kxllswxtch"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 NIGHTSTAND Kxllswxtch 100 100 100 100 100 100
2 Nightstand Justus Bennetts 54 50 64 73 65 55
6 Nightstand Dev Lemons 51 50 26 73 65 62
1 Disaster KSLV Noh 12 0 80 31 26 26
3 Final Stage KSLV Noh 12 0 51 42 39 39
4 Nightstalker RBX Krayzie Bone 12 0 40 48 40 40
5 Psycho KSLV Noh 10 0 32 40 33 33
9 Blindspot LXST CXNTURY 9 0 13 38 37 37
8 Walk In Lxzt 8 0 16 35 30 30
7 Disaster - Slowed + Reverb KSLV Noh 7 0 20 29 27 27

Test results: Mona Lisa

Search results for "mona lisa"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
1 Mona Lisa Dominic Fike 95 100 80 100 100 58
0 Mona Lisa (Spider-Man: Across the Spi... 94 100 100 78 100 27
3 mona lisa mxmtoon 94 100 51 100 100 69
2 Mona Lisa (feat. Kendrick Lamar) Lil ... 92 100 64 100 100 29
4 Mona Lisa Brentrambo LUCKI 91 100 40 80 100 51
6 Mona Lisa, Mona Lisa FINNEAS 91 100 26 100 100 50
5 The Ballad of Mona Lisa Panic! At The... 90 100 32 100 100 35
8 Mona Lisa Nat King Cole 90 100 16 100 100 56
9 Mona Lisa ONLY1 THEORY 90 100 13 100 100 58
7 Mona Lisas And Mad Hatters Elton John 49 50 20 78 62 39

Search results for "mona lisa dominic fike"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
1 Mona Lisa Dominic Fike 98 100 80 100 100 100
0 Mona Lisa (Spider-Man: Across the Spi... 97 100 100 91 100 56
2 Bodies Dominic Fike 55 50 64 77 77 59
4 Mama's Boy Dominic Fike 54 50 40 78 71 71
3 Mona Lisa (feat. Kendrick Lamar) Lil ... 51 50 51 59 58 45
8 Phone Numbers Dominic Fike Kenny Beats 51 50 16 68 71 57
6 Think Fast (feat. Weezer) Dominic Fik... 50 50 26 63 71 47
9 mona lisa mxmtoon 49 50 13 69 69 46
7 Mona Lisa val Peter Fenn Tray Haggerty 48 50 20 59 58 43
5 Monalisa Lojay Sarz Chris Brown 12 0 32 55 45 45

Test results: Forever - SOPHIE Remix

Search results for "forever sophie"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Forever - SOPHIE Remix FLETCHER SOPHIE 96 100 100 73 100 56
3 Forever in your arms Sophie Nichols 92 100 51 64 100 57
5 Forever Yours Sophie Deas 92 100 32 100 100 72
8 I Want A Mom (That Will Last Forever)... 87 100 16 64 100 31
1 Darkness Forever - Sophie's Version S... 59 50 80 71 100 47
2 Faceshopping SOPHIE 53 50 64 67 60 55
4 Sophie Bear's Den 51 50 40 67 60 58
6 Darkness Forever Soccer Mommy 50 50 26 64 67 51
7 Infatuation SOPHIE 50 50 20 70 60 56
9 Not Okay - Alone Mix SOPHIE 48 50 13 67 60 46

Search results for "forever sophie remix"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Forever - SOPHIE Remix FLETCHER SOPHIE 98 100 100 100 100 71
1 Forever - SOPHIE Remix FLETCHER SOPHIE 96 100 80 100 100 71
3 Forever - SOPHIE Remix FLETCHER SOPHIE 94 100 51 100 100 71
5 Forever - SOPHIE Remix FLETCHER SOPHIE 92 100 32 100 100 71
7 Forever - SOPHIE Remix FLETCHER SOPHIE 91 100 20 100 100 71
9 Forever - SOPHIE Remix FLETCHER SOPHIE 91 100 13 100 100 71
4 Forever (Solomun Remix) Weval Solomun 66 66 40 89 79 62
2 Forever Wavey 43 33 64 78 70 55
6 Forever Trit95 40 33 26 78 67 59
8 forever - Slowed & Reverb L0WS 38 33 16 59 61 61

Test results: She by Charles Aznavour

Search results for "she charles"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
1 She Charles Aznavour Bryan Ferry 94 100 80 82 100 51
2 She - Tous les visages de l’amour Cha... 91 100 64 82 100 25
4 She (Tous Les Visages De L’Amour) Cha... 90 100 40 82 100 25
0 She Cares Styx 59 50 100 90 72 72
7 Jesse Charles Wesley Godwin 52 50 20 84 78 53
3 She Cares Patrick Dorgan 51 50 51 71 51 51
6 She Dennis van Aarssen 48 50 26 59 48 48
9 She Elvis Costello 48 50 13 63 55 55
5 She Changes the Weather Swim Deep 47 50 32 67 43 41
8 She Changes Your Mind Copeland 46 50 16 63 44 44

Search results for "she charles aznavour"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 She (Tous Les Visages De L’Amour) Cha... 96 100 100 90 100 41
1 She Charles Aznavour Bryan Ferry 95 100 80 70 100 77
2 She - Tous les visages de l’amour Cha... 93 100 64 90 100 41
5 She Charles Aznavour Herbert Kretzmer... 92 100 32 92 100 62
8 She Engelbert Humperdinck Charles Azn... 91 100 16 95 100 65
6 She - She / English Version 1 - Theme... 89 100 26 90 100 37
4 Emmenez-moi Charles Aznavour 68 66 40 95 89 75
7 Venecia sin ti - Que c'est triste Ven... 65 66 20 90 89 51
9 Mes emmerdes - Remastered 2014 Charle... 65 66 13 90 89 58
3 She Dennis van Aarssen 38 33 51 53 48 48

Test results: Run Away With Me

Search results for "run away with me"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Run Away With Me Carly Rae Jepsen 97 100 100 81 100 65
1 Run Away With Me Carly Rae Jepsen 95 100 80 81 100 65
2 Run Away With Me Cold War Kids 94 100 64 75 100 70
3 Run away with me Mingginyu 93 100 51 74 100 76
4 Run Away With Me Ben Fankhauser 92 100 40 86 100 68
6 Run Away with Me Michael Arden 91 100 26 79 100 70
8 Run Away With Me Paradise Blossom 90 100 16 77 100 65
9 Run Away with Me - Live Aaron Tveit 89 100 13 69 100 65
5 Run Away with You Big & Rich 72 75 32 62 90 67
7 Run Away With You Logan Crosby 71 75 20 69 90 61

Search results for "run away with me carly rae jepsen"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 Run Away With Me Carly Rae Jepsen 100 100 100 100 100 100
1 Run Away With Me Carly Rae Jepsen 98 100 80 100 100 100
2 Run Away With Me - ASTR Remix Carly R... 95 100 64 82 100 86
4 Run Away With Me - Y2K Remix Carly Ra... 94 100 40 88 100 87
9 Run Away With Me Carly Rae Jepsen 93 100 13 100 100 100
6 Run Away With Me - Patrick Stump Remi... 91 100 26 81 100 66
8 Run Away With Me - Cyril Hahn Remix C... 91 100 16 74 100 80
3 Kollage Carly Rae Jepsen 50 42 51 75 80 63
5 Kamikaze Carly Rae Jepsen 49 42 32 78 78 66
7 The Loneliest Time (feat. Rufus Wainw... 44 42 20 61 65 49

Test results: The Dress

Search results for "the dress"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 The Dress Dijon 98 100 100 100 100 75
1 The Dress Looks Nice on You Sufjan St... 93 100 80 80 100 35
3 The Dress 猫 シ Corp. luxury elite 91 100 51 78 100 45
5 The Dress The Dudes 91 100 32 80 100 64
7 The Dress Alan Menken 90 100 20 78 100 60
9 The Dress Blonde Redhead 89 100 13 78 100 55
8 The Dress That My Mother Wore - Demo ... 87 100 16 67 100 22
2 Girl Anachronism The Dresden Dolls 51 50 64 67 50 37
6 Dress Taylor Swift 51 50 26 80 71 52
4 Skin Dijon 8 0 40 27 21 21

Search results for "the dress dijon"

original index candidate (title + artist) weighted difflib spotify rank partial token sort ratio token set ratio token sort ratio
0 The Dress Dijon 100 100 100 100 100 100
6 The Dress Alan Menken 63 66 26 64 75 61
9 The Dress Blonde Redhead 63 66 13 67 75 62
8 The Dress The Dudes 62 66 16 60 75 53
4 jesse Dijon 43 33 40 82 77 77
1 Dijon Antoine Stavelot 41 33 80 62 54 49
2 Dijon Angel 'Pocho' Gatti Orchestra 39 33 64 47 50 46
3 Dress Down Kaoru Akimoto 39 33 51 67 51 51
5 Magic Loop DJDS Dijon 37 33 32 69 50 50
7 Spanish Ladies The Dreadnoughts 35 33 20 55 48 48