Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further improve Knowledge Base search #8136

Closed
3 tasks
kevwalsh opened this issue Feb 24, 2022 · 15 comments
Closed
3 tasks

Further improve Knowledge Base search #8136

kevwalsh opened this issue Feb 24, 2022 · 15 comments
Assignees
Labels
CMS Team CMS Product team that manages both editor exp and devops Drupal engineering CMS team practice area Epic Issue type Knowledge Base [CMS feature] Knowledge Base

Comments

@kevwalsh
Copy link
Contributor

kevwalsh commented Feb 24, 2022

Background

Previous iteration of the CMS team looked into improvements to the knowledge base search. The current search is limited and editors have complained that it doesn't provide the results that they need. As we are redesigning the knowledge base, we want to look at the history of the search, how previous CMS teams have proposed improvements, and look for opportunities for quick fixes now and a more long-term approach to fixing the knowledge base search.

User Story or Problem Statement

The CMS team needs to understand how the original knowledge base search was created and opportunities for both short team and long term solutions.

Previous Team's Proposed Solutions

In #7012, we found some concrete ways to improve search in the short term, including more search-friendly content, and better indexing of titles.

We also identified a number of ideas to some of the user stories identified.

  1. add another sort (title ASC) to the results view. This will improve readability for serialized articles/make it easier to find results.
  2. add tagging by category (curated taxonomy)
  3. add tagging by keyword (open taxonomy)
  4. exposing filters for category, product in the results view
  5. adding clickable keyword tags to search result and full page view
  6. potentially adding the rate module so users can rate what was helpful and adding that as a weighting criterion
  7. switching to SOLR based search can further improve the search accuracy, string matching and weighting options that we have available

Relevant Links

Affected users and stakeholders

  • All CMS users

Acceptance Criteria

  • An understanding of the history of the knowledge base search.
  • A list of quick wins that can be made to improve the search.
  • A recommendation for more long term solutions for the knowledge base search.
@kevwalsh kevwalsh added Drupal engineering CMS team practice area Epic Issue type Knowledge Base [CMS feature] Knowledge Base Needs refining Issue status labels Feb 24, 2022
@github-actions github-actions bot added CMS experience Was a scrum team Content ops Was a scrum team labels Feb 24, 2022
@kevwalsh kevwalsh added the CMS Team CMS Product team that manages both editor exp and devops label Apr 20, 2022
@EWashb EWashb removed CMS experience Was a scrum team Content ops Was a scrum team labels Jul 25, 2023
@EWashb
Copy link
Contributor

EWashb commented Oct 24, 2023

@joagnitti @BerniXiongA6 this is the epic around KB search. There's also an issue about integrating a new search option. We should bring this back out of the icebox for Q4 goal of KB improvements.

@gracekretschmer-metrostar

@edmund-dunn

@anantais
Copy link
Contributor

It sounds like Marisa plans on adding to the taxonomy so that would help improve the current search which seems to be based on a view. View based searches can get a bit wonky once you add too many fields, though. Depending on how many search elements we end up with it may be better to switch to something more robust like a Solr based search.

@EWashb
Copy link
Contributor

EWashb commented Mar 22, 2024

@anantais I once heard (maybe right or wrong) that Solr had a cost. Is that true? You're not the first person (or even second) to mention Solr as an option so I'm very curious.

@anantais
Copy link
Contributor

@EWashb Solr is open source so it should be free. I have not implemented it on a site yet but I have heard good things from other devs. I have experience with something similar - Elastic search - but I would never recommend it.

@gracekretschmer-metrostar

Once Jake is onboard and has full access, he will be taking over this work.

@gracekretschmer-metrostar

@JakeBapple I would like you to focus on this work for sprint 8.

@gracekretschmer-metrostar

Schedule pre-refinement/next steps meeting next week. Jake will do discovery work and research in the meantime.

@JakeBapple
Copy link
Contributor

Setting up Apache Solr may require more than just dev work (additional servers, permissions, etc.), so I would love to pull in @edmund-dunn to see if he has any opinions/experience with implementing this. We need to confirm our needs out of this search first and then decide our best solution as well. If we can get away with a search out of the box with just a view or not.

@gracekretschmer-metrostar

Rescope to be a ticket:

For Jake, this will be a historical discovery (dig into previous tickets to understand the history of the search) and then look for quick technical enhancements for the KB search.

@JakeBapple
Copy link
Contributor

JakeBapple commented Jul 18, 2024

Some discovery notes:
We are using drupal database as source with Knowledge Base having its own index covering explicitly ONLY knowledge base articles.

The index warns of some performance impacts of automatic indexing of content for larger sites, and I'm not sure if this is something already looked into or not:
image

Search processors we are running:
Content access
Adds content access checks for nodes and comments.
Entity status
Exclude inactive users and unpublished entities (which have a "Published" state) from being indexed.
Highlight
Adds a highlighted excerpt to results and highlights returned fields.
HTML filter
Strips HTML tags from fulltext fields and decodes HTML entities. Use this processor when indexing HTML data – for example, node bodies for certain text formats. The processor also allows to boost (or ignore) the contents of specific elements.
Ignore case
Makes searches case-insensitive on selected fields.
Stemmer
Stems search terms (for example, talking to talk). Currently, this only acts on English language content. It uses the Porter 2 stemmer algorithm (More information). For best results, use after tokenizing.

Processors we are not running:
Ignore characters
Configure types of characters which should be ignored for searches.
Index hierarchy
Allows the indexing of values along with all their ancestors for hierarchical fields (like taxonomy term references)
Number field-based boosting
Adds a boost to indexed items based on the value of a numeric field.
Reverse entity references
Allows indexing of entities that link to the indexed entity.
Role-based access
Adds an access check based on a user's roles. This may be sufficient for sites where access is primarily granted or denied based on roles and permissions. For grants-based access checks on "Content" or "Comment" entities the "Content access" processor may be a suitable alternative.
Stopwords
Allows you to define stopwords which will be ignored in searches. Caution: Only use after both 'Ignore case' and 'Tokenizer' have run.
Tokenizer
Splits text into individual words for searching.
Transliteration
Makes searches insensitive to accents and other non-ASCII characters.
Type-specific boosting
Adds a boost to indexed items based on their datasource and/or bundle.

The processors we aren't running that may be worth looking into in my opinion are:

  • Index heirarchy - more description:
    This processor is mainly used in conjunction with hierarchical taxonomy vocabularies. If you have such a vocabulary, you usually want searches for a high-level category to also return results tagged with lower-level terms – for instance, filtering for "Europe" should also return content from "Denmark". This processor will facilitate this behavior by indexing, for every encountered taxonomy term, all its parent/ancestor terms, too. This also works for fields of other types that reference entities of the same type. If you have such a setup and want hierarchy functionality for that, too, you can also use this processor.

  • Tokenizer
    Splits indexed text into individual words. As dedicated search backends, like Apache Solr or OpenSearch, typically do a very good job in this regard, it is mainly meant for use with the Database backend, for which it offers more control over the tokenization process.
    The processor works in the following way when indexing a piece of text (say, a node’s body):

  1. If enabled, rudimentary CJK handling is applied.
  2. Numbers only separated by punctuation (like dates, telephone numbers, etc.) are merged to a single string of digits, such that it is possible to find them even when formatted in a slightly different way.
  3. The configured “ignored characters” are handled, if any: occurrences of two or more consecutive “ignored characters” are replaced by spaces, then all remaining ones are removed from the text.
  4. The text is then split into tokens, taking the configured “whitespace characters” as the separators.
  5. Finally, all tokens that are shorter than the configured “minimum word length” are removed.

At search time, the keywords entered by the user are processed in a similar manner to ensure they match as expected.

  • Stopwords
    Keeps certain (configured) words from being indexed, usually very common words that don't add much meaning. This can be used to make matching and scoring more accurate, and also improve performance (by keeping the fulltext index smaller). For best results, this should be used alongside (and after) "Tokenizer".

  • Ignore characters (maybe)
    Allows you to remove certain characters from indexed field values and search keywords.

To fully test these options in concert or individually, I'll need some search context for what users are frustrated with to see if I can get this to work as expected.

@JakeBapple
Copy link
Contributor

JakeBapple commented Jul 25, 2024

Listing current behavior for the use cases listed in this issue.

  1. Searching for "log in":
    image
    This seems to be what a user would expect.

  2. Searching for "system health service":
    image
    This is not what would be expected.

  3. Searching for "session 3":
    image
    This seems to be what a user would expect.

  4. Searching for "pdf":
    image
    This seems acceptable given how little context the search term is.

  5. Searching for "broken links":
    image
    This also seems acceptable given what is being searched.

Only #2 (searching for system health services) appears to be of any issue from this ticket.

@JakeBapple
Copy link
Contributor

Implementing indexing for these settings:
image

Here are the changes in results:

  1. Searching for "log in":
    image
    Minor changes in lower results, but encouraging that it's looking more for "log in" than just "log" and "in"

  2. Searching for "system health service":
    image
    Minor changes in 2nd and lower results ranking where "health system service" is given more weight.

  3. Searching for "session 3":
    image
    Worse results for session 3, this is most likely because numbers are not being weighted at all, but the word "session" is seen more times on pages 5 and 1 so it's ranked higher.

  4. Searching for "pdf":
    No change

  5. Searching for "broken links":
    No change

@gracekretschmer-metrostar

Great work, @JakeBapple! I am going to grab time for us to regroup on this work next week to determine next steps.

@gracekretschmer-metrostar

After the pre-refinement, this is how we will move forward:

  1. Jake will work on improving the indexing and tokenizing of knowledge base search.
  2. Jake will bring in the design layout from Marisa's redesign for knowledge base landing page.
  3. Marisa will partner with Jake on building out a taxonomy for the knowledge base.
  4. Jake will use the taxonomy to improve the search.
  5. CMS team will then explore using elastic or solar search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMS Team CMS Product team that manages both editor exp and devops Drupal engineering CMS team practice area Epic Issue type Knowledge Base [CMS feature] Knowledge Base
Projects
None yet
Development

No branches or pull requests

8 participants