Further improve Knowledge Base search #8136

kevwalsh · 2022-02-24T06:19:03Z

Background

Previous iteration of the CMS team looked into improvements to the knowledge base search. The current search is limited and editors have complained that it doesn't provide the results that they need. As we are redesigning the knowledge base, we want to look at the history of the search, how previous CMS teams have proposed improvements, and look for opportunities for quick fixes now and a more long-term approach to fixing the knowledge base search.

User Story or Problem Statement

The CMS team needs to understand how the original knowledge base search was created and opportunities for both short team and long term solutions.

Previous Team's Proposed Solutions

In #7012, we found some concrete ways to improve search in the short term, including more search-friendly content, and better indexing of titles.

We also identified a number of ideas to some of the user stories identified.

add another sort (title ASC) to the results view. This will improve readability for serialized articles/make it easier to find results.
add tagging by category (curated taxonomy)
add tagging by keyword (open taxonomy)
exposing filters for category, product in the results view
adding clickable keyword tags to search result and full page view
potentially adding the rate module so users can rate what was helpful and adding that as a weighting criterion
switching to SOLR based search can further improve the search accuracy, string matching and weighting options that we have available

Relevant Links

Improve search in Knowledge Base KB #7012 (comment)

Affected users and stakeholders

All CMS users

Acceptance Criteria

An understanding of the history of the knowledge base search.
A list of quick wins that can be made to improve the search.
A recommendation for more long term solutions for the knowledge base search.

EWashb · 2023-10-24T14:36:32Z

@joagnitti @BerniXiongA6 this is the epic around KB search. There's also an issue about integrating a new search option. We should bring this back out of the icebox for Q4 goal of KB improvements.

gracekretschmer-metrostar · 2024-02-28T15:41:03Z

@edmund-dunn

anantais · 2024-03-22T15:43:30Z

It sounds like Marisa plans on adding to the taxonomy so that would help improve the current search which seems to be based on a view. View based searches can get a bit wonky once you add too many fields, though. Depending on how many search elements we end up with it may be better to switch to something more robust like a Solr based search.

EWashb · 2024-03-22T16:21:56Z

@anantais I once heard (maybe right or wrong) that Solr had a cost. Is that true? You're not the first person (or even second) to mention Solr as an option so I'm very curious.

anantais · 2024-03-22T19:22:10Z

@EWashb Solr is open source so it should be free. I have not implemented it on a site yet but I have heard good things from other devs. I have experience with something similar - Elastic search - but I would never recommend it.

gracekretschmer-metrostar · 2024-03-27T16:33:03Z

Once Jake is onboard and has full access, he will be taking over this work.

gracekretschmer-metrostar · 2024-04-11T18:06:52Z

@JakeBapple I would like you to focus on this work for sprint 8.

gracekretschmer-metrostar · 2024-04-11T20:03:54Z

Schedule pre-refinement/next steps meeting next week. Jake will do discovery work and research in the meantime.

JakeBapple · 2024-07-10T13:22:12Z

Setting up Apache Solr may require more than just dev work (additional servers, permissions, etc.), so I would love to pull in @edmund-dunn to see if he has any opinions/experience with implementing this. We need to confirm our needs out of this search first and then decide our best solution as well. If we can get away with a search out of the box with just a view or not.

gracekretschmer-metrostar · 2024-07-10T16:40:30Z

Rescope to be a ticket:

For Jake, this will be a historical discovery (dig into previous tickets to understand the history of the search) and then look for quick technical enhancements for the KB search.

JakeBapple · 2024-07-18T15:00:59Z

Some discovery notes:
We are using drupal database as source with Knowledge Base having its own index covering explicitly ONLY knowledge base articles.

The index warns of some performance impacts of automatic indexing of content for larger sites, and I'm not sure if this is something already looked into or not:

Search processors we are running:
Content access
Adds content access checks for nodes and comments.
Entity status
Exclude inactive users and unpublished entities (which have a "Published" state) from being indexed.
Highlight
Adds a highlighted excerpt to results and highlights returned fields.
HTML filter
Strips HTML tags from fulltext fields and decodes HTML entities. Use this processor when indexing HTML data – for example, node bodies for certain text formats. The processor also allows to boost (or ignore) the contents of specific elements.
Ignore case
Makes searches case-insensitive on selected fields.
Stemmer
Stems search terms (for example, talking to talk). Currently, this only acts on English language content. It uses the Porter 2 stemmer algorithm (More information). For best results, use after tokenizing.

Processors we are not running:
Ignore characters
Configure types of characters which should be ignored for searches.
Index hierarchy
Allows the indexing of values along with all their ancestors for hierarchical fields (like taxonomy term references)
Number field-based boosting
Adds a boost to indexed items based on the value of a numeric field.
Reverse entity references
Allows indexing of entities that link to the indexed entity.
Role-based access
Adds an access check based on a user's roles. This may be sufficient for sites where access is primarily granted or denied based on roles and permissions. For grants-based access checks on "Content" or "Comment" entities the "Content access" processor may be a suitable alternative.
Stopwords
Allows you to define stopwords which will be ignored in searches. Caution: Only use after both 'Ignore case' and 'Tokenizer' have run.
Tokenizer
Splits text into individual words for searching.
Transliteration
Makes searches insensitive to accents and other non-ASCII characters.
Type-specific boosting
Adds a boost to indexed items based on their datasource and/or bundle.

The processors we aren't running that may be worth looking into in my opinion are:

Index heirarchy - more description:
This processor is mainly used in conjunction with hierarchical taxonomy vocabularies. If you have such a vocabulary, you usually want searches for a high-level category to also return results tagged with lower-level terms – for instance, filtering for "Europe" should also return content from "Denmark". This processor will facilitate this behavior by indexing, for every encountered taxonomy term, all its parent/ancestor terms, too. This also works for fields of other types that reference entities of the same type. If you have such a setup and want hierarchy functionality for that, too, you can also use this processor.
Tokenizer
Splits indexed text into individual words. As dedicated search backends, like Apache Solr or OpenSearch, typically do a very good job in this regard, it is mainly meant for use with the Database backend, for which it offers more control over the tokenization process.
The processor works in the following way when indexing a piece of text (say, a node’s body):

If enabled, rudimentary CJK handling is applied.
Numbers only separated by punctuation (like dates, telephone numbers, etc.) are merged to a single string of digits, such that it is possible to find them even when formatted in a slightly different way.
The configured “ignored characters” are handled, if any: occurrences of two or more consecutive “ignored characters” are replaced by spaces, then all remaining ones are removed from the text.
The text is then split into tokens, taking the configured “whitespace characters” as the separators.
Finally, all tokens that are shorter than the configured “minimum word length” are removed.

At search time, the keywords entered by the user are processed in a similar manner to ensure they match as expected.

Stopwords
Keeps certain (configured) words from being indexed, usually very common words that don't add much meaning. This can be used to make matching and scoring more accurate, and also improve performance (by keeping the fulltext index smaller). For best results, this should be used alongside (and after) "Tokenizer".
Ignore characters (maybe)
Allows you to remove certain characters from indexed field values and search keywords.

To fully test these options in concert or individually, I'll need some search context for what users are frustrated with to see if I can get this to work as expected.

JakeBapple · 2024-07-25T17:41:44Z

Listing current behavior for the use cases listed in this issue.

Searching for "log in":

This seems to be what a user would expect.
Searching for "system health service":

This is not what would be expected.
Searching for "session 3":

This seems to be what a user would expect.
Searching for "pdf":

This seems acceptable given how little context the search term is.
Searching for "broken links":

This also seems acceptable given what is being searched.

Only #2 (searching for system health services) appears to be of any issue from this ticket.

JakeBapple · 2024-07-25T18:36:42Z

Implementing indexing for these settings:

Here are the changes in results:

Searching for "log in":

Minor changes in lower results, but encouraging that it's looking more for "log in" than just "log" and "in"
Searching for "system health service":

Minor changes in 2nd and lower results ranking where "health system service" is given more weight.
Searching for "session 3":

Worse results for session 3, this is most likely because numbers are not being weighted at all, but the word "session" is seen more times on pages 5 and 1 so it's ranked higher.
Searching for "pdf":
No change
Searching for "broken links":
No change

gracekretschmer-metrostar · 2024-07-26T13:21:28Z

Great work, @JakeBapple! I am going to grab time for us to regroup on this work next week to determine next steps.

gracekretschmer-metrostar · 2024-07-31T15:45:12Z

After the pre-refinement, this is how we will move forward:

Jake will work on improving the indexing and tokenizing of knowledge base search.
Jake will bring in the design layout from Marisa's redesign for knowledge base landing page.
Marisa will partner with Jake on building out a taxonomy for the knowledge base.
Jake will use the taxonomy to improve the search.
CMS team will then explore using elastic or solar search.

kevwalsh added Drupal engineering CMS team practice area Epic Issue type Knowledge Base [CMS feature] Knowledge Base Needs refining Issue status labels Feb 24, 2022

github-actions bot added CMS experience Was a scrum team Content ops Was a scrum team labels Feb 24, 2022

kevwalsh mentioned this issue Feb 24, 2022

Improve search in Knowledge Base KB #7012

Closed

13 tasks

kevwalsh added the CMS Team CMS Product team that manages both editor exp and devops label Apr 20, 2022

EWashb removed CMS experience Was a scrum team Content ops Was a scrum team labels Jul 25, 2023

gracekretschmer-metrostar mentioned this issue Feb 15, 2024

Platform CMS Sprint Plan 4 (2/19/2024 - 3/27/2024) #17244

Open

25 tasks

gracekretschmer-metrostar assigned gracekretschmer-metrostar, MDomngz and srancour and unassigned gracekretschmer-metrostar Feb 16, 2024

gracekretschmer-metrostar mentioned this issue Feb 26, 2024

Platform CMS Sprint Plan 05 (3/4/2024 - 3/13/2024) #17331

Open

39 tasks

gracekretschmer-metrostar assigned edmund-dunn and anantais and unassigned srancour and MDomngz Feb 28, 2024

MDomngz mentioned this issue Feb 29, 2024

Design Prototype for Knowledge Base Landing Page #17383

Closed

9 tasks

gracekretschmer-metrostar mentioned this issue Mar 12, 2024

Platform CMS Sprint Plan 06 (3/14/2024 - 3/27/2024) #17502

Open

35 tasks

gracekretschmer-metrostar mentioned this issue Mar 20, 2024

Platform CMS Sprint Plan 07 (3/28/2024 - 4/10/2024) #17564

Closed

30 tasks

gracekretschmer-metrostar mentioned this issue Apr 8, 2024

Platform CMS Sprint Plan 08 (4/11/2024 - 4/24/2024) #17771

Closed

31 tasks

gracekretschmer-metrostar assigned JakeBapple and unassigned edmund-dunn Apr 11, 2024

This was referenced Apr 22, 2024

Create CMS-Specific Questions for Collab Cycle Intake Form #17813

Closed

Knowledge-base (KB) Enhancements #14069

Open

Platform CMS Sprint Plan 09 (4/25/2024 - 5/8/2024) #17926

Closed

gracekretschmer-metrostar unassigned anantais Jul 3, 2024

gracekretschmer-metrostar mentioned this issue Jul 3, 2024

Platform CMS Sprint Plan 14 (7/4/2024 - 7/17/2024) #18430

Open

24 tasks

gracekretschmer-metrostar mentioned this issue Jul 10, 2024

Platform CMS Sprint Plan 15 (7/18/2024 - 7/31/2024) #18485

Closed

30 tasks

gracekretschmer-metrostar mentioned this issue Jul 30, 2024

Platform CMS Sprint Plan 16 (8/1/2024 - 8/14/2024) #18776

Closed

30 tasks

gracekretschmer-metrostar removed their assignment Jul 31, 2024

gracekretschmer-metrostar closed this as completed Jul 31, 2024

gracekretschmer-metrostar mentioned this issue Jul 31, 2024

Enable Indexing and Tokenizing Processor for Knowledge Base Search #18805

Closed

2 tasks

gracekretschmer-metrostar removed the Needs refining Issue status label Aug 1, 2024

JakeBapple mentioned this issue Aug 22, 2024

VACMS-18805: Configurations for updating KB search. #19024

Merged

20 tasks

MDomngz mentioned this issue Aug 28, 2024

Design Prototype for KB Search Result Page #19069

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further improve Knowledge Base search #8136

Further improve Knowledge Base search #8136

kevwalsh commented Feb 24, 2022 •

edited by gracekretschmer-metrostar

Loading

EWashb commented Oct 24, 2023

gracekretschmer-metrostar commented Feb 28, 2024

anantais commented Mar 22, 2024

EWashb commented Mar 22, 2024

anantais commented Mar 22, 2024

gracekretschmer-metrostar commented Mar 27, 2024

gracekretschmer-metrostar commented Apr 11, 2024

gracekretschmer-metrostar commented Apr 11, 2024

JakeBapple commented Jul 10, 2024

gracekretschmer-metrostar commented Jul 10, 2024

JakeBapple commented Jul 18, 2024 •

edited

Loading

JakeBapple commented Jul 25, 2024 •

edited

Loading

JakeBapple commented Jul 25, 2024

gracekretschmer-metrostar commented Jul 26, 2024

gracekretschmer-metrostar commented Jul 31, 2024

Further improve Knowledge Base search #8136

Further improve Knowledge Base search #8136

Comments

kevwalsh commented Feb 24, 2022 • edited by gracekretschmer-metrostar Loading

Background

User Story or Problem Statement

Previous Team's Proposed Solutions

Relevant Links

Affected users and stakeholders

Acceptance Criteria

EWashb commented Oct 24, 2023

gracekretschmer-metrostar commented Feb 28, 2024

anantais commented Mar 22, 2024

EWashb commented Mar 22, 2024

anantais commented Mar 22, 2024

gracekretschmer-metrostar commented Mar 27, 2024

gracekretschmer-metrostar commented Apr 11, 2024

gracekretschmer-metrostar commented Apr 11, 2024

JakeBapple commented Jul 10, 2024

gracekretschmer-metrostar commented Jul 10, 2024

JakeBapple commented Jul 18, 2024 • edited Loading

JakeBapple commented Jul 25, 2024 • edited Loading

JakeBapple commented Jul 25, 2024

gracekretschmer-metrostar commented Jul 26, 2024

gracekretschmer-metrostar commented Jul 31, 2024

kevwalsh commented Feb 24, 2022 •

edited by gracekretschmer-metrostar

Loading

JakeBapple commented Jul 18, 2024 •

edited

Loading

JakeBapple commented Jul 25, 2024 •

edited

Loading