-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable fuzzy search as default search mode #4
Comments
Created ticket to solicit input from Esri: |
@torrin47 we could include a dropdown, for example, that provides different search mechanisms (default/strict?, fuzzy, wildcard) and morph the search term(s) without the user having to insert special characters. Having said that, I experimented against EDG with those options and, frankly, I wasn't impressed by fuzzy search at all. It didn't appear to work as advertised. Have you tried fuzzy search lately? Wildcard searches seemed to catch additional records but it wasn't always possible for me to rationalize why certain wildcard searches returned the records they did. All in all, it seems to help but sometimes in strange ways. |
I'm not sure what's up, but the only way I was able to get fuzzy search to work was from the main page: I like the suggestion of a dropdown. I agree it's really tough to understand search results without some sort of highlighting indicating the matched search term (something we discussed in the context of using expanded search with related terms). Highlighting would be awesome, but probably too heavy a lift to justify at this point in the GeoPortal Server lifecycle. My experience has been that strict match searches tend to generate more discouragement among users (when no results are shown) than expanded searches. Users would prefer to see many results and opt to filter from there than to see too few results and wonder why. |
Very odd but that is indeed correct, only main page executes fuzzy search correctly. I had only use the search page when investigating this. |
From @torrin47 on February 27, 2017 19:18
Below is the email chain for context. Asking Esri for their thoughts before we get started on this. Will work to formulate more specific requirements.
From: Greene, Ana
Sent: Wednesday, February 22, 2017 8:59 AM
To: Hultgren, Torrin [email protected]
Cc: Pierson, Suzanne [email protected]; Harness, Catherine [email protected]; Suma Malothu [email protected]
Subject: RE: Full text search thoughts
Hi guys,
Did I ever respond to this? Just catching up…only 2 weeks behind on email…
I totally agree that the wildcard and fuzzy searches should be the default. And like the advanced search dialog. I’d like to go ahead and put all of this on our list of near term development projects.
Thanks,
Ana Greene, M.S., PMP
Environmental Dataset Gateway (EDG) Program Manager
Office of Environmental Information (OEI)
Office of Information Management (OIM)
U.S. Environmental Protection Agency
(o): 202-566-2132
(c): 571-232-7860
[email protected]
https://edg.epa.gov/
From: Hultgren, Torrin
Sent: Tuesday, February 07, 2017 7:26 PM
To: Greene, Ana [email protected]
Cc: Pierson, Suzanne [email protected]; Harness, Catherine [email protected]; Suma Malothu [email protected]
Subject: Full text search thoughts
Hi Ana,
I believe I’ve figured out the source of our continuing confusion about full text search. It was legitimately disabled years ago, but has been working for some time, yet perhaps not in the way we might expect, so I think there’s still some room for improvement, or at least adjustment. I think a lot of our confusion revolves around partial search terms and whether or not they’re considered a match. I think we can all remember a time when we used to have to be very careful about our search terms, and we couldn’t assume that search engines would appropriately match partial words or misspellings, yet these days we take it for granted. Lucene is quite capable of handling any match type we want it to, but the default is the old strict way. If we do a search for the first part of your email address, by default it will come up blank, even though there are records containing your email address:
https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=greene.ana
EDG has “advanced Lucene syntax” if anyone chose to read the help, and could apply a wildcard to their search, which just means that indexed terms that aren’t exact matches but contain the string are returned:
https://edg.epa.gov/metadata/rest/find/document?f=searchpage&searchText=*greene.ana*
Which gives us all 6 records that contain your email address. In theory this slows performance, but we’d need orders of magnitude more records in our index before we’d notice any difference. There’s a last option that’s kind of fun – though it doesn’t seem to work with the direct link, so you’ll have to try it manually If you do a search for greene.ana~ it will conduct a “fuzzy search”, where it will include “misspellings” or words that are very similar – it should return a bunch of records with “Greenspace” in the title.
I’m not sure about you, but I think my own expectation these days is that wildcards and fuzzy searches would be the default – I’d prefer a search to return too many results that I could filter through or refine than too few. But that may also because of an assumption that the search engine would do a good job of ranking/sorting those results so the most relevant ones would appear first, and I don’t know how valid an assumption that is with the EDG. I think we could figure out how to adjust the scoring/ranking algorithm under the hood of the EDG, but I’m not at all sure how we’d measure whether our tweaks were making search results more or less relevant. And if we were to make fuzzy searches the default, I wonder how we’d allow someone to opt-out if they wanted a more strict match? Perhaps we could show an “advanced search” dialog if they wished:
http://www.lucenetutorial.com/lucene-query-builder.html
https://www.google.com/advanced_search
Anyway, curious to know your thoughts. Definitely been on the brain today.
Torrin Hultgren
EPA National Geospatial Support Team
Innovate!, Inc. | [email protected] | 703-922-9090 x737
Copied from original issue: Innovate-Inc/EDG_metadata#69
The text was updated successfully, but these errors were encountered: