Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 concatenating words after removal of html tags before indexing #53

Open
shanholmes opened this issue Sep 20, 2022 · 1 comment
Open

Comments

@shanholmes
Copy link
Contributor

When a page html is returned from getMainContent() and is subsequently parsed by the HTML5 class, block HTML elements that follow on from each other in the DOM have their text content concatenated and then indexed. This results in the search data not being accurate and terms that should be indexed not being found when searched for.

For example, the below markup:
<h2>Discover our awesome herd of elephants</h2><p>Please come check them out now!</p>
Gets parsed to the below in Algolia:
Discover our awesome herd of elephantsPlease come check them out now!

So when searching for the term elephants it returns no results as the text has been concatenated together.
When searching for the term elephantsPlease it will return the result.

Silverstripe stores html created via a HTMLEditorField field in HTMLText in a compressed fashion above with new lines reduced from the markup (aside from list elements).

The ideal resolution would be that it adds a space separator to the Algolia index between these block elements.

@mikey-harveycameron
Copy link

I had this problem too. I used an injector for the pagecrawler to amend the rendered string, to insert extra whitespaces. Not sure if this solution is robust enough yet, but here it is:

$html5 = new HTML5();
$render = $page->forTemplate();
$postRender = preg_replace(
    ['/\/(p|h\d|div|li)><(p|h\d|div|li)/mi', '/<br\/?>/'],
    ['/$1> <$2', ' '],
    $render
);
$dom = $html5->loadHTML($postRender);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants