Info-like continuous search [feature request] #4

kaushalmodi · 2017-09-06T15:20:34Z

In Info-mode, you can C-s and search for anything through the whole document. What I mean is that you do not have to be present in an Info node containing the string you are searching for.. If Info does not find your search string in the current node, it will try searching in the next node, and so on.

Can something like that be done for searching in epubs? I can be in the TOC buffer of an ebook and starting searching with C-s.. nov.el should then try to find that term successively in the whole ebook.

Thanks!

The text was updated successfully, but these errors were encountered:

wasamasa · 2017-09-06T15:29:33Z

The easiest way of implementing full-text search would be by using grep on the EPUB sources living in nov-temp-dir, presenting the search results and offering a way to jump with the document (and navigate point to the first match?). It's not terribly continuous though. What keeps me from just searching the rendered view is that currently every document is rendered on demand instead of all of them upfront. This could of course be changed (like by implementing a cache that can be filled whenever convenient and searched/filled by such a command).

kaushalmodi · 2017-09-06T15:37:23Z

The easiest way of implementing full-text search would be by using grep on the EPUB sources living in nov-temp-dir, presenting the search results and offering a way to jump with the document (and navigate point to the first match?)

That would work too; I might give a try implementing this using counsel-rg.

kaushalmodi · 2017-09-06T16:02:08Z

OK, I got it to jump to the HTML source using this:

(defun counsel-rg-nov (&optional initial-input)
  "Search for a pattern in current ebook using rg.
INITIAL-INPUT can be given as the initial minibuffer input.

The ebook is assumed to be opened by the `nov' package and so
`nov-temp-dir' variable should be set automatically."
  (interactive)
  (if (file-exists-p nov-temp-dir)
      (counsel-rg initial-input nov-temp-dir " -g '*.*html'" (format "Search %s" (buffer-name)))
    (user-error (format "%S does not exist" nov-temp-dir))))

Now I need help so that that jump happens to the corresponding location in the rendered page instead of to the html.

jwhitbeck · 2020-02-03T20:00:11Z

Hi @wasamasa, I would like to take a stab at implementing the grep solution you described above. Before I go ahead and work on a PR, do you have any preferences or suggestions?

wasamasa · 2020-02-13T21:51:41Z

Navigating point to each match is going to be a challenge. grep will give you the right document to navigate to, but actually jumping to the match is going to be tricky because the HTML you're searching is different from the rendered text (which is basically the HTML without the markup parts, some missing tags and lots of linebreaks). There is no guarantee that the search hit can be navigated to either. You could make things more reliable by having something like a DOM with original source code locations attached to the buffer text, but this is going to be a pain to implement in shr.

I suspect the better solution is to avoid the mismatch between going from rendered document -> source document -> rendered document by performing the search inside the current rendered document, then jumping to the next document if necessary and repeating until hitting a match or the end of the document (after that a wraparound could be performed, much like in info's outright magic incremental search). There is one prerequisite before attempting this, nov.el keeps only the current document in the buffer and re-renders it whenever needed. This may take enough time to make incremental search unusable. So some sort of caching solution would need to be implemented first, along with a strategy when exactly to load something into the cache:

Load up all documents when opening a file, starting with the currently rendered one
Load up the currently rendered document only, loading up subsequent ones when needed

Since this is caching, some strategy to invalidate the cache needs to be implemented as well. For example the g keybinding could be changed to invalidate the current document and killing the buffer could invalidate all documents.

jakub-w · 2020-02-20T12:08:08Z

You could use dom-texts on libxml-parse-html-region's output instead of grepping through whole html files.

shr-render-region is pretty slow and caching every rendered epub page could take ages (and a lot of memory).

Although I don't know how accurate the dom-texts' output would be in comparison to a fully rendered page, if it can be leveraged instead of doing something more heavy-weight it would be cool.

PS. This method wouldn't be too friendly to occur if someone would like to implement it in the future.

wasamasa · 2020-02-20T15:48:39Z

Thanks for your ideas on that one. dom-texts is one approach, but much like the grep solution it doesn't solve the core issue, how do you jump back from a hit to the corresponding location in the buffer?

I'm aware of shr-render-region being slow. No idea about memory use, without proper tools (no, profiler.el doesn't do it properly) it's hard to objectively judge whether that's an issue in practice. Caching in general is something I'm wary of, so don't expect it to happen on the master branch. What exactly do you mean with occur?

wasamasa · 2020-02-20T22:45:11Z

Regarding mapping DOM nodes to source locations, it seems I'm not the first to think of it: https://lists.gnu.org/archive/html/emacs-devel/2020-02/msg00096.html

jakub-w · 2020-02-21T16:22:21Z

(...) how do you jump back from a hit to the corresponding location in the buffer?

Searching through dom-texts would be there only to find the next page containing the match. The next step would be to render the page and call search-forward (or search-backward). Not the most efficient solution, but the simplest one. I don't think it's possible to know where exactly in the rendered buffer would the match be without actually rendering it first.

What exactly do you mean with occur?

It would be cool to have occur-like functionality to find all matches in the whole epub file. The code from this one could be reused for that functionality. I said dom-texts method wouldn't be too occur-friendly because it would be faster to run if all the pages were cached right from the start.

If the caching was asynchronous, it wouldn't appear to be slow. Maybe it's the way to go.
My concerns about the memory usage came from the fact that epubs can be really big if they contain a lot of images, but now that I looked into it, it doesn't seem to be an issue. The image in the buffer is just a string with display property.

wasamasa · 2020-02-21T18:48:48Z

If the caching was asynchronous, it wouldn't appear to be slow. Maybe it's the way to go.

I have my doubts. There's three ways of achieving async behavior:

Starting a regular process
Starting a network process
Timers

Processes are useful if your communication is limited to strings, however we're dealing with fontified buffers here. One-shot timers are a poor way of faking threads, provided whatever you do doesn't take too long.

wasamasa mentioned this issue Jul 22, 2018

How to search through the entire epub? #37

Closed

wasamasa mentioned this issue Jan 14, 2019

Add nov-search-forward and nov-search-backwards #43

Closed

Charlie-Gordon mentioned this issue Jun 16, 2021

Epub global search, is it possible? weirdNox/org-noter#138

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Info-like continuous search [feature request] #4

Info-like continuous search [feature request] #4

kaushalmodi commented Sep 6, 2017

wasamasa commented Sep 6, 2017

kaushalmodi commented Sep 6, 2017

kaushalmodi commented Sep 6, 2017

jwhitbeck commented Feb 3, 2020

wasamasa commented Feb 13, 2020

jakub-w commented Feb 20, 2020

wasamasa commented Feb 20, 2020

wasamasa commented Feb 20, 2020

jakub-w commented Feb 21, 2020 •

edited

Loading

wasamasa commented Feb 21, 2020

Info-like continuous search [feature request] #4

Info-like continuous search [feature request] #4

Comments

kaushalmodi commented Sep 6, 2017

wasamasa commented Sep 6, 2017

kaushalmodi commented Sep 6, 2017

kaushalmodi commented Sep 6, 2017

jwhitbeck commented Feb 3, 2020

wasamasa commented Feb 13, 2020

jakub-w commented Feb 20, 2020

wasamasa commented Feb 20, 2020

wasamasa commented Feb 20, 2020

jakub-w commented Feb 21, 2020 • edited Loading

wasamasa commented Feb 21, 2020

jakub-w commented Feb 21, 2020 •

edited

Loading