Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

Info-like continuous search [feature request] #4

Open
kaushalmodi opened this issue Sep 6, 2017 · 10 comments
Open

Info-like continuous search [feature request] #4

kaushalmodi opened this issue Sep 6, 2017 · 10 comments

Comments

@kaushalmodi
Copy link

In Info-mode, you can C-s and search for anything through the whole document. What I mean is that you do not have to be present in an Info node containing the string you are searching for.. If Info does not find your search string in the current node, it will try searching in the next node, and so on.

Can something like that be done for searching in epubs? I can be in the TOC buffer of an ebook and starting searching with C-s.. nov.el should then try to find that term successively in the whole ebook.

Thanks!

@wasamasa
Copy link
Owner

wasamasa commented Sep 6, 2017

The easiest way of implementing full-text search would be by using grep on the EPUB sources living in nov-temp-dir, presenting the search results and offering a way to jump with the document (and navigate point to the first match?). It's not terribly continuous though. What keeps me from just searching the rendered view is that currently every document is rendered on demand instead of all of them upfront. This could of course be changed (like by implementing a cache that can be filled whenever convenient and searched/filled by such a command).

@kaushalmodi
Copy link
Author

The easiest way of implementing full-text search would be by using grep on the EPUB sources living in nov-temp-dir, presenting the search results and offering a way to jump with the document (and navigate point to the first match?)

That would work too; I might give a try implementing this using counsel-rg.

@kaushalmodi
Copy link
Author

OK, I got it to jump to the HTML source using this:

(defun counsel-rg-nov (&optional initial-input)
  "Search for a pattern in current ebook using rg.
INITIAL-INPUT can be given as the initial minibuffer input.

The ebook is assumed to be opened by the `nov' package and so
`nov-temp-dir' variable should be set automatically."
  (interactive)
  (if (file-exists-p nov-temp-dir)
      (counsel-rg initial-input nov-temp-dir " -g '*.*html'" (format "Search %s" (buffer-name)))
    (user-error (format "%S does not exist" nov-temp-dir))))

Now I need help so that that jump happens to the corresponding location in the rendered page instead of to the html.

@jwhitbeck
Copy link

Hi @wasamasa, I would like to take a stab at implementing the grep solution you described above. Before I go ahead and work on a PR, do you have any preferences or suggestions?

@wasamasa
Copy link
Owner

Navigating point to each match is going to be a challenge. grep will give you the right document to navigate to, but actually jumping to the match is going to be tricky because the HTML you're searching is different from the rendered text (which is basically the HTML without the markup parts, some missing tags and lots of linebreaks). There is no guarantee that the search hit can be navigated to either. You could make things more reliable by having something like a DOM with original source code locations attached to the buffer text, but this is going to be a pain to implement in shr.

I suspect the better solution is to avoid the mismatch between going from rendered document -> source document -> rendered document by performing the search inside the current rendered document, then jumping to the next document if necessary and repeating until hitting a match or the end of the document (after that a wraparound could be performed, much like in info's outright magic incremental search). There is one prerequisite before attempting this, nov.el keeps only the current document in the buffer and re-renders it whenever needed. This may take enough time to make incremental search unusable. So some sort of caching solution would need to be implemented first, along with a strategy when exactly to load something into the cache:

  • Load up all documents when opening a file, starting with the currently rendered one
  • Load up the currently rendered document only, loading up subsequent ones when needed

Since this is caching, some strategy to invalidate the cache needs to be implemented as well. For example the g keybinding could be changed to invalidate the current document and killing the buffer could invalidate all documents.

@jakub-w
Copy link
Contributor

jakub-w commented Feb 20, 2020

You could use dom-texts on libxml-parse-html-region's output instead of grepping through whole html files.

shr-render-region is pretty slow and caching every rendered epub page could take ages (and a lot of memory).

Although I don't know how accurate the dom-texts' output would be in comparison to a fully rendered page, if it can be leveraged instead of doing something more heavy-weight it would be cool.

PS. This method wouldn't be too friendly to occur if someone would like to implement it in the future.

@wasamasa
Copy link
Owner

Thanks for your ideas on that one. dom-texts is one approach, but much like the grep solution it doesn't solve the core issue, how do you jump back from a hit to the corresponding location in the buffer?

I'm aware of shr-render-region being slow. No idea about memory use, without proper tools (no, profiler.el doesn't do it properly) it's hard to objectively judge whether that's an issue in practice. Caching in general is something I'm wary of, so don't expect it to happen on the master branch. What exactly do you mean with occur?

@wasamasa
Copy link
Owner

Regarding mapping DOM nodes to source locations, it seems I'm not the first to think of it: https://lists.gnu.org/archive/html/emacs-devel/2020-02/msg00096.html

@jakub-w
Copy link
Contributor

jakub-w commented Feb 21, 2020

(...) how do you jump back from a hit to the corresponding location in the buffer?

Searching through dom-texts would be there only to find the next page containing the match. The next step would be to render the page and call search-forward (or search-backward). Not the most efficient solution, but the simplest one. I don't think it's possible to know where exactly in the rendered buffer would the match be without actually rendering it first.

What exactly do you mean with occur?

It would be cool to have occur-like functionality to find all matches in the whole epub file. The code from this one could be reused for that functionality. I said dom-texts method wouldn't be too occur-friendly because it would be faster to run if all the pages were cached right from the start.


If the caching was asynchronous, it wouldn't appear to be slow. Maybe it's the way to go.
My concerns about the memory usage came from the fact that epubs can be really big if they contain a lot of images, but now that I looked into it, it doesn't seem to be an issue. The image in the buffer is just a string with display property.

@wasamasa
Copy link
Owner

If the caching was asynchronous, it wouldn't appear to be slow. Maybe it's the way to go.

I have my doubts. There's three ways of achieving async behavior:

  • Starting a regular process
  • Starting a network process
  • Timers

Processes are useful if your communication is limited to strings, however we're dealing with fontified buffers here. One-shot timers are a poor way of faking threads, provided whatever you do doesn't take too long.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants