Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupling the html2text rendering pipeline #37

Open
robinkrahl opened this issue Oct 3, 2020 · 8 comments
Open

Decoupling the html2text rendering pipeline #37

robinkrahl opened this issue Oct 3, 2020 · 8 comments

Comments

@robinkrahl
Copy link
Contributor

robinkrahl commented Oct 3, 2020

I’ve spent some time using html2text, reading its source code and even writing small patches. Still, I haven’t really grasped the complete rendering process that html2text performs. At the same time, I have some specific requirements like #27 or #36 that cannot be realized with html2text and maybe don’t even belong in a generic HTML rendering library.

Therefore, I am wondering: Would it be possible and would it make sense to decouple the html2text rendering pipeline into steps that can be customized by the user? This would make it easier to understand the rendering process, and it might make it possible to implement some of the requirements I mentioned earlier without having to re-implement the entire rendering stack.

From my point of view, these are the steps of the rendering pipeline (while I’m quite confident that steps 1–3 are correct, I’m not really sure about 4 and 5.):

  1. Parsing the HTML document (src/lib.rs).
  2. Transforming the HTML document into a render tree (src/lib.rs).
  3. Estimating the size of the elements of the render tree (src/lib.rs).
  4. Laying out the elements of the render tree into lines (src/text_renderer.rs?).
  5. Rendering the elements into text (src/text_renderer.rs?).
  6. Annotating the lines using a TextDecorator (src/text_renderer.rs).

It would be especially nice if the user would be able to customize step 5 without having to re-implement everything else.

Is my understanding of the rendering process roughly correct? What do you think?

@jugglerchris
Copy link
Owner

I think that's a reasonable summary of how it works. Some more notes:

  • The size estimate is needed for laying out tables - i.e. deciding how wide each of the columns should be.
  • The annotation is part of the text layout - some annotations can add text which needs to be taken into account.
  • The layout is really just a tree walk of the render tree, but it's harder to follow because tree_map_reduce() is used to avoid stack overflows; it has an explicity stack of work to do rather than the more readable recursion.
  • The text layout is mostly using the obvious algorithm - keep trying to add words until the line is full, then start a new line. Nested blocks use nested text renderers (e.g. for quoted text, render into a width-2 renderer and the add the lines to the parent with a > prefix).

Can you describe the kind of things you want to do differently?

@robinkrahl
Copy link
Contributor Author

Thanks for the explanations! I’ll have a closer look at the text rendering code.

Can you describe the kind of things you want to do differently?

It’s not about doing things differently, rather about extending the renderer for special use cases like syntax highlighting or special styling for other elements.

@jugglerchris
Copy link
Owner

I had a thought. Perhaps a useful extension point would be the point where a sub-builder is merged into the parent.
For example, after a <pre> block is processed a function could have access to the lines before it's integrated into the parent builder. (Note that currently <pre> doesn't use a sub-builder, but it could if needed. Something like <blockquote> does, so that it can format at a smaller width and then prefix them when adding to the current block).

@robinkrahl
Copy link
Contributor Author

robinkrahl commented Oct 4, 2020 via email

@robinkrahl
Copy link
Contributor Author

Another aspect to this topic is that it would be useful to use html2text’s layout mechanisms with a different data source, for example a Markdown document parsed with pulldown-cmark instead of an HTML document.

@jugglerchris
Copy link
Owner

That's an interesting thought. Though as Markdown can contain HTML tags, maybe just going via HTML makes sense. I don't know how common that is, though.

@grantslatton
Copy link

Right now the render method looks like:

    /// Render this document using the given `decorator` and wrap it to `width` columns.
    pub fn render<D: TextDecorator>(
        self,
        width: usize,
        decorator: D,
    ) -> RenderedText<D> {
        let renderer = TextRenderer::new(width, decorator);
        let builder = render_tree_to_string(renderer, self.0, &mut Discard {});
        RenderedText(builder)
    }

How would you feel about a PR that made this function take in a whole R: Renderer instead of a D: TextDecorator? It looks like the rest of the code is totally generic enough to handle this? This would allow users to render a RenderTree with their own implementations rather than being forced into TextRenderer.

Thanks for the great lib, almost exactly what I needed.

@jugglerchris
Copy link
Owner

Hi @grantslatton - sorry I accidentally lost the notification and didn't notice it was about a comment here!
I'd be very happy to add a new method taking a full renderer (called by the current render method) - then it's not a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants