Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML elements are reordered. #2267

Open
Bogdan740 opened this issue Jan 28, 2025 · 3 comments
Open

HTML elements are reordered. #2267

Bogdan740 opened this issue Jan 28, 2025 · 3 comments
Labels
needs-more-info More information is needed from the reporter to progress the issue

Comments

@Bogdan740
Copy link

I've uploaded an example.txt file that contains the html example. Unfortunately I couldn't get the example to be any smaller, so I apologise for the largeish html example.

When rendering the original html file in a HTML renderer you get:

TEXT 1

DKLASJD:  DASDSA ABT  31321 MT /  LSMGO  ABT dsada MT
TEXT 2
TEXT 3

Once you parse the HTML with Jsoup.parse(html) and render the new html with a renderer once again you get:

TEXT 1

TEXT 3
DKLASJD:  DASDSA ABT  31321 MT /  LSMGO  ABT dsada MT
TEXT 2

Notice how the text has been re-ordered. One thing that may help in investigation is that when you remove all of the inline style attributes, this issue no longer occurs and the text is in the correct order after parsing with Jsoup.parse().

Is this expected behaviour?

example.txt

@jhy
Copy link
Owner

jhy commented Jan 28, 2025

Hi there,

I can't seem to replicate it. Can you show your code?

With your example HTML in try.jsoup: https://try.jsoup.org/~KDhcX48QN1pnbYhCj-WlusIyAWo

The result of body.text() is:

TEXT 1 DKLASJD: DASDSA ABT 31321 MT / LSMGO ABT dsada MT TEXT 2 TEXT 3

Which is the same as Chrome's rendered output:

TEXT 1

DKLASJD:  DASDSA ABT  31321 MT /  LSMGO  ABT dsada MT
TEXT 2
TEXT 3

@Bogdan740
Copy link
Author

Bogdan740 commented Jan 29, 2025

I looked into the example you shared here. I copied both the "input HTML" and the "renderer output" into an HTML renderer, and it seems like the HTML structure is being reordered, as shown in the attached image. I should also mention I'm using version 1.18.3 (which I believe is the latest version) as per https://github.com/jhy/jsoup/blob/master/CHANGES.md.

Interestingly, in the example you sent, the text appears in the correct order, but the HTML itself is still being rearranged. Do you know why this might be happening?

Image

@jhy
Copy link
Owner

jhy commented Jan 29, 2025

Sorry, but I can't make out what you're trying to show me in that image.

Given how badly formed the HTML is (incorrect closers for formatters, etc) there is a chance that the adoption agency algorithm is being executed differently by jsoup and (what browser are you testing? You haven't disclosed that). Or something akin.

To get to the bottom of it, my suggestion would be to write some debug code that traces the specific order of the DOM for both jsoup and browser. Add IDs to each element so we can see which one is which. Then traverse the doc and serialize the tag + ID. For the browser you can do this in JS. Then the difference is likely to be apparent.

But I am confused as to why my original output is different to yours. Can you show your code and confirm which version you're on.

@jhy jhy added the needs-more-info More information is needed from the reporter to progress the issue label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-more-info More information is needed from the reporter to progress the issue
Projects
None yet
Development

No branches or pull requests

2 participants