Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Versions >= v5.3.2 are unable to parse specific link #280

Open
stalgiag opened this issue Sep 24, 2024 · 2 comments
Open

Regression: Versions >= v5.3.2 are unable to parse specific link #280

stalgiag opened this issue Sep 24, 2024 · 2 comments

Comments

@stalgiag
Copy link

stalgiag commented Sep 24, 2024

I work for a project that validates its links using this library. One link that is frequently validated is the HTML spec at https://html.spec.whatwg.org/. This page has one of the bigger HTML files on the web but node-html-parser was able to parse it well in approximately 23 seconds on my local machine until release 5.3.2.

Consider this example:

const HTMLParser = require('node-html-parser');
const nFetch = require('node-fetch');

async function parseHTMLSpec() {
  try {
    const response = await nFetch('https://html.spec.whatwg.org/');
    const html = await response.text();

    console.log('Fetched HTML. Attempting to parse...');
    console.time('parseHTMLSpec');
    const parsedHTML = HTMLParser.parse(html);
    console.timeEnd('parseHTMLSpec');

    console.log('HTML parsed successfully.');
    console.log('Title:', parsedHTML.querySelector('title').text);
  } catch (error) {
    console.error('Error occurred:', error);
  }
}

parseHTMLSpec();

With node-html-parser 5.3.1, this outputs the following:

Fetched HTML. Attempting to parse...
parseHTMLSpec: 23.415s
HTML parsed successfully.
Title: HTML Standard

With node-html-parser 5.3.2, this hangs indefinitely; only outputting the following even after running for hours:

console.log('Fetched HTML. Attempting to parse...');
@stalgiag stalgiag changed the title Regression: Versions >= v5.3.2 are unable to parse complex HTML Regression: Versions >= v5.3.2 are unable to specific link Sep 24, 2024
@stalgiag stalgiag changed the title Regression: Versions >= v5.3.2 are unable to specific link Regression: Versions >= v5.3.2 are unable to parse specific link Sep 24, 2024
taoqf added a commit that referenced this issue Nov 14, 2024
@taoqf
Copy link
Owner

taoqf commented Nov 14, 2024

Sorry for the bad experience.
I release a beta version [email protected]
but I could not test it due to large memory usage. Could you test it for me? thank you.

@stalgiag
Copy link
Author

Thanks for attempting to find a fix!

I tested [email protected] and I also ran out of memory with this error:
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory.

I tested on both Node 20.10.0 and Node 18.18.1. Note that this does not happen on <v5.3.2 using the same machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants