-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URLNormalizer - encodeNonURICharacters giving an error #1063
Comments
What is the desired outcome with such a URL? Do you know if it is supposed to be a valid URL? When I access it in my browser, the server sends a "Bad Request." What would you replace the URL with so it resolves to a valid page that can be downloaded properly? If the URL is indeed bad, even if appropriately encoded, this exception should be harmless, and you can ignore it. If the concern is to keep your logs clean, I suggest one of the following:
<Logger name="com.norconex.collector.http.url.impl.GenericURLNormalizer" level="OFF" additivity="false">
<AppenderRef ref="Console"/>
</Logger>
<referenceFilters>
<filter class="ReferenceFilter" onMatch="exclude">
<valueMatcher method="regex">.*%url%.*</valueMatcher>
</filter>
</referenceFilters> Does this help resolve your issue? |
Hello Pascal, Yes, it seems like the issue is due to a faulty URL. We’ve already asked the webmaster to address the problem, but we’re not sure when it will be resolved. Your second option would definitely work for us—thanks again for the suggestion. From what the search admins have shared, when the error occurs, the crawler either stops working or halts at a certain point. But I have notice that. The ticket serves two purposes: to report the issue and to potentially raise an enhancement request for the URL normalizer. This type of issue is very rare, so your workaround that you propose would be a good solution for us. For the enhancement, it would be helpful if we had the ability to perform pre-normalization and post-normalization using the replacement tag. Here is a suggestion: Again, we’re really pleased with our decision to go with Norconex Crawler for our transition to Elasticsearch. The features and support have been outstanding. Thanks! |
Very much appreciated! In my tests, the exception did not stop the crawler. It would be nice of you to share a config file that can reproduce the crawler stopping if you have one, as I think that would be a bug. I am marking this as a feature request to allow replacements before and after normalization rules. Like what you propose, I am considering adding support for configuring multiple URL Normalizers so that you can mix and match them in the desired order. |
Regarding the issue, I was simply relaying what my admin team reported, but I believe there are multiple factors contributing to that behavior. I don’t think it's an actual bug. If we can accurately reproduce the issue, I'll open a new bug report. Thank you so much Pascal. |
<urlNormalizers>
<urlNormalizer class="GenericURLNormalizer">
<replacements>
<replace>
<match>...</match>
<replacement>...</replacement>
</replace>
</replacements>
</urlNormalizer>
<urlNormalizer class="GenericURLNormalizer">
<normalizations>
addWWW,
decodeUnreservedCharacters,
encodeNonURICharacters,
encodeSpaces,
lowerCaseSchemeHost,
removeDefaultPort,
removeDuplicateSlashes,
removeQueryString,
secureScheme,
upperCaseEscapeSequence
</normalizations>
<replacements>
<replace>
<match>...</match>
<replacement>...</replacement>
</replace>
</replacements>
</urlNormalizer>
</urlNormalizers> Please give it a try and confirm. |
Wow, I wasn’t expecting this future request to be completed already. Amazing! Does the urlNormalizer run in the sequence defined in the file? We’ll give it a try and keep you posted. Thanks again. |
Yes, they run in the order you define them. |
On the Forces.ca webpage (https://forces.ca/fr/temps-partiel/), we found an issue with an href link pointing to https://forces.ca/"%url%/".
When using the normalization with the encodeNonURICharacters option, it throws an error because the link is not well-formed.
We attempted to apply a replacement, but normalization runs before any replacements are made. If we could perform the replacement before normalization, it would likely resolve the issue.
For now, we're exploring other solutions. Thank you for your understanding.
The text was updated successfully, but these errors were encountered: