You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HstsResolver doesn't handle country code second-level domains (e.g. co.jp) well and emits a WARN log and fails to check HSTS support correctly.
Reproduction
Run a collector with start URL = https://www.ipsj.or.jp/english/index.html.
Actual behavior
HstsResovler tries to communicate with or.jp and emits a WARN message:
WARN HstsResolver - Attempt to verify if the site supports Strict-Transport-Security (HSTS) failed for domain "or.jp". We'll assumume HSTS is not supported for all URLs on that domain.
java.net.UnknownHostException: co.jp: No address associated with hostname
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929) ~[?:?]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519) ~[?:?]
at java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848) ~[?:?]
at java.net.InetAddress.getAllByName0(InetAddress.java:1509) ~[?:?]
at java.net.InetAddress.getAllByName(InetAddress.java:1368) ~[?:?]
at java.net.InetAddress.getAllByName(InetAddress.java:1302) ~[?:?]
at org.apache.http.impl.conn.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:45) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.13.jar!/:4.5.13]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar!/:4.5.13]
at com.norconex.collector.http.fetch.util.HstsResolver.lambda$resolveHstsSupport$1(HstsResolver.java:105) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at java.util.HashMap.computeIfAbsent(HashMap.java:1134) ~[?:?]
at com.norconex.collector.http.fetch.util.HstsResolver.resolveHstsSupport(HstsResolver.java:100) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.fetch.util.HstsResolver.resolve(HstsResolver.java:77) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.fetch.impl.GenericHttpFetcher.fetch(GenericHttpFetcher.java:399) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.fetch.HttpFetchClient.fetch(HttpFetchClient.java:102) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.robot.impl.StandardRobotsTxtProvider.getRobotsTxt(StandardRobotsTxtProvider.java:99) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.pipeline.importer.HttpImporterPipeline$DelayResolverStage.executeStage(HttpImporterPipeline.java:89) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) ~[norconex-commons-lang-2.0.0.jar!/:2.0.0]
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) ~[norconex-collector-http-3.0.0.jar!/:3.0.0]
at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:611) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:556) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:923) ~[norconex-collector-core-2.0.0.jar!/:2.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Expected behavior
HstsResovler tries to communicate with ipsj.or.jp.
Resources
Public Suffix List has a list of suffices that under which Internet users can (or historically could) directly register names (not just country specific ones). It also provides information about Java libraries.
The text was updated successfully, but these errors were encountered:
A new snapshot release was just made with a fix that now considers the "effective" top-level domain for a URL instead of just the last two parts of the domain. It is using the Public Suffix List as you suggested.
That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.
To ensure only https URLs get crawled for your site, I can think of two options:
Update the website so HSTS can be resolved against the top-level domain ipsj.or.jp.
Update your crawler configuration to set disableHSTS to true on the GenericHttpFetcher and enforce https using the GenericURLNormalizer.
That being said, you will still get a warning/exception. The reason is, that your public suffix is or.jp so the effective top-level domain for your site is ipsj.or.jp (as you expected). That domain is not reachable (timeout) when trying to resolve HSTS with a HEAD request.
Actually, the URL I shared was just an example which I randomly picked from sites I was familiar with. However, your suggestions to mitigate another error message are very helpful.
I'm looking forward to a new release with the fix.
Summary
HstsResolver
doesn't handle country code second-level domains (e.g.co.jp
) well and emits a WARN log and fails to check HSTS support correctly.Reproduction
Run a collector with start URL =
https://www.ipsj.or.jp/english/index.html
.Actual behavior
HstsResovler
tries to communicate withor.jp
and emits a WARN message:Expected behavior
HstsResovler
tries to communicate withipsj.or.jp
.Resources
under which Internet users can (or historically could) directly register names
(not just country specific ones). It also provides information about Java libraries.The text was updated successfully, but these errors were encountered: