-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SiteMinder Authentication #218
Comments
try this: <metadataFetcher class="${metaFetcher}" >
<validStatusCodes>200,301,302</validStatusCodes>
</metadataFetcher> |
Hi, |
Without ways to reproduce it is hard for me to comment, but looking at your log snippet, nothing indicates the redirect is not being crawled. Understanding these two lines from your log may help:
REJECTED_REDIRECTED means the original URL is being dropped in favor of the target URL. That target URL will be crawled unless it is rejected by some other rules you have defined in your config. Further in your logs you should have indications whether it was indeed processed or not, and reasons why not if the case. Does this help? |
Hi Pascal, |
Do you by any chance have or know of a public login form with a test/demo account we can use to start on this? |
Hi Pascal, |
I am marking this as a feature request. It will likely remain open until we can get our hands on a public SiteMinder site we can use for testing/implementing this. You can always contact Norconex to have someone work on your intranet to put this in place. |
Hi Pascal, do you have any update regarding SiteMinder Authentication? Thanks. |
Hello Akshay, No update. Do you have a SiteMinder site with temporary access so we can give it a try? |
Hi Any Update on this will be able to crawl siteminder authenticated url ? I was also facing same issue any help would be greatly appreciated. |
@Krishna210414 , the issue is the same: we need a sample SiteMinder site we can use as a test. You got one you can share? |
Nope . I don't have one to share on the forum , But i want to know is there specific setting with which i will be able to crawl redirected url ? |
It would be helpful if you can provide the way to pass targeted url as parameter to the authentication url. |
@shubhamsamy As I experienced, httpClientFactory login seems to have limited capabilities. Understandably there are so many different auth methods including federated login. I tried it with sites built upon Drupal which uses a regular form auth. If it doesn't work with your intranet, probably it is because it hops pages to get authenticated. In this case, using PhantomJS seems to the best bet with Norconex for now. I was able to crawl through both FORM and SAML auth. After all PhantomJS is a headless browser; it seems to take a bit too much hack esp. for SAML auth. |
Thanks for the reply.Can you share your logic ? |
@Krishna210414 I am not sure if you have the same issue as @shubhamsamy does. If you're dealing with httpClientFactory, have you tried the following config? <metadataFetcher class="$metaFetcher">
<validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher> |
i tried with that it didnt work |
If you're trying to do Form Auth, you can configure: <httpcollector id="My Collector">
<crawler id="$crawler-id">
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>$crawler-url</url>
</startURLs>
<metadataFetcher class="$metaFetcher">
<validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher>
<documentFetcher class="${http}.fetch.impl.PhantomJSDocumentFetcher"
detectContentType="true" detectCharset="true" screenshotEnabled="true">
<exePath>${run-path}</exePath>
<scriptPath>${script-path}</scriptPath>
<resourceTimeout>5000</resourceTimeout>
<validStatusCodes>200,302,403</validStatusCodes>
<notFoundStatusCodes>404</notFoundStatusCodes>
<referencePattern>^https://.*</referencePattern>
<renderWaitTime>3000</renderWaitTime>
<screenshotDimensions>600x400</screenshotDimensions>
<screenshotZoomFactor>0.25</screenshotZoomFactor>
<screenshotScaleDimensions>300</screenshotScaleDimensions>
<screenshotScaleStretch>false</screenshotScaleStretch>
<screenshotScaleQuality>medium</screenshotScaleQuality>
<screenshotImageFormat>png</screenshotImageFormat>
<screenshotStorage>disk</screenshotStorage>
<screenshotStorageDiskDir structure="url2path">${workdir}/screenshot</screenshotStorageDiskDir>
<screenshotStorageDiskField>dummy</screenshotStorageDiskField>
</documentFetcher>
......
</crawler>
</crawlers>
</httpcollector> #And add the following JS file. The code has ability to attempt form auth when (existing) session cookie is not valid. Note that the current documentFetcher version doesn't have ability to pass arguments. So config should be defined within js file. The following js code is a working example. However, the form auth method may be different from the sites that I am working on (in my case, they're Drupal sites). Chances are big that you have to customize/test/debug further. /**
* This file is used and is required by the PhantomJSDocumentFetcher.
* Modifying this file could break PhantomJSDocumentFetcher behavior.
*/
var webPage = require('webpage');
var page;
var loginPage;
var fs = require('fs');
var system = require('system');
// Phantomjs global config
phantom.cookiesEnabled = true;
phantom.javascriptEnabled = true;
phantom.state = 'no-state';
//#############
// Local config
// ############
var loginAttempt = 0;
var userName = "username";
var userPass = "password";
var workDir = '/path/to/work/dir';
// Define session cookie file
// in order for PhantomJS to keep a session alive
// make it sure to be writable
var cookie = workDir + '/cookies/cookie.json';
var loginUrl = 'https://example.com/login'; // site login link where a login form presents
var logoutUrl = 'https://example.com/logout'; // site logout link
if (system.args.length !== 10) {
system.stderr.writeLine('Invalid number of arguments.');
phantom.exit(1);
}
var url = system.args[1]; // The URL to fetch
var outfile = system.args[2]; // The temp output file
var timeout = system.args[3]; // How long to wait for the whole page to render
var bindId = system.args[4]; // HttpClient binding id
var protocol = system.args[5]; // Was the original URL "https" or "http"?
var thumbnailFile = system.args[6]; // Optional path to image file
var dimension = system.args[7]; // e.g. 1024x768
var zoomFactor = system.args[8]; // e.g. 0.25 (25%)
var resourceTimeout = system.args[9]; // timeout for a single page resource
var addCookieInfo = function() {
Array.prototype.forEach.call(JSON.parse(fs.read(cookie)), function(param) {
phantom.addCookie(param);
});
};
var removeCookies = function() {
if (fs.exists(cookie)) {
fs.remove(cookie);
}
if (loginPage === 'object') {
loginPage.close();
}
loginPage = webPage.create();
loginPage.open(logoutUrl, function(status) {
if (status === "success") {}
});
}
function runLogin() {
if (loginPage === 'object') {
loginPage.close();
}
if (loginAttempt > 2) {
system.stderr.writeLine('Reached max login attempt.');
phantom.exit();
}
else {
loginAttempt++;
loginPage = webPage.create();
loginPage.open(loginUrl, function(status) {
if (status === "success") {
// system.stderr.writeLine('Form auth started.');
/**
* #############################################
* NOTE: Login Form
* Customize for UserID, Password, and Form fields
* Or rewrite to pass each objects to this function
* #############################################
*/
loginPage.evaluate(function(uname, upass) {
document.getElementById("username").value = uname;
document.getElementById("userpass").value = upass;
document.getElementById("loginform").submit();
//docForm = document.getElementsByTagName("form");
//docForm[0].submit();
}, userName, userPass);
loginPage.onLoadFinished = function(status) {
if (status === 'success') {
if (!phantom.state || phantom.state == 'no-state') {
phantom.state = 'no-session';
}
if (phantom.state === 'no-session') {
fs.write(cookie, JSON.stringify(phantom.cookies), "w");
phantom.state = 'run-state';
setTimeout(runPage, 500);
}
}
};
}
});
}
}
/**
* Set varabiles with Norconex options
*/
function setPage() {
page.onResourceError = function(resourceError) {
system.stderr.writeLine(resourceError.url + ': ' + resourceError.errorString);
};
if (thumbnailFile && dimension) {
var pageWidth = 1024;
var pageHeight = 768;
if (dimension) {
var size = dimension.split('x');
pageWidth = parseInt(size[0], 10) * zoomFactor;
pageHeight = parseInt(size[1], 10) * zoomFactor;
}
page.viewportSize = { width: pageWidth, height: pageHeight };
page.clipRect = { top: 0, left: 0, width: pageWidth, height: pageHeight };
}
if (thumbnailFile && zoomFactor) {
page.zoomFactor = zoomFactor;
}
if (bindId !== "-1") {
page.customHeaders = {
"collector.proxy.bindId": bindId,
"collector.proxy.protocol": protocol
};
}
if (resourceTimeout !== "-1") {
page.settings.resourceTimeout = resourceTimeout;
}
}
function runPage() {
if (page === 'object') {
page.close();
}
page = webPage.create();
addCookieInfo();
setPage();
page.open(url, function(status) {
if (status !== 'success') {
system.stderr.writeLine('Unsuccessful loading of: ' + url + ' (status=' + status + ').');
system.stderr.writeLine('Content: ' + page.content);
if (page.content) {
fs.write(outfile, "error", 'w');
}
phantom.exit();
}
else {
if (phantom.state === 'run-state') {
window.setTimeout(function() {
if (thumbnailFile) {
page.render(thumbnailFile);
}
if (page.content) {
fs.write(outfile, page.content, 'w');
}
// page.render("test_page.png");
phantom.exit();
}, timeout);
}
}
});
page.onResourceReceived = function(response) {
if (response.stage == 'end'){
return;
}
if (response.url == url) {
if (response.status == 403) {
phantom.state = 'no-session';
}
else {
phantom.state = 'run-state';
response.headers.forEach(function(header){
system.stdout.writeLine('HEADER:' + header.name + '=' + header.value);
});
system.stdout.writeLine('STATUS:' + response.status);
system.stdout.writeLine('STATUSTEXT:' + response.statusText);
system.stdout.writeLine('CONTENTTYPE:' + response.contentType);
}
}
};
page.onLoadFinished = function(status) {
if (status === 'success') {
if (phantom.state == 'no-session') {
removeCookies();
setTimeout(runLogin, 500);
}
}
};
}
if (!fs.isFile(cookie)) {
runLogin();
}
else {
runPage();
} |
Thanks for the logic how to ensure the JS is invoked so that i can start putting the logic for redirection. |
Hi,
I am not able to crawl the redirected URL. I need to crawl a reference URL in the page which is being redirected to other site.
I have attached snippet from log which tells that the crawl stage is 'Redirect', status code is 302 as show below:
crawlState=REDIRECT, statusCode=302, reasonPhrase=Found
Please have a look and let us know as what could be the reason for this.
Regards,
Sam
redirect log.txt
The text was updated successfully, but these errors were encountered: