Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SiteMinder Authentication #218

Open
shubhamsamy opened this issue Jan 12, 2016 · 19 comments
Open

SiteMinder Authentication #218

shubhamsamy opened this issue Jan 12, 2016 · 19 comments

Comments

@shubhamsamy
Copy link

Hi,
I am not able to crawl the redirected URL. I need to crawl a reference URL in the page which is being redirected to other site.
I have attached snippet from log which tells that the crawl stage is 'Redirect', status code is 302 as show below:
crawlState=REDIRECT, statusCode=302, reasonPhrase=Found

Please have a look and let us know as what could be the reason for this.

Regards,
Sam
redirect log.txt

@jetnet
Copy link

jetnet commented Jan 12, 2016

try this:

    <metadataFetcher class="${metaFetcher}" >
      <validStatusCodes>200,301,302</validStatusCodes>
    </metadataFetcher>

@shubhamsamy
Copy link
Author

Hi,
I have tried to add valid status codes but still I am getting the same error.
I am attaching the configuration. I have removed site name etc as these are intranet site.
Please have a look and let me know if there any thing missing in my configuration.
Thanks & Regards,
Sam
crawler.txt

@essiembre
Copy link
Contributor

Without ways to reproduce it is hard for me to comment, but looking at your log snippet, nothing indicates the redirect is not being crawled. Understanding these two lines from your log may help:

CrawlerIbnstance1: 2016-01-12 11:46:55 INFO -       REJECTED_REDIRECTED: http://abc.xyz.com/Download?docid=123&Status=FREE (Subject: HttpFetchResponse [crawlState=REDIRECT, statusCode=302, reasonPhrase=Found (http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4)])
CrawlerIbnstance1: 2016-01-12 11:46:55 DEBUG - Queued for processing: http://def.mno.com/get?docid=123&Lang=EN&Rev=T&Format=PDFV1R4

REJECTED_REDIRECTED means the original URL is being dropped in favor of the target URL. That target URL will be crawled unless it is rejected by some other rules you have defined in your config.
The line saying Queueued for processing... tells you the target URL will be processed.

Further in your logs you should have indications whether it was indeed processed or not, and reasons why not if the case.

Does this help?

@shubhamsamy
Copy link
Author

Hi Pascal,
Thanks for your input. Page is queued and never get crawled as it is redirecting to a site which uses SiteMinder Authentication. Please let us know if there is plan to add the feature to support SiteMinder Authentication.
Thank & Regards,
Sam

@essiembre
Copy link
Contributor

Do you by any chance have or know of a public login form with a test/demo account we can use to start on this?

@shubhamsamy
Copy link
Author

Hi Pascal,
I am sorry as these are intranet sites and are not available outside.
Regards,
Praveen

@essiembre
Copy link
Contributor

I am marking this as a feature request.

It will likely remain open until we can get our hands on a public SiteMinder site we can use for testing/implementing this.

You can always contact Norconex to have someone work on your intranet to put this in place.

@essiembre essiembre changed the title URL Redirect is not working SiteMinder Authentication Jan 12, 2017
@akshaybijawe
Copy link

Hi Pascal, do you have any update regarding SiteMinder Authentication? Thanks.

@essiembre
Copy link
Contributor

Hello Akshay, No update. Do you have a SiteMinder site with temporary access so we can give it a try?

@Krishna210414
Copy link

Hi Any Update on this will be able to crawl siteminder authenticated url ? I was also facing same issue any help would be greatly appreciated.

@essiembre
Copy link
Contributor

@Krishna210414 , the issue is the same: we need a sample SiteMinder site we can use as a test. You got one you can share?

@Krishna210414
Copy link

Nope . I don't have one to share on the forum , But i want to know is there specific setting with which i will be able to crawl redirected url ?

@Krishna210414
Copy link

It would be helpful if you can provide the way to pass targeted url as parameter to the authentication url.

@wolverline
Copy link

wolverline commented Jul 27, 2018

@shubhamsamy As I experienced, httpClientFactory login seems to have limited capabilities. Understandably there are so many different auth methods including federated login. I tried it with sites built upon Drupal which uses a regular form auth. If it doesn't work with your intranet, probably it is because it hops pages to get authenticated. In this case, using PhantomJS seems to the best bet with Norconex for now. I was able to crawl through both FORM and SAML auth. After all PhantomJS is a headless browser; it seems to take a bit too much hack esp. for SAML auth.

@Krishna210414
Copy link

Thanks for the reply.Can you share your logic ?

@wolverline
Copy link

@Krishna210414 I am not sure if you have the same issue as @shubhamsamy does. If you're dealing with httpClientFactory, have you tried the following config?

<metadataFetcher class="$metaFetcher">
  <validStatusCodes>200,302,403</validStatusCodes>
</metadataFetcher>

@Krishna210414
Copy link

i tried with that it didnt work

@wolverline
Copy link

wolverline commented Jul 27, 2018

If you're trying to do Form Auth, you can configure:

<httpcollector id="My Collector">
  <crawler id="$crawler-id">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>$crawler-url</url>
      </startURLs>
   
      <metadataFetcher class="$metaFetcher">
        <validStatusCodes>200,302,403</validStatusCodes>
      </metadataFetcher>

      <documentFetcher class="${http}.fetch.impl.PhantomJSDocumentFetcher"
        detectContentType="true" detectCharset="true" screenshotEnabled="true">
        <exePath>${run-path}</exePath>
        <scriptPath>${script-path}</scriptPath>
        <resourceTimeout>5000</resourceTimeout>
        <validStatusCodes>200,302,403</validStatusCodes>
        <notFoundStatusCodes>404</notFoundStatusCodes>
        <referencePattern>^https://.*</referencePattern>
        <renderWaitTime>3000</renderWaitTime>
        <screenshotDimensions>600x400</screenshotDimensions>
        <screenshotZoomFactor>0.25</screenshotZoomFactor>
        <screenshotScaleDimensions>300</screenshotScaleDimensions>
        <screenshotScaleStretch>false</screenshotScaleStretch>
        <screenshotScaleQuality>medium</screenshotScaleQuality>
        <screenshotImageFormat>png</screenshotImageFormat>
        <screenshotStorage>disk</screenshotStorage>
        <screenshotStorageDiskDir structure="url2path">${workdir}/screenshot</screenshotStorageDiskDir>
        <screenshotStorageDiskField>dummy</screenshotStorageDiskField>
      </documentFetcher>
......
    </crawler>
  </crawlers>
</httpcollector>

#And add the following JS file. The code has ability to attempt form auth when (existing) session cookie is not valid. Note that the current documentFetcher version doesn't have ability to pass arguments. So config should be defined within js file. The following js code is a working example. However, the form auth method may be different from the sites that I am working on (in my case, they're Drupal sites). Chances are big that you have to customize/test/debug further.

/**
 * This file is used and is required by the PhantomJSDocumentFetcher.  
 * Modifying this file could break PhantomJSDocumentFetcher behavior.
 */
var webPage = require('webpage');
var page;
var loginPage;
var fs = require('fs');
var system = require('system');

// Phantomjs global config
phantom.cookiesEnabled = true;
phantom.javascriptEnabled = true;
phantom.state = 'no-state';

//#############
// Local config
// ############
var loginAttempt = 0;
var userName = "username";
var userPass = "password";
var workDir  = '/path/to/work/dir';
// Define session cookie file
// in order for PhantomJS to keep a session alive
// make it sure to be writable
var cookie = workDir + '/cookies/cookie.json';
var loginUrl  = 'https://example.com/login'; // site login link where a login form presents
var logoutUrl = 'https://example.com/logout'; // site logout link

if (system.args.length !== 10) {
  system.stderr.writeLine('Invalid number of arguments.');
  phantom.exit(1);
}

var url = system.args[1];           // The URL to fetch
var outfile = system.args[2];       // The temp output file
var timeout = system.args[3];       // How long to wait for the whole page to render
var bindId = system.args[4];        // HttpClient binding id
var protocol = system.args[5];      // Was the original URL "https" or "http"?
var thumbnailFile = system.args[6]; // Optional path to image file
var dimension = system.args[7];     // e.g. 1024x768
var zoomFactor = system.args[8];    // e.g. 0.25 (25%)
var resourceTimeout = system.args[9]; // timeout for a single page resource

var addCookieInfo = function() {
  Array.prototype.forEach.call(JSON.parse(fs.read(cookie)), function(param) {
    phantom.addCookie(param);
  });
};

var removeCookies = function() {
  if (fs.exists(cookie)) {
    fs.remove(cookie);
  }
  if (loginPage === 'object') {
    loginPage.close();
  }
  loginPage = webPage.create();
  loginPage.open(logoutUrl, function(status) {
    if (status === "success") {}
  });
}

function runLogin() {
  if (loginPage === 'object') {
    loginPage.close();
  }
  if (loginAttempt > 2) {
    system.stderr.writeLine('Reached max login attempt.');
    phantom.exit();
  }
  else {
    loginAttempt++;
    loginPage = webPage.create();
    loginPage.open(loginUrl, function(status) {
      if (status === "success") {
        // system.stderr.writeLine('Form auth started.');
        /**
         * #############################################
         * NOTE: Login Form
         * Customize for UserID, Password, and Form fields
         * Or rewrite to pass each objects to this function
         * #############################################
         */
        loginPage.evaluate(function(uname, upass) {
          document.getElementById("username").value = uname;
          document.getElementById("userpass").value = upass;
          document.getElementById("loginform").submit();
          //docForm = document.getElementsByTagName("form");
          //docForm[0].submit();
        }, userName, userPass);

        loginPage.onLoadFinished = function(status) { 
          if (status === 'success') {
            if (!phantom.state || phantom.state == 'no-state') {
              phantom.state = 'no-session';
            }
            if (phantom.state === 'no-session') {
              fs.write(cookie, JSON.stringify(phantom.cookies), "w");
              phantom.state = 'run-state';
              setTimeout(runPage, 500);
            }
          }
        };
      }
    });
  }
}

/**
 * Set varabiles with Norconex options
 */
function setPage() {
  page.onResourceError = function(resourceError) {
    system.stderr.writeLine(resourceError.url + ': ' + resourceError.errorString);
  };
  if (thumbnailFile && dimension) {
    var pageWidth = 1024;
    var pageHeight = 768;
    if (dimension) {
      var size = dimension.split('x');
      pageWidth = parseInt(size[0], 10) * zoomFactor;
      pageHeight = parseInt(size[1], 10) * zoomFactor;
    }
    page.viewportSize = { width: pageWidth, height: pageHeight };
    page.clipRect = { top: 0, left: 0, width: pageWidth, height: pageHeight };
  }
  if (thumbnailFile && zoomFactor) {
    page.zoomFactor = zoomFactor;
  }

  if (bindId !== "-1") {
    page.customHeaders = {
      "collector.proxy.bindId": bindId,
      "collector.proxy.protocol": protocol
    };
  }
  if (resourceTimeout !== "-1") {
    page.settings.resourceTimeout = resourceTimeout;
  }
}

function runPage() {
  if (page === 'object') {
    page.close();
  } 
  page = webPage.create();
  addCookieInfo();
  setPage();
  page.open(url, function(status) {
    if (status !== 'success') {
      system.stderr.writeLine('Unsuccessful loading of: ' + url + ' (status=' + status + ').');
      system.stderr.writeLine('Content: ' + page.content);
      if (page.content) {
        fs.write(outfile, "error", 'w');
      }
      phantom.exit();
    }
    else {
      if (phantom.state === 'run-state') {
        window.setTimeout(function() {
          if (thumbnailFile) {
            page.render(thumbnailFile);
          }
          if (page.content) {
            fs.write(outfile, page.content, 'w');
          }
          // page.render("test_page.png");
          phantom.exit();
        }, timeout);
      }

    }   
  });

  page.onResourceReceived = function(response) {  
    if (response.stage == 'end'){
      return;
    }
    if (response.url == url) {
      if (response.status == 403) {
        phantom.state = 'no-session';

      }
      else {
        phantom.state = 'run-state';
        response.headers.forEach(function(header){
          system.stdout.writeLine('HEADER:' + header.name + '=' + header.value);
        });
        system.stdout.writeLine('STATUS:' + response.status);
        system.stdout.writeLine('STATUSTEXT:' + response.statusText);
        system.stdout.writeLine('CONTENTTYPE:' + response.contentType);
      }
    }
  };
  
  page.onLoadFinished = function(status) {
    if (status === 'success') {
      if (phantom.state == 'no-session') {
        removeCookies();
        setTimeout(runLogin, 500);
      }
    }
  };
}

if (!fs.isFile(cookie)) {
  runLogin();
}
else {
  runPage();
}

@Krishna210414
Copy link

Thanks for the logic how to ensure the JS is invoked so that i can start putting the logic for redirection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants