Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds flag to allow katana to use an existing Chrome instance #490

Merged
merged 4 commits into from
Jun 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ CONFIGURATION:
-mrs, -max-response-size int maximum response size to read (default 9223372036854775807)
-timeout int time to wait for request in seconds (default 10)
-aff, -automatic-form-fill enable automatic form filling (experimental)
-fx, -form-extraction enable extraction of form, input, textarea & select elements
-fx, -form-extraction enable extraction of form, input, textarea & select elements
-retry int number of times to retry the request (default 1)
-proxy string http/socks5 proxy to use
-H, -headers string[] custom header/cookie to include in all http request in header:value format (file)
Expand All @@ -148,6 +148,7 @@ HEADLESS:
-cdd, -chrome-data-dir string path to store chrome browser data
-scp, -system-chrome-path string use specified chrome browser for headless crawling
-noi, -no-incognito start headless chrome without incognito mode
-cwu, -chrome-ws-url string use chrome browser instance launched elsewhere with the debugger listening at this URL
-xhr, -xhr-extraction extract xhr requests

SCOPE:
Expand Down Expand Up @@ -311,6 +312,7 @@ HEADLESS:
-cdd, -chrome-data-dir string path to store chrome browser data
-scp, -system-chrome-path string use specified chrome browser for headless crawling
-noi, -no-incognito start headless chrome without incognito mode
-cwu, -chrome-ws-url string use chrome browser instance launched elsewhere with the debugger listening at this URL
-xhr, -xhr-extraction extract xhr requests
```

Expand Down Expand Up @@ -548,6 +550,41 @@ CONFIGURATION:
-s, -strategy string Visit strategy (depth-first, breadth-first) (default "depth-first")
```

### Connecting to Active Browser Session

Katana can also connect to active browser session where user is already logged in and authenticated. and use it for crawling. The only requirement for this is to start browser with remote debugging enabled.

Here is an example of starting chrome browser with remote debugging enabled and using it with katana -

**step 1) First Locate path of chrome executable**

| Operating System | Chromium Executable Location | Google Chrome Executable Location |
|------------------|------------------------------|-----------------------------------|
| Windows (64-bit) | `C:\Program Files (x86)\Google\Chromium\Application\chrome.exe` | `C:\Program Files (x86)\Google\Chrome\Application\chrome.exe` |
| Windows (32-bit) | `C:\Program Files\Google\Chromium\Application\chrome.exe` | `C:\Program Files\Google\Chrome\Application\chrome.exe` |
| macOS | `/Applications/Chromium.app/Contents/MacOS/Chromium` | `/Applications/Google Chrome.app/Contents/MacOS/Google Chrome` |
| Linux | `/usr/bin/chromium` | `/usr/bin/google-chrome` |

**step 2) Start chrome with remote debugging enabled and it will return websocker url. For example, on MacOS, you can start chrome with remote debugging enabled using following command** -

```console
$ /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222


DevTools listening on ws://127.0.0.1:9222/devtools/browser/c5316c9c-19d6-42dc-847a-41d1aeebf7d6
```

> Now login to the website you want to crawl and keep the browser open.

**step 3) Now use the websocket url with katana to connect to the active browser session and crawl the website**

```console
katana -headless -u https://tesla.com -cwu ws://127.0.0.1:9222/devtools/browser/c5316c9c-19d6-42dc-847a-41d1aeebf7d6 -no-incognito
```

> **Note**: you can use `-cdd` option to specify custom chrome data directory to store browser data and cookies but that does not save session data if cookie is set to `Session` only or expires after certain time.


## Filters

*`-field`*
Expand Down
1 change: 1 addition & 0 deletions cmd/katana/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ pipelines offering both headless and non-headless crawling.`)
flagSet.StringVarP(&options.ChromeDataDir, "chrome-data-dir", "cdd", "", "path to store chrome browser data"),
flagSet.StringVarP(&options.SystemChromePath, "system-chrome-path", "scp", "", "use specified chrome browser for headless crawling"),
flagSet.BoolVarP(&options.HeadlessNoIncognito, "no-incognito", "noi", false, "start headless chrome without incognito mode"),
flagSet.StringVarP(&options.ChromeWSUrl, "chrome-ws-url", "cwu", "", "use chrome browser instance launched elsewhere with the debugger listening at this URL"),
flagSet.BoolVarP(&options.XhrExtraction, "xhr-extraction", "xhr", false, "extract xhr requests"),
)

Expand Down
117 changes: 70 additions & 47 deletions pkg/engine/hybrid/hybrid.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,67 +41,37 @@ func New(options *types.CrawlerOptions) (*Crawler, error) {

previousPIDs := findChromeProcesses()

chromeLauncher := launcher.New().
Leakless(false).
Set("disable-gpu", "true").
Set("ignore-certificate-errors", "true").
Set("ignore-certificate-errors", "1").
Set("disable-crash-reporter", "true").
Set("disable-notifications", "true").
Set("hide-scrollbars", "true").
Set("window-size", fmt.Sprintf("%d,%d", 1080, 1920)).
Set("mute-audio", "true").
Delete("use-mock-keychain").
UserDataDir(dataStore)
var launcherURL string
var chromeLauncher *launcher.Launcher

if options.Options.UseInstalledChrome {
if chromePath, hasChrome := launcher.LookPath(); hasChrome {
chromeLauncher.Bin(chromePath)
} else {
return nil, errorutil.NewWithTag("hybrid", "the chrome browser is not installed").WithLevel(errorutil.Fatal)
}
}
if options.Options.SystemChromePath != "" {
chromeLauncher.Bin(options.Options.SystemChromePath)
}

if options.Options.ShowBrowser {
chromeLauncher = chromeLauncher.Headless(false)
if options.Options.ChromeWSUrl != "" {
launcherURL = options.Options.ChromeWSUrl
} else {
chromeLauncher = chromeLauncher.Headless(true)
}

if options.Options.HeadlessNoSandbox {
chromeLauncher.Set("no-sandbox", "true")
}

if options.Options.Proxy != "" && options.Options.Headless {
proxyURL, err := urlutil.Parse(options.Options.Proxy)
// create new chrome launcher instance
chromeLauncher, err = buildChromeLauncher(options, dataStore)
if err != nil {
return nil, err
}
chromeLauncher.Set("proxy-server", proxyURL.String())
}

for k, v := range options.Options.ParseHeadlessOptionalArguments() {
chromeLauncher.Set(flags.Flag(k), v)
}

launcherURL, err := chromeLauncher.Launch()
if err != nil {
return nil, err
// launch chrome headless process
launcherURL, err = chromeLauncher.Launch()
if err != nil {
return nil, err
}
}

browser := rod.New().ControlURL(launcherURL)
if browserErr := browser.Connect(); browserErr != nil {
return nil, browserErr
return nil, errorutil.NewWithErr(browserErr).Msgf("failed to connect to chrome instance at %s", launcherURL)
}

// create a new browser instance (default to incognito mode)
if !options.Options.HeadlessNoIncognito {
incognito, err := browser.Incognito()
if err != nil {
chromeLauncher.Kill()
if chromeLauncher != nil {
chromeLauncher.Kill()
}
return nil, errorutil.NewWithErr(err).Msgf("failed to create incognito browser")
}
browser = incognito
Expand All @@ -124,8 +94,10 @@ func New(options *types.CrawlerOptions) (*Crawler, error) {

// Close closes the crawler process
func (c *Crawler) Close() error {
if err := c.browser.Close(); err != nil {
return err
if c.Options.Options.ChromeWSUrl == "" {
if err := c.browser.Close(); err != nil {
return err
}
}
if c.Options.Options.ChromeDataDir == "" {
if err := os.RemoveAll(c.tempDir); err != nil {
Expand All @@ -151,6 +123,57 @@ func (c *Crawler) Crawl(rootURL string) error {
return nil
}

// buildChromeLauncher builds a new chrome launcher instance
func buildChromeLauncher(options *types.CrawlerOptions, dataStore string) (*launcher.Launcher, error) {
chromeLauncher := launcher.New().
Leakless(false).
Set("disable-gpu", "true").
Set("ignore-certificate-errors", "true").
Set("ignore-certificate-errors", "1").
Set("disable-crash-reporter", "true").
Set("disable-notifications", "true").
Set("hide-scrollbars", "true").
Set("window-size", fmt.Sprintf("%d,%d", 1080, 1920)).
Set("mute-audio", "true").
Delete("use-mock-keychain").
UserDataDir(dataStore)

if options.Options.UseInstalledChrome {
if chromePath, hasChrome := launcher.LookPath(); hasChrome {
chromeLauncher.Bin(chromePath)
} else {
return nil, errorutil.NewWithTag("hybrid", "the chrome browser is not installed").WithLevel(errorutil.Fatal)
}
}
if options.Options.SystemChromePath != "" {
chromeLauncher.Bin(options.Options.SystemChromePath)
}

if options.Options.ShowBrowser {
chromeLauncher = chromeLauncher.Headless(false)
} else {
chromeLauncher = chromeLauncher.Headless(true)
}

if options.Options.HeadlessNoSandbox {
chromeLauncher.Set("no-sandbox", "true")
}

if options.Options.Proxy != "" && options.Options.Headless {
proxyURL, err := urlutil.Parse(options.Options.Proxy)
if err != nil {
return nil, err
}
chromeLauncher.Set("proxy-server", proxyURL.String())
}

for k, v := range options.Options.ParseHeadlessOptionalArguments() {
chromeLauncher.Set(flags.Flag(k), v)
}

return chromeLauncher, nil
}

// killChromeProcesses any and all new chrome processes started after
// headless process launch.
func (c *Crawler) killChromeProcesses() error {
Expand Down
2 changes: 2 additions & 0 deletions pkg/types/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,8 @@ type Options struct {
HeadlessNoSandbox bool
// SystemChromePath : Specify the chrome binary path for headless crawling
SystemChromePath string
// ChromeWSUrl : Specify the Chrome debugger websocket url for a running Chrome instance to attach to
ChromeWSUrl string
// OnResult allows callback function on a result
OnResult OnResultCallback
// StoreResponse specifies if katana should store http requests/responses
Expand Down