Cangrejo

TODO: Write a gem description

Installation

Add this line to your application's Gemfile:

gem 'cangrejo'

And then execute:

$ bundle

Or install it yourself as:

$ gem install cangrejo

Usage

Configuration

Cangrejo.configure do |config|
  config.set_crawler_cache_path 'tmp' # make sure this path exists!
  config.set_temp_path '/tmp/crawler_cache' # make sure this path exists!

  if Rails.env.development?
    # Override crawler configurations, more on this later
    config.set_crawler_setup_for 'platanus/some-crawler', {
      path: '/path/to/crawler',
      git_remote: 'git://crawler/repo',
      git_commit: 'ThEcr4wl3rc0m1ty0un33d'
    }
  end
end

Rails integration

When using cangrejo inside a rails app use the following base configuration inside an initializer (railtie is comming soon!):

Cangrejo.configure do |config|
  config.set_temp_path Rails.root.join '/tmp'
end

About crawler configurations

There are three ways to run a crawler:

By default, crawlers are identified by their unique uri (like platanus/demo)and ran in the Crabfarm.io cloud. To do so you will need to create an account and register the crawler repo.

Crawlers can also be run from a local repository, just map the crawler uri to a path in the initializer:

config.set_crawler_setup_for 'org/repo', {
  path: '/path/to/crawler'
}

Crawlers can also be ran from a git remote, the crawler is downloaded to the path specified using config.set_crawler_cache_path and then ran locally:

config.set_crawler_setup_for 'org/repo', {
  git_remote: 'git://crawler/repo',
  git_commit: 'ThEcr4wl3rc0m1ty0un33d'
}

Sessions

To communicate with crawlers you use crawling sessions. event though you can manually build and start a session, it is recommended to use the Cangrejo.connect method to handle session lifecycle for you:

Cangrejo.connect 'org/repo' do |session|
  session.navigate(:front_page, param1: 'hello')
end

You can also call connect with no crawler name, if so, connect will use the crawler that was registered first in the configuration.

Once inside a connect block, you can change the session state using navigate

session.navigate(:front_page, param1: 'hello')

Data extracted by last navigation is available at doc property as an open struct

session.doc.title
session.doc.price

You can also create, start and stop sessions manually;

session = Cangrejo::Session.new 'org/repo'
session.navigate(:front_page, param1: 'hello')
session.relase

Don't forget to release the session when you are done!! Once released the session becomes unusable.

Contributing

Fork it ( https://github.com/[my-github-username]/cangrejo/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
Gemfile		Gemfile
Guardfile		Guardfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
cangrejo.gemspec		cangrejo.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cangrejo

Installation

Usage

Configuration

Rails integration

About crawler configurations

Sessions

Contributing

About

Releases

Packages

Languages

License

nicolasmery/cangrejo-gem

Folders and files

Latest commit

History

Repository files navigation

Cangrejo

Installation

Usage

Configuration

Rails integration

About crawler configurations

Sessions

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages