Skip to content

walber/lesa-crawler

Repository files navigation

LESA Crawler

A web crawler that uses the Elasticsearch, Kibana, Scrapy framework, Splash javascript rendering service on top of a Docker containerized application archtecture that aims to retrieve data from LESA tickets.

Getting Started

Since current LESA doesn't provide any sort of REST API to retrieve data from tickets, I've started developping a web crawler that aquire data through Xpath queries. All data retrieved is stored in an Elasticsearch index where it can be visualized through Kibana.

Prerequisites

To run this app you will need to install:

  • Docker (version 17.09.0+)
  • docker-compose (version 1.16.1+)

Configuration

  1. Clone this project.
  2. Replace the SCREEN_NAME and TIME_ZONE values with your LESA screen name and time zone:
# lesa-crawler/crawler/lesaticket/custom_settings.py
SCREEN_NAME = 'screen.name'
TIME_ZONE = '<your time zone>' # Where "+0000" means GMT time zone.
  1. You can also change the start mark of date range and region of the query:
# lesa-crawler/crawler/lesaticket/custom_settings.py
START_MONTH = ...
START_DAY = ...
START_YEAR = ...
REGION_ID = ...
  1. Encode your screen.name:password using a base64 enconder. (Your JIRA credentials).
# lesa-crawler/crawler/lesaticket/custom_settings.py
LIFERAY_ISSUES_AUTORIZATION_HEADER = {
    'Authorization': 'Basic c2NyZWVuLm5hbWU6cGFzc3dvcmQ='
}
  1. Replace the SUPPORT_OFFICE value with yours:
# lesa-crawler/crawler/lesaticket/custom_settings.py
SUPPORT_OFFICE = '<your support office>'
  1. Encode your email:password using a base64 enconder. (Your LESA site credentials).
  2. Replace the authorization hash code with yours:
# lesa-crawler/crawler/lesaticket/settings.py
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml, ...',
   'Accept-Language': 'en',
   'Authorization': 'Basic c2NyZWVuLm5hbWVAbGlmZXJheS5jb20=',
}
  1. Set the time zone of scrapyd and splash Dockerfiles.

Deployment

  1. Run the following command to build the containers and startup the aplication.
$ docker-compose --file <path to>/lesa-crawler/docker-compose.yml up --build

Or go to lesa-crawler directory and just enter:

$ docker-compose up --build
  1. When it finishes its initialization, you are able to access the following URLs:
  1. If don't want to wait the application start crawling, execute the following command:
$ curl http://localhost:6800/schedule.json -d project=default -d spider=ticket
  1. Check out the dashboard sample by importing 'export.json' into Kibana.

Built With

  • Elasticsearch - An open-source full-text search and analytics engine.
  • Kibana - An open source visualization platform designed to work with Elasticsearch.
  • Scrapyd - A service to run Scrapy spiders.
  • Splash - Lightweight, scriptable browser as a service with an HTTP API.

Screenshots

Sample 1: Alt Text

Sample 2: Alt Text

Sample 3: Alt Text

Sample 4: Alt Text

About

A Scrapy based web crawler for LESA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published