A web crawler that uses the Elasticsearch, Kibana, Scrapy framework, Splash javascript rendering service on top of a Docker containerized application archtecture that aims to retrieve data from LESA tickets.
Since current LESA doesn't provide any sort of REST API to retrieve data from tickets, I've started developping a web crawler that aquire data through Xpath queries. All data retrieved is stored in an Elasticsearch index where it can be visualized through Kibana.
To run this app you will need to install:
- Docker (version 17.09.0+)
- docker-compose (version 1.16.1+)
- Clone this project.
- Replace the SCREEN_NAME and TIME_ZONE values with your LESA screen name and time zone:
# lesa-crawler/crawler/lesaticket/custom_settings.py
SCREEN_NAME = 'screen.name'
TIME_ZONE = '<your time zone>' # Where "+0000" means GMT time zone.
- You can also change the start mark of date range and region of the query:
# lesa-crawler/crawler/lesaticket/custom_settings.py
START_MONTH = ...
START_DAY = ...
START_YEAR = ...
REGION_ID = ...
- Encode your screen.name:password using a base64 enconder. (Your JIRA credentials).
# lesa-crawler/crawler/lesaticket/custom_settings.py
LIFERAY_ISSUES_AUTORIZATION_HEADER = {
'Authorization': 'Basic c2NyZWVuLm5hbWU6cGFzc3dvcmQ='
}
- Replace the SUPPORT_OFFICE value with yours:
# lesa-crawler/crawler/lesaticket/custom_settings.py
SUPPORT_OFFICE = '<your support office>'
- Encode your email:password using a base64 enconder. (Your LESA site credentials).
- Replace the authorization hash code with yours:
# lesa-crawler/crawler/lesaticket/settings.py
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml, ...',
'Accept-Language': 'en',
'Authorization': 'Basic c2NyZWVuLm5hbWVAbGlmZXJheS5jb20=',
}
- Set the time zone of scrapyd and splash Dockerfiles.
- Run the following command to build the containers and startup the aplication.
$ docker-compose --file <path to>/lesa-crawler/docker-compose.yml up --build
Or go to lesa-crawler directory and just enter:
$ docker-compose up --build
- When it finishes its initialization, you are able to access the following URLs:
- http://localhost:9200 (user: elastic, password: changeme)
- http://localhost:5601 (same as above)
- http://localhost:6800 (The scrapyd web interface)
- http://localhost:8050 (The splash web interface)
- If don't want to wait the application start crawling, execute the following command:
$ curl http://localhost:6800/schedule.json -d project=default -d spider=ticket
- Check out the dashboard sample by importing 'export.json' into Kibana.
- Elasticsearch - An open-source full-text search and analytics engine.
- Kibana - An open source visualization platform designed to work with Elasticsearch.
- Scrapyd - A service to run Scrapy spiders.
- Splash - Lightweight, scriptable browser as a service with an HTTP API.