This repository contains scrapers developed for wikiart.org. The scraper is a part of the project Art Guide undertaken in Practicing DS Skills in ML Competitions and Building ML-powered Applications classes.
For our project, we required comprehensive metadata about art pieces, such
as genres, styles, and other descriptors which were not present in other
datasets I found. Thus, these scrapers are
designed to extract all tabular information about Art Pieces, Artists, Art Movements, Schools and Styles.
present on the website.
The project consists of 5 crawlers:
- wikiart spider: This crawler extracts comprehensive details and images of various art pieces from the WikiArt website.
- wikiart artists spider: This crawler specializes in gathering information about artists.
- wikiart styles spider: This crawler is focused on collecting extensive information about different art styles.
- wikiart movements spider: This crawler delves into the world of art movements.
- wikiart schools spider: This crawler concentrates on gathering comprehensive data about art schools.
In addition to the primary crawlers, the project includes DuckDuckGo spiders for updating descriptions in specific categories:
- duck_duck_go.py: Updates descriptions for art pieces.
- duck_duck_go_artist.py: Updates information about artists.
- duck_duck_go_style.py: Updates information about art styles.
- duck_duck_go_movement.py: Updates information about art movements.
- duck_duck_go_school.py: Updates information about art schools.
These DuckDuckGo spiders enhance and maintain the data integrity by fetching updated information for paintings, artists, styles, movements, and schools based on the existing datasets.
Scraped Information for Artworks:
- URL
- Title
- Original Title
- Author
- Author Link
- Date
- Styles
- Series
- Series Link
- Genre
- Genre Link
- Media
- Location
- Dimensions
- Description
- Wiki Description
- Wiki Link
- Tags
- Image URLs
- Images
Scraped Information about Artists:
- URL
- Name
- Original Name
- Birth Date
- Birthplace
- Death Date
- Death Place
- Active Years
- Nationality
- Art Movements
- Painting School
- Genres
- Fields
- Influenced On
- Influenced By
- Teachers
- Pupils
- Art Institutions
- Friends And Coworkers
- Description
- Wiki Description
- Wikipedia Link
Scraped Information for Art Styles:
- Name
- Link
- Description
Scraped Information for Art Movements:
- Name
- Link
- Description
Scraped Information for Art Schools:
- Name
- Link
- Description
The main objective is to extract detailed data about art pieces and artists from the website, providing valuable datasets for data science and machine learning endeavors.
Scraping of 191265 images took ~14 hours on a MacBook Pro (Retina, 15-inch, Mid 2015, 2,2 GHz Quad-Core Intel Core i7). Scraping of 3521 artists took less than 10 minutes
- Python 3.x (3.10 is verified)
- Scrapy
-
Clone this repository:
git clone https://github.com/michaelvin1322/scrapWikiArt
-
Navigate to the repository and install the required packages:
cd ScrapWikiArt
pip install -r requirements.txt
Crawler | Command |
---|---|
Art Pieces Crawler | scrapy runspider -o data/data.csv -t csv ScrapWikiArt/spiders/wikiart.py |
Artists Crawler | scrapy runspider -o data/artists.csv -t csv ScrapWikiArt/spiders/wikiart_artist.py |
Styles Crawler | scrapy runspider -o data/styles.csv -t csv ScrapWikiArt/spiders/wikiart_style.py |
Movements Crawler | scrapy runspider -o data/movements.csv -t csv ScrapWikiArt/spiders/wikiart_movement.py |
Schools Crawler | scrapy runspider -o data/schools.csv -t csv ScrapWikiArt/spiders/wikiart_school.py |
DuckDuckGo Crawler | scrapy runspider -o data/data_update.csv -t csv -a input_file=data/data.csv ScrapWikiArt/spiders/duck_duck_go.py |
DuckDuckGo Artist Spider | scrapy runspider -o data/artist_update.csv -t csv -a input_file=data/artists.csv ScrapWikiArt/spiders/duck_duck_go_artist.py |
DuckDuckGo Styles Spider | scrapy runspider -o data/styles_update.csv -t csv -a input_file=data/styles.csv ScrapWikiArt/spiders/duck_duck_go_style.py |
DuckDuckGo Movements Spider | scrapy runspider -o data/movements_update.csv -t csv -a input_file=data/movements.csv ScrapWikiArt/spiders/duck_duck_go_movement.py |
DuckDuckGo Schools Spider | scrapy runspider -o data/schools_update.csv -t csv -a input_file=data/schools.csv ScrapWikiArt/spiders/duck_duck_go_school.py |
By default, images will be downloaded into the data/img
directory and
data will be saved in data/data.csv
.
Images folder may be changed in settings.py
by changing path in
IMAGES_STORE
.
By default, data will be saved in data/artists.csv
.
By default, data will be saved in data/styles.csv
.
By default, data will be saved in data/movements.csv
.
By default, data will be saved in data/schools.csv
.