🎬 Persian Database of Movies

Overview

This project is designed to create a structured dataset by crawling movie-related information from a Persian website containing a comprehensive movie data. The main tool used in this project is scrapy. Utilizing two primary scripts, crawl_urls.py and crawl_movies.py, the crafted dataset comprises approximately 15,000 movie entries.

`crawl_urls.py`

This script is responsible for crawling URLs that direct to individual movie pages. It should be run first to gather the necessary links for the subsequent movie data extraction process.

`crawl_movies.py`

After collecting movie page URLs with crawl_urls.py, you can then run crawl_movies.py to crawl detailed information about the movies. This script delves into each URL and extracts the relevant movie data to construct the dataset.

Running the Scripts `scrapy` Way

You can do this is well. Please refer to scrapy documents to learn how to do so.

Runtime

It took approximately ~20 minutes for crawling URLs, and ~35 minutes to crawl movie pages on my personal notebook, connected to Internet provided by an Iranian ISP. So you should be fine if you want to make adjustements and run the script.

TODO

Improve Preprocessing: Refine data cleaning to enhance the quality of the dataset.
Better Validation: Aim for stronger checks to ensure data quality.
Database Consistency: Work on making the database entries more uniform.
Add Movie Reviews: Include movie reviews and comments to enrich the dataset.
Consider Other Sources: Look into additional websites for a wider range of data.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
amovies_crawler		amovies_crawler
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
avamovies_urls.csv		avamovies_urls.csv
crawl_movies.py		crawl_movies.py
crawl_urls.py		crawl_urls.py
movie_data.csv		movie_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Persian Database of Movies

Overview

`crawl_urls.py`

`crawl_movies.py`

Running the Scripts `scrapy` Way

Runtime

TODO

License

About

Releases

Languages

License

jrazi/persian-dataset-of-movies

Folders and files

Latest commit

History

Repository files navigation

🎬 Persian Database of Movies

Overview

crawl_urls.py

crawl_movies.py

Running the Scripts scrapy Way

Runtime

TODO

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages

`crawl_urls.py`

`crawl_movies.py`

Running the Scripts `scrapy` Way