Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
sanjay-kv authored May 11, 2024
2 parents c73256f + 14c1d75 commit 28033e0
Show file tree
Hide file tree
Showing 3 changed files with 288 additions and 2 deletions.
15 changes: 15 additions & 0 deletions .github/workflows/greetings.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
name: Greetings

on: [pull_request_target, issues]

jobs:
greeting:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/first-interaction@v1
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
issue-message: "Thank you for raising a issue, Hope you enjoing the open source. we try to reply or assign as soon possibe. Connect with mentor."
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,34 @@

<h1 align="center">IMDB Movie review Scraping</h1>
<blockquote align="center">Scraping the movie review ✏️ using python programming language💻. </blockquote>

<p align="center">For new data generation <b>Semi-supervised-sequence-learning-Project</b> we have written a python script to fetch📊, data from the 💻, imdb website and converted into txt files. </p>
<p align="center">This project aims to replicate the Semi-supervised-sequence-learning-Project on a new dataset generated through scraping IMDb movie reviews. The generated data will be utilized for further analysis and exploration.
</p>




# Introduction

**`Semi-supervised-sequence-learning-Project`** :computer: replication process is done over here and for further analysis creation of new data is required.
**`Semi-supervised-sequence-learning-Project`** :computer: The IMDb Movie Review Scraping project aims to gather a new dataset by automatically extracting movie reviews from IMDb. This dataset will support various natural language processing tasks, including sentiment analysis and recommendation systems. Using web scraping techniques, such as Beautiful Soup, movie reviews are collected, preprocessed, and structured into a CSV format suitable for analysis, including Support Vector Machine classification.

- The following script includes the following.

- `Movie_review_imdb_scrapping.ipynb` - Script to scrape the data from imdb website
- `rename_files.ipynb` - Script to rename the scrapped text files as per the requirements
- `convert_texts_to_csv.ipynb` - Python script to make a CSV file from the txt files for SVM processing

- `Movie_review_imdb_scrapping.ipynb` - Script to scrape the data from IMDb website
- `rename_files.ipynb` - Script to rename the scraped text files as per the requirements
- `convert_texts_to_csv.ipynb` - Python script for converting the scraped text files into a CSV format suitable for SVM processing




## Dependencies

install Beautifulsoup using `pip install beautifulsoup4`
Ensure Beautifulsoup is installed using `pip install beautifulsoup4`

## Installation

Expand Down
261 changes: 261 additions & 0 deletions School_Web Scrapping.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,261 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "00cd96ca",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"\n",
"new=pd.DataFrame()\n",
"\n",
"for j in range(1,6):\n",
"\n",
" url=f\"https://school.careers360.com/schools/schools-in-india?page={j}\"\n",
" headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}\n",
" webpage=requests.get(url,headers).text\n",
" \n",
" soup=BeautifulSoup(webpage,'lxml')\n",
" schools=soup.find_all('div', class_=\"schoolListing_card position-relative\")\n",
" \n",
" name=[]\n",
" fee=[]\n",
" System=[]\n",
" Place=[]\n",
" rating=[]\n",
" types=[]\n",
" for i in schools:\n",
" name.append(i.find('h2',class_='school_Name').text.strip())\n",
" fee.append(i.find('div',class_=\"schoolList_Info d-flex align-items-center gap-1 flex-wrap mb-1\").text.strip()[15:22])\n",
" rating.append(i.find('div',class_='school-overview').text[19:24])\n",
" System.append(i.find('span',class_='comma').text.strip())\n",
" types.append((i.find_all('div', class_ = 'schoolList_Info d-flex align-items-center gap-1 flex-wrap mb-1')[1]).find_all('span')[0].text.strip())\n",
" \n",
" d={\"NAME\":name,\n",
" \"FEE\":fee, \n",
" \"System\":System,\n",
" \"Place\":Place,\n",
" \"Type\":types,\n",
" \"Rating\":rating}\n",
" df = pd.DataFrame.from_dict(d, orient='index')\n",
" df = df.T\n",
" df.head()\n",
"\n",
" new=pd.concat([new,df],ignore_index=True)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d52d5ee9",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>NAME</th>\n",
" <th>FEE</th>\n",
" <th>System</th>\n",
" <th>Place</th>\n",
" <th>Type</th>\n",
" <th>Rating</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Shri Ram Centennial School, Jaganpura</td>\n",
" <td>127,800</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAAA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Vivek High School, Sector-38-B</td>\n",
" <td></td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAAA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>DAV Senior Secondary School (Lahore), Sector-8C</td>\n",
" <td></td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAAA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Government Model Senior Secondary School, Sec...</td>\n",
" <td>682 (CB</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAAA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Carmel Convent School, Sector 9B</td>\n",
" <td>30,090</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAAA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>195</th>\n",
" <td>Salwan Public School, Mayur Vihar Phase-III</td>\n",
" <td>88,241</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAA+</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196</th>\n",
" <td>N K Bagrodia Global School, Sector-17, Dwarka</td>\n",
" <td></td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAA+</td>\n",
" </tr>\n",
" <tr>\n",
" <th>197</th>\n",
" <td>Delhi Public School, Gavier</td>\n",
" <td>170,530</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAA+</td>\n",
" </tr>\n",
" <tr>\n",
" <th>198</th>\n",
" <td>Delhi Public School, kalali</td>\n",
" <td>125,075</td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAA+</td>\n",
" </tr>\n",
" <tr>\n",
" <th>199</th>\n",
" <td>Atmiya Vidya Mandir, Kamrej</td>\n",
" <td></td>\n",
" <td>CBSE</td>\n",
" <td>None</td>\n",
" <td>Sr. Secondary/Higher Secondary School</td>\n",
" <td>AAAA+</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>200 rows × 6 columns</p>\n",
"</div>"
],
"text/plain": [
" NAME FEE System Place \\\n",
"0 Shri Ram Centennial School, Jaganpura 127,800 CBSE None \n",
"1 Vivek High School, Sector-38-B CBSE None \n",
"2 DAV Senior Secondary School (Lahore), Sector-8C CBSE None \n",
"3 Government Model Senior Secondary School, Sec... 682 (CB CBSE None \n",
"4 Carmel Convent School, Sector 9B 30,090 CBSE None \n",
".. ... ... ... ... \n",
"195 Salwan Public School, Mayur Vihar Phase-III 88,241 CBSE None \n",
"196 N K Bagrodia Global School, Sector-17, Dwarka CBSE None \n",
"197 Delhi Public School, Gavier 170,530 CBSE None \n",
"198 Delhi Public School, kalali 125,075 CBSE None \n",
"199 Atmiya Vidya Mandir, Kamrej CBSE None \n",
"\n",
" Type Rating \n",
"0 Sr. Secondary/Higher Secondary School AAAAA \n",
"1 Sr. Secondary/Higher Secondary School AAAAA \n",
"2 Sr. Secondary/Higher Secondary School AAAAA \n",
"3 Sr. Secondary/Higher Secondary School AAAAA \n",
"4 Sr. Secondary/Higher Secondary School AAAAA \n",
".. ... ... \n",
"195 Sr. Secondary/Higher Secondary School AAAA+ \n",
"196 Sr. Secondary/Higher Secondary School AAAA+ \n",
"197 Sr. Secondary/Higher Secondary School AAAA+ \n",
"198 Sr. Secondary/Higher Secondary School AAAA+ \n",
"199 Sr. Secondary/Higher Secondary School AAAA+ \n",
"\n",
"[200 rows x 6 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f9cf05c",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 28033e0

Please sign in to comment.