Skip to content

3lynk/2022_Summer_Python_Data_Study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

63 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

2022 Summer Data Study

2022๋…„ ์—ฌ๋ฆ„๋ฐฉํ•™ ๋ฐ์ดํ„ฐ ์Šคํ„ฐ๋””

# 2022_07_22
# 2022_07_23
# 2022_07_29
# 2022_07_31
# 2022_08_05
# 2022_08_20


  • 2022_07_22

    ๐Ÿ“Œ BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ 7์›” 1์ผ์ž์˜ ๋ชจ๋“  ๋‰ด์Šค ์ˆ˜์ง‘ ๐Ÿ“Œ

    ๋„ค์ด๋ฒ„ ๋‰ด์Šค ์ฃผ์†Œ๋ฅผ request๋กœ ์š”์ฒญํ•˜์˜€์„ ๋•Œ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒ ์„œ๋ฒ„์—์„œ ์‚ฌ์šฉ์ž ์†Œํ”„ํŠธ์›จ์–ด์˜ ์‹๋ณ„ ์ •๋ณด์ธ User-Agent ์—†์ด HTTP ์š”์ฒญ์„ ํ•˜๋ฉด ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒ User-Agent ๊ฐ’์„ ํฌํ•จํ•˜๋Š” header ์ถ”๊ฐ€ํ•˜์—ฌ ์˜ค๋ฅ˜ ํ•ด๊ฒฐ (ํฌ๋กฌ ์‚ฌ์šฉ)

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
    res = requests.get(URL, headers=headers)

    ํ•ด๋‹น ๋‚ ์งœ์˜ ์ฒซ๋ฒˆ์งธ ํŽ˜์ด์ง€ URL๊ณผ ๊ทธ ํŽ˜์ด์ง€ ํ•˜๋‹จ์˜ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ ๋ฆฌ์ŠคํŠธ์˜ url์„ page ๋ฆฌ์ŠคํŠธ์— append

    page = ['https://news.naver.com/main/list.naver?mode=LS2D&mid=shm&sid2=229&sid1=105&date=20220701']
    for li in soup.select('#main_content > div.paging > a'):
        page.append('https://news.naver.com/main/list.naver' + li['href'])

    page ๋ฆฌ์ŠคํŠธ ์•ˆ์˜ ํŽ˜์ด์ง€์— ์žˆ๋Š” ๋‰ด์Šค ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ”๋กœ for๋ฌธ์œผ๋กœ ๋Œ๋ฆฌ๋Š” 'ํŽ˜์ด์ง€ ํƒ์ƒ‰' for๋ฌธ ์‚ฌ์šฉ

    for li in soup.select('#main_content > div > ul > li'):
        url = li.a['href']

    ๋ฆฌ์ŠคํŠธ l์— 'ํŽ˜์ด์ง€ ํƒ์ƒ‰' for๋ฌธ์œผ๋กœ ์–ป์€ title, date, contents ๋ฐ์ดํ„ฐ๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฌถ์–ด appendํ•œ ํ›„ for๋ฌธ ์ข…๋ฃŒ ํ›„ pandas๋ฅผ ์ด์šฉํ•˜์—ฌ DataFrame์œผ๋กœ ์ •๋ฆฌํ•˜๊ณ  ์—‘์…€(naver_news.xlsx)๋กœ ์ €์žฅ

    l.append([title, date, contents])
    .
    .
    .
    df = pd.DataFrame(l, columns = ['title', 'url', 'contents'])
    df.to_excel('naver_news.xlsx', index = False) 

  • 2022_07_23

    ๐Ÿ“Œ 6์›” ํ•œ ๋‹ฌ์น˜ ๋‰ด์Šค ์ˆ˜์ง‘ ๐Ÿ“Œ

    6์›” ํ•œ ๋‹ฌ์ด 30์ผ์ธ๊ฑธ ๋ฐ˜์˜ํ•˜์—ฌ for, range๋ฌธ์œผ๋กœ 30์ผ์น˜ ๋‰ด์Šค url์„ ๋งŒ๋“ค์–ด day_url ๋ฆฌ์ŠคํŠธ์— append

    for day in range(30):
    day_url.append('https://news.naver.com/main/list.naver?mode=LS2D&mid=shm&sid2=229&sid1=105&date='+ str(20220601 + day))

  • 2022_07_29

    ๐Ÿ“Œ title๊ณผ date ์ˆ˜์ง‘ ๋ฐฉ์‹ ๋ณ€๊ฒฝ ๐Ÿ“Œ

    ๊ธฐ์กด์—๋Š” ๋‰ด์Šค ๊ฐœ๋ณ„ ํŽ˜์ด์ง€ ์•ˆ์—์„œ title๊ณผ date๋ฅผ ์ˆ˜์ง‘ ์ด ๊ณผ์ •์—์„œ ์–ธ๋ก ์‚ฌ๋งˆ๋‹ค selector๊ฐ€ ๋‹ฌ๋ผ์„œ ๊ทธ ์ข…๋ฅ˜๋ฅผ ๋ชจ๋‘ ์ฐพ์€ ํ›„ try, except๋ฌธ์œผ๋กœ ์ฒ˜๋ฆฌ (2022_08_05 ์ฐธ๊ณ )

    #title
    try :
        title = soup.select_one('#ct > div > div > h2').text.strip()
    except:
        title = soup.select_one('#content > div > div > div > div > h4').text.strip()
    #date
    try:
        date = soup.select_one('#ct > div > div > div > div > span').text.strip()
    except:
        date = soup.select_one('#content > div > div> div > div> div > span').text.strip()
        if date == '์ƒˆ๋กœ์šด ๋‰ด์Šค':
            date = soup.select_one('#content > div.end_ct > div > div.article_info > span > em').text.strip()
        else:
            list(date)
            date = date[5:]
            str(date)

    ์œ„์˜ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋” ๋งŽ์€ ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ์ˆ˜์ง‘ํ•  ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๊ณ  ํšจ์œจ์ ์ด์ง€ ๋ชปํ•˜๊ธฐ์— ๋‰ด์Šค ๊ฐœ๋ณ„ ํŽ˜์ด์ง€๊ฐ€ ์•„๋‹Œ ๋‰ด์Šค ๋ฆฌ์ŠคํŠธ๊ฐ€ ๋ณด์ด๋Š” ํŽ˜์ด์ง€์—์„œ ๊ธฐ๋ณธ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๋„๋ก ์ˆ˜์ • (๊ทธ๋Ÿฌ๋‚˜ ํŽ˜์ด์ง€ ๋‚ด๋ถ€์—์„œ ์ˆ˜์ง‘ํ•˜๋Š” contents ํ•ญ๋ชฉ์€ ์•„์ง ์ˆ˜์ •ํ•˜์ง€ ๋ชปํ•จ)

    def basic_info(li):
        url = li.a['href']
        for t in li.select('#main_content > div > ul > li > dl > dt > a'):
            title = t.text.strip()
        date = li.select_one('#main_content > div > ul > li > dl > dd > span.date').text.strip()
    
        return title, date, url

  • 2022_07_31

    ๐Ÿ“Œ MySQL์— ๋ฐ์ดํ„ฐ ์ •๋ฆฌ ๐Ÿ“Œ

    # MySQL์— data๋ฅผ ๋„ฃ์€ img

    img1

    # MySQL Workbench๋กœ ์Šคํ‚ค๋งˆ, ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ img

    img2

    ์—‘์…€์˜ ํ–‰์˜ ํ•œ๊ณ„๋Š” 1048576๊ฐœ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ธฐ์— ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์–ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉ MySQL์— web ์Šคํ‚ค๋งˆ๋ฅผ ์ƒ์„ฑ, news ํ…Œ์ด๋ธ” ์ƒ์„ฑ, publisher, title, date ํ•„๋“œ ์ƒ์„ฑ

    CREATE DATABASE web
    
    USE web
    
    CREATE TABLE `test`.`news` (
        `publisher` VARCHAR(20) NOT NULL,
        `title` VARCHAR(100) NOT NULL,
        `date` VARCHAR(30) NOT NULL);

    python์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•˜์—ฌ pymysql ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉ

    pip install PyMySQL
    

    MySQL์— ์—ฐ๊ฒฐํ•  ๋•Œ ํ•„์š”ํ•œ db, host, user, password, port, charset ์ •๋ณด๋ฅผ ๋”•์…”๋„ˆ๋ฆฌํ˜•์œผ๋กœ mysql_user_info.py๋กœ ๋ฏธ๋ฆฌ ์ €์žฅํ•˜์˜€๋‹ค๊ฐ€ naver_news_detail.py์—์„œ import ํ•˜์—ฌ ์‚ฌ์šฉ

    # mysql_user_info.py
    user_info = {'db' : 'web', 'host' : '127.0.0.1', 'user' : 'root', 'passwd' : 'DB_PASSWORD', 'port' : 3306, 'charset' : 'utf8'}
    import mysql_user_info

    with as๋ฌธ์„ ์ด์šฉํ•˜์—ฌ close()๋ฌธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋จ

    def insert_data(publisher, title, date):
    user = mysql_user_info.user_info
    db = pymysql.connect(db=user['db'], host=user['host'], user=user['user'], passwd=user['passwd'], port=user['port'], charset=user['charset'])
    
    sql = 'INSERT INTO news (publisher, title, date) VALUES (%s, %s, %s)'
    
    with db:
        with db.cursor() as cursor:
            cursor.execute(sql, (publisher, title, date))
            db.commit()

    time ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ฝ”๋“œ ์‹คํ–‰์‹œ๊ฐ„ ์ธก์ • (6์›” ํ•œ ๋‹ฌ์น˜๋ฅผ contents๋ฅผ ํ•จ๊ป˜ ์ˆ˜์ง‘ํ•˜๋ฉด ์•ฝ 10๋ถ„ ์ •๋„ ์†Œ์š”, contents๋ฅผ ์ˆ˜์ง‘ ์•ˆํ•˜์˜€์„ ๋•Œ 68.97558355331421์ดˆ ์†Œ์š”)

    import time
    
    start_time = time.time()
    .
    .
    .
    print(f'Time : {time.time() - start_time}')

  • 2022_08_05

    ๐Ÿ“Œ 2021๋…„ 1๋…„์น˜ ๋‰ด์Šค ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๐Ÿ“Œ

    # MySQL์— 1๋…„ ์น˜ data๊ฐ€ ๋“ค์–ด๊ฐ€ ์žˆ๋Š” img

    img1

    # ์•ฝ 2์‹œ๊ฐ„ ๋ฐ˜ ์†Œ์š”๋œ img

    img1

    2021๋…„ 1๋…„์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ 1๋…„์น˜ ๋งํฌ๋ฅผ day_url ๋ฆฌ์ŠคํŠธ์— append

    # ๋‚ ์งœ ๋ฆฌ์ŠคํŠธ ๋งŒ๋“ค๊ธฐ
    month_day = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    day_url = []
    
    for month in range(12):
        for day in range(month_day[month]):
            day_url.append(('https://news.naver.com/main/list.naver?mode=LS2D&mid=shm&sid2=229&sid1=105&date=' + str(20210000 + (month + 1) * 100 + (day + 1))))

    2022_07_29 ์— ์ˆ˜์ •ํ•˜์ง€ ๋ชปํ•˜์˜€๋˜ contents ํ•ญ๋ชฉ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐฉ์‹ ๋ณ€๊ฒฝ
    ์–ธ๋ก ์‚ฌ๋งˆ๋‹ค selector๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์ด ์•„๋‹Œ ๋„ค์ด๋ฒ„ ๋‰ด์Šค์˜ ์ผ๋ฐ˜ ๋‰ด์Šค, ์—ฐ์˜ˆ ๋‰ด์Šค, ์Šคํฌ์ธ  ๋‰ด์Šค๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฒƒ์„ ์ธ์ง€ํ•˜๊ณ  ์ˆ˜์ •

    # contents
    try:
        contents = soup.select_one('#dic_area').text.strip()
    except:
        if soup.select_one('#header > div > div > h1 > a:nth-of-type(2)').text == '์Šคํฌ์ธ ':
            contents = soup.select_one('#newsEndContents').text.strip()
        elif soup.select_one('#header > div > div > h1 > a:nth-of-type(2)').text == 'TV์—ฐ์˜ˆ':
            contents = soup.select_one('#content > div.end_ct > div > div.end_body_wrp').text.strip()

    ํŠน์ • ๋‚ ์งœ์˜ ํŠน์ • ํŽ˜์ด์ง€์—์„œ๋งŒ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘์ด ์•ˆ๋˜๋Š”๊ฑธ ๋ฐœ๊ฒฌ
    ํ•˜๋ฃจ๋™์•ˆ ๊ณ ๋ฏผ ํ›„ ๊ต์ˆ˜๋‹˜๊ป˜ ์งˆ๋ฌธ์„ ํ•˜์˜€๊ณ  ์…€๋ ˆ๋‹ˆ์›€์„ ์ด์šฉํ•ด์„œ ์ˆ˜์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋‹ต๋ณ€์„ ์–ป์Œ
    ์ด๋ฒˆ์—๋Š” ์…€๋ ˆ๋‹ˆ์›€ ์—†์ด ์ˆ˜์ง‘ํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์™ธ ์ฒ˜๋ฆฌ ํ•˜๋„๋ก ์ˆ˜์ •

    try:
        title, date, publisher, url = basic_info(li)
        contents = detail_info(url)
    
        print(f'Date : \n{date}')
    
        insert_data(publisher, title, date, contents)
    
    except:
        continue

  • 2022_08_20

    ๐Ÿ“Œ KoNLPy ์ด์šฉํ•œ ํ˜•ํƒœ์†Œ ๋ถ„์„ ๐Ÿ“Œ

    ์šฐ์„  naver_news_detail.py๋กœ db์— ์ €์žฅํ–ˆ์—ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์˜ด

    def fetch():
        with pymysql.connect(db=user['db'], host=user['host'], user=user['user'], passwd=user['passwd'], port=user['port'], charset=user['charset']) as db:
            with db.cursor(pymysql.cursors.DictCursor) as cur:
                sql = 'SELECT * FROM news'
                cur.execute(sql)
                db.commit()
    
                data = cur.fetchall()
    
        return data

    ๋ถ„์„ํ•œ ํ˜•ํƒœ์†Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์„ db ์„ค๊ณ„๋ฅผ ๊ณ ๋ฏผํ•˜๋‹ค๊ฐ€ | id | type | word | ๊ตฌ์กฐ๋กœ ์„ค๊ณ„
    id๋Š” 'publisher-date'๋กœ ์„ค์ •

    # id ์ƒ์„ฑ
    id = i['publisher'] + '-' + i['date']

    ์ œ๋ชฉ์—์„œ๋Š” ๋ช…์‚ฌ๋งŒ, ๋ณธ๋ฌธ์—์„œ๋Š” ๋ช…์‚ฌ, ํ˜•์šฉ์‚ฌ๋งŒ ์ถ”์ถœํ•˜๊ณ ์ž ํ•จ
    ๋ช…์‚ฌ๋Š” ์–ด์ ˆ์„ ์ถ”์ถœํ•˜์—ฌ ๋„ฃ์Œ

    # ํ˜•ํƒœ์†Œ ๋ถ„์„
    title_pos = okt.pos(i['title'])
    title_noun = okt.phrases((i['title']))
    body_pos = okt.pos(i['body'])
    body_noun = okt.phrases((i['body']))

    ๋ฐ์ดํ„ฐ๋ฅผ ๋‘ ๊ฐœ์˜ ํ…Œ์ด๋ธ”์— ๋„ฃ์ง€๋งŒ ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋กœ ์ฒ˜๋ฆฌ

    # morpheme ํ…Œ์ด๋ธ”์— data ๋„ฃ๊ธฐ
    def insert_data(id, type, word, sort):
        try:
            with pymysql.connect(db=user['db'], host=user['host'], user=user['user'], passwd=user['passwd'], port=user['port'], charset=user['charset']) as db:
                with db.cursor() as cursor:
                    sql = 'INSERT INTO ' + sort + '_morpheme (id, type, word) VALUES (%s, %s, %s)'
                    cursor.execute(sql, (id, type, word))
                    db.commit()
        except:
            pass
    # ๋Ÿฌ๋‹์‹œ๊ฐ„์„ ๊ณ„์‚ฐํ•œ ๊ฒฐ๊ณผ

    img1

About

2022_Summer_Data_Study

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages