[외부데이터 활용] - 크롤링 #4

hyeming-king · 2022-10-13T02:27:24Z

필요한 자료 수집

트랙 메타 데이터

TMI 트랙 스피드 개인전 랭킹 데이터 -> 리스틀리로 해결
나무위키 트랙 테마별 정보 -> 셀레니움으로 스크래핑

유저 반응 텍스트 데이터

TMI 트랙 리뷰 데이터
카트라이더 커뮤니티

goodkse7 · 2022-10-13T08:11:03Z

2022.10.13의 성과

트랙 메타 데이터 크롤링

1. 진행한 것

TMI 트랙 메타 데이터

'스피드 개인전' 사용률 기준 300개 메타 데이터 수집
리스틀리로 수집하여 구글 스프레드시트로 정리

나무위키 트랙 메타 데이터

1920 테마로 스크래핑 테스트 완료
TMI에서 제공하는 정보 외 트랙 길이, 주행 방향, 첫 등장일 등 정보 확보

# 크롤링할 주소 설정
url = 'https://namu.wiki/w/%ED%81%AC%EB%A0%88%EC%9D%B4%EC%A7%80%EB%A0%88%EC%9D%B4%EC%8B%B1%20%EC%B9%B4%ED%8A%B8%EB%9D%BC%EC%9D%B4%EB%8D%94/%ED%8A%B8%EB%9E%99/1920' #카트 트랙 랭킹 
driver.get(url)

# 소스를 읽어 BeatuifulSoup를 이용해 파싱
html = driver.page_source  # 브라우저가 가지고 있는 정보를 가져옴
soup = BeautifulSoup(html, "html.parser")
track_list = soup.select("table.UQjgK8i0._f0b7325cc9e2662864c573d822bf4dca")
# print(track_list[0].prettify())
# trs = track_list[0].select("tr")  
# for tr in trs:
#         print(tr.text)

track_data_list =[]
for track_info in track_list:
    name = track_info.select("tr")[0]   # 코드 가독성을 위해 변수 이름을 좀 더 간단하게 변경
    tag   = track_info.select("tr")[2]
    hardness = track_info.select("tr")[3]
    laps     = track_info.select("tr")[4]
    track_length   = track_info.select("tr")[5]
    direction   = track_info.select("tr")[6]
    playmode  = track_info.select("tr")[7]
    AI  = track_info.select("tr")[8]
    # 서버는 수집에서 제외
    release_date  = track_info.select("tr")[10]
    league_track  = track_info.select("tr")[11]
    license  = track_info.select("tr")[12]
    nickname  = track_info.select("tr")[13]
    
    track_data_list.append(
        [
            name.text,
            tag.text,
            hardness.text[3:],
            laps.text[1:],
            track_length.text[5:],
            direction.text[5:],
            playmode.text[5:],
            AI.text[5:],
            release_date.text[5:],
            league_track.text[5:],
            license.text[4:],
            nickname.text[2:]
        ]
    )

columns = ["트랙 이름", "태그", "난이도", "랩", "트랙 길이", "진행 방향", "트랙 분류", "AI 주행", "첫 등장일", "리그 트랙", "라이센스", "약칭"]
df = pd.DataFrame(track_data_list, columns = columns)
df

# driver.close()

2. 진행 할 것

(1) TMI 데이터와 나무위키 데이터 연결
(2) 트랙에 대한 유저 반응 데이터 스크래핑 시도 (TMI 트랙 리뷰, 커뮤니티)

Emilia0608 · 2022-10-22T07:43:59Z

2022.10.22의 성과

나무위키 크롤링

나무위키에서 데이터 크롤링 할 때 표를 가져오는 html 부분의 table class 명이 바뀌었음

▲ 하이라이트한 부분 클래스 이름 변경됨!
전부 변경이 필요함

특수 트랙이나 몇 개의 트랙이 다른 테이블 모양을 가지고 있어서 error 발생

이에 일반적인거는 df에 담아서 저장해주었으며, 특수 트랙은 리스트로 담아 저장함 (01_crawling_track.csv, 01_crawling_errortrack.csv)
for 문 및 함수 get_crawl 만들었음
df, error_list=get_crawl()
라고 쓰면 df에는 정상적인 트랙이 error_list에는 특수트랙의 이름이 담김
error list에 담긴 트랙들을 어떻게 처리(엑셀로 손으로 작성? 등) 과 같은 방법 찾아봐야함

hhyojjin · 2022-10-22T08:10:17Z

위의 내용에서
나무위키에서 데이터 크롤링 할 때 '표를 가져오는 html 부분의 table class 명'은 사이트에서 변경이 발생할 때 변경하면 됩니다.
각 링크별 변경은 아니에요. 일괄 변경이기 때문에 변경 발생시 1회 변경으로 대처하면 될 것 같습니다.

meta data에서 출력된 raw data에 theme data를 붙여 trackId, map_name 옆에 테마를 함께 확인할 수 있도록 했습니다.

크롤링 데이터에 meta(raw) data&theme data를 merge하였습니다. 크롤링 된 트랙에 대한 trackId, theme를 함께 확인할 수 있도록 했습니다.

크롤링/테마/메타(raw)데이터 셋 모두에 대한 텍스트 전처리 코드입니다.

hyeming-king added this to @Kartrider-game-analysis Oct 17, 2022

hyeming-king moved this to In Progress in @Kartrider-game-analysis Oct 17, 2022

hyeming-king assigned goodkse7 Oct 17, 2022

hyeming-king added 💛 data 데이터 조사, 수집 등 💎 crawling 데이터 크롤링 and removed 💛 data 데이터 조사, 수집 등 labels Oct 17, 2022

hyeming-king modified the milestone: Data 파이프라인 구축하기 Oct 17, 2022

goodkse7 added a commit that referenced this issue Oct 20, 2022

#4 나무위키 트랙 정보 크롤링 테스트 코드 - 1920테마

9ae6dc7

goodkse7 added a commit that referenced this issue Oct 21, 2022

#4 나무위키 트랙 중 1920 테마 크롤링 테스트 코드

9c05188

goodkse7 added a commit that referenced this issue Oct 21, 2022

#4 크롤링 파일 이름 네이밍 컨벤션에 따라 변경, 폴더 위치 변경

59ed38f

Emilia0608 added a commit that referenced this issue Oct 22, 2022

#4 특수 트랙 error list 담기, for문

605ef6b

junetofeb added a commit that referenced this issue Oct 25, 2022

#4 하위 맵 데이터에 크롤링 데이터 추가

4eadb10

junetofeb added a commit that referenced this issue Oct 25, 2022

#4 상위 맵 데이터에 크롤링 데이터 추가

8f66df6

junetofeb added a commit that referenced this issue Oct 25, 2022

#4 상위 맵 데이터에 크롤링 데이터 추가

51ffda5

junetofeb added a commit that referenced this issue Oct 26, 2022

#4 KPI + Crawling 합친 데이터 전처리

63eab1a

hoinnovation added a commit that referenced this issue Oct 26, 2022

#4 트랙 테마 데이터

fff1f62

hoinnovation added a commit that referenced this issue Oct 26, 2022

#4 필요 없는 데이터 삭제

4f23187

junetofeb added a commit that referenced this issue Oct 26, 2022

#4 크롤링 불가 트랙 error list 에 담아둔 파일

86d024a

junetofeb added a commit that referenced this issue Oct 26, 2022

#4 theme data/meta data(Raw data) merge

af7cf6c

meta data에서 출력된 raw data에 theme data를 붙여 trackId, map_name 옆에 테마를 함께 확인할 수 있도록 했습니다.

junetofeb added a commit that referenced this issue Oct 26, 2022

#4 crawl data/meta data(merged data) merge

5174c31

크롤링 데이터에 meta(raw) data&theme data를 merge하였습니다. 크롤링 된 트랙에 대한 trackId, theme를 함께 확인할 수 있도록 했습니다.

junetofeb added a commit that referenced this issue Oct 26, 2022

#4 크롤링/테마/메타(raw)데이터 텍스트 전처리 코드

242fac7

크롤링/테마/메타(raw)데이터 셋 모두에 대한 텍스트 전처리 코드입니다.

hyeming-king added this to kart-track-analysis-project Oct 28, 2022

hyeming-king added this to the Data 수집하기 milestone Oct 28, 2022

hyeming-king moved this to Todo in kart-track-analysis-project Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[외부데이터 활용] - 크롤링 #4

[외부데이터 활용] - 크롤링 #4

hyeming-king commented Oct 13, 2022 •

edited by goodkse7

Loading

goodkse7 commented Oct 13, 2022 •

edited

Loading

Emilia0608 commented Oct 22, 2022 •

edited

Loading

hhyojjin commented Oct 22, 2022

[외부데이터 활용] - 크롤링 #4

[외부데이터 활용] - 크롤링 #4

Comments

hyeming-king commented Oct 13, 2022 • edited by goodkse7 Loading

필요한 자료 수집

goodkse7 commented Oct 13, 2022 • edited Loading

2022.10.13의 성과

트랙 메타 데이터 크롤링

1. 진행한 것

TMI 트랙 메타 데이터

나무위키 트랙 메타 데이터

2. 진행 할 것

Emilia0608 commented Oct 22, 2022 • edited Loading

2022.10.22의 성과

나무위키 크롤링

hhyojjin commented Oct 22, 2022

hyeming-king commented Oct 13, 2022 •

edited by goodkse7

Loading

goodkse7 commented Oct 13, 2022 •

edited

Loading

Emilia0608 commented Oct 22, 2022 •

edited

Loading