Home

Welcome to the bigdata wiki!

《大数据挖掘与分析》

小组大作业（Final Group Project）

Deadline: 2018年7月3日 23：59:59 (每迟交一天，降低百分之十的分数)
分组要求：最好三个人一组，因人数限制可以有四个人一组，不能超过四个人。
要求在notebook中完成，包括文字的介绍和数据分析的代码。格式如下：
- 项目题目
- 项目摘要
- 引言
- 研究思路：要有对于研究项目的系统的描述
- 研究发现
- 结论：总结研究的发现
必选：OWS推特数据分析
- 计算每一天有多少条tweets
- 提取某一天的数据dat，存到硬盘里，
- 对dat数据进行清洗和描述
- 文本分析
  - 选取部分数据建立主题模型或情感分析的模型
  - 构建给用户推荐Hashtag的推荐系统：
    - 将数据整理为以下格式：人、hashtag、次数
    - 对以上数据构建基于物的相似性的推荐系统
- 构建评论网络
  - 对评论网络进行描述和分析

个人可选加分项目 (Final Group Project For Individuals）

自己抓取其他数据或者使用二手数据进行相应的数据分析。比如：

分析《权力的游戏》中的核心人物及其演变

A Network analysis of Game of Thrones: Analyze the network of characters in Game of Thrones and how it changes over the course of the books. https://www.datacamp.com/projects/76

Get the Data

Winter is Coming. Let's load the dataset ASAP
Time for some Network of Thrones
Populate the network with the DataFrame
Finding the most important character in Game of Thrones
Evolution of importance of characters over the books
What's up with Stannis Baratheon?
What does the Google PageRank algorithm tell us about Game of Thrones?
Correlation between different measures
Conclusion

第九次作业

下载www数据
- WWW Data download http://www3.nd.edu/~networks/resources.htm World-Wide-Web: [README] [DATA] Réka Albert, Hawoong Jeong and Albert-László Barabási: Diameter of the World Wide Web Nature 401, 130 (1999) [ PDF ]
构建networkx的网络对象g（提示：有向网络），将www数据添加到g当中
计算网络中的节点数量和链接数量
计算www网络的网络密度
绘制www网络的出度分布、入度分布
使用BA模型生成节点数为m取值为2，N取值分别为10、100、1000、10000时，绘制平均路径长度d与节点数量的关系

第八次作业

1. 练习实现UserCF和ItemCF的python代码
1. 使用graphlab对于音乐数据或电影数据构建推荐系统

第七次作业

使用graphlab进行主题模型分析

第六次作业

作业1：使用另外一种sklearn的分类器来对tweet_negative2进行情感分析
作业2：使用https://github.com/victorneo/Twitter-Sentimental-Analysis 所提供的推特数据进行情感分析，可以使用其代码 https://github.com/victorneo/Twitter-Sentimental-Analysis/blob/master/classification.py

第五次作业

政府工作报告文本挖掘，分词、词云、时间序列三个部分

第四次作业

从百度云下载ows-raw.txt数据
参考06.data_cleaning_Tweets.ipynb内容
- 采用分段读取的策略对数据进行处理
- 提取其转发网络
将notebook download as html，压缩为zip文件提交到issue里。

第三次作业

根据04.PythonCrawler_beautifulsoup.ipynb中的相关代码抓取top250豆瓣电影的名称、URL、得分、评价数。
将notebook download as html，压缩为zip文件提交到issue里。

第二次作业

运行 https://github.com/computational-class/bigdata/blob/gh-pages/code/03.python_intro.ipynb 中的python代码，
- 从%matplotlib inline开始执行代码

        %matplotlib inline
	import random, datetime
	import numpy as np
	import matplotlib.pyplot as plt
	import matplotlib
	import statsmodels.api as sm
	from scipy.stats import norm
	from scipy.stats.stats import pearsonr

将生成的.ipynb文件另存为html文件；
将html文件压缩为zip作为附件上传到issue里。
附件形式上传到issue里，提交方式：
- 苹果用户download as html，压缩为zip文件提交
- windows用户download as markdown，压缩为zip提交

第一次作业

下载安装anaconda python，选择python3.x 版本
练习使用jupyter notebook
注册Github账号
在这里 https://github.com/computational-class/bigdata/issues 发起issue提交作业
- 关于markdown的介绍 https://en.wikipedia.org/wiki/Markdown
介绍自己：名字、学号、个人网站等

说明：所有的代码可以在这里快速浏览 http://nbviewer.jupyter.org/github/computational-class/bigdata/tree/gh-pages/code/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly