winter-2019
课程资料:
- Course page
- Video page
- Video page (Chinese):
学习笔记参考:
斯坦福CS224N深度学习自然语言处理2019冬学习笔记目录
参考书:
- Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd ed. draft)
- Jacob Eisenstein. Natural Language Processing
- Yoav Goldberg. A Primer on Neural Network Models for Natural Language Processing
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning
神经网络相关的基础:
- Michael A. Nielsen. Neural Networks and Deep Learning
- Eugene Charniak. Introduction to Deep Learning
- The course (10 mins)
- Human language and word meaning (15 mins)
- Word2vec introduction (15 mins)
- Word2vec objective function gradients (25 mins)
- Optimization basics (5 mins)
- Looking at word vectors (10 mins or less)
课件
- cs224n-2019-lecture01-wordvecs1
- WordNet, 一个包含同义词集和上位词(“is a”关系) synonym sets and hypernyms 的列表的辞典
- 在传统的自然语言处理中,我们把词语看作离散的符号,单词通过one-hot向量表示
- 在Distributional semantics中,一个单词的意思是由经常出现在该单词附近的词(上下文)给出的,单词通过一个向量表示,称为word embeddings或者word representations,它们是分布式表示(distributed representation)
- Word2vec的思想
- cs224n-2019-notes01-wordvecs1
- Natural Language Processing.
- Word Vectors.
- Singular Value Decomposition(SVD). (对共现计数矩阵进行SVD分解,得到词向量)
- Word2Vec.
- Skip-gram. (根据中心词预测上下文)
- Continuous Bag of Words(CBOW). (根据上下文预测中心词)
- Negative Sampling.
- Hierarchical Softmax.
Suggested Readings
- Word2Vec Tutorial - The Skip-Gram Model (该博客分为2个部分,skipgram思想,以及改进训练方法:下采样和负采样)
- 理解 Word2Vec 之 Skip-Gram 模型(上述文章的翻译)
- Applying word2vec to Recommenders and Advertising (word2vec用于推荐和广告)
- Efficient Estimation of Word Representations in Vector Space (original word2vec paper)(没太看懂,之后再看一遍)
- Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)
参考阅读
- [NLP] 秒懂词向量Word2vec的本质(推荐了一些很好的资料)
- word2vec Parameter Learning Explained
- 基于神经网络的词和文档语义向量表示方法研究
- word2vec中的数学原理详解
- 网易有道word2vec(词向量相关模型,word2vec部分代码解析与tricks)
Assignment 1:Exploring Word Vectors
-
Count-Based Word Vectors(共现矩阵的搭建, SVD降维, 可视化展示)
-
Prediction-Based Word Vectors(Word2Vec, 与SVD的对比, 使用gensim, 同义词,反义词,类比,Bias)
笔记整理
- word2vec的思想、算法步骤分解、代码
- Finish looking at word vectors and word2vec (12 mins)
- Optimization basics (8 mins)
- Can we capture this essence more effectively by counting? (15m)
- The GloVe model of word vectors (10 min)
- Evaluating word vectors (15 mins)
- Word senses (5 mins)
课件
- Gensim word vector visualization[code] [[preview](https://web.stanford.edu/class/cs224n/materials/Gensim word vector visualization.html)]
- cs224n-2019-lecture02-wordvecs2
- 复习word2vec(一个单词的向量是一行;得到的概率分布不区分上下文的相对位置;每个词和and, of等词共同出现的概率都很高)
- optimization: 梯度下降,随机梯度下降SGD,mini-batch(32或64,减少噪声,提高计算速度),每次只更新出现的词的向量(特定行)
- 为什么需要两个向量?——数学上更简单(中心词和上下文词分开考虑),最终是把2个向量平均。也可以每个词只用一个向量。
- word2vec的两个模型:Skip-grams(SG), Continuous Bag of Words(CBOW), 还有negative sampling技巧,抽样分布技巧(unigram分布的3/4次方)
- 为什么不直接用共现计数矩阵?随着词语的变多会变得很大;维度很高,需要大量空间存储;后续的分类问题会遇到稀疏问题。解决方法:降维,只存储一些重要信息,固定维度。即做SVD。很少起作用,但在某些领域内被用的比较多,举例:Hacks to X(several used in Rohde et al. 2005)
- Count based vs. direct prediction
- Glove-结合两个流派的想法,在神经网络中使用计数矩阵,共现概率的比值可以编码成meaning component
- 评估词向量的方法(内在—同义词、类比等,外在—在真实任务中测试,eg命名实体识别)
- 词语多义性问题-1.聚类该词的所有上下文,得到不同的簇,将该词分解为不同的场景下的词。2.直接加权平均各个场景下的向量,奇迹般地有很好的效果
- cs224n-2019-notes02-wordvecs2
- Glove
- 评估词向量效果的方法
Suggested Readings
- GloVe: Global Vectors for Word Representation (original GloVe paper)
- Improving Distributional Similarity with Lessons Learned from Word Embeddings
- Evaluation methods for unsupervised word embeddings
Additional Readings:
- A Latent Variable Model Approach to PMI-based Word Embeddings
- Linear Algebraic Structure of Word Senses, with Applications to Polysemy
- On the Dimensionality of Word Embedding.
参考阅读
- 理解GloVe模型(+总结)(很详细易懂,讲解了GloVe模型的思想)
Python review[slides]
review
glove的思想、算法步骤分解、代码
评估词向量的方法
- Course information update (5 mins)
- Classification review/introduction (10 mins)
- Neural networks introduction (15 mins)
- Named Entity Recognition (5 mins)
- Binary true vs. corrupted word window classification (15 mins)
- Matrix calculus introduction (20 mins)
课件
- cs224n-2019-lecture03-neuralnets
- 分类:情感分类,命名实体识别,买卖决策等,softmax分类器,cross-entropy损失函数(线性分类器)
- 神经网络分类器,词向量分类的不同(同时学习权重矩阵和词向量,因此参数也更多),神经网络简介
- 命名实体识别(NER):找到文本中的"名字"并且进行分类
- 在上下文语境中给单词分类,怎么用上下文?将词及其上下文词的向量连接起来
- 比如如果这个词在上下文中是表示位置,给高分,否则给低分
- 梯度
- matrix calculus notes
- cs224n-2019-notes03-neuralnets
- 神经网络,最大边缘目标函数,反向传播
- 技巧:梯度检验,正则,Dropout,激活函数,数据预处理(减去均值,标准化,白化Whitening),参数初始化,学习策略,优化策略(momentum, adaptive)
Suggested Readings:
Additional Readings:
Assignment 2
review
NER
梯度
- Matrix gradients for our simple neural net and some tips [15 mins]
- Computation graphs and backpropagation [40 mins]
- Stuff you should know [15 mins] a. Regularization to prevent overfitting b. Vectorization c. Nonlinearities d. Initialization e. Optimizers f. Learning rates
课件
-
cs224n-2019-lecture04-backprop
- 梯度计算分解,一些tips,使用预训练的词向量的问题
- 计算图表示前向传播和反向传播,用上游的梯度和链式法则来得到下游的梯度
- 正则,矢量化,非线性,初始化,优化器,学习率
Suggested Readings:
- CS231n notes on network architectures
- Learning Representations by Backpropagating Errors
- Derivatives, Backpropagation, and Vectorization
- Yes you should understand backprop
- Syntactic Structure: Consistency and Dependency (25 mins)
- Dependency Grammar and Treebanks (15 mins)
- Transition-based dependency parsing (15 mins)
- Neural dependency parsing (15 mins)
cs224n-2019-lecture05-dep-parsing [scrawled-on slides]
- 短语结构,依赖结构
cs224n-2019-notes04-dependencyparsing
Suggested Readings:
- Incrementality in Deterministic Dependency Parsing
- A Fast and Accurate Dependency Parser using Neural Networks
- Dependency Parsing
- Globally Normalized Transition-Based Neural Networks
- Universal Stanford Dependencies: A cross-linguistic typology
- Universal Dependencies website
Assignment 3
Recurrent Neural Networks (RNNs) and why they’re great for Language Modeling (LM).
- 语言模型
- RNN
Suggested Readings:
- N-gram Language Models (textbook chapter)
- The Unreasonable Effectiveness of Recurrent Neural Networks (blog post overview)
- Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.1 and 10.2)
- On Chomsky and the Two Cultures of Statistical Learning
-
Problems with RNNs and how to fix them
-
More complex RNN variants
cs224n-2019-lecture07-fancy-rnn
- 梯度消失
- LSTM和GRU
Suggested Readings:
- Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.3, 10.5, 10.7-10.12)
- Learning long-term dependencies with gradient descent is difficult (one of the original vanishing gradient papers)
- On the difficulty of training Recurrent Neural Networks (proof of vanishing gradient problem)
- Vanishing Gradients Jupyter Notebook (demo for feedforward networks)
- Understanding LSTM Networks (blog post overview)
Assignment 4
[code] [handout] [Azure Guide] [Practical Guide to VMs]
How we can do Neural Machine Translation (NMT) using an RNN based architecture called sequence to sequence with attention
- 机器翻译:
- 1.1950s,早期是基于规则的,利用词典翻译;
- 2.1990s-2010s,基于统计的机器翻译(SMT),从数据中学习统计模型,贝叶斯规则,考虑翻译和句子语法流畅。对齐:一对多,多对一,多对多。
- 3.2014-,基于神经网络的机器翻译(NMT),seq2seq,两个RNNs。seq2seq任务有:总结(长文本到短文本),对话,解析,代码生成(自然语言到代码)。贪心解码。束搜索解码
- 评估方式:BLEU(Bilingual Evaluation Understudy)
- 未解决的问题:词汇表之外的词,领域不匹配,保持较长文本的上下文,低资源语料少,没有加入常识,从训练数据中学到了偏见,无法解释的翻译,
- Attention。
cs224n-2019-notes06-NMT_seq2seq_attention
Suggested Readings:
- Statistical Machine Translation slides, CS224n 2015 (lectures 2/3/4)
- Statistical Machine Translation (book by Philipp Koehn)
- BLEU (original paper)
- Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- Sequence Transduction with Recurrent Neural Networks (early seq2seq speech recognition paper)
- Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
- Attention and Augmented Recurrent Neural Networks (blog post overview)
- Massive Exploration of Neural Machine Translation Architectures (practical advice for hyperparameter choices)
- Final project types and details; assessment revisited
- Finding research topics; a couple of examples
- Finding data
- Review of gated neural sequence models
- A couple of MT topics
- Doing your research
- Presenting your results and evaluation
cs224n-2019-lecture09-final-projects
- 默认的项目是问答系统SQuAD
- Look at ACL anthology for NLP papers: https://aclanthology.info
- https://paperswithcode.com/sota
- 数据:
- https://catalog.ldc.upenn.edu/
- http://statmt.org
- https://universaldependencies.org
- Look at Kaggle,research papers,lists of datasets
- https://machinelearningmastery.com/datasets-natural-languageprocessing/
- https://github.com/niderhoff/nlp-datasets
Suggested Readings:
- Practical Methodology (Deep Learning book chapter)
- Final final project notes, etc.
- Motivation/History
- The SQuAD dataset
- The Stanford Attentive Reader model
- BiDAF
- Recent, more advanced architectures
- ELMo and BERT preview
- 两个部分:寻找那些可能包含答案的文档(信息检索),从文档或段落中找答案(阅读理解)
- 阅读理解的历史,2013年MCTest:P+Q——>A,2015/16:CNN/DM、SQuAD数据集
- 开放领域问答的历史:1964年是依赖解析和匹配,1993年线上百科全书,1999年设立TREC问答,2011年IBM的DeepQA系统,2016年用神经网络和信息检索IR
- SQuAD数据集,评估方法
- 斯坦福的简单模型:Attentive Reader model,预测回答文本的起始位置和结束位置
- BiDAF
Project Proposal
Default Final Project
- Announcements (5 mins)
- Intro to CNNs (20 mins)
- Simple CNN for Sentence Classification: Yoon (2014) (20 mins)
- CNN potpourri (5 mins)
- Deep CNN for Sentence Classification: Conneau et al. (2017) (10 mins)
- Quasi-recurrent Neural Networks (10 mins)
cs224n-2019-lecture11-convnets
- CNN
- 句子分类
Suggested Readings:
- Convolutional Neural Networks for Sentence Classification
- A Convolutional Neural Network for Modelling Sentences
- A tiny bit of linguistics (10 mins)
- Purely character-level models (10 mins)
- Subword-models: Byte Pair Encoding and friends (20 mins)
- Hybrid character and word level models (30 mins)
- fastText (5 mins)
cs224n-2019-lecture12-subwords
Suggested readings:
- Minh-Thang Luong and Christopher Manning. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Assignment 5
[original code (requires Stanford login) / public version] [handout]
Suggested readings:
- Smith, Noah A. Contextual Word Representations: A Contextual Introduction. (Published just in time for this lecture!)
- The Illustrated BERT, ELMo, and co.
Lecture 14: Transformers and Self-Attention For Generative Models(guest lecture by Ashish Vaswani and Anna Huang)
Suggested readings:
- Attention is all you need
- Image Transformer
- Music Transformer: Generating music with long-term structure
Project Milestone
Suggested Readings:
Final project poster session [details]
Final Project Report due [instructions]
Project Poster/Video due [instructions]