soruce : NLP || Christopher Manning || Stanford L43 ~ L46
Information Extraction (I.E) :
- Find and understand limit relavant part of texts.
- Gather information from many pieces of text.
- Produces a structured representation of relevant infomation
The Goal of I.E.
- Organize infomation so that it is useful to people
- Put information in sematically precise form that allows further inference to be made by computer algorithm.
Apple Mail knows there is a "date", so it recommend you to create a New calendar event.
It's easy, just use regular expression and name lists.
Google knows it's a location when you searching it.
find them and classifiy what it is!(person/date/location/orgaization)
if you have a good NER system, you 'll do a good job in question answering.(it always asking who did what at when )
ORG : ogrinaztion O : Out of ner PER : person
left : per token(word), right : ner type(entity)
you may think it is a token(word) classification task, but we are interest in the entity, So the standard evaluation is per entity not per token.
we cannot use classfication metrics for ner task. it will get wired situation
so our y label shoud contain the boundary.
ground truth : ORG(1-4) prediction : ORG(2-4)
due to the "first" is not being predicted, it counts flase negtive, and the "back of chicago" is not fully correct, it counts false positive!
so select nothing would have been better.
otherwise you need to pick other metrics give partital credit.
So in common ner task. we use F1 score to measure the performance. (due to MUC scorer might be complex and not straight-forward.)
If your entity is too hard to capture(easy thing like date, time, fix named list). then you can do a ML approach.
labelling
There is some problem in IO encoding labelling method.
Sue is 1 person name, Mengqiu is another person name.
But the labelling will tell machine : a person called Sue Mengqiu Huang.
So we have another labeling method.
I : inner O : other B : beginning
It's will be great that we label in this way. And it comes a bit of costs.
If we have
We'll have labels
In this case,
for IO encoding. the prediction will run faster, but less accurate.
This course will use IO encoding. due to the different people are adjacent will be very rare to happen. the IO encoding run faster.
And an itresting thing is : In practice, if we use IOB encoing, the system still get it wrong because the adjacent is too rare to happen. So you can find the IOB encoing dataset still encode B-PER, I-PER, I-PER for Sue Mengqiu Huang(even it's 2 people)
POS-Tagging - 詞性標注
word shape is kind of reguar expression. it's a powerful feature.
Maximum entropy markov models (MEMMs) or Conditional Markov models.
word segmentation -> 斷詞
text segmentation -> what part is questions, what part is answer?(Q and A as label)
DT - 限定詞 NNP - ? VBD - 動詞,過去式
A larger space of sequences is usually explored via search.
Inference in sequence model :
make decision at the point based on conditional evidence(from observation and previous decision)
But we can't use all of the sequence data(it will be too big). so we need to use local(smaller) data to make decision.
On above figure, we have a greedy search inference model(always pick best decision based on small window ground truth and features)
the greedy search approach works well in practice. but it is not the optimal decision.
Sometimes the decision based on prevoius word(which is not the strong signal at the previous decision point.)
Then we have another search method. Beam search.
Instead of keep top 1 most likely label for each position, we keep the top k most likely labels.
In practial, k = 3 ~ 5 helps a lot.(but not all of the case)
Beam Search still not a global optimal. but an approximate search method.
we can actually find the best sequence of states that has the globally highest score in the model.
https://ithelp.ithome.com.tw/articles/10208587
Inference of CTC https://www.ycc.idv.tw/crnn-ctc.html
In nowday, we use attention NN to capture long distance word interaction. which above the method cannot solve it very will.
check
pos tagging : 詞性標註 chunking : 短動詞,短名詞等,可以藉由pos tagging接續做
task :
- identify AML news.(document classficiation)
- get AML list in the news.(NER)
language : chinese
dataset : news url, contain_aml, aml_name_list
used model :
identify AML news (AML classifier) :
- jieba + countVectorizer / tfidf + Multinomial Naive Bayes
- BERT chinese embedding + BiLSTM_CRF
- BERT chinese embedding + BiLSTM_CRF + Rule-based name list
BERT chinese + BiLSTM_CRF : F1-Score 0.728 Keyword List : 0.746
combined appraoch : F1-Score 0.92
AML NER Model :
- BERT chinese at setence level
- BIO encoding(BIO, BIOSE, IOB, BILOU ...)
- post-processing(Tabu-list)
highlights :
- chinese document data cleaning
- hybird rule-based and ml-based approach.
https://github.com/YLTsai0609/bert_ner
https://github.com/GitYCC/bert-minimal-tutorial/blob/master/notebooks/chinese_ner.ipynb