A basic problems in Natural Language Processing is part-of-speech tagging, in which the goal is to mark every word in a sentence with its part of speech (noun, verb, adjective, etc.). Sometimes this is easy: a sentence like “Blueberries are blue” clearly consists of a noun, verb, and adjective, since each of these words has only one possible part of speech (e.g., “blueberries” is a noun but can’t be a verb). But in general, one has to look at all the words in a sentence to figure out the part of speech of any individual word. For example, consider the — grammatically correct! — sentence: “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.” To figure out what it means, we can parse its parts of speech:
Each line consists of a sentence, and each word is followed by one of 12 part-of-speech tags: ADJ (adjective), ADV (adverb), ADP (adposition), CONJ (conjunction), DET (determiner), NOUN, NUM (number), PRON (pronoun), PRT (particle), VERB, X (foreign word), and . (punctuation mark).
We have 3 different approaches to solve the problem: Simplified (Naive Bayes), HMM (using Viterbi) and MCMC using Gibbs.
A Simplified version where we estimate the highest probable tag for a particular word using Bayes Rule. It uses the below Bayes net and estimates th tag using given below formula.
where si* is the most probable tag for given word Wi.
In this Section we count the occurrences of the word having a particular tag (bag_of_words in the code). We also count the number of times a particular tag has appeared in the training data (tag_cnt in code). Based on the count of word appearing totally and for the particular tag, we calculate the probability (Given by word_prob in the code) as well as the probability of the tag with respect to the whole corpus (tag_prob in code).
We predict the tag as the one which has the highest probability for that word. This is a Naive approach and does not consider previous tags, words or position into consideration.
- For bc.test: Word Accuracy: 93.92% Sentence Accuracy: 47.45%
- For bc.test.tiny: Word Accuracy: 97.62% Sentence Accuracy:66.67%
This is a good Word accuracy but can still be improved by using Viterbi Algorithm in an HMM or using MCMC.
In this set we try to calculate the Maximum A Posteriori(MAP) by incorporating dependencies between words. It uses the following Bayes Net and considers dependencies between 2 consecutive words.
It finds the sequence using below formula
Along with already calculated emission probabilities, we also find the initial probability (Given by initial_prob) which is basically probability of that word starting the sentence. It also calculates the transition probability from 1 state to next state (Given by trans). Using these Values we do forward step calculation starting with initial for state s0 and then creating a Viterbi table (V_table) which has the sentence words as it's columns and all possible tags as it's rows. After each forward step calculation we pick the max value and based on that generate values for next step.
As soon as we reach the last step, we have to do a backward propagation using the MAP value for last state and go backwards till we create the entire sequence, (given by viterbi_seq).
- bc.test: Word Accuracy: 94.86% Sentence Accuracy: 52.90%
- bc.test.tiny: Word Accuracy: 97.62% Sentence Accuracy:66.67%
This is the best approach among all 3 methods and gives the best accuracy for word and sentence.
In this method we use the below Net. As this is not an HMM, we cannot use Viterbi and therefore we run Gibbs Sampling. The idea behind Gibbs Sampling is to run the same logic N times and create a new random sample every time. After running enough iterations, the sampling algorithm will converge and give us a good sample output. We have used below MCMC for this step.
As seen in figure above we need to compute additional probabilities for this method. 1 state is connected to 2 future states as well as 2 states in the past (Given by trans2) and the current word is connected to the previous state (given by word_count_prev_prob). Ideal approach would be to calculate probabilities by just changing current state values while keeping all other states values constant. This is the original idea behind Gibbs, but calculating static values at each step for entire sentence, N times gets computationally heavy. So we do not calculate values not dependent on current state. In this case we have chosen a sampling value of 150. Usually we require thousands of iterations, but based on given datasets these converge for a value far smaller than 150. All samples for that particular sentence are stored in a list of lists. Using those values we find the tag occurring the most in it's particular position and append it to final prediction (given by predicted_list).
- For bc.test: Word Accuracy: 93.92% Sentence Accuracy: 47.45%
- For bc.test.tiny: Word Accuracy: 97.62% Sentence Accuracy:66.67%
a. In the 1st part of Bayes, there was not much of a challenge.
b. In Viterbi sequence, we tried to run the entire thing on our own and had some challenged in backward tracking. We referred to the sample code, given by Prof. Crandall in the in-class activity. Using the same flow of logic, we managed to succeed in getting a good output.
c. This was the most challenging part. Computing transition and emission for each state was an issue. Calculating probabilities for the chain for each position in the state was another challenge. Logic used had if-else condition on length of the sentence. In cases where length of sentence is 2/3, some conditions failed and gave errors. The logic had to be tweaked for this. Another problem was using np.random.choice. As the probabilities did not add up to 1, the formula failed. Had to look up this issue and normalize values (given in References). The last part of taking best value out of all samples for particular word was already done in B555 class and was easier to do.
a. Testing will only be done on Data for given 12 tags.
b. As samples are converging within 150 for given datasets, it is set to that value to ensure code does not time out. Factor can be changed as well.
Above results are as executed on Silo for bc.test. Run time: 6.5 mins.
Above results are executed on Silo for bc.test.tiny Run Time: 0.8s
-
https://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary
-
Viterbi Training part referred from In class activity solution provided by Prof. Crandall
-
https://newbedev.com/np-random-choice-probabilities-do-not-sum-to-1 For normalizing probabilities.
-
Referred Code given by Prof Crandall in in class activity for Gibbs.
-
Confirmed logic of not calculating probabilities of redundant nodes with Prof. Crandall after class. Ex: s3 is only connected to s1,s2,w2,w3,w4,s5,s6. Other than this,rest ignored.
-
Read this for understanding concepts: https://webfiles.uci.edu/msteyver/publications/SteyversGriffithsLSABookFormatted.pdf
-
Referred to Slides in B555 Machine Learning course - Prof. Roni Khardon for Sampling and MCMC.
Detecting air-ice and ice-rock boundaries from the given radar echogram.
We have 3 different approaches to solve the problem: Simplified Bayes Net, HMM and HMM with human feedback.
The structure of the bayes net used for this approach is shown below.
Here the observed variable is the column and states are the rows, but since there is no dependencies b/w states, state to state transition is not possible. Therefore, for the air-ice boundary the emission probability given a column is 1 for the pixel carrying maximum normalized edge strength value in that column, for all the other pixels in the column the emission probability given a column is 1. Correspondingly the same logic is applied for the ice-rock boundary but with the constraint that the ice-rock boundary is always below air-ice boundary and the minimum distance between air-ice boundary and ice-rock boundary is 10 pixels.
The structure of the bayes net used for this approach is shown below.
The observed variables and states are same as above. The transition probabilities for a given state is shown below.
The assumption here is that given that a pixel exists on a boundary in a given column, the possible pixel in the next column is either in the same row as the previous or, a row above or below it, or two rows above or below it.
Since there is dependencies between states, the emission probability for the air-ice boundary is the scaled edge strength, here the scaling is done such that the all the emission probabilities in a given column sum to 1. Correspondingly the same logic is applied for the ice-rock boundary but with the constraint that the emission probabilities for the states from the top of the image till 9 pixels below the air-ice boundary is zero.
In this approach for each of the boundaries a point which lies on these boundaries is passed as input. While the previous approach is followed there are minor differences applied to it. First for the transition probability, if the next possible state is in the column in which the passed input exists then the only transition possible is from a state in previous column to the state (row) given in the input. Similarly the emission probability is equal to 1 for the input pixel passed and for all the other pixels in the same column the value is 0.
Since the probabilities are very low values we faced the underflow problem, to overcome it the viterbi table is scaled after calculating the values of the all states for a given observation. The scalling is done such that the values of all states for a given observeration sum to one.
Below are the few images containing the results.
For image 31.png with the human input for air_ice (23,22) and ice-rock (5,65).
Air-Ice boundary:
Ice-Rock boundary:
For image 16.png with the human input for air_ice (21,22) and ice-rock (99,41).
Air-Ice boundary:
Ice-Rock boundary:
Our goal is to extract text from a noisy scanned image of a document using the versatility of HMMs. But the images are noisy, so any particular letter may be difficult to recognize. However, if we make the assumption that these images have English words and sentences, we can use statistical properties of the language to resolve ambiguities, just like in Part 2.
The data used in this part are images with some text in them.
- The train image that contains letters. (No Noise.)('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789(),.-!?"' ')
- There are 20 test images with different text with added noise. 'It is so ordered.' For Example : test_17_0.png
This is an optical character recognition problem. We are using following algorithmic approaches to extract the text present in the image.
-
Simplified (Naive Bayes)
-
Hidden Markov Model (using Viterbi algorithm.)
Using the simplified (Naive Bayes) version, the most probable image is estimated are head-on sub-image comparison.
The emission probability is estimated by comparing each test character pixel matrix with the pixel matrix of each train character and return the similarity that exists.
4 counts are maintained to extract the most information from the comparison.
- matched_star --------- count of matched '*' pixel in train and test sub-images.
- matched_blank --------- count of matched ' ' pixel in train and test sub-images.
- matched_star_blank ----- count of instances where pixel in train is '*' and the same coordinate pixel is ' '
- matched_blank_star ----- count of instances where pixel in train is ' ' and the same coordinate pixel is '*'
- matched_star_blank gives the idea about the missed information by the test image which can prove to be vital in prediction.
- matched_blank_star gives the idea about the noise in the image, where the pixel value in train is ' ' but for the same coordinate pixel the pixel value is '*', which should not be the case ideally (for same character in train sub-image and test sub-image).
Using this, we can account for noise in the image as well, and hence the emission probabilities can be calculated for each pixel based on the summation of the exponentiation of the matching pixels over 1-noise and the exponentiation of the non-matching pixels over the noise.
The noise estimation of the image was done using different functions referred from : https://stackoverflow.com/questions/2440504/noise-estimation-noise-measurement-in-image
The noise on average for the test images was approximately 11 %. The assumption taken here is that 10% of pixels in the test images are noisy. Additionally the weights for matched_star, matched_blank, matched_star_blank and matched_blank_star have been calibrated based on intuition and experimentation. The emissions probability is calculated after applying Laplace Smoothing(alpha=1).
Weight assumption :
weights = {
'matched_star' : 0.9,
'matched_blank' : 0.32,
'mismatch_star_blank' : 1.3,
'mismatch_blank_star' : -0.5,
}
The heatmap gives us an idea about the emission probability of the character as explained in the above section.
3.3.2 Hidden Markov Model (using Viterbi Algorithm)
In the hidden markov model as given in the figure 1(b), the viterbi algorithm is used to estimate the most likely sequence with the help of transition probabilites and initial probabilities acquired from the train set. Using this information improves the model as prior information drives the prediction towards true value.
It finds the sequence using below formula
The graph gives us the idea about the frequency of the word being initialized with a given character.
The heatmap gives us an idea about the frequency of the character followed by another character.
For the test image 11, it can be seen that the simplified model predicts the image with a good confidence, however due to noise, it misidentifies some characters. The HMM model fixes this issue and upon usage of the initial and transition probabilities from 'bc.train', the characters are identified correctly.
- The emission probabilities were calculated after consideration of 4 elements as highlighted in detail at (3.3.1). The weight was selected after experimentation.
- The noise assumption was tuned manually to output the best results.
-
It is assumed that the sub-image window is of the dimensions (25x14) pixels. Each train sub-image window is compared with test sub-image window to estimate the emission probabilities.
-
The train_set assumption is made wherein if the filename contains 'bc.train' or 'bc.test' then every second word in the word list will be skipped.