11-751 Assignment 2

Released: Wednesday, Oct 22

Due: Monday, Nov 03

1. DTW/Dynamic Programming (50 points)

In continuous speech recognition three different kinds of error can occur: a spoken word is misrecognized, i.e. substituted by another word (=substitution error) , or a spoken word wasn't recognized at all, i.e. deleted (= deletion error), or the recognition system recognizes a word which was not spoken, i.e. it inserted a word (=insertion error). Based on these three error types the Word Error Rate (WER) is defined as a measure for the performance of the recognition engine:

         #Deletions + #Subsitutions + #Insertions
WER =   ------------------------------------------  * 100%
               # words to be recognized

Consider the following example:

REF: This great machine can recognize          speech
HYP: This machine can wreck a nice beach
DEL SUB INS INS SUB

1 + 2 + 2
WER = ------------ * 100% = 83%
6
So, the recognizer has a Word Error Rate of 83% or a Word Accuracy WA (WA = 100 - WER) of 17%. Make sure you understand that WER always meant to be the minimal error rate. Otherwise one could interpret the above example as the second reference word "great" to be substituted with "machine", then the reference word "machine" could be interpreted as to be substituted with "can", the word "can" with "wreck" and so on.

Problem 1: Write a program that gets two lists as input (first list = list of reference strings, second list = list of hypothesis) and gives the Word Error Rate as output and output the final alignment of the two strings at the same time. (You can use any programming language as long as your program shows correct output on the monitor.)


Note: The Input/Output example follows, for example, if you use command line arguments:

Input:

%some_OS> WER_program "This great machine can recognize speech" "This machine can wreck a nice beach"
Output:
REF: This great machine can recognize          speech
HYP: This machine can wreck a nice beach
DEL SUB INS INS SUB

WER = 83%
You may simplify the output format as long as the format expresses the word alignment clearly.

2. Hidden Markov Model (50 points)

Note: See Kanungo's slides for the convention of HMM symbols.  Kanungo's slides basically follow the Rabiner paper, which is a classic.  You may use that paper as a guide to work on this problem set.

In this problem set, you will manually (no programming required) solve the three basic problems of HMMs applied on a simplified speech recognition task.  Assuming the following speech recognition process [from Rich Stern's slides]:


The front-end component of this speech recognition system is a Feature Extraction module, which converts speech waveforms into speech features.  The Decision Making Procedure takes the speech features as its input and somehow generates phoneme sequences as output.

Assumptions:  In this homework assignment, it is assumed the process above is a consonant-vowel (C-V) recognizer which means it only recognizes 2 phones: C (consonant) and V (vowel).  For example, if a person says 'bit', the C-V recognizer outputs 'CVC'; if a person says 'amazed', the C-V recognizer outputs 'VCVCC'.  Taking a sequence of speech frames as input, the Feature Extraction module is assumed to magically output a sequence of numbers in the set {0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9}.  For example, if there are 6 frames of speech waveform, then the Feature Extraction module generates a sequence of 6 numbers between (0,1), say, {0.1, 0.3, 0.5, 0.7, 0.9, 0,8}.  Note that there are only 9 possible feature values, i.e. 0.1, 0.2, ..., 0.9.

HMM is popular in a statistical Decision Making Procedure.  The basic components of the HMM used in this homework assignment is defined as follows:


Q.2.1 The Evaluation Problem of HMM (10 points)

Given:

use the Forward algorithm to compute the probability of the observation given the HMM.  Fill in the alpha-values in the following trellis-like table to demonstrate your computation.

alpha
State 3 = C




State 2 = V




State 1 = C





O1 = 0.3
O2 = 0.8
O3 = 0.7
O4 = 0.2

The probability of the observation given the HMM is: ____________________


Q.2.2 The Decoding Problem of HMM (10 points)

Given the same observation and HMM as in Q.2.1, use the Viterbi algorithm to find the corresponding state sequence which is optimal in a way best explaining the observation.  Fill in the following trellis-like table to demonstrate the Viterbi computation and fill in the best state number corresponding to each observation on the bottom row.

delta
State 3 = C




State 2 = V




State 1 = C





O1 = 0.3
O2 = 0.8
O3 = 0.7
O4 = 0.2
Optimal State
State = __
State = __
State = __
State = __


Q.2.3 The Learning Problem of HMM (20 points)

Given the same observation and HMM as in Q.2.1, use the forward-backward (Baum-Welch) algorithm to maximize the likelihood P(Observation | A, B, Pi).  To manually run one iteration of the algorithm, fill in the beta values in the following trellis-like table, complete the ksi and gamma table, and update A and B accordingly.  Note: use the ksi and gamma definition in the Kanungo's slides , not the Xuedong Huang et al. textbook definition of gamma; these two gamma's are different.

beta
State 3 = C




State 2 = V




State 1 = C





O1 = 0.3
O2 = 0.8
O3 = 0.7
O4 = 0.2


ksi
3,3



x
3,2



x
3,1



x
2,3



x
2,2



x
2,1



x
1,3



x
1,2



x
1,1



x

O1 = 0.3
O2 = 0.8
O3 = 0.7
O4 = 0.2


gamma
3



x
2



x
1



x

O1 = 0.3
O2 = 0.8
O3 = 0.7
O4 = 0.2


Note: Since the first and the third states both belong to the same state component C, parameters of C should be calculated by combining the ksi and gamma values of the first the the third states as values of a single state.

Write down the updated transition probabilities on the transition arcs:




State Component = C (q1 and q3)
Output
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Prob










State Component = V (q2)
Output
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Prob










Q.2.4 Bayesian Decision (10 points)

Bayesian decision for our C-V speech recognition can be viewed as follows.

argmaxi P(PhoneSequencei|Observation) = argmaxi P(Observation|PhoneSequencei)*P(PhoneSequencei)

Assuming P(PhoneSequencei), the so-called Language Model, are equally likely for every PhoneSequence, so P(PhoneSequencei) can be ignored in the decision rule.  HMM can be used to estimate P(Observation|PhoneSequencei), the so-called Acoustic Model, in this framework.

Assuming the same observation and HMM as in Q.2.1, here we suppose there is another HMM composed of only a single state C also in the the acoustic model, i.e. the acoustic model consists of two HMM's, CVC and C.  Since there is only one state in the second HMM, this single C-state has only the self-transition with transition probability 1.0.  All the other setup of this question are identical to Q.2.1.  Given the same observation, {0.3, 0.8, 0.7, 0.2}, which phone sequence, CVC vs. C, will be recognized by the Beyesian decision rule? Why?



Hand in instructions:
Please hand in descriptions of your program (including the operation of your program and major algorithms and data structures) and your answers for all the questions in paper work. Your source code and executable should be handed in by email to me scjou@cs.cmu.edu.