In continuous speech recognition three different kinds of error can occur:
a spoken word is misrecognized, i.e. substituted by another word (=substitution
error) , or a spoken word wasn't recognized at all, i.e. deleted
(= deletion error), or the recognition system recognizes a word
which was not spoken, i.e. it inserted a word (=insertion error).
Based on these three error types the Word Error Rate (WER) is defined
as a measure for the performance of the recognition engine:
#Deletions + #Subsitutions + #Insertions
WER = ------------------------------------------ * 100%
# words to be recognized
Consider the following example:
REF: This great machine can recognize speechSo, the recognizer has a Word Error Rate of 83% or a Word Accuracy WA (WA = 100 - WER) of 17%. Make sure you understand that WER always meant to be the minimal error rate. Otherwise one could interpret the above example as the second reference word "great" to be substituted with "machine", then the reference word "machine" could be interpreted as to be substituted with "can", the word "can" with "wreck" and so on.
HYP: This machine can wreck a nice beach
DEL SUB INS INS SUB
1 + 2 + 2
WER = ------------ * 100% = 83%
6
Problem 1: Write a program that gets two lists as input (first list
= list of reference strings, second list = list of hypothesis) and
gives the Word Error Rate as output and output the final alignment of
the two strings at the same time. (You can use any programming language
as long as your program shows correct output on the monitor.)
Input:
%some_OS> WER_program "This great machine can recognize speech" "This machine can wreck a nice beach"Output:
REF: This great machine can recognize speechYou may simplify the output format as long as the format expresses the word alignment clearly.
HYP: This machine can wreck a nice beach
DEL SUB INS INS SUB
WER = 83%
Note: See Kanungo's slides for the convention of HMM symbols. Kanungo's slides basically follow the Rabiner paper, which is a classic. You may use that paper as a guide to work on this problem set.
In this problem set, you will manually (no programming required) solve
the three basic problems of HMMs applied on a simplified speech recognition
task. Assuming the following speech recognition process [from Rich
Stern's slides]:
The front-end component of this speech recognition system is a Feature
Extraction module, which converts speech waveforms into speech features.
The Decision Making Procedure takes the speech features as its input
and somehow generates phoneme sequences as output.
Assumptions: In this homework assignment, it is assumed the process
above is a consonant-vowel (C-V) recognizer which means it only recognizes
2 phones: C (consonant) and V (vowel). For example,
if a person says 'bit', the C-V recognizer outputs 'CVC'; if a
person says 'amazed', the C-V recognizer outputs 'VCVCC'. Taking
a sequence of speech frames as input, the Feature Extraction module is
assumed to magically output a sequence of numbers in the set {0.1 0.2
0.3 0.4 0.5 0.6 0.7 0.8 0.9}. For example, if there are 6 frames of
speech waveform, then the Feature Extraction module generates a sequence
of 6 numbers between (0,1), say, {0.1, 0.3, 0.5, 0.7, 0.9, 0,8}. Note
that there are only 9 possible feature values, i.e. 0.1, 0.2, ..., 0.9.
HMM is popular in a statistical Decision Making Procedure. The basic
components of the HMM used in this homework assignment is defined as
follows:
Given:
| State Component = C (q1 and q3) |
|||||||||
| Output |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
| Prob |
0.9 |
0.8 |
0.7 |
0.6 |
0.5 |
0.4 |
0.3 |
0.2 |
0.1 |
| State Component = V (q2) |
|||||||||
| Output |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
| Prob |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
use the Forward algorithm to compute the probability of the observation
given the HMM. Fill in the alpha-values in the following trellis-like
table to demonstrate your computation.
| alpha |
||||
| State 3 = C |
||||
| State 2 = V |
||||
| State 1 = C |
||||
| O1
= 0.3 |
O2
= 0.8 |
O3
= 0.7 |
O4
= 0.2 |
|
The probability of the observation given the HMM is: ____________________
Given the same observation and HMM as in Q.2.1, use the Viterbi algorithm to find the corresponding state sequence which is optimal in a way best explaining the observation. Fill in the following trellis-like table to demonstrate the Viterbi computation and fill in the best state number corresponding to each observation on the bottom row.
| delta | ||||
| State 3 = C |
||||
| State 2 = V |
||||
| State 1 = C |
||||
| O1
= 0.3 |
O2
= 0.8 |
O3
= 0.7 |
O4
= 0.2 |
|
| Optimal State |
State = __ |
State = __ |
State = __ |
State = __ |
| beta | ||||
| State 3 = C |
||||
| State 2 = V |
||||
| State 1 = C |
||||
| O1
= 0.3 |
O2
= 0.8 |
O3
= 0.7 |
O4
= 0.2 |
|
| ksi | ||||
| 3,3 |
x |
|||
| 3,2 |
x |
|||
| 3,1 |
x |
|||
| 2,3 |
x |
|||
| 2,2 |
x |
|||
| 2,1 |
x |
|||
| 1,3 |
x |
|||
| 1,2 |
x |
|||
| 1,1 |
x |
|||
| O1
= 0.3 |
O2
= 0.8 |
O3
= 0.7 |
O4
= 0.2 |
|
| gamma | ||||
| 3 |
x |
|||
| 2 |
x |
|||
| 1 |
x |
|||
| O1
= 0.3 |
O2
= 0.8 |
O3
= 0.7 |
O4
= 0.2 |
|
| State Component = C (q1 and q3) |
|||||||||
| Output |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
| Prob |
|||||||||
| State Component = V (q2) |
|||||||||
| Output |
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
| Prob |
|||||||||
Hand in instructions:
Please hand in descriptions of your program (including the operation
of your program and major algorithms and data structures) and your answers
for all the questions in paper work. Your source code and executable
should be handed in by email to me scjou@cs.cmu.edu.