STAT 534

You will have to submit a report about what you did, submit your code, and submit results on a test set to be provided later. We will evaluate and compare the results, and unveil the winners and the data source during exam week.

Data sets The data set for training is available here. It contains n0=1000 sequences of T=40 integers in 0,1,2,3 (one sequence per line, space separated). These sequences have been sampled independently from an unknown distribution. Later, we will post an unlabeled test set, on which you will test the model you obtained. Find the test set here: test1_534.dat. It contains 50 sequences of length 40.

What to do

You will use the data made available to train an HMM by the Baum-Welch method. Choose an appropriate number of states N. We suggest N=5 or larger, but it is possible that M=4 or smaller works, although we did not try them.

A rigurous way to choose N is by Cross-Validation. In this case you will have to set aside some part of the training set for validation, or use K-fold CV (K=5 is a good choice).
Number the states from 0 to N-1, so that the state with the most uniform emission probability is labeled 0. If possible, give states with the most deterministic emission probabilities highest numbers. (E.g. if 1 is emitted w.p. 99% in state q, label this state N-1, if 3 is emitted w.p. 80% in state q', label this state N-2.)
Example: In state 0, emission probabilities are [0.3 0.3 0.2 0.2]; in state 1 they are [0.5 0.2 0.3 0]; in state 2 they are [0.9 0.1 0 0]; etc
For a given training set of your choice, with n samples, you will write the (log-)likelihood of this set. Then you will use the B-W algorithm to maximize this log-likelihood.Deleted: Derive/explain this in the project report.
Report the transition matrix A, and the emission probability matrix B.
Use the Viterbi algorithm to calculate the most likely sequence of states for all the data. Report the time it took to run the Viterbi algorithm on all the data.
Write code that, given a sequence of outputs, predicts the next output and the next state, and test it on the data.
Write code that calculates the log-likelihood of the test set (by Forward-Backward).
When the test set is available, you will be asked to perform some of these tasks on it:
- output the log-likelihood of the test set
- output the Viterbi sequence of states
- predict the next Deleted:state and output for a given sequence.

Software resources For full credit, you should implement all the algorithms yourself. But you can also choose to use HMM software (in python) downloaded from standard packages (but not software written by your peers or other students at UW). In this case, you must know intimately what the software is doing in the context of your project; more precisely, you must explain how you set the parameters, and report which software you used. Do not submit downloaded software with your code.

Good practices

Before and after each step, look at the data (that is, make some exploratory plots to understand your data)

Test your code on simple data for which you know the answers (that mean, create an HMM or your own and sample data from it).

Take care to make the report readable, to write correctly and in full sentences, and to explain clearly and precisely what you did. Use precise technical terms (Example: instead of "the algorithm was quite fast" write "the Baum Welch algorithm, on the training set of 500 sequences, converged in 1200 iterations, and took a total of 150 seconds to run".)

All numerical results should be easily readable (e.g. 0.064574583973279 is not easily readable, but 0.064 is), and all the plot axes should be labeled.

How to initialize A,B in Baum-Welch? Random numbers, normalized by rows are a good rule of thumb. Do not initialize with 0, nor with equal values in all entries (can you see why?).

Time line

Data available	May 25
Test set available	June 10
Test results due	24h after test set
Award ceremony	virtual June 14 or later
Submit report	June 12 midnight

Report outline

Training: how much data used, run time of Baum-Welch, run time/iteration, how did you choose stopping
How did you choose M? What range of values did you try? If cross-validation, how large was the validation set. If other methods or additional validation, what were they?
What software did you use? How did you set the parameters?
Experimental results, e.g learning curve(s), training (validation) losses, parameters A, B. Be selectivein what you show! Full credit will given for legible well marked and clearly explained plots and results.
The formula(s) you used to predict next output.
At least one paragraph of remarks and comments about the results and data.
Optional: references
Total length: no more than 5 pages of contents, with extra pages containing references or figures, up to no more than 10 pages total.

How to submit your test set results

Instructions TB Posted
We will use a script to download and evaluate your results, so please do not vary from the given formats or file names, lest your results be distorted.
Go to the Canvas dropbox. Upload the files.

Grading
Report 55 = 20 presentation + 35 for analysis; HMM code implementation 25; Test set results = 20. Total = 100.

Marina Meila

Last modified: Sat Nov 30 15:43:52 PST 2013