Generating word-timestamps for an audio file (aka generating a Forced Alignment has had a long and rich history in the NLP and AI communities as a whole. In fact, one of motivating uses for Hidden Markov Models (HMM) was Automatic Speech Recognition (ASR). I, on the other hand, was motivated by a desire to create my own Karaoke system, which I did in this project.
Classic Forced Alignment methods were based on a Dynamic Programming-esque algorithm called Dynamic Time Warping (DTW) or Hidden Markov Models in tandem with Gaussian Mixture Models (HMM-GMM). The latter is still very successful (though the GMM may be substituted with a Deep Neural Network instead), though modern methods focus on recurrent models, i.e. Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), and most recently attention-based models such as Transformers. All of these, however, are computationally expensive due to the temporal nature of the models
In this project, I attempted to massage a Convolutional Neural Network to generate a forced alignment. While work has been done on ASR with CNNs, a forced alignment wasn't possible due to the chosen, yet ubiquitous, loss function: Connectionist Temporal Classification (CTC). I was successful in doing this, creating a purely Convolutional Forced Aligner with an average alignment error rate of 67 ms.
To see my thought process as well as the ups and downs I experienced making this, you can read the blog-style story.md.
If you want a detailed technical report with diagrams and equations, you can read Convolutional_Forced_Alignment.pdf.
All requirements can be found in environment.yml and can be loaded in via Anaconda. However, if you wish to use an alternate package management system/environment manager, here are the packages + versions I used:
1. python==3.9.1
2. pytorch==1.7.1 + CUDA 11.0
3. torchaudio==0.7.2
4. nltk
After cloning this repository, all available options are in config.py:
DATASET_DIR: The path to the training/test dataset. As of now, it only supports 'TIMIT' or 'Librispeech'.CHECKPONIT_DIR: The path to the directory where you would like the model checkpoints to be saved to/loaded from.ADAM_lr: The learning rate for the ADAM optimizer.batch_size: The batch size for training.SGD_lr,SGD_l2_penalty: Learning rate and weight penalization parameters for Stochastic Gradient Descent, which is used on the validation set.weights_init_a,weights_init_b: The range of values which the model weights should be initialized uniformly between.epochs: Number of epochs to train for.activation: For flexibility, an activation chosen bewteenrelu,prelu, andmaxoutcan be selected. This will apply to all layers in the network.start_epoch: The epoch to start at for traning or inference. Must have the appropriate.ptmodel weights saved beforehand to do this. This number should be +1 whatever the filename states.mode: Choose bewteentrain,test,cam,align,test-align. The details of each are desribed in the next section.dataset: The dataset the model is based off of. Choose betweenTIMITorLibrispeech.sample_path: If you wish to generate a forced alignmentt (modealign) or class activation map (modecam) for an individual file, provide the path to the respective audio file here. Should be used in conjunction withsample_transcript.sample_transcript: If you wish to generate a forced alignmentt (modealign) or class activation map (modecam) for an individual file, provide the path to the text transcript file here. Should be used in conjunction withsample_path.timit_sample_path: Can be used in place ofsample_pathandsample_transcript. Since TIMIT provides ground truth word-timings, set this path to the desired TIMIT sample (e.g. timit/data/TRAIN/DR4/MESG0/SX72) incamoralignmode to also see the alignment error.model: For now, onlyzhangis valid.cam_phonemes: Used incammode. Choose which phonemes to find the activations for (e.g. [1, 2, 10]).cam_word: Used inalignmode. Choose which word to find the activations for (e.g. 1).
To run the script, simply run python main.py.
Note: For almost all of the modes, you should be using dataset: TIMIT, but Librispeech is left as an option only for train or test if you want to train the CNN on the dataset and see the results. (It is nothing novel, just normal training with CTC loss and decoding).
-
train: Train themodelon thedatasetfor(epochs - start_epochs)epochs with the hyperparametersADAM_lr,batch_size,SGD_lr,SGD_l2_penalty,weights_init_a,weights_init_b, andactivation. Every epoch, it will save a model checkpoint in theCHECKPOINT_DIR. ForTIMIT, it will automatically find the pre-partitioned TRAIN dataset. -
test: Loads in a pre-trained model from epochstart_epoch(assuming the checkpoint exists). Tests the accuracy of themodelon thedataset. ForTIMIT, it will automatically find the pre-partioned TEST dataset. After 15 epochs, my model achieved a Phoneme Error Rate of22%on the TIMIT test set. -
cam: Loads in a pre-trained model from epochstart_epoch(assuming the checkpoint exists). Given a sample insample_pathortimit_sample_path, this will generated a class activation map via the GRAD-CAM method. It will show the activations for all phonemes listed incam_phonemes; any invalid ones will be ignored. -
align: Loads in a pre-trained model from epochstart_epoch(assuming the checkpoint exists). Given a sample/transcript combo either viasample_pathandsample_transcriptortimit_sample_path, this will generate and print the Forced Alignment (in seconds). If atimit_sample_pathis provided, it will also provide the alignment error. If acam_wordis provided, it will also generate the class activations for all phonemes in the desired word. -
test-align: Loads in a pre-trained model from epochstart_epoch(assuming the checkpoint exists). This will compute the average Alignment Error (in seconds) for the entire TIMIT dataset given our method. After 15 epochs, my model achieved an average Alignment Error of67 ms.