wav2vec2-base_than_I_did

This model is a fine-tuned version of facebook/wav2vec2-base on the MatsRooth/than_I_did dataset. It achieves the following results on the evaluation set:

Loss: 0.2077
Accuracy: 0.9592

Model description

This is a binary classifier for the prosody of tokens of "I did". The label s is subject prominence. The label ns is the complement, with prominence either on "did" or afterwards.

Intended uses & limitations

Research on prosody.

Training and evaluation data

The utterances are collected on Youtube, aligned with the Youtube transcript using Kaldi, and cut to the words "I did" using Matlab. Labels were assigned by the experimenter, using 's' for tokens there the main clause subject differed from the than-clause subject, and 'ns' for other tokens. The labeling does not depend on prosody, though it correlates with it.

On the same problem using an SVM classifier, see Howell, Jonathan, Mats Rooth, and Michael Wagner, Acoustic classification of focus: On the web and in the lab (2016).

The class ns was reduced to 160 tokens, to match the number of tokens of s.

Training procedure

Training and evaluation use run_audio_classification.py from HuggingFace. The slurm script than_I_did.sub launches training.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 16
eval_batch_size: 16
seed: 0
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 20.0
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy
No log	0.94	8	0.6940	0.4694
0.6939	2.0	17	0.6776	0.6735
0.6844	2.94	25	0.6505	0.6531
0.6752	4.0	34	0.6390	0.6122
0.6071	4.94	42	0.5664	0.7959
0.5483	6.0	51	0.4090	0.8571
0.5483	6.94	59	0.3948	0.8163
0.4747	8.0	68	0.4082	0.8163
0.4782	8.94	76	0.3435	0.8776
0.4403	10.0	85	0.3410	0.8776
0.4682	10.94	93	0.2878	0.8980
0.4032	12.0	102	0.2589	0.9184
0.359	12.94	110	0.2554	0.9184
0.359	14.0	119	0.2077	0.9592
0.3142	14.94	127	0.1839	0.9592
0.3735	16.0	136	0.1944	0.9388
0.3655	16.94	144	0.1870	0.9592
0.3918	18.0	153	0.2005	0.9592
0.3305	18.82	160	0.1947	0.9592

Framework versions

Transformers 4.36.0.dev0
Pytorch 2.1.0+cu121
Datasets 2.13.1
Tokenizers 0.15.0

MatsRooth
/

wav2vec2-base_than_I_did