marinone94
/

xls-r-300m-sv-robust

Automatic Speech Recognition

mozilla-foundation/common_voice_9_0

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

marinone94 commited on Feb 2, 2022

Commit

044dff6

•

1 Parent(s): fcf680c

log df of train and test data

Files changed (2) hide show

run.sh +1 -1
run_speech_recognition_ctc.py +6 -0

run.sh CHANGED Viewed

@@ -5,7 +5,7 @@ python run_speech_recognition_ctc.py \
 	--train_split_name="train+validation,train" \
 	--eval_split_name="test,None" \
 	--output_dir="./" \
-	--overwrite_output_dir \
 	--num_train_epochs="3" \
 	--per_device_train_batch_size="32" \
 	--per_device_eval_batch_size="32" \

 	--train_split_name="train+validation,train" \
 	--eval_split_name="test,None" \
 	--output_dir="./" \
+	--preprocessing_only \
 	--num_train_epochs="3" \
 	--per_device_train_batch_size="32" \
 	--per_device_eval_batch_size="32" \

run_speech_recognition_ctc.py CHANGED Viewed

@@ -750,6 +750,12 @@ def main():
     # If dataset_seed is set, shuffle train
     if data_args.dataset_seed is not None:
         vectorized_datasets["train"] = vectorized_datasets["train"].shuffle(seed=data_args.dataset_seed)
     # for large datasets it is advised to run the preprocessing on a
     # single machine first with ``args.preprocessing_only`` since there will mostly likely

     # If dataset_seed is set, shuffle train
     if data_args.dataset_seed is not None:
         vectorized_datasets["train"] = vectorized_datasets["train"].shuffle(seed=data_args.dataset_seed)
+    # Log sample of datasets
+    pd_train = vectorized_datasets["train"].select(range(10)).to_pandas()
+    pd_eval = vectorized_datasets["eval"].select(range(10)).to_pandas()
+    wandb.log({"train_sample": pd_train})
+    wandb.log("eval_sample": pd_eval)
     # for large datasets it is advised to run the preprocessing on a
     # single machine first with ``args.preprocessing_only`` since there will mostly likely