sanchit-gandhi HF staff commited on
Commit
07a4393
1 Parent(s): 5000ecd

Update Example Code Snippets

Browse files

Fixes #5. Note that the `SpeechEncoderDecoderModel` allows for arbitrary combinations of speech encoders and text decoders. Hence, we have arbitrary combinations of feature extractors and tokenizers, meaning it's not possible to define a processor class (which requires fixed feature extractor and tokenizer classes). Thus, we explicitly define which feature extractor and tokenizer we are using the `AutoFeatureExtractor` and `AutoTokenizer` classes.

cc

@Changhan

- it would be great if you could merge this simple README update for Transformers usage! Thanks!

Files changed (1) hide show
  1. README.md +9 -7
README.md CHANGED
@@ -106,7 +106,7 @@ from transformers import pipeline
106
  librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
107
  audio_file = librispeech_en[0]["file"]
108
 
109
- asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-1b-21-to-en")
110
 
111
  translation = asr(audio_file)
112
  ```
@@ -115,17 +115,19 @@ or step-by-step as follows:
115
 
116
  ```python
117
  import torch
118
- from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel
119
  from datasets import load_dataset
120
 
121
  model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
122
- processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
 
123
 
124
- ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
 
125
 
126
- inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["array"]["sampling_rate"], return_tensors="pt")
127
- generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
128
- transcription = processor.batch_decode(generated_ids)
129
  ```
130
 
131
  ## Results `{lang}` -> `en`
 
106
  librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
107
  audio_file = librispeech_en[0]["file"]
108
 
109
+ asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-1b-21-to-en")
110
 
111
  translation = asr(audio_file)
112
  ```
 
115
 
116
  ```python
117
  import torch
118
+ from transformers import AutoFeatureExtractor, AutoTokenizer, SpeechEncoderDecoderModel
119
  from datasets import load_dataset
120
 
121
  model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
122
+ feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
123
+ tokenizer = AutoTokenizer.from_pretrained("facebook/wav2vec2-xls-r-1b-21-to-en")
124
 
125
+ librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
126
+ sample = librispeech_en[0]["audio"]
127
 
128
+ inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt")
129
+ generated_ids = model.generate(**inputs)
130
+ transcription = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
131
  ```
132
 
133
  ## Results `{lang}` -> `en`