Speech Verification Repository
μ΄ μ μ₯μλ μμ± λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ νμ μΈμ λͺ¨λΈμ νμ΅νκ³ μ¬μ©νλ λ°©λ²μ μ 곡ν©λλ€. νκ΅μ΄ μμ± λ°μ΄ν°μ μΈ AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ μ μ¬μ©νμ¬ νμ΅μ΄ μ§νλ λͺ¨λΈμ λλ€.
λͺ¨λΈ κ°μ
- λͺ¨λΈ μ΄λ¦: wav2vec2-base-960h-contrastive
- λͺ¨λΈ μ€λͺ : μ΄ λͺ¨λΈμ Wav2Vec 2.0 μν€ν μ²λ₯Ό κΈ°λ°μΌλ‘ ν νμ μΈμ λͺ¨λΈμ λλ€. λμ‘° νμ΅(Contrastive Learning)μ ν΅ν΄ λμΌ νμμ λν μμ± λ°μ΄ν°μ νΉμ§μ ν¨κ³Όμ μΌλ‘ νμ΅ν μ μμ΅λλ€.
- νμ© λΆμΌ: μμ± μΈμ, νμ λΆλ₯λ±μ νμ€ν¬μ νμ©λ μ μμ΅λλ€.
The original model can be found facebook/wav2vec2-base-960h
νμ΅ λ°μ΄ν°
- AIHubμ νμ μΈμμ© μμ± λ°μ΄ν°μ μ¬μ©
- νκ΅μ΄ μμ± λ°μ΄ν°λ‘ ꡬμ±λμ΄ μμΌλ©°, λ€μν νμμ μμ± μνμ ν¬ν¨
- μλ³Έ λ°μ΄ν° λ§ν¬: AIHub νμ μΈμ λ°μ΄ν°μ
μ¬μ© λ°©λ²
- Library import
import librosa
import torch
import torch.nn.functional as F
from transformers import Wav2Vec2Model
from transformers import Wav2Vec2FeatureExtractor
from torch.nn.functional import cosine_similarity
- Load Model
from transformers import Wav2Vec2Model, AutoFeatureExtractor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "Songhun/wav2vec2-base-960h-contrastive"
model = Wav2Vec2Model.from_pretrained(model_name).to(device)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
- Calculate Voice Similarity
file_path1 = './test1.wav'
file_path2 = './test2.wav'
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
def load_and_process_audio(file_path, feature_extractor, max_length=4.0):
audio, sampling_rate = librosa.load(file_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt", padding="max_length", truncation=True, max_length=int(max_length * sampling_rate))
return inputs.input_values
audio_input1 = load_and_process_audio(file_path1, feature_extractor).to(device)
audio_input2 = load_and_process_audio(file_path2, feature_extractor).to(device)
embedding1 = model(audio_input1).last_hidden_state.mean(dim=1)
embedding2 = model(audio_input2).last_hidden_state.mean(dim=1)
similarity = F.cosine_similarity(embedding1, embedding2).item()
print(f"Similarity between the two audio files: {similarity}")
Threshold: 0.3331 is Youden's J statistic optimal threshold
- Downloads last month
- 711
This model does not have enough activity to be deployed to Inference API (serverless) yet.
Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.