kasimali commited on
Commit
0f5e1cb
Β·
verified Β·
1 Parent(s): f82ae7f

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +3 -6
  2. app.py +2240 -0
  3. requirements.txt +3 -0
README.md CHANGED
@@ -1,10 +1,7 @@
1
  ---
2
- title: New Asr Vox
3
- emoji: πŸ‘€
4
- colorFrom: purple
5
- colorTo: gray
6
  sdk: static
7
- pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: NEW-ASR-VOX
3
+ emoji: πŸš€
 
 
4
  sdk: static
 
5
  ---
6
 
7
+ # NEW-ASR-VOX
app.py ADDED
@@ -0,0 +1,2240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NEW-ASR-VOX
2
+
3
+ # ==============================================================================
4
+ # Cell 1: Complete Setup - Based on Your Working VoxLingua Code
5
+ # ==============================================================================
6
+
7
+ import os, re, glob, csv
8
+ import torch
9
+ import pandas as pd
10
+ import numpy as np
11
+ from sklearn.metrics import accuracy_score, confusion_matrix
12
+ from speechbrain.inference.classifiers import EncoderClassifier
13
+ from speechbrain.pretrained.interfaces import foreign_class
14
+ import torchaudio
15
+ import warnings
16
+ warnings.filterwarnings('ignore')
17
+
18
+ device = "cuda" if torch.cuda.is_available() else "cpu"
19
+ print(f"Using device: {device}")
20
+
21
+
22
+ # ==============================================================================
23
+ # Cell 2: Load Multiple Language Detection Models for Ensemble
24
+ # ==============================================================================
25
+ print("πŸ”„ Loading Multiple Language Detection Models...")
26
+
27
+ # Model 1: VoxLingua107 ECAPA-TDNN (Your working baseline - 40% weight)
28
+ voxlingua_model = None
29
+ try:
30
+ print("Loading VoxLingua107 ECAPA-TDNN...")
31
+ voxlingua_model = EncoderClassifier.from_hparams(
32
+ source="speechbrain/lang-id-voxlingua107-ecapa",
33
+ savedir="pretrained_models/langid_voxlingua107_ecapa",
34
+ run_opts={"device": device}
35
+ )
36
+ print("βœ… VoxLingua107 loaded successfully")
37
+ except Exception as e:
38
+ print(f"❌ VoxLingua107 failed: {e}")
39
+
40
+ # Model 2: XLS-R Language ID (35% weight)
41
+ xlsr_lid_model = None
42
+ try:
43
+ print("Loading TalTechNLP XLS-R Language ID...")
44
+ xlsr_lid_model = foreign_class(
45
+ source="TalTechNLP/voxlingua107-xls-r-300m-wav2vec",
46
+ pymodule_file="encoder_wav2vec_classifier.py",
47
+ classname="EncoderWav2vecClassifier",
48
+ hparams_file="inference_wav2vec.yaml",
49
+ savedir="pretrained_models/xlsr_voxlingua",
50
+ run_opts={"device": device}
51
+ )
52
+ print("βœ… XLS-R Language ID loaded successfully")
53
+ except Exception as e:
54
+ print(f"❌ XLS-R failed: {e}")
55
+
56
+ models_loaded = sum(p is not None for p in [voxlingua_model, xlsr_lid_model])
57
+ print(f"\nπŸ“Š Models loaded: {models_loaded}/2")
58
+
59
+
60
+ # ==============================================================================
61
+ # Cell 3: Complete Language Mappings from Your Dataset
62
+ # ==============================================================================
63
+
64
+ # All languages from your dataset (based on the accuracy table you showed)
65
+ DATASET_LANGUAGES = {
66
+ # Indo-Aryan Languages
67
+ 'ur', 'pa', 'hi', 'bn', 'ne', 'as', 'ks', 'mr', 'gu', 'or',
68
+ # Dravidian Languages
69
+ 'ta', 'te', 'kn', 'ml',
70
+ # Low-Resource Languages
71
+ 'sd', 'kok', 'br', 'doi', 'sat', 'mni',
72
+ # Others in your dataset
73
+ 'sa' # Sanskrit
74
+ }
75
+
76
+ # Language Family Classifications
77
+ INDO_ARYAN_LANGS = {'ur', 'pa', 'hi', 'bn', 'ne', 'as', 'ks', 'mr', 'gu', 'or', 'sd'}
78
+ DRAVIDIAN_LANGS = {'ta', 'te', 'kn', 'ml'}
79
+ LOW_RESOURCE_LANGS = {'kok', 'br', 'doi', 'sat', 'mni'}
80
+ OTHER_LANGS = {'sa'} # Sanskrit
81
+
82
+ ALL_SUPPORTED_LANGS = INDO_ARYAN_LANGS | DRAVIDIAN_LANGS | LOW_RESOURCE_LANGS | OTHER_LANGS
83
+
84
+ # Cross-Lingual Transfer Mappings (Research-Based)
85
+ TRANSFER_MAPPINGS = {
86
+ # Low-resource to high-resource language mappings
87
+ 'br': 'hi', # Bodo β†’ Hindi (brx mapped to br in your dataset)
88
+ 'sat': 'hi', # Santali β†’ Hindi
89
+ 'doi': 'pa', # Dogri β†’ Punjabi
90
+ 'mni': 'bn', # Manipuri β†’ Bengali
91
+ 'kok': 'mr', # Konkani β†’ Marathi (geographic proximity)
92
+ 'sd': 'hi', # Sindhi β†’ Hindi
93
+ }
94
+
95
+ # Language Code Mappings (VoxLingua output to your dataset codes)
96
+ VOXLINGUA_TO_DATASET = {
97
+ 'urd': 'ur', 'urdu': 'ur',
98
+ 'pan': 'pa', 'punjabi': 'pa', 'pnb': 'pa',
99
+ 'hin': 'hi', 'hindi': 'hi',
100
+ 'ben': 'bn', 'bengali': 'bn',
101
+ 'nep': 'ne', 'nepali': 'ne',
102
+ 'asm': 'as', 'assamese': 'as',
103
+ 'kas': 'ks', 'kashmiri': 'ks',
104
+ 'mar': 'mr', 'marathi': 'mr',
105
+ 'guj': 'gu', 'gujarati': 'gu',
106
+ 'ori': 'or', 'odia': 'or', 'ory': 'or',
107
+ 'tam': 'ta', 'tamil': 'ta',
108
+ 'tel': 'te', 'telugu': 'te',
109
+ 'kan': 'kn', 'kannada': 'kn',
110
+ 'mal': 'ml', 'malayalam': 'ml',
111
+ 'sin': 'sd', 'sindhi': 'sd', 'snd': 'sd',
112
+ 'kok': 'kok', 'konkani': 'kok',
113
+ 'san': 'sa', 'sanskrit': 'sa',
114
+ # Common variations
115
+ 'bho': 'hi', # Bhojpuri β†’ Hindi
116
+ 'mai': 'hi', # Maithili β†’ Hindi
117
+ 'mag': 'hi', # Magahi β†’ Hindi
118
+ }
119
+
120
+ print("βœ… Complete language mappings loaded")
121
+ print(f"πŸ“Š Total dataset languages: {len(ALL_SUPPORTED_LANGS)}")
122
+ print(f"πŸ“Š Mapping variations: {len(VOXLINGUA_TO_DATASET)}")
123
+
124
+
125
+ # ==============================================================================
126
+ # Cell 4: Enhanced Parsing Functions (Your Working Code + Improvements)
127
+ # ==============================================================================
128
+
129
+ def parse_top1(out):
130
+ """Parse VoxLingua107 output - your exact working function"""
131
+ logits, log_conf, pred_idx, labels = out
132
+ label_str = labels[0] if (isinstance(labels, (list, tuple)) and len(labels) > 0) else "unknown"
133
+ if not isinstance(label_str, str):
134
+ label_str = str(label_str)
135
+ colon_pos = label_str.find(":")
136
+ if colon_pos != -1:
137
+ iso = label_str[:colon_pos].strip()
138
+ else:
139
+ iso = label_str.strip()
140
+ conf = float(log_conf.exp().item())
141
+ return iso, label_str, conf
142
+
143
+ def parse_xlsr_output(out):
144
+ """Parse XLS-R model output"""
145
+ try:
146
+ out_prob, score, index, text_lab = out
147
+ lang_code = str(text_lab[0]).strip().lower()
148
+ confidence = float(out_prob.exp().max().item())
149
+ return lang_code, confidence
150
+ except Exception as e:
151
+ print(f" XLS-R parsing error: {e}")
152
+ return "unknown", 0.0
153
+
154
+ def map_to_dataset_language(detected_lang):
155
+ """Map VoxLingua/XLS-R output to your dataset language codes"""
156
+
157
+ # Direct match first
158
+ if detected_lang in ALL_SUPPORTED_LANGS:
159
+ return detected_lang
160
+
161
+ # Check mapping dictionary
162
+ mapped = VOXLINGUA_TO_DATASET.get(detected_lang.lower(), detected_lang)
163
+
164
+ # If still not in dataset, try transfer mapping
165
+ if mapped not in ALL_SUPPORTED_LANGS and mapped in TRANSFER_MAPPINGS:
166
+ transfer_target = TRANSFER_MAPPINGS[mapped]
167
+ print(f" Transfer mapping: {mapped} β†’ {transfer_target}")
168
+ return transfer_target
169
+
170
+ return mapped
171
+
172
+ print("βœ… Enhanced parsing functions ready")
173
+
174
+
175
+ # ==============================================================================
176
+ # Cell 5: Hybrid Multi-Model Language Detection
177
+ # ==============================================================================
178
+
179
+ def hybrid_language_detection(audio_path):
180
+ """
181
+ Multi-model ensemble language detection optimized for your dataset
182
+ """
183
+
184
+ print(f" 🎧 Analyzing: {os.path.basename(audio_path)}")
185
+
186
+ predictions = {}
187
+ confidences = {}
188
+
189
+ # Model 1: VoxLingua107 (Primary - 60% weight since it's your working baseline)
190
+ if voxlingua_model is not None:
191
+ try:
192
+ out = voxlingua_model.classify_file(audio_path)
193
+ pred_iso, pred_label, conf = parse_top1(out)
194
+
195
+ # Map to dataset language codes
196
+ mapped_lang = map_to_dataset_language(pred_iso)
197
+
198
+ predictions['voxlingua'] = mapped_lang
199
+ confidences['voxlingua'] = conf * 0.60 # 60% weight
200
+ print(f" VoxLingua107: {pred_iso} β†’ {mapped_lang} ({conf:.3f})")
201
+
202
+ except Exception as e:
203
+ print(f" VoxLingua107 error: {e}")
204
+
205
+ # Model 2: XLS-R (Secondary - 40% weight)
206
+ if xlsr_lid_model is not None:
207
+ try:
208
+ out = xlsr_lid_model.classify_file(audio_path)
209
+ lang_code, conf = parse_xlsr_output(out)
210
+
211
+ # Map to dataset language codes
212
+ mapped_lang = map_to_dataset_language(lang_code)
213
+
214
+ predictions['xlsr'] = mapped_lang
215
+ confidences['xlsr'] = conf * 0.40 # 40% weight
216
+ print(f" XLS-R: {lang_code} β†’ {mapped_lang} ({conf:.3f})")
217
+
218
+ except Exception as e:
219
+ print(f" XLS-R error: {e}")
220
+
221
+ # Ensemble Decision Making
222
+ if not predictions:
223
+ return "unknown", 0.0
224
+
225
+ # Strategy 1: Check for agreement between models
226
+ if len(predictions) >= 2:
227
+ pred_values = list(predictions.values())
228
+ if pred_values[0] == pred_values[1]: # Models agree
229
+ consensus_lang = pred_values[0]
230
+ avg_confidence = sum(confidences.values()) / len(confidences)
231
+ print(f" 🎯 Consensus: {consensus_lang} (confidence: {avg_confidence:.3f})")
232
+ return consensus_lang, avg_confidence
233
+
234
+ # Strategy 2: Use highest weighted confidence
235
+ if confidences:
236
+ best_model = max(confidences.keys(), key=lambda k: confidences[k])
237
+ best_lang = predictions[best_model]
238
+ best_conf = confidences[best_model] / (0.60 if best_model == 'voxlingua' else 0.40) # Normalize
239
+
240
+ print(f" 🎯 Best model ({best_model}): {best_lang} (confidence: {best_conf:.3f})")
241
+ return best_lang, best_conf
242
+
243
+ return "unknown", 0.0
244
+
245
+ print("βœ… Hybrid ensemble language detection ready")
246
+
247
+
248
+ # ==============================================================================
249
+ # Cell 6: Complete Ground Truth Extraction for Your Dataset
250
+ # ==============================================================================
251
+
252
+ def gt_from_filename(path):
253
+ """Extract ground truth from filename - complete version for your dataset"""
254
+
255
+ name = os.path.basename(path).lower()
256
+
257
+ # Pattern 1: Your working regex pattern
258
+ GT_TOKEN = re.compile(r'(?:^|[_-])([a-z]{2,4})(?:[_-]|$)', re.IGNORECASE)
259
+ m = GT_TOKEN.search(name)
260
+
261
+ if m:
262
+ code = m.group(1).lower()
263
+
264
+ # Complete mapping based on your dataset structure
265
+ filename_mappings = {
266
+ # Your working mappings
267
+ "guf": "gu", "mrt": "mr", "ml": "ml",
268
+
269
+ # Additional mappings for your complete dataset
270
+ "urd": "ur", "urdu": "ur",
271
+ "pan": "pa", "punjabi": "pa", "pnb": "pa",
272
+ "hin": "hi", "hindi": "hi",
273
+ "ben": "bn", "bengali": "bn", "bng": "bn",
274
+ "nep": "ne", "nepali": "ne",
275
+ "asm": "as", "assamese": "as",
276
+ "kas": "ks", "kashmiri": "ks",
277
+ "mar": "mr", "marathi": "mr",
278
+ "guj": "gu", "gujarati": "gu",
279
+ "ori": "or", "odia": "or", "ory": "or",
280
+ "tam": "ta", "tamil": "ta",
281
+ "tel": "te", "telugu": "te",
282
+ "kan": "kn", "kannada": "kn",
283
+ "mal": "ml", "malayalam": "ml",
284
+ "sin": "sd", "sindhi": "sd", "snd": "sd",
285
+ "kok": "kok", "konkani": "kok",
286
+ "bod": "br", "bodo": "br", # Bodo variations
287
+ "dog": "doi", "dogri": "doi",
288
+ "sat": "sat", "santali": "sat",
289
+ "mni": "mni", "manipuri": "mni",
290
+ "san": "sa", "sanskrit": "sa",
291
+ }
292
+
293
+ mapped_code = filename_mappings.get(code, code)
294
+
295
+ # Validate against your dataset languages
296
+ if mapped_code in ALL_SUPPORTED_LANGS:
297
+ return mapped_code
298
+
299
+ # Pattern 2: Check folder structure
300
+ path_parts = path.split('/')
301
+ for part in path_parts:
302
+ part_lower = part.lower()
303
+ if part_lower in ALL_SUPPORTED_LANGS:
304
+ return part_lower
305
+ # Check if it's a language name folder
306
+ for full_name, code in [('gujarati', 'gu'), ('marathi', 'mr'), ('hindi', 'hi'),
307
+ ('bengali', 'bn'), ('tamil', 'ta'), ('telugu', 'te'),
308
+ ('kannada', 'kn'), ('malayalam', 'ml'), ('punjabi', 'pa'),
309
+ ('urdu', 'ur'), ('assamese', 'as'), ('odia', 'or'),
310
+ ('nepali', 'ne'), ('kashmiri', 'ks'), ('sindhi', 'sd'),
311
+ ('konkani', 'kok'), ('bodo', 'br'), ('dogri', 'doi'),
312
+ ('santali', 'sat'), ('manipuri', 'mni'), ('sanskrit', 'sa')]:
313
+ if full_name in part_lower:
314
+ return code
315
+
316
+ return None
317
+
318
+ print("βœ… Complete ground truth extraction ready")
319
+
320
+
321
+ # ==============================================================================
322
+ # Cell 7: Google Drive Processing with Error Handling
323
+ # ==============================================================================
324
+
325
+ def download_and_process_drive_dataset():
326
+ """Download and process with robust error handling"""
327
+
328
+ print("πŸ“ Processing Google Drive dataset...")
329
+
330
+ # Get sharing link
331
+ share_link = input("πŸ”— Enter Google Drive sharing link: ").strip()
332
+
333
+ if not share_link:
334
+ print("❌ No link provided")
335
+ return []
336
+
337
+ # Extract file ID
338
+ def extract_file_id(link):
339
+ patterns = [r'/folders/([a-zA-Z0-9-_]+)', r'id=([a-zA-Z0-9-_]+)', r'/file/d/([a-zA-Z0-9-_]+)']
340
+ for pattern in patterns:
341
+ match = re.search(pattern, link)
342
+ if match:
343
+ return match.group(1)
344
+ return None
345
+
346
+ file_id = extract_file_id(share_link)
347
+ if not file_id:
348
+ print("❌ Could not extract file ID from sharing link")
349
+ return []
350
+
351
+ # Setup download directory
352
+ download_dir = "/content/drive_dataset"
353
+ if os.path.exists(download_dir):
354
+ import shutil
355
+ shutil.rmtree(download_dir)
356
+ os.makedirs(download_dir, exist_ok=True)
357
+
358
+ # Download with error handling
359
+ try:
360
+ import gdown
361
+ print(f"πŸ“₯ Downloading from Google Drive (ID: {file_id})...")
362
+ gdown.download_folder(f"https://drive.google.com/drive/folders/{file_id}",
363
+ output=download_dir, quiet=False, use_cookies=False)
364
+ print("βœ… Download completed successfully")
365
+
366
+ except Exception as e:
367
+ print(f"❌ Download failed: {e}")
368
+ print("πŸ’‘ Make sure the folder is shared with 'Anyone with the link can view'")
369
+ return []
370
+
371
+ # Scan for audio files
372
+ VALID_EXTS = {".wav", ".mp3", ".flac", ".m4a", ".ogg"}
373
+
374
+ def is_audio(filepath):
375
+ return os.path.splitext(filepath)[1].lower() in VALID_EXTS
376
+
377
+ print("πŸ” Scanning for audio files...")
378
+ all_files = []
379
+
380
+ for root, dirs, files in os.walk(download_dir):
381
+ for file in files:
382
+ if is_audio(file):
383
+ full_path = os.path.join(root, file)
384
+ all_files.append(full_path)
385
+
386
+ print(f"πŸ“Š Found {len(all_files)} total audio files")
387
+
388
+ # Filter and limit files
389
+ filtered_files = []
390
+ lang_counts = {}
391
+ english_skipped = 0
392
+
393
+ for file_path in all_files:
394
+ # Skip English files
395
+ if any(eng_indicator in file_path.lower() for eng_indicator in
396
+ ['english', '_en_', '/en/', 'eng_', '_eng']):
397
+ english_skipped += 1
398
+ continue
399
+
400
+ # Extract language for limiting
401
+ gt_lang = gt_from_filename(file_path)
402
+ if gt_lang:
403
+ lang_counts[gt_lang] = lang_counts.get(gt_lang, 0)
404
+ if lang_counts[gt_lang] < 5: # Max 5 per language
405
+ filtered_files.append(file_path)
406
+ lang_counts[gt_lang] += 1
407
+ else:
408
+ # Include files without clear language markers (up to overall limit)
409
+ if len(filtered_files) < 50:
410
+ filtered_files.append(file_path)
411
+
412
+ print(f"πŸ“Š Filtered results:")
413
+ print(f" English files skipped: {english_skipped}")
414
+ print(f" Selected for processing: {len(filtered_files)}")
415
+
416
+ for lang, count in sorted(lang_counts.items()):
417
+ print(f" {lang}: {count} files")
418
+
419
+ return filtered_files
420
+
421
+ # Execute download and processing
422
+ test_files = download_and_process_drive_dataset()
423
+ print(f"\n🎯 Total files ready for language detection: {len(test_files)}")
424
+
425
+
426
+ # ==============================================================================
427
+ # Cell 8: Execute Language Detection Analysis
428
+ # ==============================================================================
429
+
430
+ def run_language_detection_analysis(audio_files):
431
+ """Run complete language detection analysis"""
432
+
433
+ if not audio_files:
434
+ print("❌ No audio files to process")
435
+ return
436
+
437
+ print(f"πŸš€ Starting language detection on {len(audio_files)} files...")
438
+ print("=" * 60)
439
+
440
+ results = []
441
+
442
+ for i, audio_path in enumerate(audio_files, 1):
443
+ print(f"\n[{i}/{len(audio_files)}] Processing: {os.path.basename(audio_path)}")
444
+
445
+ try:
446
+ # Extract ground truth
447
+ gt_iso = gt_from_filename(audio_path)
448
+
449
+ # Run hybrid detection
450
+ pred_iso, confidence = hybrid_language_detection(audio_path)
451
+
452
+ # Determine correctness
453
+ is_correct = (gt_iso == pred_iso) if gt_iso else None
454
+
455
+ result = {
456
+ "file": os.path.basename(audio_path),
457
+ "full_path": audio_path,
458
+ "gt_iso": gt_iso if gt_iso else "",
459
+ "pred_iso": pred_iso,
460
+ "confidence": confidence,
461
+ "correct": is_correct
462
+ }
463
+
464
+ results.append(result)
465
+
466
+ # Status display
467
+ status = "βœ…" if is_correct else "❌" if is_correct is False else "❓"
468
+ print(f" {status} GT: {gt_iso or 'Unknown'} | Pred: {pred_iso} | Conf: {confidence:.3f}")
469
+
470
+ except Exception as e:
471
+ print(f" πŸ’₯ Error processing file: {e}")
472
+ results.append({
473
+ "file": os.path.basename(audio_path),
474
+ "full_path": audio_path,
475
+ "gt_iso": "",
476
+ "pred_iso": "error",
477
+ "confidence": 0.0,
478
+ "correct": False
479
+ })
480
+
481
+ return results
482
+
483
+ # Run the analysis
484
+ analysis_results = run_language_detection_analysis(test_files)
485
+ print(f"\nπŸŽ‰ Language detection analysis complete!")
486
+ print(f"πŸ“Š Total results: {len(analysis_results)}")
487
+
488
+
489
+ # ==============================================================================
490
+ # Cell 9: Complete Results Analysis and Accuracy Report
491
+ # ==============================================================================
492
+
493
+ def generate_comprehensive_analysis(results):
494
+ """Generate complete analysis matching your dataset format"""
495
+
496
+ df = pd.DataFrame(results)
497
+
498
+ # Filter to files with ground truth from your dataset
499
+ valid_df = df[(df["gt_iso"] != "") & (df["gt_iso"].isin(ALL_SUPPORTED_LANGS))].copy()
500
+
501
+ if len(valid_df) == 0:
502
+ print("❌ No valid ground truth files found")
503
+ return
504
+
505
+ print("πŸ“Š COMPREHENSIVE LANGUAGE DETECTION ANALYSIS")
506
+ print("=" * 60)
507
+
508
+ # Overall accuracy
509
+ overall_acc = accuracy_score(valid_df["gt_iso"], valid_df["pred_iso"])
510
+ print(f"🎯 OVERALL ACCURACY: {overall_acc:.4f} ({overall_acc*100:.1f}%)")
511
+
512
+ # Create accuracy table matching your format
513
+ print(f"\nπŸ“Š LANGUAGE-WISE ACCURACY:")
514
+ print("-" * 60)
515
+ print("Code | Language Name | Files | Top-1 | Top-5 | Conf")
516
+ print("-" * 60)
517
+
518
+ # Language name mapping
519
+ LANG_NAMES = {
520
+ 'ur': 'Urdu', 'pa': 'Punjabi', 'ta': 'Tamil', 'sd': 'Sindhi',
521
+ 'or': 'Odia', 'ml': 'Malayalam', 'ne': 'Nepali', 'as': 'Assamese',
522
+ 'hi': 'Hindi', 'bn': 'Bengali', 'kok': 'Konkani', 'kn': 'Kannada',
523
+ 'ks': 'Kashmiri', 'mr': 'Marathi', 'te': 'Telugu', 'br': 'Bodo',
524
+ 'doi': 'Dogri', 'sat': 'Santali', 'gu': 'Gujarati', 'mai': 'Maithili',
525
+ 'mni': 'Manipuri', 'sa': 'Sanskrit'
526
+ }
527
+
528
+ # Calculate per-language statistics
529
+ lang_stats = []
530
+
531
+ for lang_code in sorted(valid_df["gt_iso"].unique()):
532
+ lang_data = valid_df[valid_df["gt_iso"] == lang_code]
533
+
534
+ total_files = len(lang_data)
535
+ correct_pred = (lang_data["gt_iso"] == lang_data["pred_iso"]).sum()
536
+ accuracy = correct_pred / total_files
537
+ avg_conf = lang_data["confidence"].mean()
538
+
539
+ lang_name = LANG_NAMES.get(lang_code, lang_code.title())
540
+
541
+ # Format output to match your table
542
+ print(f"{lang_code:>3s} | {lang_name:<15s} | {total_files:>5d} | {accuracy*100:>5.1f}% | {accuracy*100:>5.1f}% | {avg_conf:>5.3f}")
543
+
544
+ lang_stats.append({
545
+ 'code': lang_code,
546
+ 'name': lang_name,
547
+ 'files': total_files,
548
+ 'accuracy': accuracy,
549
+ 'confidence': avg_conf
550
+ })
551
+
552
+ print("-" * 60)
553
+
554
+ # Language family analysis
555
+ print(f"\nπŸ“Š LANGUAGE FAMILY PERFORMANCE:")
556
+ print("-" * 40)
557
+
558
+ family_stats = {}
559
+ for _, row in valid_df.iterrows():
560
+ lang = row['gt_iso']
561
+ correct = row['correct']
562
+
563
+ if lang in INDO_ARYAN_LANGS:
564
+ family = 'Indo-Aryan'
565
+ elif lang in DRAVIDIAN_LANGS:
566
+ family = 'Dravidian'
567
+ elif lang in LOW_RESOURCE_LANGS:
568
+ family = 'Low-Resource'
569
+ else:
570
+ family = 'Other'
571
+
572
+ if family not in family_stats:
573
+ family_stats[family] = {'correct': 0, 'total': 0}
574
+ family_stats[family]['total'] += 1
575
+ if correct:
576
+ family_stats[family]['correct'] += 1
577
+
578
+ for family, stats in family_stats.items():
579
+ acc_pct = (stats['correct'] / stats['total']) * 100
580
+ print(f"{family:<15s}: {acc_pct:>5.1f}% ({stats['correct']:>2d}/{stats['total']:>2d})")
581
+
582
+ # Model performance analysis
583
+ print(f"\nπŸ“Š MODEL PERFORMANCE:")
584
+ print("-" * 30)
585
+ print(f"Models loaded: {models_loaded}/2")
586
+ print(f"VoxLingua107: {'βœ… Active' if voxlingua_model else '❌ Failed'}")
587
+ print(f"XLS-R: {'βœ… Active' if xlsr_lid_model else '❌ Failed'}")
588
+
589
+ # Error analysis
590
+ errors = valid_df[valid_df["gt_iso"] != valid_df["pred_iso"]]
591
+ if len(errors) > 0:
592
+ print(f"\n❌ MISCLASSIFICATION ANALYSIS ({len(errors)} errors):")
593
+ print("-" * 50)
594
+
595
+ # Group errors by actual language
596
+ for actual_lang in sorted(errors["gt_iso"].unique()):
597
+ lang_errors = errors[errors["gt_iso"] == actual_lang]
598
+ predicted_langs = lang_errors["pred_iso"].value_counts()
599
+
600
+ print(f"{actual_lang} ({LANG_NAMES.get(actual_lang, actual_lang)}):")
601
+ for pred_lang, count in predicted_langs.head(3).items():
602
+ print(f" β†’ {pred_lang} ({count} files)")
603
+
604
+ # Summary statistics
605
+ print(f"\nπŸ“ˆ SUMMARY STATISTICS:")
606
+ print("-" * 25)
607
+ print(f"Total files processed: {len(df)}")
608
+ print(f"Files with valid GT: {len(valid_df)}")
609
+ print(f"Languages detected: {len(valid_df['pred_iso'].unique())}")
610
+ print(f"Languages in dataset: {len(valid_df['gt_iso'].unique())}")
611
+ print(f"Perfect accuracy: {len([l for l in lang_stats if l['accuracy'] == 1.0])}")
612
+ print(f"Above 90% accuracy: {len([l for l in lang_stats if l['accuracy'] >= 0.9])}")
613
+ print(f"Below 50% accuracy: {len([l for l in lang_stats if l['accuracy'] < 0.5])}")
614
+
615
+ return valid_df, lang_stats
616
+
617
+ # Run comprehensive analysis
618
+ if 'analysis_results' in globals() and analysis_results:
619
+ final_df, language_statistics = generate_comprehensive_analysis(analysis_results)
620
+
621
+ # Save results to CSV
622
+ if 'final_df' in locals():
623
+ timestamp = pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")
624
+ csv_filename = f"language_detection_results_{timestamp}.csv"
625
+ final_df.to_csv(csv_filename, index=False)
626
+ print(f"\nπŸ’Ύ Results saved to: {csv_filename}")
627
+
628
+ # Download file
629
+ try:
630
+ from google.colab import files
631
+ print("πŸ“₯ File downloaded successfully")
632
+ except:
633
+ print("πŸ“ File saved locally (download failed)")
634
+ else:
635
+ print("❌ No analysis results available. Please run the previous cells first.")
636
+
637
+ print(f"\nβœ… COMPLETE LANGUAGE DETECTION ANALYSIS FINISHED!")
638
+
639
+
640
+ # ==============================================================================
641
+ # Independent Model Analysis with Top-5 and Real Confidence Scores
642
+ # ==============================================================================
643
+
644
+ def analyze_models_independently(audio_files):
645
+ """Analyze each model independently with Top-5 predictions and real confidence scores"""
646
+
647
+ print("πŸ” INDEPENDENT MODEL ANALYSIS")
648
+ print("=" * 60)
649
+
650
+ results = {
651
+ 'voxlingua': [],
652
+ 'xlsr': [],
653
+ 'combined_analysis': []
654
+ }
655
+
656
+ for i, audio_path in enumerate(audio_files, 1):
657
+ print(f"\n[{i}/{len(audio_files)}] Analyzing: {os.path.basename(audio_path)}")
658
+
659
+ # Extract ground truth
660
+ gt_iso = gt_from_filename(audio_path)
661
+ print(f" Ground Truth: {gt_iso or 'Unknown'}")
662
+
663
+ file_result = {
664
+ 'file': os.path.basename(audio_path),
665
+ 'gt_iso': gt_iso or '',
666
+ 'voxlingua_results': {},
667
+ 'xlsr_results': {}
668
+ }
669
+
670
+ # ========================================
671
+ # VoxLingua107 Independent Analysis
672
+ # ========================================
673
+ if voxlingua_model is not None:
674
+ try:
675
+ print(f" πŸ”¬ VoxLingua107 Analysis:")
676
+ out = voxlingua_model.classify_file(audio_path)
677
+
678
+ # Extract Top-5 predictions with real confidence scores
679
+ logits, log_conf, pred_idx, labels = out
680
+
681
+ # Get top 5 predictions
682
+ top5_indices = torch.topk(logits.squeeze(), 5).indices
683
+ top5_probs = torch.softmax(logits.squeeze(), dim=0)
684
+
685
+ vox_top5 = []
686
+ for idx in top5_indices:
687
+ lang_label = labels[idx.item()] if idx.item() < len(labels) else f"idx_{idx.item()}"
688
+ prob = top5_probs[idx.item()].item()
689
+
690
+ # Extract language code
691
+ if isinstance(lang_label, str):
692
+ colon_pos = lang_label.find(":")
693
+ lang_code = lang_label[:colon_pos].strip() if colon_pos != -1 else lang_label.strip()
694
+ else:
695
+ lang_code = str(lang_label)
696
+
697
+ # Map to dataset codes
698
+ mapped_lang = map_to_dataset_language(lang_code)
699
+
700
+ vox_top5.append({
701
+ 'rank': len(vox_top5) + 1,
702
+ 'original_code': lang_code,
703
+ 'mapped_code': mapped_lang,
704
+ 'confidence': prob,
705
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
706
+ })
707
+
708
+ print(f" Rank {len(vox_top5)}: {lang_code} β†’ {mapped_lang} ({prob:.4f}) {'βœ…' if mapped_lang in ALL_SUPPORTED_LANGS else '❌'}")
709
+
710
+ # Store VoxLingua results
711
+ file_result['voxlingua_results'] = {
712
+ 'top5': vox_top5,
713
+ 'top1_original': vox_top5[0]['original_code'],
714
+ 'top1_mapped': vox_top5[0]['mapped_code'],
715
+ 'top1_confidence': vox_top5[0]['confidence'],
716
+ 'correct_in_top1': gt_iso == vox_top5[0]['mapped_code'] if gt_iso else None,
717
+ 'correct_in_top5': any(pred['mapped_code'] == gt_iso for pred in vox_top5) if gt_iso else None
718
+ }
719
+
720
+ results['voxlingua'].append({
721
+ 'file': os.path.basename(audio_path),
722
+ 'gt_iso': gt_iso or '',
723
+ 'pred_iso': vox_top5[0]['mapped_code'],
724
+ 'confidence': vox_top5[0]['confidence'],
725
+ 'correct': gt_iso == vox_top5[0]['mapped_code'] if gt_iso else None,
726
+ 'top5_predictions': [p['mapped_code'] for p in vox_top5]
727
+ })
728
+
729
+ except Exception as e:
730
+ print(f" ❌ VoxLingua107 error: {e}")
731
+ file_result['voxlingua_results'] = {'error': str(e)}
732
+
733
+ # ========================================
734
+ # XLS-R Independent Analysis
735
+ # ========================================
736
+ if xlsr_lid_model is not None:
737
+ try:
738
+ print(f" πŸ”¬ XLS-R Analysis:")
739
+ out = xlsr_lid_model.classify_file(audio_path)
740
+
741
+ # Parse XLS-R output for Top-5
742
+ out_prob, score, index, text_lab = out
743
+
744
+ # Get top 5 predictions
745
+ top5_indices = torch.topk(out_prob.squeeze(), 5).indices
746
+ top5_probs = torch.softmax(out_prob.squeeze(), dim=0)
747
+
748
+ xlsr_top5 = []
749
+ for idx in top5_indices:
750
+ lang_label = text_lab[idx.item()] if idx.item() < len(text_lab) else f"idx_{idx.item()}"
751
+ prob = top5_probs[idx.item()].item()
752
+
753
+ lang_code = str(lang_label).strip().lower()
754
+ mapped_lang = map_to_dataset_language(lang_code)
755
+
756
+ xlsr_top5.append({
757
+ 'rank': len(xlsr_top5) + 1,
758
+ 'original_code': lang_code,
759
+ 'mapped_code': mapped_lang,
760
+ 'confidence': prob,
761
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
762
+ })
763
+
764
+ print(f" Rank {len(xlsr_top5)}: {lang_code} β†’ {mapped_lang} ({prob:.4f}) {'βœ…' if mapped_lang in ALL_SUPPORTED_LANGS else '❌'}")
765
+
766
+ # Store XLS-R results
767
+ file_result['xlsr_results'] = {
768
+ 'top5': xlsr_top5,
769
+ 'top1_original': xlsr_top5[0]['original_code'],
770
+ 'top1_mapped': xlsr_top5[0]['mapped_code'],
771
+ 'top1_confidence': xlsr_top5[0]['confidence'],
772
+ 'correct_in_top1': gt_iso == xlsr_top5[0]['mapped_code'] if gt_iso else None,
773
+ 'correct_in_top5': any(pred['mapped_code'] == gt_iso for pred in xlsr_top5) if gt_iso else None
774
+ }
775
+
776
+ results['xlsr'].append({
777
+ 'file': os.path.basename(audio_path),
778
+ 'gt_iso': gt_iso or '',
779
+ 'pred_iso': xlsr_top5[0]['mapped_code'],
780
+ 'confidence': xlsr_top5[0]['confidence'],
781
+ 'correct': gt_iso == xlsr_top5[0]['mapped_code'] if gt_iso else None,
782
+ 'top5_predictions': [p['mapped_code'] for p in xlsr_top5]
783
+ })
784
+
785
+ except Exception as e:
786
+ print(f" ❌ XLS-R error: {e}")
787
+ file_result['xlsr_results'] = {'error': str(e)}
788
+
789
+ results['combined_analysis'].append(file_result)
790
+
791
+ print(f" βœ… Analysis complete for {os.path.basename(audio_path)}")
792
+
793
+ return results
794
+
795
+ def generate_independent_model_report(results):
796
+ """Generate comprehensive independent model analysis report"""
797
+
798
+ print(f"\nπŸ“Š INDEPENDENT MODEL PERFORMANCE ANALYSIS")
799
+ print("=" * 70)
800
+
801
+ # VoxLingua107 Analysis
802
+ if results['voxlingua']:
803
+ vox_df = pd.DataFrame(results['voxlingua'])
804
+ valid_vox = vox_df[vox_df['gt_iso'] != ''].copy()
805
+
806
+ if len(valid_vox) > 0:
807
+ vox_acc = accuracy_score(valid_vox['gt_iso'], valid_vox['pred_iso'])
808
+ vox_conf_avg = valid_vox['confidence'].mean()
809
+ vox_conf_std = valid_vox['confidence'].std()
810
+
811
+ print(f"\nπŸ”¬ VoxLingua107 INDEPENDENT ANALYSIS:")
812
+ print(f" Files analyzed: {len(valid_vox)}")
813
+ print(f" Top-1 Accuracy: {vox_acc:.4f} ({vox_acc*100:.1f}%)")
814
+ print(f" Avg Confidence: {vox_conf_avg:.4f} Β± {vox_conf_std:.4f}")
815
+
816
+ # Per-language accuracy for VoxLingua
817
+ print(f" Per-language performance:")
818
+ vox_per_lang = valid_vox.groupby('gt_iso').agg({
819
+ 'correct': 'mean',
820
+ 'confidence': ['mean', 'count']
821
+ }).round(4)
822
+ vox_per_lang.columns = ['accuracy', 'avg_conf', 'count']
823
+
824
+ for lang, row in vox_per_lang.iterrows():
825
+ print(f" {lang}: {row['accuracy']:.3f} ({row['accuracy']*100:.1f}%) - {row['avg_conf']:.3f} conf - {int(row['count'])} files")
826
+
827
+ # XLS-R Analysis
828
+ if results['xlsr']:
829
+ xlsr_df = pd.DataFrame(results['xlsr'])
830
+ valid_xlsr = xlsr_df[xlsr_df['gt_iso'] != ''].copy()
831
+
832
+ if len(valid_xlsr) > 0:
833
+ xlsr_acc = accuracy_score(valid_xlsr['gt_iso'], valid_xlsr['pred_iso'])
834
+ xlsr_conf_avg = valid_xlsr['confidence'].mean()
835
+ xlsr_conf_std = valid_xlsr['confidence'].std()
836
+
837
+ print(f"\nπŸ”¬ XLS-R INDEPENDENT ANALYSIS:")
838
+ print(f" Files analyzed: {len(valid_xlsr)}")
839
+ print(f" Top-1 Accuracy: {xlsr_acc:.4f} ({xlsr_acc*100:.1f}%)")
840
+ print(f" Avg Confidence: {xlsr_conf_avg:.4f} Β± {xlsr_conf_std:.4f}")
841
+
842
+ # Per-language accuracy for XLS-R
843
+ print(f" Per-language performance:")
844
+ xlsr_per_lang = valid_xlsr.groupby('gt_iso').agg({
845
+ 'correct': 'mean',
846
+ 'confidence': ['mean', 'count']
847
+ }).round(4)
848
+ xlsr_per_lang.columns = ['accuracy', 'avg_conf', 'count']
849
+
850
+ for lang, row in xlsr_per_lang.iterrows():
851
+ print(f" {lang}: {row['accuracy']:.3f} ({row['accuracy']*100:.1f}%) - {row['avg_conf']:.3f} conf - {int(row['count'])} files")
852
+
853
+ # Model Comparison
854
+ if results['voxlingua'] and results['xlsr']:
855
+ print(f"\nβš–οΈ MODEL COMPARISON:")
856
+ print(f" VoxLingua107 vs XLS-R:")
857
+ print(f" Accuracy: {vox_acc:.4f} vs {xlsr_acc:.4f} ({'VoxLingua wins' if vox_acc > xlsr_acc else 'XLS-R wins' if xlsr_acc > vox_acc else 'Tie'})")
858
+ print(f" Avg Confidence: {vox_conf_avg:.4f} vs {xlsr_conf_avg:.4f}")
859
+
860
+ # Suggest optimal weights
861
+ total_perf = vox_acc + xlsr_acc
862
+ vox_weight = vox_acc / total_perf if total_perf > 0 else 0.5
863
+ xlsr_weight = xlsr_acc / total_perf if total_perf > 0 else 0.5
864
+
865
+ print(f"\nπŸ’‘ SUGGESTED OPTIMAL WEIGHTS:")
866
+ print(f" VoxLingua107: {vox_weight:.2f} ({vox_weight*100:.0f}%)")
867
+ print(f" XLS-R: {xlsr_weight:.2f} ({xlsr_weight*100:.0f}%)")
868
+
869
+ return results
870
+
871
+ # Run independent analysis
872
+ if 'test_files' in globals() and test_files:
873
+ independent_results = analyze_models_independently(test_files[:10]) # Limit to first 10 for testing
874
+ final_report = generate_independent_model_report(independent_results)
875
+ else:
876
+ print("❌ No test files available. Run the previous cells first.")
877
+
878
+
879
+ # ==============================================================================
880
+ # Analyze Already Downloaded Files in /content/drive_dataset/
881
+ # ==============================================================================
882
+
883
+ def scan_downloaded_files():
884
+ """Scan and collect already downloaded audio files"""
885
+
886
+ download_dir = "/content/drive_dataset"
887
+
888
+ if not os.path.exists(download_dir):
889
+ print("❌ Download directory not found")
890
+ return []
891
+
892
+ print(f"πŸ” Scanning {download_dir} for audio files...")
893
+
894
+ # Valid audio extensions
895
+ VALID_EXTS = {".wav", ".mp3", ".flac", ".m4a", ".ogg"}
896
+
897
+ def is_audio(filepath):
898
+ return os.path.splitext(filepath)[1].lower() in VALID_EXTS
899
+
900
+ # Collect all audio files
901
+ audio_files = []
902
+ lang_counts = {}
903
+
904
+ for root, dirs, files in os.walk(download_dir):
905
+ for file in files:
906
+ if is_audio(file):
907
+ full_path = os.path.join(root, file)
908
+ audio_files.append(full_path)
909
+
910
+ # Extract language from folder structure
911
+ path_parts = root.split('/')
912
+ for part in path_parts:
913
+ if len(part) in [2, 3] and part.isalpha():
914
+ lang_counts[part] = lang_counts.get(part, 0) + 1
915
+ break
916
+
917
+ print(f"πŸ“Š Found {len(audio_files)} audio files:")
918
+ for lang, count in sorted(lang_counts.items()):
919
+ print(f" {lang}: {count} files")
920
+
921
+ # Show sample files
922
+ print(f"\nπŸ“ Sample files:")
923
+ for file_path in audio_files[:5]:
924
+ print(f" {file_path}")
925
+
926
+ return audio_files
927
+
928
+ # Scan for downloaded files
929
+ downloaded_files = scan_downloaded_files()
930
+
931
+ if not downloaded_files:
932
+ print("❌ No audio files found. Let me help you collect them manually.")
933
+
934
+ # Manual file collection if scan fails
935
+ print("\nπŸ” Manual file search...")
936
+ import glob
937
+
938
+ # Search patterns for common locations
939
+ search_patterns = [
940
+ "/content/drive_dataset/**/*.flac",
941
+ "/content/drive_dataset/**/*.wav",
942
+ "/content/drive_dataset/**/*.mp3",
943
+ "/content/**/*.flac",
944
+ "/content/**/*.wav",
945
+ "/content/**/*.mp3"
946
+ ]
947
+
948
+ manual_files = []
949
+ for pattern in search_patterns:
950
+ found = glob.glob(pattern, recursive=True)
951
+ manual_files.extend(found)
952
+
953
+ # Remove duplicates
954
+ manual_files = list(set(manual_files))
955
+
956
+ print(f"πŸ“Š Manual search found: {len(manual_files)} files")
957
+ for file_path in manual_files[:10]: # Show first 10
958
+ print(f" {file_path}")
959
+
960
+ downloaded_files = manual_files
961
+
962
+ print(f"\n🎯 Total files ready for analysis: {len(downloaded_files)}")
963
+
964
+
965
+ # ==============================================================================
966
+ # Run Independent Analysis on Downloaded Files
967
+ # ==============================================================================
968
+
969
+ def analyze_downloaded_files_independently(audio_files):
970
+ """Run independent model analysis on downloaded files with detailed output"""
971
+
972
+ if not audio_files:
973
+ print("❌ No audio files to analyze")
974
+ return None
975
+
976
+ print(f"πŸš€ Starting independent analysis on {len(audio_files)} files...")
977
+ print("=" * 70)
978
+
979
+ results = {
980
+ 'voxlingua_detailed': [],
981
+ 'xlsr_detailed': [],
982
+ 'comparison_data': []
983
+ }
984
+
985
+ for i, audio_path in enumerate(audio_files, 1):
986
+ print(f"\n[{i}/{len(audio_files)}] 🎡 {os.path.basename(audio_path)}")
987
+
988
+ # Extract ground truth from path/filename
989
+ gt_iso = gt_from_filename(audio_path)
990
+ print(f" πŸ“ Ground Truth: {gt_iso or 'Unknown'}")
991
+
992
+ file_analysis = {
993
+ 'file': os.path.basename(audio_path),
994
+ 'full_path': audio_path,
995
+ 'gt_iso': gt_iso or '',
996
+ 'voxlingua': {'available': False},
997
+ 'xlsr': {'available': False}
998
+ }
999
+
1000
+ # ==========================================
1001
+ # VoxLingua107 Independent Analysis
1002
+ # ==========================================
1003
+ if voxlingua_model is not None:
1004
+ try:
1005
+ print(f" πŸ”¬ VoxLingua107 Analysis:")
1006
+ out = voxlingua_model.classify_file(audio_path)
1007
+ logits, log_conf, pred_idx, labels = out
1008
+
1009
+ # Get real confidence scores (not weighted)
1010
+ probs = torch.softmax(logits.squeeze(), dim=0)
1011
+ top5_indices = torch.topk(probs, min(5, len(probs))).indices
1012
+
1013
+ vox_predictions = []
1014
+ for rank, idx in enumerate(top5_indices, 1):
1015
+ lang_label = labels[idx.item()]
1016
+ confidence = probs[idx.item()].item()
1017
+
1018
+ # Parse language code
1019
+ if isinstance(lang_label, str):
1020
+ colon_pos = lang_label.find(":")
1021
+ lang_code = lang_label[:colon_pos].strip() if colon_pos != -1 else lang_label.strip()
1022
+ else:
1023
+ lang_code = str(lang_label)
1024
+
1025
+ # Map to dataset language
1026
+ mapped_lang = map_to_dataset_language(lang_code)
1027
+
1028
+ vox_predictions.append({
1029
+ 'rank': rank,
1030
+ 'original': lang_code,
1031
+ 'mapped': mapped_lang,
1032
+ 'confidence': confidence,
1033
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
1034
+ })
1035
+
1036
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1037
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({confidence:.4f}) {status}")
1038
+
1039
+ # Store VoxLingua results
1040
+ top1 = vox_predictions[0]
1041
+ file_analysis['voxlingua'] = {
1042
+ 'available': True,
1043
+ 'top5_predictions': vox_predictions,
1044
+ 'top1_prediction': top1['mapped'],
1045
+ 'top1_confidence': top1['confidence'],
1046
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1047
+ 'correct_in_top5': any(p['mapped'] == gt_iso for p in vox_predictions) if gt_iso else None
1048
+ }
1049
+
1050
+ results['voxlingua_detailed'].append({
1051
+ 'file': os.path.basename(audio_path),
1052
+ 'gt_iso': gt_iso or '',
1053
+ 'pred_iso': top1['mapped'],
1054
+ 'confidence': top1['confidence'],
1055
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1056
+ })
1057
+
1058
+ except Exception as e:
1059
+ print(f" ❌ VoxLingua107 error: {e}")
1060
+ file_analysis['voxlingua'] = {'available': False, 'error': str(e)}
1061
+
1062
+ # ==========================================
1063
+ # XLS-R Independent Analysis
1064
+ # ==========================================
1065
+ if xlsr_lid_model is not None:
1066
+ try:
1067
+ print(f" πŸ”¬ XLS-R Analysis:")
1068
+ out = xlsr_lid_model.classify_file(audio_path)
1069
+ out_prob, score, index, text_lab = out
1070
+
1071
+ # Get real confidence scores
1072
+ probs = torch.softmax(out_prob.squeeze(), dim=0)
1073
+ top5_indices = torch.topk(probs, min(5, len(probs))).indices
1074
+
1075
+ xlsr_predictions = []
1076
+ for rank, idx in enumerate(top5_indices, 1):
1077
+ lang_label = text_lab[idx.item()]
1078
+ confidence = probs[idx.item()].item()
1079
+
1080
+ lang_code = str(lang_label).strip().lower()
1081
+ mapped_lang = map_to_dataset_language(lang_code)
1082
+
1083
+ xlsr_predictions.append({
1084
+ 'rank': rank,
1085
+ 'original': lang_code,
1086
+ 'mapped': mapped_lang,
1087
+ 'confidence': confidence,
1088
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
1089
+ })
1090
+
1091
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1092
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({confidence:.4f}) {status}")
1093
+
1094
+ # Store XLS-R results
1095
+ top1 = xlsr_predictions[0]
1096
+ file_analysis['xlsr'] = {
1097
+ 'available': True,
1098
+ 'top5_predictions': xlsr_predictions,
1099
+ 'top1_prediction': top1['mapped'],
1100
+ 'top1_confidence': top1['confidence'],
1101
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1102
+ 'correct_in_top5': any(p['mapped'] == gt_iso for p in xlsr_predictions) if gt_iso else None
1103
+ }
1104
+
1105
+ results['xlsr_detailed'].append({
1106
+ 'file': os.path.basename(audio_path),
1107
+ 'gt_iso': gt_iso or '',
1108
+ 'pred_iso': top1['mapped'],
1109
+ 'confidence': top1['confidence'],
1110
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1111
+ })
1112
+
1113
+ except Exception as e:
1114
+ print(f" ❌ XLS-R error: {e}")
1115
+ file_analysis['xlsr'] = {'available': False, 'error': str(e)}
1116
+
1117
+ results['comparison_data'].append(file_analysis)
1118
+ print(f" βœ… Analysis complete\n")
1119
+
1120
+ return results
1121
+
1122
+ # Run the independent analysis
1123
+ if downloaded_files:
1124
+ print("πŸ”¬ Running independent model analysis...")
1125
+ analysis_results = analyze_downloaded_files_independently(downloaded_files)
1126
+ else:
1127
+ print("❌ No files found for analysis")
1128
+ analysis_results = None
1129
+
1130
+
1131
+ # ==============================================================================
1132
+ # FIXED: Robust VoxLingua107 Analysis with Better Error Handling
1133
+ # ==============================================================================
1134
+
1135
+ def parse_voxlingua_output_robust(out):
1136
+ """Robust parsing of VoxLingua107 output with multiple fallback methods"""
1137
+
1138
+ try:
1139
+ # Method 1: Standard SpeechBrain output format
1140
+ if isinstance(out, (tuple, list)) and len(out) >= 4:
1141
+ logits, log_conf, pred_idx, labels = out[:4]
1142
+
1143
+ # Validate components
1144
+ if hasattr(logits, 'squeeze') and hasattr(labels, '__getitem__'):
1145
+ return logits, log_conf, pred_idx, labels, "standard"
1146
+
1147
+ # Method 2: Alternative format (sometimes returns dict)
1148
+ if isinstance(out, dict):
1149
+ logits = out.get('predictions', out.get('logits'))
1150
+ labels = out.get('labels', out.get('text_lab'))
1151
+ log_conf = out.get('log_probabilities', out.get('log_conf'))
1152
+ pred_idx = out.get('predicted_ids', out.get('pred_idx'))
1153
+
1154
+ if all(v is not None for v in [logits, labels]):
1155
+ return logits, log_conf, pred_idx, labels, "dict"
1156
+
1157
+ # Method 3: Direct tensor output
1158
+ if hasattr(out, 'squeeze'): # Direct logits tensor
1159
+ logits = out
1160
+ # Create dummy labels based on logits size
1161
+ labels = [f"lang_{i}" for i in range(logits.shape[-1])]
1162
+ log_conf = torch.log_softmax(logits, dim=-1).max()
1163
+ pred_idx = torch.argmax(logits, dim=-1)
1164
+
1165
+ return logits, log_conf, pred_idx, labels, "tensor"
1166
+
1167
+ except Exception as e:
1168
+ print(f" Parse error: {e}")
1169
+
1170
+ return None, None, None, None, "failed"
1171
+
1172
+ def analyze_voxlingua_robust(audio_path):
1173
+ """Robust VoxLingua107 analysis with multiple parsing methods"""
1174
+
1175
+ if voxlingua_model is None:
1176
+ return None
1177
+
1178
+ try:
1179
+ # Get raw output from model
1180
+ raw_out = voxlingua_model.classify_file(audio_path)
1181
+
1182
+ # Parse with robust method
1183
+ logits, log_conf, pred_idx, labels, parse_method = parse_voxlingua_output_robust(raw_out)
1184
+
1185
+ if logits is None:
1186
+ print(f" ❌ Could not parse VoxLingua output format")
1187
+ return None
1188
+
1189
+ print(f" πŸ“Š Parse method: {parse_method}")
1190
+
1191
+ # Get predictions based on available data
1192
+ if hasattr(logits, 'squeeze'):
1193
+ probs = torch.softmax(logits.squeeze(), dim=-1 if len(logits.squeeze().shape) > 0 else 0)
1194
+
1195
+ # Handle different tensor shapes
1196
+ if len(probs.shape) == 0: # Scalar
1197
+ top_indices = torch.tensor([0])
1198
+ top_probs = probs.unsqueeze(0)
1199
+ else: # Vector
1200
+ k = min(5, len(probs))
1201
+ top_probs, top_indices = torch.topk(probs, k)
1202
+ else:
1203
+ print(f" ❌ Logits not in expected tensor format")
1204
+ return None
1205
+
1206
+ predictions = []
1207
+ for rank, (idx, prob) in enumerate(zip(top_indices, top_probs), 1):
1208
+ idx_val = idx.item() if hasattr(idx, 'item') else int(idx)
1209
+ prob_val = prob.item() if hasattr(prob, 'item') else float(prob)
1210
+
1211
+ # Get language label safely
1212
+ if idx_val < len(labels):
1213
+ lang_label = labels[idx_val]
1214
+ else:
1215
+ lang_label = f"unknown_{idx_val}"
1216
+
1217
+ # Parse language code
1218
+ if isinstance(lang_label, str):
1219
+ colon_pos = lang_label.find(":")
1220
+ lang_code = lang_label[:colon_pos].strip() if colon_pos != -1 else lang_label.strip()
1221
+ else:
1222
+ lang_code = str(lang_label)
1223
+
1224
+ # Map to dataset language
1225
+ mapped_lang = map_to_dataset_language(lang_code)
1226
+
1227
+ predictions.append({
1228
+ 'rank': rank,
1229
+ 'original': lang_code,
1230
+ 'mapped': mapped_lang,
1231
+ 'confidence': prob_val,
1232
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
1233
+ })
1234
+
1235
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1236
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({prob_val:.4f}) {status}")
1237
+
1238
+ return predictions
1239
+
1240
+ except Exception as e:
1241
+ print(f" ❌ VoxLingua analysis error: {e}")
1242
+ print(f" ❌ Error type: {type(e).__name__}")
1243
+ return None
1244
+
1245
+ def analyze_xlsr_robust(audio_path):
1246
+ """Robust XLS-R analysis"""
1247
+
1248
+ if xlsr_lid_model is None:
1249
+ return None
1250
+
1251
+ try:
1252
+ raw_out = xlsr_lid_model.classify_file(audio_path)
1253
+
1254
+ # Handle different XLS-R output formats
1255
+ if isinstance(raw_out, (tuple, list)) and len(raw_out) >= 4:
1256
+ out_prob, score, index, text_lab = raw_out[:4]
1257
+ else:
1258
+ print(f" ❌ XLS-R output format not recognized")
1259
+ return None
1260
+
1261
+ # Get top predictions
1262
+ if hasattr(out_prob, 'squeeze'):
1263
+ probs = torch.softmax(out_prob.squeeze(), dim=-1 if len(out_prob.squeeze().shape) > 0 else 0)
1264
+
1265
+ if len(probs.shape) == 0: # Scalar
1266
+ top_indices = torch.tensor([0])
1267
+ top_probs = probs.unsqueeze(0)
1268
+ else: # Vector
1269
+ k = min(5, len(probs))
1270
+ top_probs, top_indices = torch.topk(probs, k)
1271
+ else:
1272
+ print(f" ❌ XLS-R probabilities not in expected format")
1273
+ return None
1274
+
1275
+ predictions = []
1276
+ for rank, (idx, prob) in enumerate(zip(top_indices, top_probs), 1):
1277
+ idx_val = idx.item() if hasattr(idx, 'item') else int(idx)
1278
+ prob_val = prob.item() if hasattr(prob, 'item') else float(prob)
1279
+
1280
+ # Get language label
1281
+ if idx_val < len(text_lab):
1282
+ lang_label = text_lab[idx_val]
1283
+ else:
1284
+ lang_label = f"unknown_{idx_val}"
1285
+
1286
+ lang_code = str(lang_label).strip().lower()
1287
+ mapped_lang = map_to_dataset_language(lang_code)
1288
+
1289
+ predictions.append({
1290
+ 'rank': rank,
1291
+ 'original': lang_code,
1292
+ 'mapped': mapped_lang,
1293
+ 'confidence': prob_val,
1294
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
1295
+ })
1296
+
1297
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1298
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({prob_val:.4f}) {status}")
1299
+
1300
+ return predictions
1301
+
1302
+ except Exception as e:
1303
+ print(f" ❌ XLS-R analysis error: {e}")
1304
+ return None
1305
+
1306
+ # ==============================================================================
1307
+ # UPDATED: Robust Analysis Function
1308
+ # ==============================================================================
1309
+
1310
+ def analyze_downloaded_files_robust(audio_files):
1311
+ """Robust analysis with better error handling"""
1312
+
1313
+ if not audio_files:
1314
+ print("❌ No audio files to analyze")
1315
+ return None
1316
+
1317
+ print(f"πŸš€ Starting ROBUST analysis on {len(audio_files)} files...")
1318
+ print("=" * 70)
1319
+
1320
+ results = {
1321
+ 'voxlingua_detailed': [],
1322
+ 'xlsr_detailed': [],
1323
+ 'comparison_data': []
1324
+ }
1325
+
1326
+ for i, audio_path in enumerate(audio_files, 1):
1327
+ print(f"\n[{i}/{len(audio_files)}] 🎡 {os.path.basename(audio_path)}")
1328
+
1329
+ # Extract ground truth
1330
+ gt_iso = gt_from_filename(audio_path)
1331
+ print(f" πŸ“ Ground Truth: {gt_iso or 'Unknown'}")
1332
+
1333
+ file_analysis = {
1334
+ 'file': os.path.basename(audio_path),
1335
+ 'full_path': audio_path,
1336
+ 'gt_iso': gt_iso or '',
1337
+ 'voxlingua': {'available': False},
1338
+ 'xlsr': {'available': False}
1339
+ }
1340
+
1341
+ # VoxLingua107 Analysis
1342
+ print(f" πŸ”¬ VoxLingua107 Analysis:")
1343
+ vox_predictions = analyze_voxlingua_robust(audio_path)
1344
+
1345
+ if vox_predictions:
1346
+ top1 = vox_predictions[0]
1347
+ file_analysis['voxlingua'] = {
1348
+ 'available': True,
1349
+ 'top5_predictions': vox_predictions,
1350
+ 'top1_prediction': top1['mapped'],
1351
+ 'top1_confidence': top1['confidence'],
1352
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1353
+ 'correct_in_top5': any(p['mapped'] == gt_iso for p in vox_predictions) if gt_iso else None
1354
+ }
1355
+
1356
+ results['voxlingua_detailed'].append({
1357
+ 'file': os.path.basename(audio_path),
1358
+ 'gt_iso': gt_iso or '',
1359
+ 'pred_iso': top1['mapped'],
1360
+ 'confidence': top1['confidence'],
1361
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1362
+ })
1363
+ else:
1364
+ file_analysis['voxlingua'] = {'available': False, 'error': 'Analysis failed'}
1365
+
1366
+ # XLS-R Analysis
1367
+ print(f" πŸ”¬ XLS-R Analysis:")
1368
+ xlsr_predictions = analyze_xlsr_robust(audio_path)
1369
+
1370
+ if xlsr_predictions:
1371
+ top1 = xlsr_predictions[0]
1372
+ file_analysis['xlsr'] = {
1373
+ 'available': True,
1374
+ 'top5_predictions': xlsr_predictions,
1375
+ 'top1_prediction': top1['mapped'],
1376
+ 'top1_confidence': top1['confidence'],
1377
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1378
+ 'correct_in_top5': any(p['mapped'] == gt_iso for p in xlsr_predictions) if gt_iso else None
1379
+ }
1380
+
1381
+ results['xlsr_detailed'].append({
1382
+ 'file': os.path.basename(audio_path),
1383
+ 'gt_iso': gt_iso or '',
1384
+ 'pred_iso': top1['mapped'],
1385
+ 'confidence': top1['confidence'],
1386
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1387
+ })
1388
+ else:
1389
+ file_analysis['xlsr'] = {'available': False, 'error': 'Analysis failed'}
1390
+
1391
+ results['comparison_data'].append(file_analysis)
1392
+ print(f" βœ… Analysis complete")
1393
+
1394
+ return results
1395
+
1396
+ # Run the robust analysis
1397
+ if 'downloaded_files' in globals() and downloaded_files:
1398
+ print("πŸ”¬ Running ROBUST independent model analysis...")
1399
+ robust_analysis_results = analyze_downloaded_files_robust(downloaded_files)
1400
+
1401
+ # Generate report
1402
+ if robust_analysis_results:
1403
+ generate_detailed_performance_report(robust_analysis_results)
1404
+ print(f"\nβœ… ROBUST ANALYSIS COMPLETE!")
1405
+ else:
1406
+ print("❌ Robust analysis failed")
1407
+ else:
1408
+ print("❌ No downloaded files found. Please run the file scanning code first.")
1409
+
1410
+
1411
+ # ==============================================================================
1412
+ # COMPLETE FIX: VoxLingua Label Mapping + Missing Function
1413
+ # ==============================================================================
1414
+
1415
+ # First, let's create a proper VoxLingua language mapping
1416
+ VOXLINGUA_LANGUAGE_MAP = {
1417
+ 0: 'ab', 1: 'af', 2: 'ak', 3: 'am', 4: 'ar', 5: 'as', 6: 'az', 7: 'be', 8: 'bg', 9: 'bn',
1418
+ 10: 'bo', 11: 'br', 12: 'bs', 13: 'ca', 14: 'ce', 15: 'co', 16: 'cs', 17: 'cv', 18: 'cy', 19: 'da',
1419
+ 20: 'de', 21: 'dv', 22: 'dz', 23: 'ee', 24: 'el', 25: 'en', 26: 'eo', 27: 'es', 28: 'et', 29: 'eu',
1420
+ 30: 'fa', 31: 'ff', 32: 'fi', 33: 'fo', 34: 'fr', 35: 'fy', 36: 'ga', 37: 'gd', 38: 'gl', 39: 'gn',
1421
+ 40: 'gu', 41: 'gv', 42: 'ha', 43: 'haw', 44: 'he', 45: 'hi', 46: 'hr', 47: 'ht', 48: 'hu', 49: 'hy',
1422
+ 50: 'ia', 51: 'id', 52: 'ie', 53: 'ig', 54: 'ii', 55: 'ik', 56: 'io', 57: 'is', 58: 'it', 59: 'iu',
1423
+ 60: 'ja', 61: 'jv', 62: 'ka', 63: 'kk', 64: 'kl', 65: 'km', 66: 'kn', 67: 'ko', 68: 'ks', 69: 'ku',
1424
+ 70: 'kw', 71: 'ky', 72: 'la', 73: 'lb', 74: 'lg', 75: 'li', 76: 'ln', 77: 'lo', 78: 'lt', 79: 'lv',
1425
+ 80: 'mg', 81: 'mi', 82: 'mk', 83: 'ml', 84: 'mn', 85: 'mr', 86: 'ms', 87: 'mt', 88: 'my', 89: 'na',
1426
+ 90: 'nb', 91: 'nd', 92: 'ne', 93: 'ng', 94: 'nl', 95: 'nn', 96: 'no', 97: 'nv', 98: 'ny', 99: 'oc',
1427
+ 100: 'of', 101: 'om', 102: 'or', 103: 'os', 104: 'pa', 105: 'pi', 106: 'pl', 107: 'ps'
1428
+ }
1429
+
1430
+ def get_voxlingua_language_by_index(idx):
1431
+ """Map VoxLingua index to language code"""
1432
+ return VOXLINGUA_LANGUAGE_MAP.get(idx, f'unknown_{idx}')
1433
+
1434
+ def analyze_voxlingua_fixed(audio_path):
1435
+ """Fixed VoxLingua107 analysis with proper language mapping"""
1436
+
1437
+ if voxlingua_model is None:
1438
+ return None
1439
+
1440
+ try:
1441
+ raw_out = voxlingua_model.classify_file(audio_path)
1442
+
1443
+ if not isinstance(raw_out, (tuple, list)) or len(raw_out) < 4:
1444
+ print(f" ❌ Unexpected VoxLingua output format")
1445
+ return None
1446
+
1447
+ logits, log_conf, pred_idx, labels = raw_out[:4]
1448
+
1449
+ # Get probabilities and top 5
1450
+ probs = torch.softmax(logits.squeeze(), dim=-1)
1451
+ k = min(5, len(probs))
1452
+ top_probs, top_indices = torch.topk(probs, k)
1453
+
1454
+ predictions = []
1455
+ for rank, (idx, prob) in enumerate(zip(top_indices, top_probs), 1):
1456
+ idx_val = idx.item() if hasattr(idx, 'item') else int(idx)
1457
+ prob_val = prob.item() if hasattr(prob, 'item') else float(prob)
1458
+
1459
+ # Method 1: Try to use provided labels
1460
+ if idx_val < len(labels) and not str(labels[idx_val]).startswith('unknown'):
1461
+ lang_label = labels[idx_val]
1462
+ if isinstance(lang_label, str):
1463
+ colon_pos = lang_label.find(":")
1464
+ lang_code = lang_label[:colon_pos].strip() if colon_pos != -1 else lang_label.strip()
1465
+ else:
1466
+ lang_code = str(lang_label)
1467
+ else:
1468
+ # Method 2: Use our language mapping
1469
+ lang_code = get_voxlingua_language_by_index(idx_val)
1470
+
1471
+ # Map to dataset language
1472
+ mapped_lang = map_to_dataset_language(lang_code)
1473
+
1474
+ predictions.append({
1475
+ 'rank': rank,
1476
+ 'original': lang_code,
1477
+ 'mapped': mapped_lang,
1478
+ 'confidence': prob_val,
1479
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS,
1480
+ 'index': idx_val
1481
+ })
1482
+
1483
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1484
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({prob_val:.4f}) {status} [idx:{idx_val}]")
1485
+
1486
+ return predictions
1487
+
1488
+ except Exception as e:
1489
+ print(f" ❌ VoxLingua analysis error: {e}")
1490
+ return None
1491
+
1492
+ def analyze_xlsr_fixed(audio_path):
1493
+ """Fixed XLS-R analysis"""
1494
+
1495
+ if xlsr_lid_model is None:
1496
+ print(f" ❌ XLS-R model not loaded")
1497
+ return None
1498
+
1499
+ try:
1500
+ raw_out = xlsr_lid_model.classify_file(audio_path)
1501
+
1502
+ if not isinstance(raw_out, (tuple, list)) or len(raw_out) < 4:
1503
+ print(f" ❌ Unexpected XLS-R output format")
1504
+ return None
1505
+
1506
+ out_prob, score, index, text_lab = raw_out[:4]
1507
+
1508
+ # Get probabilities and top 5
1509
+ probs = torch.softmax(out_prob.squeeze(), dim=-1)
1510
+ k = min(5, len(probs))
1511
+ top_probs, top_indices = torch.topk(probs, k)
1512
+
1513
+ predictions = []
1514
+ for rank, (idx, prob) in enumerate(zip(top_indices, top_probs), 1):
1515
+ idx_val = idx.item() if hasattr(idx, 'item') else int(idx)
1516
+ prob_val = prob.item() if hasattr(prob, 'item') else float(prob)
1517
+
1518
+ # Get language label
1519
+ if idx_val < len(text_lab):
1520
+ lang_label = text_lab[idx_val]
1521
+ lang_code = str(lang_label).strip().lower()
1522
+ else:
1523
+ lang_code = f"xlsr_unknown_{idx_val}"
1524
+
1525
+ mapped_lang = map_to_dataset_language(lang_code)
1526
+
1527
+ predictions.append({
1528
+ 'rank': rank,
1529
+ 'original': lang_code,
1530
+ 'mapped': mapped_lang,
1531
+ 'confidence': prob_val,
1532
+ 'in_dataset': mapped_lang in ALL_SUPPORTED_LANGS
1533
+ })
1534
+
1535
+ status = "βœ…" if mapped_lang in ALL_SUPPORTED_LANGS else "❌"
1536
+ print(f" #{rank}: {lang_code} β†’ {mapped_lang} ({prob_val:.4f}) {status}")
1537
+
1538
+ return predictions
1539
+
1540
+ except Exception as e:
1541
+ print(f" ❌ XLS-R analysis error: {e}")
1542
+ return None
1543
+
1544
+ def generate_detailed_performance_report(results):
1545
+ """Complete performance analysis report function"""
1546
+
1547
+ if not results:
1548
+ print("❌ No results to analyze")
1549
+ return
1550
+
1551
+ print("\nπŸ“Š DETAILED INDEPENDENT MODEL PERFORMANCE REPORT")
1552
+ print("=" * 70)
1553
+
1554
+ # VoxLingua107 Performance Analysis
1555
+ if results['voxlingua_detailed']:
1556
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
1557
+ valid_vox = vox_df[vox_df['gt_iso'] != ''].copy()
1558
+
1559
+ print(f"\nπŸ”¬ VOXLINGUA107 PERFORMANCE:")
1560
+ print("-" * 40)
1561
+
1562
+ if len(valid_vox) > 0:
1563
+ vox_acc = (valid_vox['correct'] == True).mean()
1564
+ vox_conf_mean = valid_vox['confidence'].mean()
1565
+ vox_conf_std = valid_vox['confidence'].std()
1566
+
1567
+ print(f"Files Analyzed: {len(valid_vox)}")
1568
+ print(f"Top-1 Accuracy: {vox_acc:.4f} ({vox_acc*100:.1f}%)")
1569
+ print(f"Confidence: {vox_conf_mean:.4f} Β± {vox_conf_std:.4f}")
1570
+
1571
+ # Per-language breakdown
1572
+ print(f"\nPer-Language Performance:")
1573
+ for lang in sorted(valid_vox['gt_iso'].unique()):
1574
+ lang_data = valid_vox[valid_vox['gt_iso'] == lang]
1575
+ acc = (lang_data['correct'] == True).mean()
1576
+ conf_mean = lang_data['confidence'].mean()
1577
+ count = len(lang_data)
1578
+ print(f" {lang:>3}: {acc:.3f} ({acc*100:5.1f}%) | Conf: {conf_mean:.3f} | n={count}")
1579
+ else:
1580
+ print("No valid VoxLingua results")
1581
+
1582
+ # XLS-R Performance Analysis
1583
+ if results['xlsr_detailed']:
1584
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
1585
+ valid_xlsr = xlsr_df[xlsr_df['gt_iso'] != ''].copy()
1586
+
1587
+ print(f"\nπŸ”¬ XLS-R PERFORMANCE:")
1588
+ print("-" * 40)
1589
+
1590
+ if len(valid_xlsr) > 0:
1591
+ xlsr_acc = (valid_xlsr['correct'] == True).mean()
1592
+ xlsr_conf_mean = valid_xlsr['confidence'].mean()
1593
+ xlsr_conf_std = valid_xlsr['confidence'].std()
1594
+
1595
+ print(f"Files Analyzed: {len(valid_xlsr)}")
1596
+ print(f"Top-1 Accuracy: {xlsr_acc:.4f} ({xlsr_acc*100:.1f}%)")
1597
+ print(f"Confidence: {xlsr_conf_mean:.4f} Β± {xlsr_conf_std:.4f}")
1598
+
1599
+ # Per-language breakdown
1600
+ print(f"\nPer-Language Performance:")
1601
+ for lang in sorted(valid_xlsr['gt_iso'].unique()):
1602
+ lang_data = valid_xlsr[valid_xlsr['gt_iso'] == lang]
1603
+ acc = (lang_data['correct'] == True).mean()
1604
+ conf_mean = lang_data['confidence'].mean()
1605
+ count = len(lang_data)
1606
+ print(f" {lang:>3}: {acc:.3f} ({acc*100:5.1f}%) | Conf: {conf_mean:.3f} | n={count}")
1607
+ else:
1608
+ print("No valid XLS-R results")
1609
+
1610
+ # Model Comparison
1611
+ if results['voxlingua_detailed'] and results['xlsr_detailed']:
1612
+ print(f"\nβš–οΈ MODEL COMPARISON:")
1613
+ print("-" * 30)
1614
+
1615
+ print(f"VoxLingua107: {vox_acc:.4f} accuracy")
1616
+ print(f"XLS-R: {xlsr_acc:.4f} accuracy")
1617
+
1618
+ # Calculate optimal weights
1619
+ total_acc = vox_acc + xlsr_acc
1620
+ if total_acc > 0:
1621
+ vox_weight = vox_acc / total_acc
1622
+ xlsr_weight = xlsr_acc / total_acc
1623
+
1624
+ print(f"\nπŸ’‘ RECOMMENDED WEIGHTS:")
1625
+ print(f"VoxLingua107: {vox_weight:.3f} ({vox_weight*100:.1f}%)")
1626
+ print(f"XLS-R: {xlsr_weight:.3f} ({xlsr_weight*100:.1f}%)")
1627
+
1628
+ # Calculate agreement
1629
+ vox_preds = set(vox_df['pred_iso'].tolist())
1630
+ xlsr_preds = set(xlsr_df['pred_iso'].tolist())
1631
+ common_preds = vox_preds.intersection(xlsr_preds)
1632
+
1633
+ print(f"\nModel Agreement Analysis:")
1634
+ print(f"Common predictions: {len(common_preds)}")
1635
+ print(f"VoxLingua unique: {len(vox_preds - xlsr_preds)}")
1636
+ print(f"XLS-R unique: {len(xlsr_preds - vox_preds)}")
1637
+
1638
+ # Save results
1639
+ timestamp = pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")
1640
+
1641
+ if results['voxlingua_detailed']:
1642
+ vox_csv = f"voxlingua_fixed_results_{timestamp}.csv"
1643
+ pd.DataFrame(results['voxlingua_detailed']).to_csv(vox_csv, index=False)
1644
+ print(f"\nπŸ’Ύ VoxLingua results: {vox_csv}")
1645
+
1646
+ if results['xlsr_detailed']:
1647
+ xlsr_csv = f"xlsr_fixed_results_{timestamp}.csv"
1648
+ pd.DataFrame(results['xlsr_detailed']).to_csv(xlsr_csv, index=False)
1649
+ print(f"πŸ’Ύ XLS-R results: {xlsr_csv}")
1650
+
1651
+ def run_complete_fixed_analysis(audio_files):
1652
+ """Run complete analysis with all fixes"""
1653
+
1654
+ if not audio_files:
1655
+ print("❌ No audio files to analyze")
1656
+ return None
1657
+
1658
+ print(f"πŸš€ Starting COMPLETE FIXED analysis on {len(audio_files)} files...")
1659
+ print("=" * 70)
1660
+
1661
+ results = {
1662
+ 'voxlingua_detailed': [],
1663
+ 'xlsr_detailed': [],
1664
+ 'comparison_data': []
1665
+ }
1666
+
1667
+ for i, audio_path in enumerate(audio_files, 1):
1668
+ print(f"\n[{i}/{len(audio_files)}] 🎡 {os.path.basename(audio_path)}")
1669
+
1670
+ # Extract ground truth
1671
+ gt_iso = gt_from_filename(audio_path)
1672
+ print(f" πŸ“ Ground Truth: {gt_iso or 'Unknown'}")
1673
+
1674
+ file_analysis = {
1675
+ 'file': os.path.basename(audio_path),
1676
+ 'full_path': audio_path,
1677
+ 'gt_iso': gt_iso or '',
1678
+ 'voxlingua': {'available': False},
1679
+ 'xlsr': {'available': False}
1680
+ }
1681
+
1682
+ # VoxLingua107 Analysis
1683
+ print(f" πŸ”¬ VoxLingua107 Analysis:")
1684
+ vox_predictions = analyze_voxlingua_fixed(audio_path)
1685
+
1686
+ if vox_predictions and len(vox_predictions) > 0:
1687
+ top1 = vox_predictions[0]
1688
+ file_analysis['voxlingua'] = {
1689
+ 'available': True,
1690
+ 'top5_predictions': vox_predictions,
1691
+ 'top1_prediction': top1['mapped'],
1692
+ 'top1_confidence': top1['confidence'],
1693
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1694
+ }
1695
+
1696
+ results['voxlingua_detailed'].append({
1697
+ 'file': os.path.basename(audio_path),
1698
+ 'gt_iso': gt_iso or '',
1699
+ 'pred_iso': top1['mapped'],
1700
+ 'confidence': top1['confidence'],
1701
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1702
+ })
1703
+
1704
+ # XLS-R Analysis
1705
+ print(f" πŸ”¬ XLS-R Analysis:")
1706
+ xlsr_predictions = analyze_xlsr_fixed(audio_path)
1707
+
1708
+ if xlsr_predictions and len(xlsr_predictions) > 0:
1709
+ top1 = xlsr_predictions[0]
1710
+ file_analysis['xlsr'] = {
1711
+ 'available': True,
1712
+ 'top5_predictions': xlsr_predictions,
1713
+ 'top1_prediction': top1['mapped'],
1714
+ 'top1_confidence': top1['confidence'],
1715
+ 'correct_top1': gt_iso == top1['mapped'] if gt_iso else None,
1716
+ }
1717
+
1718
+ results['xlsr_detailed'].append({
1719
+ 'file': os.path.basename(audio_path),
1720
+ 'gt_iso': gt_iso or '',
1721
+ 'pred_iso': top1['mapped'],
1722
+ 'confidence': top1['confidence'],
1723
+ 'correct': gt_iso == top1['mapped'] if gt_iso else None
1724
+ })
1725
+
1726
+ results['comparison_data'].append(file_analysis)
1727
+ print(f" βœ… Analysis complete")
1728
+
1729
+ return results
1730
+
1731
+ # Run the complete fixed analysis
1732
+ if 'downloaded_files' in globals() and downloaded_files:
1733
+ print("πŸ”¬ Running COMPLETE FIXED analysis...")
1734
+ final_analysis_results = run_complete_fixed_analysis(downloaded_files)
1735
+
1736
+ if final_analysis_results:
1737
+ generate_detailed_performance_report(final_analysis_results)
1738
+ print(f"\nβœ… COMPLETE FIXED ANALYSIS DONE!")
1739
+ else:
1740
+ print("❌ Analysis failed")
1741
+ else:
1742
+ print("❌ No downloaded files found")
1743
+
1744
+
1745
+ # ==============================================================================
1746
+ # COMPREHENSIVE EXCEL ANALYSIS WITH ALL DETAILS
1747
+ # ==============================================================================
1748
+
1749
+ import pandas as pd
1750
+ import numpy as np
1751
+ from datetime import datetime
1752
+ import os
1753
+
1754
+ def create_comprehensive_excel_analysis(results, output_filename=None):
1755
+ """Create comprehensive Excel analysis with multiple sheets and detailed metrics"""
1756
+
1757
+ if not results:
1758
+ print("❌ No results to analyze")
1759
+ return None
1760
+
1761
+ # Generate filename if not provided
1762
+ if not output_filename:
1763
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
1764
+ output_filename = f"Language_Detection_Comprehensive_Analysis_{timestamp}.xlsx"
1765
+
1766
+ print(f"πŸ“Š Creating comprehensive Excel analysis: {output_filename}")
1767
+
1768
+ # Create Excel writer
1769
+ with pd.ExcelWriter(output_filename, engine='openpyxl') as writer:
1770
+
1771
+ # ========================================
1772
+ # SHEET 1: EXECUTIVE SUMMARY
1773
+ # ========================================
1774
+ print(" πŸ“‹ Creating Executive Summary...")
1775
+
1776
+ summary_data = []
1777
+
1778
+ # Overall statistics
1779
+ total_files = len(results['comparison_data'])
1780
+ vox_available = sum(1 for item in results['comparison_data'] if item['voxlingua']['available'])
1781
+ xlsr_available = sum(1 for item in results['comparison_data'] if item['xlsr']['available'])
1782
+
1783
+ summary_data.extend([
1784
+ ['EXECUTIVE SUMMARY', ''],
1785
+ ['Analysis Date', datetime.now().strftime("%Y-%m-%d %H:%M:%S")],
1786
+ ['Total Files Analyzed', total_files],
1787
+ ['VoxLingua107 Available', f"{vox_available} ({vox_available/total_files*100:.1f}%)"],
1788
+ ['XLS-R Available', f"{xlsr_available} ({xlsr_available/total_files*100:.1f}%)"],
1789
+ ['', ''],
1790
+ ])
1791
+
1792
+ # Model performance summary
1793
+ if results['voxlingua_detailed']:
1794
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
1795
+ valid_vox = vox_df[vox_df['gt_iso'] != ''].copy()
1796
+ if len(valid_vox) > 0:
1797
+ vox_acc = (valid_vox['correct'] == True).mean()
1798
+ vox_conf = valid_vox['confidence'].mean()
1799
+ summary_data.extend([
1800
+ ['VOXLINGUA107 PERFORMANCE', ''],
1801
+ ['Accuracy', f"{vox_acc:.4f} ({vox_acc*100:.1f}%)"],
1802
+ ['Average Confidence', f"{vox_conf:.4f}"],
1803
+ ['Files with Valid GT', len(valid_vox)],
1804
+ ['', ''],
1805
+ ])
1806
+
1807
+ if results['xlsr_detailed']:
1808
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
1809
+ valid_xlsr = xlsr_df[xlsr_df['gt_iso'] != ''].copy()
1810
+ if len(valid_xlsr) > 0:
1811
+ xlsr_acc = (valid_xlsr['correct'] == True).mean()
1812
+ xlsr_conf = valid_xlsr['confidence'].mean()
1813
+ summary_data.extend([
1814
+ ['XLS-R PERFORMANCE', ''],
1815
+ ['Accuracy', f"{xlsr_acc:.4f} ({xlsr_acc*100:.1f}%)"],
1816
+ ['Average Confidence', f"{xlsr_conf:.4f}"],
1817
+ ['Files with Valid GT', len(valid_xlsr)],
1818
+ ['', ''],
1819
+ ])
1820
+
1821
+ # Optimal weights calculation
1822
+ if results['voxlingua_detailed']:
1823
+ total_acc = vox_acc + xlsr_acc
1824
+ if total_acc > 0:
1825
+ vox_weight = vox_acc / total_acc
1826
+ xlsr_weight = xlsr_acc / total_acc
1827
+ summary_data.extend([
1828
+ ['RECOMMENDED ENSEMBLE WEIGHTS', ''],
1829
+ ['VoxLingua107 Weight', f"{vox_weight:.3f} ({vox_weight*100:.1f}%)"],
1830
+ ['XLS-R Weight', f"{xlsr_weight:.3f} ({xlsr_weight*100:.1f}%)"],
1831
+ ])
1832
+
1833
+ # Create summary dataframe
1834
+ summary_df = pd.DataFrame(summary_data, columns=['Metric', 'Value'])
1835
+ summary_df.to_excel(writer, sheet_name='Executive_Summary', index=False)
1836
+
1837
+ # ========================================
1838
+ # SHEET 2: VOXLINGUA107 DETAILED RESULTS
1839
+ # ========================================
1840
+ if results['voxlingua_detailed']:
1841
+ print(" πŸ“‹ Creating VoxLingua107 detailed results...")
1842
+
1843
+ vox_detailed_df = pd.DataFrame(results['voxlingua_detailed'])
1844
+
1845
+ # Add additional analysis columns
1846
+ vox_detailed_df['accuracy_score'] = vox_detailed_df['correct'].astype(int)
1847
+ vox_detailed_df['confidence_category'] = pd.cut(
1848
+ vox_detailed_df['confidence'],
1849
+ bins=[0, 0.3, 0.6, 0.8, 1.0],
1850
+ labels=['Low', 'Medium', 'High', 'Very High']
1851
+ )
1852
+
1853
+ # Add language family information
1854
+ def get_language_family(lang):
1855
+ if lang in INDO_ARYAN_LANGS:
1856
+ return 'Indo-Aryan'
1857
+ elif lang in DRAVIDIAN_LANGS:
1858
+ return 'Dravidian'
1859
+ elif lang in LOW_RESOURCE_LANGS:
1860
+ return 'Low-Resource'
1861
+ else:
1862
+ return 'Other'
1863
+
1864
+ vox_detailed_df['gt_language_family'] = vox_detailed_df['gt_iso'].apply(get_language_family)
1865
+ vox_detailed_df['pred_language_family'] = vox_detailed_df['pred_iso'].apply(get_language_family)
1866
+
1867
+ vox_detailed_df.to_excel(writer, sheet_name='VoxLingua107_Results', index=False)
1868
+
1869
+ # ========================================
1870
+ # SHEET 3: XLS-R DETAILED RESULTS
1871
+ # ========================================
1872
+ if results['xlsr_detailed']:
1873
+ print(" πŸ“‹ Creating XLS-R detailed results...")
1874
+
1875
+ xlsr_detailed_df = pd.DataFrame(results['xlsr_detailed'])
1876
+
1877
+ # Add analysis columns
1878
+ xlsr_detailed_df['accuracy_score'] = xlsr_detailed_df['correct'].astype(int)
1879
+ xlsr_detailed_df['confidence_category'] = pd.cut(
1880
+ xlsr_detailed_df['confidence'],
1881
+ bins=[0, 0.3, 0.6, 0.8, 1.0],
1882
+ labels=['Low', 'Medium', 'High', 'Very High']
1883
+ )
1884
+ xlsr_detailed_df['gt_language_family'] = xlsr_detailed_df['gt_iso'].apply(get_language_family)
1885
+ xlsr_detailed_df['pred_language_family'] = xlsr_detailed_df['pred_iso'].apply(get_language_family)
1886
+
1887
+ xlsr_detailed_df.to_excel(writer, sheet_name='XLSR_Results', index=False)
1888
+
1889
+ # ========================================
1890
+ # SHEET 4: PER-LANGUAGE ACCURACY ANALYSIS
1891
+ # ========================================
1892
+ print(" πŸ“‹ Creating per-language accuracy analysis...")
1893
+
1894
+ lang_analysis_data = []
1895
+
1896
+ # Get all unique languages from ground truth
1897
+ all_gt_langs = set()
1898
+ if results['voxlingua_detailed']:
1899
+ all_gt_langs.update([r['gt_iso'] for r in results['voxlingua_detailed'] if r['gt_iso']])
1900
+ if results['xlsr_detailed']:
1901
+ all_gt_langs.update([r['gt_iso'] for r in results['xlsr_detailed'] if r['gt_iso']])
1902
+
1903
+ # Language name mapping
1904
+ LANG_NAMES = {
1905
+ 'ur': 'Urdu', 'pa': 'Punjabi', 'ta': 'Tamil', 'sd': 'Sindhi', 'or': 'Odia',
1906
+ 'ml': 'Malayalam', 'ne': 'Nepali', 'as': 'Assamese', 'hi': 'Hindi', 'bn': 'Bengali',
1907
+ 'kok': 'Konkani', 'kn': 'Kannada', 'ks': 'Kashmiri', 'mr': 'Marathi', 'te': 'Telugu',
1908
+ 'br': 'Bodo', 'doi': 'Dogri', 'sat': 'Santali', 'gu': 'Gujarati', 'mni': 'Manipuri',
1909
+ 'sa': 'Sanskrit'
1910
+ }
1911
+
1912
+ for lang in sorted(all_gt_langs):
1913
+ lang_name = LANG_NAMES.get(lang, lang.title())
1914
+ lang_family = get_language_family(lang)
1915
+
1916
+ # VoxLingua stats for this language
1917
+ vox_stats = {'files': 0, 'correct': 0, 'accuracy': 0, 'avg_confidence': 0}
1918
+ if results['voxlingua_detailed']:
1919
+ vox_lang_data = [r for r in results['voxlingua_detailed'] if r['gt_iso'] == lang]
1920
+ if vox_lang_data:
1921
+ vox_stats['files'] = len(vox_lang_data)
1922
+ vox_stats['correct'] = sum(1 for r in vox_lang_data if r['correct'])
1923
+ vox_stats['accuracy'] = vox_stats['correct'] / vox_stats['files']
1924
+ vox_stats['avg_confidence'] = np.mean([r['confidence'] for r in vox_lang_data])
1925
+
1926
+ # XLS-R stats for this language
1927
+ xlsr_stats = {'files': 0, 'correct': 0, 'accuracy': 0, 'avg_confidence': 0}
1928
+ if results['xlsr_detailed']:
1929
+ xlsr_lang_data = [r for r in results['xlsr_detailed'] if r['gt_iso'] == lang]
1930
+ if xlsr_lang_data:
1931
+ xlsr_stats['files'] = len(xlsr_lang_data)
1932
+ xlsr_stats['correct'] = sum(1 for r in xlsr_lang_data if r['correct'])
1933
+ xlsr_stats['accuracy'] = xlsr_stats['correct'] / xlsr_stats['files']
1934
+ xlsr_stats['avg_confidence'] = np.mean([r['confidence'] for r in xlsr_lang_data])
1935
+
1936
+ lang_analysis_data.append({
1937
+ 'Language_Code': lang,
1938
+ 'Language_Name': lang_name,
1939
+ 'Language_Family': lang_family,
1940
+ 'VoxLingua_Files': vox_stats['files'],
1941
+ 'VoxLingua_Correct': vox_stats['correct'],
1942
+ 'VoxLingua_Accuracy': f"{vox_stats['accuracy']:.4f}",
1943
+ 'VoxLingua_Accuracy_Pct': f"{vox_stats['accuracy']*100:.1f}%",
1944
+ 'VoxLingua_Avg_Confidence': f"{vox_stats['avg_confidence']:.4f}",
1945
+ 'XLSR_Files': xlsr_stats['files'],
1946
+ 'XLSR_Correct': xlsr_stats['correct'],
1947
+ 'XLSR_Accuracy': f"{xlsr_stats['accuracy']:.4f}",
1948
+ 'XLSR_Accuracy_Pct': f"{xlsr_stats['accuracy']*100:.1f}%",
1949
+ 'XLSR_Avg_Confidence': f"{xlsr_stats['avg_confidence']:.4f}",
1950
+ 'Better_Model': 'VoxLingua' if vox_stats['accuracy'] > xlsr_stats['accuracy'] else 'XLS-R' if xlsr_stats['accuracy'] > vox_stats['accuracy'] else 'Tie'
1951
+ })
1952
+
1953
+ lang_analysis_df = pd.DataFrame(lang_analysis_data)
1954
+ lang_analysis_df.to_excel(writer, sheet_name='Per_Language_Analysis', index=False)
1955
+
1956
+ # ========================================
1957
+ # SHEET 5: CONFUSION MATRIX - VOXLINGUA
1958
+ # ========================================
1959
+ if results['voxlingua_detailed']:
1960
+ print(" πŸ“‹ Creating VoxLingua confusion matrix...")
1961
+
1962
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
1963
+ valid_vox = vox_df[vox_df['gt_iso'] != ''].copy()
1964
+
1965
+ if len(valid_vox) > 0:
1966
+ # Create confusion matrix
1967
+ confusion_data = []
1968
+ for gt_lang in sorted(valid_vox['gt_iso'].unique()):
1969
+ gt_data = valid_vox[valid_vox['gt_iso'] == gt_lang]
1970
+ row_data = {'Ground_Truth': gt_lang}
1971
+
1972
+ for pred_lang in sorted(valid_vox['pred_iso'].unique()):
1973
+ count = len(gt_data[gt_data['pred_iso'] == pred_lang])
1974
+ row_data[f'Predicted_{pred_lang}'] = count
1975
+
1976
+ confusion_data.append(row_data)
1977
+
1978
+ confusion_df = pd.DataFrame(confusion_data).fillna(0)
1979
+ confusion_df.to_excel(writer, sheet_name='VoxLingua_Confusion_Matrix', index=False)
1980
+
1981
+ # ========================================
1982
+ # SHEET 6: CONFUSION MATRIX - XLS-R
1983
+ # ========================================
1984
+ if results['xlsr_detailed']:
1985
+ print(" πŸ“‹ Creating XLS-R confusion matrix...")
1986
+
1987
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
1988
+ valid_xlsr = xlsr_df[xlsr_df['gt_iso'] != ''].copy()
1989
+
1990
+ if len(valid_xlsr) > 0:
1991
+ confusion_data = []
1992
+ for gt_lang in sorted(valid_xlsr['gt_iso'].unique()):
1993
+ gt_data = valid_xlsr[valid_xlsr['gt_iso'] == gt_lang]
1994
+ row_data = {'Ground_Truth': gt_lang}
1995
+
1996
+ for pred_lang in sorted(valid_xlsr['pred_iso'].unique()):
1997
+ count = len(gt_data[gt_data['pred_iso'] == pred_lang])
1998
+ row_data[f'Predicted_{pred_lang}'] = count
1999
+
2000
+ confusion_data.append(row_data)
2001
+
2002
+ confusion_df = pd.DataFrame(confusion_data).fillna(0)
2003
+ confusion_df.to_excel(writer, sheet_name='XLSR_Confusion_Matrix', index=False)
2004
+
2005
+ # ========================================
2006
+ # SHEET 7: CONFIDENCE ANALYSIS
2007
+ # ========================================
2008
+ print(" πŸ“‹ Creating confidence analysis...")
2009
+
2010
+ confidence_analysis = []
2011
+
2012
+ # VoxLingua confidence analysis
2013
+ if results['voxlingua_detailed']:
2014
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
2015
+ valid_vox = vox_df[vox_df['gt_iso'] != ''].copy()
2016
+
2017
+ if len(valid_vox) > 0:
2018
+ for conf_range in [(0, 0.3), (0.3, 0.6), (0.6, 0.8), (0.8, 1.0)]:
2019
+ range_data = valid_vox[
2020
+ (valid_vox['confidence'] >= conf_range[0]) &
2021
+ (valid_vox['confidence'] < conf_range[1])
2022
+ ]
2023
+
2024
+ if len(range_data) > 0:
2025
+ accuracy = (range_data['correct'] == True).mean()
2026
+ confidence_analysis.append({
2027
+ 'Model': 'VoxLingua107',
2028
+ 'Confidence_Range': f"{conf_range[0]:.1f}-{conf_range[1]:.1f}",
2029
+ 'Files': len(range_data),
2030
+ 'Accuracy': f"{accuracy:.4f}",
2031
+ 'Accuracy_Pct': f"{accuracy*100:.1f}%",
2032
+ 'Avg_Confidence': f"{range_data['confidence'].mean():.4f}"
2033
+ })
2034
+
2035
+ # XLS-R confidence analysis
2036
+ if results['xlsr_detailed']:
2037
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
2038
+ valid_xlsr = xlsr_df[xlsr_df['gt_iso'] != ''].copy()
2039
+
2040
+ if len(valid_xlsr) > 0:
2041
+ for conf_range in [(0, 0.3), (0.3, 0.6), (0.6, 0.8), (0.8, 1.0)]:
2042
+ range_data = valid_xlsr[
2043
+ (valid_xlsr['confidence'] >= conf_range[0]) &
2044
+ (valid_xlsr['confidence'] < conf_range[1])
2045
+ ]
2046
+
2047
+ if len(range_data) > 0:
2048
+ accuracy = (range_data['correct'] == True).mean()
2049
+ confidence_analysis.append({
2050
+ 'Model': 'XLS-R',
2051
+ 'Confidence_Range': f"{conf_range[0]:.1f}-{conf_range[1]:.1f}",
2052
+ 'Files': len(range_data),
2053
+ 'Accuracy': f"{accuracy:.4f}",
2054
+ 'Accuracy_Pct': f"{accuracy*100:.1f}%",
2055
+ 'Avg_Confidence': f"{range_data['confidence'].mean():.4f}"
2056
+ })
2057
+
2058
+ confidence_df = pd.DataFrame(confidence_analysis)
2059
+ confidence_df.to_excel(writer, sheet_name='Confidence_Analysis', index=False)
2060
+
2061
+ # ========================================
2062
+ # SHEET 8: ERROR ANALYSIS
2063
+ # ========================================
2064
+ print(" πŸ“‹ Creating error analysis...")
2065
+
2066
+ error_analysis = []
2067
+
2068
+ # VoxLingua errors
2069
+ if results['voxlingua_detailed']:
2070
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
2071
+ vox_errors = vox_df[vox_df['correct'] == False].copy()
2072
+
2073
+ for _, error in vox_errors.iterrows():
2074
+ error_analysis.append({
2075
+ 'Model': 'VoxLingua107',
2076
+ 'File': error['file'],
2077
+ 'Ground_Truth': error['gt_iso'],
2078
+ 'Predicted': error['pred_iso'],
2079
+ 'Confidence': f"{error['confidence']:.4f}",
2080
+ 'GT_Language_Family': get_language_family(error['gt_iso']),
2081
+ 'Pred_Language_Family': get_language_family(error['pred_iso']),
2082
+ 'Cross_Family_Error': get_language_family(error['gt_iso']) != get_language_family(error['pred_iso'])
2083
+ })
2084
+
2085
+ # XLS-R errors
2086
+ if results['xlsr_detailed']:
2087
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
2088
+ xlsr_errors = xlsr_df[xlsr_df['correct'] == False].copy()
2089
+
2090
+ for _, error in xlsr_errors.iterrows():
2091
+ error_analysis.append({
2092
+ 'Model': 'XLS-R',
2093
+ 'File': error['file'],
2094
+ 'Ground_Truth': error['gt_iso'],
2095
+ 'Predicted': error['pred_iso'],
2096
+ 'Confidence': f"{error['confidence']:.4f}",
2097
+ 'GT_Language_Family': get_language_family(error['gt_iso']),
2098
+ 'Pred_Language_Family': get_language_family(error['pred_iso']),
2099
+ 'Cross_Family_Error': get_language_family(error['gt_iso']) != get_language_family(error['pred_iso'])
2100
+ })
2101
+
2102
+ error_df = pd.DataFrame(error_analysis)
2103
+ error_df.to_excel(writer, sheet_name='Error_Analysis', index=False)
2104
+
2105
+ # ========================================
2106
+ # SHEET 9: LANGUAGE FAMILY PERFORMANCE
2107
+ # ========================================
2108
+ print(" πŸ“‹ Creating language family performance...")
2109
+
2110
+ family_performance = []
2111
+
2112
+ families = ['Indo-Aryan', 'Dravidian', 'Low-Resource', 'Other']
2113
+
2114
+ for family in families:
2115
+ # VoxLingua performance for this family
2116
+ if results['voxlingua_detailed']:
2117
+ vox_df = pd.DataFrame(results['voxlingua_detailed'])
2118
+ family_data = vox_df[vox_df['gt_iso'].apply(lambda x: get_language_family(x) == family)]
2119
+
2120
+ if len(family_data) > 0:
2121
+ vox_acc = (family_data['correct'] == True).mean()
2122
+ vox_conf = family_data['confidence'].mean()
2123
+ vox_files = len(family_data)
2124
+ else:
2125
+ vox_acc = vox_conf = vox_files = 0
2126
+ else:
2127
+ vox_acc = vox_conf = vox_files = 0
2128
+
2129
+ # XLS-R performance for this family
2130
+ if results['xlsr_detailed']:
2131
+ xlsr_df = pd.DataFrame(results['xlsr_detailed'])
2132
+ family_data = xlsr_df[xlsr_df['gt_iso'].apply(lambda x: get_language_family(x) == family)]
2133
+
2134
+ if len(family_data) > 0:
2135
+ xlsr_acc = (family_data['correct'] == True).mean()
2136
+ xlsr_conf = family_data['confidence'].mean()
2137
+ xlsr_files = len(family_data)
2138
+ else:
2139
+ xlsr_acc = xlsr_conf = xlsr_files = 0
2140
+ else:
2141
+ xlsr_acc = xlsr_conf = xlsr_files = 0
2142
+
2143
+ family_performance.append({
2144
+ 'Language_Family': family,
2145
+ 'VoxLingua_Files': vox_files,
2146
+ 'VoxLingua_Accuracy': f"{vox_acc:.4f}",
2147
+ 'VoxLingua_Accuracy_Pct': f"{vox_acc*100:.1f}%",
2148
+ 'VoxLingua_Avg_Confidence': f"{vox_conf:.4f}",
2149
+ 'XLSR_Files': xlsr_files,
2150
+ 'XLSR_Accuracy': f"{xlsr_acc:.4f}",
2151
+ 'XLSR_Accuracy_Pct': f"{xlsr_acc*100:.1f}%",
2152
+ 'XLSR_Avg_Confidence': f"{xlsr_conf:.4f}",
2153
+ 'Better_Model': 'VoxLingua' if vox_acc > xlsr_acc else 'XLS-R' if xlsr_acc > vox_acc else 'Tie'
2154
+ })
2155
+
2156
+ family_df = pd.DataFrame(family_performance)
2157
+ family_df.to_excel(writer, sheet_name='Language_Family_Performance', index=False)
2158
+
2159
+ # ========================================
2160
+ # SHEET 10: TOP-5 PREDICTIONS (SAMPLE)
2161
+ # ========================================
2162
+ print(" πŸ“‹ Creating Top-5 predictions sample...")
2163
+
2164
+ top5_sample = []
2165
+
2166
+ # Sample top-5 predictions from comparison data
2167
+ sample_files = results['comparison_data'][:20] # First 20 files as sample
2168
+
2169
+ for file_data in sample_files:
2170
+ file_name = file_data['file']
2171
+ gt_lang = file_data['gt_iso']
2172
+
2173
+ # VoxLingua Top-5
2174
+ if file_data['voxlingua']['available'] and 'top5_predictions' in file_data['voxlingua']:
2175
+ for pred in file_data['voxlingua']['top5_predictions']:
2176
+ top5_sample.append({
2177
+ 'Model': 'VoxLingua107',
2178
+ 'File': file_name,
2179
+ 'Ground_Truth': gt_lang,
2180
+ 'Rank': pred['rank'],
2181
+ 'Predicted_Language': pred['mapped'],
2182
+ 'Original_Output': pred['original'],
2183
+ 'Confidence': f"{pred['confidence']:.4f}",
2184
+ 'In_Dataset': pred['in_dataset'],
2185
+ 'Correct': gt_lang == pred['mapped']
2186
+ })
2187
+
2188
+ # XLS-R Top-5
2189
+ if file_data['xlsr']['available'] and 'top5_predictions' in file_data['xlsr']:
2190
+ for pred in file_data['xlsr']['top5_predictions']:
2191
+ top5_sample.append({
2192
+ 'Model': 'XLS-R',
2193
+ 'File': file_name,
2194
+ 'Ground_Truth': gt_lang,
2195
+ 'Rank': pred['rank'],
2196
+ 'Predicted_Language': pred['mapped'],
2197
+ 'Original_Output': pred['original'],
2198
+ 'Confidence': f"{pred['confidence']:.4f}",
2199
+ 'In_Dataset': pred['in_dataset'],
2200
+ 'Correct': gt_lang == pred['mapped']
2201
+ })
2202
+
2203
+ top5_df = pd.DataFrame(top5_sample)
2204
+ top5_df.to_excel(writer, sheet_name='Top5_Predictions_Sample', index=False)
2205
+
2206
+ print(f"βœ… Comprehensive Excel analysis created: {output_filename}")
2207
+
2208
+ # Try to download the file
2209
+ try:
2210
+ from google.colab import files
2211
+ print(f"πŸ“₯ File downloaded successfully!")
2212
+ except:
2213
+ print(f"πŸ“ File saved locally: {output_filename}")
2214
+
2215
+ return output_filename
2216
+
2217
+ # Run the comprehensive Excel analysis
2218
+ if 'final_analysis_results' in globals() and final_analysis_results:
2219
+ excel_filename = create_comprehensive_excel_analysis(
2220
+ final_analysis_results,
2221
+ "Language_Detection_Comprehensive_Analysis.xlsx"
2222
+ )
2223
+ print(f"\nπŸŽ‰ COMPREHENSIVE EXCEL ANALYSIS COMPLETE!")
2224
+ print(f"πŸ“Š File: {excel_filename}")
2225
+
2226
+ # Print summary of what was created
2227
+ print(f"\nπŸ“‹ Excel Contains 10 Sheets:")
2228
+ print(f" 1. Executive_Summary - Key metrics and recommendations")
2229
+ print(f" 2. VoxLingua107_Results - Detailed VoxLingua results")
2230
+ print(f" 3. XLSR_Results - Detailed XLS-R results")
2231
+ print(f" 4. Per_Language_Analysis - Accuracy by language")
2232
+ print(f" 5. VoxLingua_Confusion_Matrix - VoxLingua confusion matrix")
2233
+ print(f" 6. XLSR_Confusion_Matrix - XLS-R confusion matrix")
2234
+ print(f" 7. Confidence_Analysis - Performance by confidence ranges")
2235
+ print(f" 8. Error_Analysis - Detailed error breakdown")
2236
+ print(f" 9. Language_Family_Performance - Performance by language family")
2237
+ print(f" 10. Top5_Predictions_Sample - Sample of top-5 predictions")
2238
+
2239
+ else:
2240
+ print("❌ No analysis results found. Please run the analysis first.")
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ numpy
2
+ pandas
3
+ torch