update source description
#1
by
yanirmr
- opened
- src/about.py +14 -13
src/about.py
CHANGED
@@ -44,28 +44,29 @@ The following datasets are used in our evaluation:
|
|
44 |
|
45 |
### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
|
46 |
- **Size**: 2 hours
|
47 |
-
- **Domain**: Manual transcription of
|
48 |
-
- **Source**:
|
49 |
|
50 |
### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
|
51 |
-
- **Size**:
|
52 |
-
- **Domain**:
|
53 |
-
- **Source**:
|
|
|
54 |
|
55 |
### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
|
56 |
- **Size**: X hours
|
57 |
-
- **Domain**:
|
58 |
-
- **Source**:
|
59 |
|
60 |
### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
|
61 |
-
- **Size**: X hours
|
62 |
-
- **Domain**:
|
63 |
-
- **Source**:
|
64 |
|
65 |
### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
|
66 |
-
- **Size**:
|
67 |
-
- **Domain**:
|
68 |
-
- **Source**:
|
69 |
"""
|
70 |
|
71 |
# Technical details about evaluation
|
|
|
44 |
|
45 |
### [ivrit-ai/eval-d1](https://huggingface.co/datasets/ivrit-ai/eval-d1)
|
46 |
- **Size**: 2 hours
|
47 |
+
- **Domain**: Manual transcription of a single podcast episode featuring an informal conversation between two speakers (male and female). Audio is segmented into approximately 5-minute chunks.
|
48 |
+
- **Source**: Part of the ivrit.ai corpus. Selected episode has been manually transcribed to golden standard quality to serve as a high-quality evaluation benchmark.
|
49 |
|
50 |
### [ivrit-ai/saspeech](https://huggingface.co/datasets/ivrit-ai/saspeech)
|
51 |
+
- **Size**: 4 hours (manually corrected portion of the corpus)
|
52 |
+
- **Domain**: Economic and political podcast content, containing both read speech and conversational segments. Segments are several seconds in length.
|
53 |
+
- **Source**: Derived from the [Robo-Shaul project](https://www.roboshaul.com/) and published in the paper
|
54 |
+
"SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion" (Sharoni, O., Shenberg, R., Cooper, E. (2023) SASPEECH: A Hebrew Single Speaker Dataset for Text To Speech and Voice Conversion. Proc. INTERSPEECH 2023,)
|
55 |
|
56 |
### [google/fleurs/he](https://huggingface.co/datasets/google/fleurs)
|
57 |
- **Size**: X hours
|
58 |
+
- **Domain**: Read speech covering common topics and phrases in Hebrew
|
59 |
+
- **Source**: Created as part of Google's FLEURS project, designed for multilingual speech tasks and evaluation. Data collected through crowdsourcing from Hebrew speakers.
|
60 |
|
61 |
### [mozilla-foundation/common_voice_17_0/he](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
|
62 |
+
- **Size**: X hours (test set of the corpus)
|
63 |
+
- **Domain**: Read sentences in Hebrew from various texts.
|
64 |
+
- **Source**: Collected through Mozilla's Common Voice initiative, where volunteers contribute recordings and validate other speakers' contributions
|
65 |
|
66 |
### [imvladikon/hebrew_speech_kan](https://huggingface.co/datasets/imvladikon/hebrew_speech_kan)
|
67 |
+
- **Size**: 1.7 hours (validation setof the corpus)
|
68 |
+
- **Domain**: Varied content types from the Kan (Israeli Public Broadcasting Corporation) youtube channel
|
69 |
+
- **Source**: Published by Vladimir Gurevich. Scraped audio and subtitles data from YouTube channel "כאן" (Kan).
|
70 |
"""
|
71 |
|
72 |
# Technical details about evaluation
|