Spaces:
Running
Running
| <html> | |
| <head> | |
| <meta charset="utf-8"> | |
| <meta name="description" | |
| content="Sidon is a fast, open-source multilingual speech restoration model for large-scale dataset restoration in TTS and spoken language modeling."> | |
| <meta name="keywords" | |
| content="speech restoration, dataset restoration, multilingual, TTS, spoken language models, vocoder, LoRA, w2v-BERT 2.0, HiFi-GAN"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1"> | |
| <title>Sidon: Fast and Robust Open-Source Multilingual Speech Restoration</title> | |
| <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> | |
| <link rel="stylesheet" href="./static/css/bulma.min.css"> | |
| <link rel="stylesheet" href="./static/css/bulma-carousel.min.css"> | |
| <link rel="stylesheet" href="./static/css/bulma-slider.min.css"> | |
| <link rel="stylesheet" href="./static/css/fontawesome.all.min.css"> | |
| <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> | |
| <link rel="stylesheet" href="./static/css/index.css"> | |
| <link rel="icon" href="./static/images/favicon.svg"> | |
| <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> | |
| <script defer src="./static/js/fontawesome.all.min.js"></script> | |
| <script src="./static/js/bulma-carousel.min.js"></script> | |
| <script src="./static/js/bulma-slider.min.js"></script> | |
| <script src="./static/js/index.js"></script> | |
| </head> | |
| <body> | |
| <section class="hero"> | |
| <div class="hero-body"> | |
| <div class="container is-max-desktop"> | |
| <div class="columns is-centered"> | |
| <div class="column has-text-centered"> | |
| <h1 class="title is-1 publication-title">Sidon: Fast and Robust Open-Source Multilingual Speech Restoration | |
| for Dataset Cleansing</h1> | |
| <div class="is-size-5 publication-authors"> | |
| <span class="author-block">Wataru Nakata,</span> | |
| <span class="author-block">Yuki Saito,</span> | |
| <span class="author-block">Yota Ueda,</span> | |
| <span class="author-block">Hiroshi Saruwatari</span> | |
| </div> | |
| <div class="is-size-5 publication-authors"> | |
| <span class="author-block">The University of Tokyo, Japan.</span> | |
| </div> | |
| <div class="column has-text-centered"> | |
| <div class="publication-links"> | |
| <span class="link-block"> | |
| <a href="https://arxiv.org/abs/2509.17052" target="_blank" class="external-link button is-normal is-rounded is-dark"> | |
| <span class="icon"> | |
| <i class="fas fa-file-pdf"></i> | |
| </span> | |
| <span>Paper</span> | |
| </a> | |
| </span> | |
| <span class="link-block"> | |
| <a href="https://github.com/sarulab-speech/Sidon" target="_blank" | |
| class="external-link button is-normal is-rounded is-dark"> | |
| <span class="icon"> | |
| <i class="fab fa-github"></i> | |
| </span> | |
| <span>Code </span> | |
| </a> | |
| </span> | |
| <span class="link-block"> | |
| <a href="https://huggingface.co/spaces/sarulab-speech/sidon_demo_beta" target="_blank" | |
| class="external-link button is-normal is-rounded is-dark"> | |
| <span class="icon"> | |
| 🤗 | |
| </span> | |
| <span>Live Demo</span> | |
| </a> | |
| </span> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </section> | |
| <!-- Teaser section removed for this project page. --> | |
| <!-- Results carousel removed for this project page. --> | |
| <section class="section"> | |
| <div class="container is-max-desktop"> | |
| <!-- Abstract. --> | |
| <div class="columns is-centered has-text-centered"> | |
| <div class="column is-four-fifths"> | |
| <h2 class="title is-3">Abstract</h2> | |
| <div class="content has-text-justified"> | |
| <p> | |
| Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, | |
| multilingual recordings. We introduce <b>Sidon</b>, a fast, open-source | |
| speech restoration model that converts noisy in-the-wild speech into | |
| studio-quality speech and scales to dozens of languages. Sidon consists of | |
| two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech | |
| and vocoder trained to synthesize restored speech from the cleansed features. | |
| Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with | |
| the aim of dataset cleansing for speech synthesis. Sidon is also computationally | |
| efficient, running up to 3,390× faster than real time on a | |
| single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech | |
| recognition corpus improves the quality of synthetic | |
| speech in a zero-shot setting. Code and model are released to | |
| facilitate reproducible dataset cleansing for the research community. | |
| </p> | |
| </div> | |
| </div> | |
| </div> | |
| <!--/ Abstract. --> | |
| <!-- Paper video section intentionally omitted. --> | |
| </div> | |
| </section> | |
| <section class="section"> | |
| <div class="container is-max-desktop"> | |
| <h2 class="title is-3 has-text-centered">Full Multilingual Results (FLEURS)</h2> | |
| <div class="content has-text-centered is-size-6"> | |
| <p>The full multilingual evaluation table is large. It is hidden by default.</p> | |
| </div> | |
| <details id="fleurs-details" class="box"> | |
| <summary class="is-size-5">Show results table</summary> | |
| <p class="is-size-7 has-text-grey">Loads an embedded page with all language-wise metrics.</p> | |
| <div id="fleurs-iframe-wrapper" style="margin-top: 0.75rem;"> | |
| <iframe title="FLEURS Results" | |
| src="full_result.html" | |
| loading="lazy" | |
| style="width: 100%; height: 70vh; border: 1px solid #e5e5e5; border-radius: 6px;"></iframe> | |
| </div> | |
| </details> | |
| </div> | |
| </section> | |
| <section class="section"> | |
| <div class="container is-max-desktop"> | |
| <h2 class="title is-3 has-text-centered">Multilingual Samples from FLEURS</h2> | |
| <div id="samples-multilingual-root" class="samples-root"></div> | |
| <h2 class="title is-3 has-text-centered" style="margin-top:2rem;">English Demo Samples from LibriTTS</h2> | |
| <div id="samples-english-root" class="samples-root"></div> | |
| </div> | |
| </section> | |
| <section class="section" id="BibTeX"> | |
| <div class="container is-max-desktop content"> | |
| <h2 class="title">BibTeX</h2> | |
| <pre><code>@inproceedings{sidon2026, | |
| author = {Nakata, Wataru and Saito, Yuki and Ueda, Yota and Saruwatari, Hiroshi}, | |
| title = {Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Dataset Restoration}, | |
| booktitle = {TBA}, | |
| year = {TBA} | |
| }</code></pre> | |
| </div> | |
| </section> | |
| <footer class="footer"> | |
| <div class="container"> | |
| <div class="content has-text-centered"> | |
| <a class="icon-link" target="_blank" href="#"> | |
| <i class="fas fa-file-pdf"></i> | |
| </a> | |
| <a class="icon-link" href="https://huggingface.co/spaces/Wataru/SidonSamples" target="_blank" | |
| class="external-link"> | |
| <i class="fab fa-github"></i> | |
| </a> | |
| </div> | |
| <div class="columns is-centered"> | |
| <div class="column is-8"> | |
| <div class="content"> | |
| <p> | |
| This website is licensed under a <a rel="license" target="_blank" | |
| href="http://creativecommons.org/licenses/by-sa/4.0/">Creative | |
| Commons Attribution-ShareAlike 4.0 International License</a>. | |
| </p> | |
| <p> | |
| This means you are free to borrow the <a target="_blank" | |
| href="https://github.com/nerfies/nerfies.github.io">source code</a> of this website, | |
| we just ask that you link back to this page in the footer. | |
| Please remember to remove the analytics code included in the header of the website which | |
| you do not want on your website. | |
| </p> | |
| </div> | |
| </div> | |
| </div> | |
| </div> | |
| </footer> | |
| </body> | |
| </html> | |