BUG #1 - ( Issue encountered during audio transcription )

#2
by ImPolymath - opened

Issue encountered during audio transcription

0image.png

The transcription issue shown in the image stems from poor segmentation and language detection. Although the user spoke exclusively in French, the result includes words from other languages, indicating that:

  1. Audio segmentation: The audio was improperly split, resulting in poorly connected sentences.
  2. Multi-language detection: The automatic language detection, using GPT-4o-mini, failed, blending multiple languages instead of recognizing only French.

Adjustments in segmentation and forced detection of French could improve the transcription quality.

Changes in the new version - 1.3.9

  1. Pages and Navigation:
    • The app.py file has been updated to incorporate multi-page navigation, featuring two main sections: "Configuration" and "Translator". The configuration part now includes a new page (pages/configuration_ui_lang.py) for selecting the interface language, while the main page (pages/main.py) handles translation and user interaction.
    • The application structure has been clarified by separating interface configuration (language selection, STT/TTS settings) from the message processing on the main page.

  2. User Interface Enhancements:
    • In pages/main.py, the main_page function has been improved to support multiple input methods: text input (using st.chat_input), audio input (with st.audio_input and a recording widget), and file uploads (managed via st.file_uploader).
    • Distinct tabs (text_input, audio_input, file_upload_input) have been created to organize these different input modes.
    • The audio and text-to-speech parameters are encapsulated within dialog boxes (st.dialog) for configuring STT and TTS settings.
    • Fine session management is implemented using st.session_state to hold and manipulate session information, such as selected languages, uploaded files, transcriptions, and generated responses.

  3. Language Management and Translation:
    • Language selection and management are now centralized: the var_app.py file defines a list of supported languages and their corresponding emojis, while pages/configuration_ui_lang.py offers an interface to set the interface language. Translations are loaded from a JSON file (ui_lang_support.json).
    • The init_langs_for_processing function (in pages/main.py) and functions in core/core.py initialize a processing mode for translation by configuring both the system prompt and the operational prompt.

  4. Audio Processing and Transcription:
    • In core/speech_to_text.py, the module now includes key functions like huggingface_endpoints_stt and transcribe_audio, which handle transcription via Hugging Face endpoints and segment audio files that exceed the 25 MB limit.
    • Audio segmentation logic is integrated (based on file size and audio properties) to facilitate processing of lengthy audio files.
    • An option for voice isolation (using the isolate_audio function) is provided before transcription, based on the user’s choice.

  5. Assistant Integration and Streaming Responses:
    • In core/core.py and core/demorrha.py, the DemorrhaAssistant class has been enhanced to manage the translation assistant. This class handles the creation, search, and update of the assistant via the OpenAI API.
    • A vectorization logic is introduced to improve search and provide personalization according to style or context.
    • The use_assistant method employs a streaming thread that gradually returns the assistant’s response. This response is collected by a generator and displayed in real time within the Streamlit interface.

  6. Various Enhancements and File Management:
    • Utility functions such as save_attachment, hash_file, and callback_change_edited_text have been introduced for handling and saving files (both text and audio).
    • The README.md file has been updated to reflect the current state and new functionalities, including multi-language support, audio input options, and streaming responses.
    • The requirements.txt file now specifies the exact dependencies (notably streamlit, openai, pydub, python-dotenv, and elevenlabs), ensuring consistency in the development environment.

Overall, the latest merges have significantly improved Demorrha's application architecture by better separating responsibilities (interface configuration, input processing, transcription, translation, text-to-speech, and assistant management). The user interface is now more intuitive with dedicated tabs and dynamic settings management, while robust integration with OpenAI and Hugging Face APIs ensures reliable real-time audio and text processing.

ImPolymath changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment