Spaces:

DeepLearning101
/

Corrector101zhTW

Running

File size: 7,508 Bytes

b79c056
 
 
5acee80
b79c056
 
fd2be54
 
d6e90a6
b9c5636
d6e90a6
fd2be54
 
237df5d
d6e90a6
b9c5636
 
b79c056
b9c5636
5887026
b9c5636
5887026
 
 
b79c056
 
5887026
 
b9c5636
 
 
 
 
 
 
 
 
 
 
b79c056
 
9aa7185
 
 
 
 
 
 
b79c056
b9c5636
 
 
deda166
a8deb66
b9c5636
60901ea
83079a9
861583b
 
 
 
 
 
 
 
 
 
 
3433b60
83079a9
520bdcd
b9c5636
5887026

import gradio as gr
import operator
import torch
import os
from transformers import BertTokenizer, BertForMaskedLM

model_name_or_path = "DeepLearning101/Corrector101zhTW"
auth_token = os.getenv("Corrector") 

# 嘗試加載模型和分詞器
try:
    tokenizer = BertTokenizer.from_pretrained(model_name_or_path, token=auth_token)
    model = BertForMaskedLM.from_pretrained(model_name_or_path, token=auth_token)
    model.eval()
except Exception as e:
    print(f"加載模型或分詞器失敗，錯誤信息：{e}")
    exit(1)

def ai_text(text):
    """處理輸入文本並返回修正後的文本及錯誤細節"""
    with torch.no_grad():
        inputs = tokenizer(text, return_tensors="pt", padding=True)
        outputs = model(**inputs)
    corrected_text, details = get_errors(text, outputs)
    return corrected_text + ' ' + str(details)

def get_errors(text, outputs):
    """識別原始文本和模型輸出之間的差異"""
    sub_details = []
    corrected_text = tokenizer.decode(torch.argmax(outputs.logits[0], dim=-1), skip_special_tokens=True).replace(' ', '')
    for i, ori_char in enumerate(text):
        if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
            continue
        if i >= len(corrected_text):
            continue
        if ori_char != corrected_text[i]:
            sub_details.append((ori_char, corrected_text[i], i, i + 1))
    sub_details = sorted(sub_details, key=operator.itemgetter(2))
    return corrected_text, sub_details

if __name__ == '__main__':
    examples = [
        ['你究輸入利的手機門號跟生分證就可以了。'],
        ['這裡是客服中新，很高性為您服物，請問金天有什麼須要幫忙'],
        ['因為我們這邊是按天術比例計蒜給您的，其實不會有態大的穎響。也就是您用前面的資非的廢率來做計算'],
        ['我來看以下，他的時價是多少？起實您就可以直皆就不用到門事'],
        ['因為你現在月富是六九九嘛，我幫擬減衣百塊，兒且也不會江速'],
    ]
    gr.Interface(
        fn=ai_text,
        inputs=gr.Textbox(lines=2, label="欲校正的文字"),
        outputs=gr.Textbox(lines=2, label="修正後的文字"),
        title="<h1 align='center'>客服ASR文本AI糾錯系統</h1>",
        description="""<h2><a href='https://deep-learning-101.github.io' target='_blank'>deep-learning-101.github.io</a> | <a href='https://www.twman.org/AI' target='_blank'> AI </a> | <a href='https://www.twman.org' target='_blank'>TonTon Huang Ph.D.</a> | <a href='https://blog.twman.org/p/deeplearning101.html' target='_blank'>手把手帶你一起踩AI坑</a><br></h2><br>
                    輸入ASR文本，糾正同音字/詞錯誤<br>
                    <a href='https://github.com/Deep-Learning-101' target='_blank'>Deep Learning 101 Github</a> | <a href='http://deeplearning101.twman.org' target='_blank'>Deep Learning 101</a> | <a href='https://www.facebook.com/groups/525579498272187/' target='_blank'>台灣人工智慧社團 FB</a> | <a href='https://www.youtube.com/c/DeepLearning101' target='_blank'>YouTube</a><br>
                    <a href='https://blog.twman.org/2025/04/AI-Robot.html' target='_blank'>AI 陪伴機器人：2025 趨勢分析技術突破、市場潛力與未來展望</a> | <a href='https://blog.twman.org/2025/04/FinanceGenAI.html' target='_blank'>金融科技新浪潮：生成式 AI (GenAI) 應用場景、效益與導入挑戰</a><br>
                    <a href='https://blog.twman.org/2025/03/AIAgent.html' target='_blank'>避開 AI Agent 開發陷阱：常見問題、挑戰與解決方案 (實戰經驗)</a>：<a href="https://deep-learning-101.github.io/agent" target="_blank">探討多種 AI 代理人工具的應用經驗與挑戰，分享實用經驗與工具推薦。</a><br>
                    <a href="https://blog.twman.org/2024/08/LLM.html" target="_blank">白話文手把手帶你科普 GenAI</a></b>：<a href="https://deep-learning-101.github.io/GenAI" target="_blank">淺顯介紹生成式人工智慧核心概念，強調硬體資源和數據的重要性。</a><br>
                    <a href="https://blog.twman.org/2024/09/LLM.html" target="_blank">大型語言模型直接就打完收工？</a></b>：<a href="https://deep-learning-101.github.io/1010LLM" target="_blank">回顧 LLM 領域探索歷程，討論硬體升級對 AI 開發的重要性。</a><br>
                    <a href="https://blog.twman.org/2024/07/RAG.html" target="_blank">檢索增強生成(RAG)不是萬靈丹之優化挑戰技巧</a></b>：<a href="https://deep-learning-101.github.io/RAG" target="_blank">探討 RAG 技術應用與挑戰，提供實用經驗分享和工具建議。</a><br>
                    <a href="https://blog.twman.org/2024/02/LLM.html" target="_blank">大型語言模型 (LLM) 入門完整指南：原理、應用與未來</a></b>：<a href="https://deep-learning-101.github.io/0204LLM" target="_blank">探討多種 LLM 工具的應用與挑戰，強調硬體資源的重要性。</a><br>
                    <a href="https://blog.twman.org/2023/04/GPT.html" target="_blank">解析探索大型語言模型：模型發展歷史、訓練及微調技術的 VRAM 估算</a></b>：<a href="https://deep-learning-101.github.io/GPU" target="_blank">探討 LLM 的發展與應用，強調硬體資源在開發中的關鍵作用。</a><br>
                    <a href="https://blog.twman.org/2024/11/diffusion.html" target="_blank">Diffusion Model 完全解析：從原理、應用到實作 (AI 圖像生成)</a></b>；<a href="https://deep-learning-101.github.io/diffusion" target="_blank">深入探討影像生成與分割技術的應用，強調硬體資源的重要性。</a><br>
                    <a href="https://blog.twman.org/2024/02/asr-tts.html" target="_blank">ASR/TTS 開發避坑指南：語音辨識與合成的常見挑戰與對策</a></b>：<a href="https://deep-learning-101.github.io/asr-tts" target="_blank">探討 ASR 和 TTS 技術應用中的問題，強調數據質量的重要性。</a><br>
                    <a href="https://blog.twman.org/2021/04/NLP.html" target="_blank">那些 NLP 踩的坑</a></b>：<a href="https://deep-learning-101.github.io/nlp" target="_blank">分享 NLP 領域的實踐經驗，強調數據質量對模型效果的影響。</a><br>
                    <a href="https://blog.twman.org/2021/04/ASR.html" target="_blank">那些語音處理踩的坑</a></b>：<a href="https://deep-learning-101.github.io/speech" target="_blank">分享語音處理領域的實務經驗，強調資料品質對模型效果的影響。</a><br>
                    <a href="https://blog.twman.org/2020/05/DeepLearning.html" target="_blank">手把手學深度學習安裝環境</a></b>：<a href="https://deep-learning-101.github.io/101" target="_blank">詳細介紹在 Ubuntu 上安裝深度學習環境的步驟，分享實際操作經驗。</a><br>
                    <a href='https://blog.twman.org/2023/07/wsl.html' target='_blank'>用PPOCRLabel來幫PaddleOCR做OCR的微調和標註</a><br>
                    <a href='https://blog.twman.org/2023/07/HugIE.html' target='_blank'>基於機器閱讀理解和指令微調的統一信息抽取框架之診斷書醫囑資訊擷取分析</a><br>                
                    <a href='https://github.com/shibing624/pycorrector' target='_blank'>Masked Language Model (MLM) as correction BERT</a>""",
        examples=examples
    ).launch()