Kudasai / README.md
Bikatr7's picture
push v3.4.3 files
1480a44 verified
|
raw
history blame
16.3 kB
metadata
license: gpl-3.0
title: Kudasai
sdk: gradio
emoji: 🈷️
python_version: 3.10.0
app_file: webgui.py
colorFrom: gray
colorTo: gray
short_description: Japanese-English preprocessor with automated translation.
pinned: true

Table of Contents


Notes

This readme is for the Hugging Space instance of Kudasai's WebGUI and the WebGUI itself, to run Kudasai locally or see any info on the project, please see the GitHub Page.

Streamlining Japanese-English Translation with Advanced Preprocessing and Integrated Translation Technologies.

Preprocessor and Translation logic is sourced from external packages, which I also designed, see Kairyou and EasyTL for more information.

Kudasai has a public trello board, you can find it here to see what I'm working on.

The webgui on huggingface does not save anything through runs, so you will need to download the output files or copy the text out of the webgui. API keys are not saved, and the output folder is overwritten every time you run it. Archives deleted every run as well.


Naming Conventions

kudasai.py - Main script - ください - Please

Kairyou - Preprocessing Package - 改良 - Reform

kaiseki.py - DeepL translation module - 解析 - Parsing

kijiku.py - OpenAI translation module - 基軸 - Foundation

Kudasai gets it's original name idea from it's inspiration, Atreyagaurav's Onegai. Which also means please. You can find that here


General Usage

Kudasai's WebGUI is pretty easy to understand for the general usage, most incorrect actions will be caught by the system and a message will be displayed to the user on how to correct it.

Normally, Kudasai would save files to the local system, but on Hugging Face's servers, this is not possible. Instead, you'll have to click the 'Save As' button to download the files to your local system.

Or you can click the copy button on the top right of textbox modals to copy the text to your clipboard.

For further details, see below chapters.


Indexing and Preprocessing

Indexing is not for everyone, only use it if you have a large amount of previous text and want to flag new names. It can be a very slow and long process, especially on Hugging Face's servers. It's recommended to use a local version of Kudasai for this process.

You'll need a txt file or some text to index. You'll also need a knowledge base, this can either be a single txt file or a directory of them, as well as a replacements json. Either Kudasai or Fukuin Type works. See this for further details on replacement jsons.

Please do indexing before preprocessing, output is neater that way.

For Preprocessing, you'll need a txt file or some text to preprocess. You'll also need a replacements json. Either Kudasai or Fukuin Type works like with indexing.

For both, text is put in the textbox modals, with the output text being in the first field, and results being in the second field.

They both have a debug field, but neither module really uses it.


Translation with DeepL

DeepL is a paid service, so you'll need an API key to use it. You can get one here. However is free under 500,000 characters a month.

Same general things apply here, use a text input (File or raw text). Also enter your API key in the API key field.

While DeepL translation does work, it is currently deprecated in favor of the LLMs, and a bit buggy. It's recommended to use the LLMs for translation. Perhaps in the future, I'll update the DeepL translation to be more stable given demand.

DeepL translation is fairly unsophisticated compared to the LLMs, so there's not much to configure. Press the translate button and wait for the results. Output will show in the appropriate fields.


Translation with LLMs

Kudasai supports 2 different LLMs at the moment, OpenAI's GPT and Google's Gemini.

For OpenAI, you'll need an API key, you can get one here. This is a paid service with no free tier.

For Gemini, you'll need an API key, you can get one here. Gemini is free to use under 60 concurrent requests.

I'd recommend using GPT for most things, as it's generally better at translation.

Once again, mostly straightforward, fill in your API key, select your LLM, and select your text. You'll also need to add your settings file if on HuggingFace.

You can calculate costs here or just translate. Output will show in the appropriate fields.

For further details on the settings file, see here.


Translation with LLMs Settings

(Fairly technical, can be abstracted away by using default settings or someone else's settings file.)

----------------------------------------------------------------------------------
Kijiku Settings:

prompt_assembly_mode : 1 or 2. 1 means the system message will actually be treated as a system message. 2 means it'll be treated as a user message. 1 is recommend for gpt-4 otherwise either works. For Gemini, this setting is ignored.

number_of_lines_per_batch : The number of lines to be built into a prompt at once. Theoretically, more lines would be more cost effective, but other complications may occur with higher lines. So far been tested up to 48.

sentence_fragmenter_mode : 1 or 2  (1 - via regex and other nonsense) 2 - None (Takes formatting and text directly from API return)) the API can sometimes return a result on a single line, so this determines the way Kijiku fragments the sentences if at all. Use 2 for newer models.

je_check_mode : 1 or 2, 1 will print out the jap then the english below separated by ---, 2 will attempt to pair the english and jap sentences, placing the jap above the eng. If it cannot, it will default to 1. Use 2 for newer models.

number_of_malformed_batch_retries : (Malformed batch is when je-fixing fails) How many times Kijiku will attempt to mend a malformed batch (mending is resending the request), only for gpt4. Be careful with increasing as cost increases at (cost * length * n) at worst case. This setting is ignored if je_check_mode is set to 1.

batch_retry_timeout : How long Kijiku will try to translate a batch in seconds, if a requests exceeds this duration, Kijiku will leave it untranslated.

number_of_concurrent_batches : How many translations batches Kijiku will send to the translation API at a time. For OpenAI, be conservative as rate-limiting is aggressive, I'd suggest 3-5. For Gemini, do not exceed 60.
----------------------------------------------------------------------------------
Open AI Settings:
See https://platform.openai.com/docs/api-reference/chat/create for further details
----------------------------------------------------------------------------------
openai_model : ID of the model to use. Kijiku only works with 'chat' models.

openai_system_message : Instructions to the model. Basically tells the model how to translate.

openai_temperature : What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Lower values are typically better for translation.

openai_top_p : An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. I generally recommend altering this or temperature but not both.

openai_n : How many chat completion choices to generate for each input message. Do not change this.

openai_stream : If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. See the OpenAI python library on GitHub for example code. Do not change this.

openai_stop : Up to 4 sequences where the API will stop generating further tokens. Do not change this.

openai_logit_bias : Modifies the likelihood of specified tokens appearing in the completion. Do not change this.

openai_max_tokens :  The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length. I wouldn't recommend changing this. Is none by default. If you change to an integer, make sure it doesn't exceed that model's context length or your request will fail and repeat till timeout.

openai_presence_penalty : Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. While negative values encourage repetition. Should leave this at 0.0.

openai_frequency_penalty : Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim. Negative values encourage repetition. Should leave this at 0.0.
----------------------------------------------------------------------------------
openai_stream, openai_logit_bias, openai_stop and openai_n are included for completion's sake, current versions of Kudasai will hardcode their values when validating the Kijiku_rule.json to their default values. As different values for these settings do not have a use case in Kudasai's current implementation.
----------------------------------------------------------------------------------
Gemini Settings:
https://ai.google.dev/docs/concepts#model-parameters for further details
----------------------------------------------------------------------------------
gemini_model : The model to use. Currently only supports gemini-pro and gemini-pro-vision, the 1.0 model and it's aliases.

gemini_prompt : Instructions to the model. Basically tells the model how to translate.

gemini_temperature : What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Lower values are typically better for translation.

gemini_top_p : An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. I generally recommend altering this or temperature but not both.

gemini_top_k : Determines the number of most probable tokens to consider for each selection step. A higher value increases diversity, a lower value makes the output more deterministic.

gemini_candidate_count : The number of candidates to generate for each input message. Do not change this.

gemini_stream : If set, partial message deltas will be sent, like in Gemini Chat. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. Do not change this.

gemini_stop_sequences : Up to 4 sequences where the API will stop generating further tokens. Do not change this.

gemini_max_output_tokens : The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length. I wouldn't recommend changing this. Is none by default. If you change to an integer, make sure it doesn't exceed that model's context length or your request will fail and repeat till timeout.
----------------------------------------------------------------------------------
gemini_stream, gemini_stop_sequences and gemini_candidate_count are included for completion's sake, current versions of Kudasai will hardcode their values when validating the Kijiku_rule.json to their default values. As different values for these settings do not have a use case in Kudasai's current implementation.
----------------------------------------------------------------------------------

Web GUI

Below are images of the WebGUI:

Indexing | Kairyou: Indexing Screen | Kairyou

Preprocessing | Kairyou: Preprocessing Screen | Kairyou

Translation | Kaiseki: Translation Screen | Kaiseki

Translation | Kijiku: Translation Screen | Kijiku

Kijiku Settings: Kijiku Settings

Logging: Logging


License

This project (Kudasai) is licensed under the GNU General Public License (GPL). You can find the full text of the license in the LICENSE file.

The GPL is a copyleft license that promotes the principles of open-source software. It ensures that any derivative works based on this project must also be distributed under the same GPL license. This license grants you the freedom to use, modify, and distribute the software.

Please note that this information is a brief summary of the GPL. For a detailed understanding of your rights and obligations under this license, please refer to the full license text.


Contact

If you have any questions, comments, or concerns, please feel free to contact me at Tetralon07@gmail.com.

For any bugs or suggestions please use the issues tab here.

Once again, I actively encourage and welcome any feedback on this project.