"""
TODO:
- 统计 tokenizer_impl
- 统计 OOV
- 统计 reversal
- 增加 math,code
## balance
- 高压缩率 VS vocab_size:
- 高压缩率,就意味着,编码后的token数少,那么 token长度 就会长,--> vocab_size 就会太大
- 高压缩率 VS 无损
- s
- OOV
- OOV 多,那么生成的 UNK 可能多(一个char 一个UNK) --> token 数目多 -> 压缩率低
- OOV 多,那么生成的 UNK 可能少() --> token 数目多 -> 压缩率低
"""
import gradio as gr
from compression_util import get_compression_leaderboard, common_corpuses
# From the perspective of compression
# exactly reconstructed from compressed tokens
docs = """## 📖 What is a good tokenizer?
From a compression perspective, a good tokenizer should be lossless,
and keep high compression rate (fewer tokens for given text).
The encoding and decoding process can be formulated as
```python
token_ids = tokenizer.encode(input_text) # compressed tokens
decoded_text = tokenizer.decode(token_ids) # reconstructed text
```
**Lossless**
Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`. There are mainly two causes of compression loss.
1. `OOV`: Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the OOV and
tokenization loss of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
[t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
2. `Normalization`: Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
which may bring some slight differences to the reconstructed text. 👉 Check the tokenization loss of
[qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
[llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
**Compression Rate**
There are mainly two types of metric to represent the `input_text`:
- `char-level`: the number of characters in the given text
- `byte-level`: the number of bytes in the given text.
To evaluate compression rate, simple metrics can be "how many chars per token" or "how many bytes per token".
In this leaderboard, we adopt more frequently used metric: "how many chars per token" and
"how many billion tokens per gigabytes corpus", i.e. `char/token` and `b_tokens/g_bytes`.
💬 [Discussion is Welcome](https://huggingface.co/spaces/eson/tokenizer-arena/discussions)
"""
# theme = gr.themes.Monochrome()
theme = gr.themes.Default()
# theme.set(accordion_text_weight=600) # 暂不支持
with gr.Blocks(theme=theme) as demo:
# gr.Markdown("## Convertor")
# with gr.Accordion("Convertor", open=False):
# gr.Markdown("Tokenize {} corpus")
# with gr.Row(elem_classes="no-border"):
# gr.Button("File Size", min_width=50)
# file_size = gr.Textbox(
# show_label=False,
# min_width=50,
# # elem_classes="textbox-as-text"
# )
# gr.Dropdown(
# choices=['MB', 'GB', 'TB'],
# show_label=False,
# min_width=15,
# # elem_classes="textbox-as-text"
# )
# # gr.Markdown('