A newer version of the Gradio SDK is available:
4.38.1
title: BPE Language Tokenization
emoji: π
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.37.1
app_file: app.py
pinned: false
license: mit
S20-Tokenizer
This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units.
Table of Contents
Overview
The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords.
Features
- Interactive Interface: User-friendly web interface to input text and see tokenization results.
- BPE Tokenization: Demonstrates the Byte Pair Encoding algorithm for subword tokenization.
- Visualization: Real-time visualization of the tokenization process.
the UI interface will look like :
![tokenizer](App screenshot.png)
the tokenization compression ratio looked like : ![training](Training Screenshot.png)
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference