--- title: BPE Language Tokenization emoji: 👀 colorFrom: gray colorTo: indigo sdk: gradio sdk_version: 4.37.1 app_file: app.py pinned: false license: mit --- # S20-Tokenizer This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units. ## Table of Contents - [Overview](#overview) - [Features](#features) ## Overview The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords. ## Features - **Interactive Interface**: User-friendly web interface to input text and see tokenization results. - **BPE Tokenization**: Demonstrates the Byte Pair Encoding algorithm for subword tokenization. - **Visualization**: Real-time visualization of the tokenization process. the UI interface will look like : ![tokenizer](App screenshot.png) the tokenization compression ratio looked like : ![training](Training Screenshot.png) Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference