DhrubaAdhikary1991's picture
push BPE
9860ece
---
title: BPE Language Tokenization
emoji: πŸ‘€
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.37.1
app_file: app.py
pinned: false
license: mit
---
# S20-Tokenizer
This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units.
## Table of Contents
- [Overview](#overview)
- [Features](#features)
## Overview
The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords.
## Features
- **Interactive Interface**: User-friendly web interface to input text and see tokenization results.
- **BPE Tokenization**: Demonstrates the Byte Pair Encoding algorithm for subword tokenization.
- **Visualization**: Real-time visualization of the tokenization process.
the UI interface will look like :
![tokenizer](App screenshot.png)
the tokenization compression ratio looked like :
![training](Training Screenshot.png)
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference