DhrubaAdhikary1991's picture
push BPE
9860ece

A newer version of the Gradio SDK is available: 4.38.1

Upgrade
metadata
title: BPE Language Tokenization
emoji: πŸ‘€
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 4.37.1
app_file: app.py
pinned: false
license: mit

S20-Tokenizer

This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units.

Table of Contents

Overview

The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords.

Features

  • Interactive Interface: User-friendly web interface to input text and see tokenization results.
  • BPE Tokenization: Demonstrates the Byte Pair Encoding algorithm for subword tokenization.
  • Visualization: Real-time visualization of the tokenization process.

the UI interface will look like :

![tokenizer](App screenshot.png)

the tokenization compression ratio looked like : ![training](Training Screenshot.png)

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference