DhrubaAdhikary1991
commited on
Commit
•
9860ece
1
Parent(s):
df8dc40
push BPE
Browse files
README.md
CHANGED
@@ -10,4 +10,28 @@ pinned: false
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
+
|
14 |
+
# S20-Tokenizer
|
15 |
+
|
16 |
+
This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units.
|
17 |
+
|
18 |
+
## Table of Contents
|
19 |
+
- [Overview](#overview)
|
20 |
+
- [Features](#features)
|
21 |
+
|
22 |
+
## Overview
|
23 |
+
The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords.
|
24 |
+
|
25 |
+
## Features
|
26 |
+
- **Interactive Interface**: User-friendly web interface to input text and see tokenization results.
|
27 |
+
- **BPE Tokenization**: Demonstrates the Byte Pair Encoding algorithm for subword tokenization.
|
28 |
+
- **Visualization**: Real-time visualization of the tokenization process.
|
29 |
+
|
30 |
+
the UI interface will look like :
|
31 |
+
|
32 |
+
![tokenizer](App screenshot.png)
|
33 |
+
|
34 |
+
the tokenization compression ratio looked like :
|
35 |
+
![training](Training Screenshot.png)
|
36 |
+
|
37 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|