DhrubaAdhikary1991 commited on
Commit
9860ece
1 Parent(s): df8dc40
Files changed (1) hide show
  1. README.md +24 -0
README.md CHANGED
@@ -10,4 +10,28 @@ pinned: false
10
  license: mit
11
  ---
12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
10
  license: mit
11
  ---
12
 
13
+
14
+ # S20-Tokenizer
15
+
16
+ This repository contains a Hugging Face Space implementation of a tokenizer using Byte Pair Encoding (BPE). The tokenizer is designed to preprocess text data for natural language processing tasks, making it more manageable by breaking down words into subword units.
17
+
18
+ ## Table of Contents
19
+ - [Overview](#overview)
20
+ - [Features](#features)
21
+
22
+ ## Overview
23
+ The S20-Tokenizer Space provides an interactive interface to demonstrate the BPE tokenization process. It allows users to input text and observe how BPE tokenizes the text into subwords.
24
+
25
+ ## Features
26
+ - **Interactive Interface**: User-friendly web interface to input text and see tokenization results.
27
+ - **BPE Tokenization**: Demonstrates the Byte Pair Encoding algorithm for subword tokenization.
28
+ - **Visualization**: Real-time visualization of the tokenization process.
29
+
30
+ the UI interface will look like :
31
+
32
+ ![tokenizer](App screenshot.png)
33
+
34
+ the tokenization compression ratio looked like :
35
+ ![training](Training Screenshot.png)
36
+
37
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference