ItsMaxNorm commited on
Commit
9b2a433
·
verified ·
1 Parent(s): 7ee016a

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +136 -0
  2. merges.json +1 -0
  3. tokenizer_config.json +12 -0
  4. vocab.json +131 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - chess
7
+ - tokenizer
8
+ - bpe
9
+ - game-ai
10
+ library_name: rustbpe
11
+ datasets:
12
+ - angeluriot/chess_games
13
+ ---
14
+
15
+ # Chess BPE Tokenizer
16
+
17
+ A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
18
+
19
+ ## Model Details
20
+
21
+ - **Tokenizer Type**: BPE (Byte Pair Encoding)
22
+ - **Vocabulary Size**: 256
23
+ - **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
24
+ - **Training Split**: train[0:1000]
25
+ - **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.♘g1♘f3..`)
26
+
27
+ ## Move Format Description
28
+
29
+ The tokenizer is trained on a custom chess move notation:
30
+
31
+ | Component | Description | Example |
32
+ |-----------|-------------|---------|
33
+ | Player prefix | `w.` (white) or `b.` (black) | `w.` |
34
+ | Piece + Source | Unicode piece + square | `♘g1` |
35
+ | Piece + Destination | Unicode piece + square | `♘f3` |
36
+ | Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |
37
+
38
+ ### Examples
39
+
40
+ | Move | Meaning |
41
+ |------|---------|
42
+ | `w.♘g1♘f3..` | White knight from g1 to f3 |
43
+ | `b.♟c7♟c5..` | Black pawn from c7 to c5 |
44
+ | `b.♟c5♟d4.x.` | Black pawn captures on d4 |
45
+ | `w.♔e1♔g1♖h1♖f1..` | White kingside castle |
46
+ | `b.♛d7♛d5..+` | Black queen to d5 with check |
47
+
48
+ ### Chess Piece Symbols
49
+
50
+ | White | Black | Piece |
51
+ |-------|-------|-------|
52
+ | ♔ | ♚ | King |
53
+ | ♕ | ♛ | Queen |
54
+ | ♖ | ♜ | Rook |
55
+ | ♗ | ♝ | Bishop |
56
+ | ♘ | ♞ | Knight |
57
+ | ♙ | ♟ | Pawn |
58
+
59
+ ## Usage
60
+
61
+ ### Installation
62
+
63
+ ```bash
64
+ pip install rustbpe huggingface_hub
65
+ ```
66
+
67
+ ### Loading and Using the Tokenizer
68
+
69
+ ```python
70
+ import json
71
+ from huggingface_hub import hf_hub_download
72
+
73
+ # Download tokenizer files
74
+ vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
75
+ config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
76
+
77
+ # Load vocabulary
78
+ with open(vocab_path, 'r') as f:
79
+ vocab = json.load(f)
80
+
81
+ with open(config_path, 'r') as f:
82
+ config = json.load(f)
83
+
84
+ print(f"Vocab size: {len(vocab)}")
85
+ print(f"Pattern: {config['pattern']}")
86
+ ```
87
+
88
+ ### Using with rustbpe (for encoding)
89
+
90
+ ```python
91
+ import rustbpe
92
+
93
+ # Note: rustbpe tokenizer needs to be retrained or loaded from merges
94
+ # See the training script for details
95
+ ```
96
+
97
+ ### Training Your Own
98
+
99
+ ```python
100
+ from bpess.main import train_chess_tokenizer, push_to_hub
101
+
102
+ # Train
103
+ tokenizer = train_chess_tokenizer(
104
+ vocab_size=4096,
105
+ dataset_fraction="train",
106
+ moves_key='moves_custom'
107
+ )
108
+
109
+ # Push to HuggingFace
110
+ push_to_hub(
111
+ tokenizer=tokenizer,
112
+ repo_id="your-username/chess-bpe-tokenizer",
113
+ config={
114
+ "vocab_size": 4096,
115
+ "dataset_fraction": "train",
116
+ "moves_key": "moves_custom"
117
+ }
118
+ )
119
+ ```
120
+
121
+ ## Training Details
122
+
123
+ - **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
124
+ - **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
125
+ - **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
126
+
127
+ ## Intended Use
128
+
129
+ This tokenizer is designed for:
130
+ - Training language models on chess games
131
+ - Chess move prediction tasks
132
+ - Game analysis and embedding generation
133
+
134
+ ## License
135
+
136
+ MIT License
merges.json ADDED
@@ -0,0 +1 @@
 
 
1
+ []
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tokenizer_type": "BPE",
3
+ "vocab_size": 256,
4
+ "pattern": "'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+",
5
+ "special_tokens": {},
6
+ "training_config": {
7
+ "vocab_size": 256,
8
+ "dataset_fraction": "train[0:1000]",
9
+ "moves_key": "moves_custom",
10
+ "separator": " "
11
+ }
12
+ }
vocab.json ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\u0000": 0,
3
+ "\u0001": 1,
4
+ "\u0002": 2,
5
+ "\u0003": 3,
6
+ "\u0004": 4,
7
+ "\u0005": 5,
8
+ "\u0006": 6,
9
+ "\u0007": 7,
10
+ "\b": 8,
11
+ "\t": 9,
12
+ "\n": 10,
13
+ "\u000b": 11,
14
+ "\f": 12,
15
+ "\r": 13,
16
+ "\u000e": 14,
17
+ "\u000f": 15,
18
+ "\u0010": 16,
19
+ "\u0011": 17,
20
+ "\u0012": 18,
21
+ "\u0013": 19,
22
+ "\u0014": 20,
23
+ "\u0015": 21,
24
+ "\u0016": 22,
25
+ "\u0017": 23,
26
+ "\u0018": 24,
27
+ "\u0019": 25,
28
+ "\u001a": 26,
29
+ "\u001b": 27,
30
+ "\u001c": 28,
31
+ "\u001d": 29,
32
+ "\u001e": 30,
33
+ "\u001f": 31,
34
+ " ": 32,
35
+ "!": 33,
36
+ "\"": 34,
37
+ "#": 35,
38
+ "$": 36,
39
+ "%": 37,
40
+ "&": 38,
41
+ "'": 39,
42
+ "(": 40,
43
+ ")": 41,
44
+ "*": 42,
45
+ "+": 43,
46
+ ",": 44,
47
+ "-": 45,
48
+ ".": 46,
49
+ "/": 47,
50
+ "0": 48,
51
+ "1": 49,
52
+ "2": 50,
53
+ "3": 51,
54
+ "4": 52,
55
+ "5": 53,
56
+ "6": 54,
57
+ "7": 55,
58
+ "8": 56,
59
+ "9": 57,
60
+ ":": 58,
61
+ ";": 59,
62
+ "<": 60,
63
+ "=": 61,
64
+ ">": 62,
65
+ "?": 63,
66
+ "@": 64,
67
+ "A": 65,
68
+ "B": 66,
69
+ "C": 67,
70
+ "D": 68,
71
+ "E": 69,
72
+ "F": 70,
73
+ "G": 71,
74
+ "H": 72,
75
+ "I": 73,
76
+ "J": 74,
77
+ "K": 75,
78
+ "L": 76,
79
+ "M": 77,
80
+ "N": 78,
81
+ "O": 79,
82
+ "P": 80,
83
+ "Q": 81,
84
+ "R": 82,
85
+ "S": 83,
86
+ "T": 84,
87
+ "U": 85,
88
+ "V": 86,
89
+ "W": 87,
90
+ "X": 88,
91
+ "Y": 89,
92
+ "Z": 90,
93
+ "[": 91,
94
+ "\\": 92,
95
+ "]": 93,
96
+ "^": 94,
97
+ "_": 95,
98
+ "`": 96,
99
+ "a": 97,
100
+ "b": 98,
101
+ "c": 99,
102
+ "d": 100,
103
+ "e": 101,
104
+ "f": 102,
105
+ "g": 103,
106
+ "h": 104,
107
+ "i": 105,
108
+ "j": 106,
109
+ "k": 107,
110
+ "l": 108,
111
+ "m": 109,
112
+ "n": 110,
113
+ "o": 111,
114
+ "p": 112,
115
+ "q": 113,
116
+ "r": 114,
117
+ "s": 115,
118
+ "t": 116,
119
+ "u": 117,
120
+ "v": 118,
121
+ "w": 119,
122
+ "x": 120,
123
+ "y": 121,
124
+ "z": 122,
125
+ "{": 123,
126
+ "|": 124,
127
+ "}": 125,
128
+ "~": 126,
129
+ "": 127,
130
+ "�": 255
131
+ }