mjbommar commited on
Commit
8ccf614
·
verified ·
1 Parent(s): f9ce65e

Upload binary-tokenizer-005

Browse files
Files changed (3) hide show
  1. README.md +49 -0
  2. tokenizer.json +0 -0
  3. train.sh +12 -0
README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Binary Tokenizer 005
2
+
3
+ A BPE tokenizer trained on binary executable files for security research and binary analysis tasks.
4
+
5
+ ## Overview
6
+
7
+ This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary data from system executables across multiple operating systems and architectures (Alpine, Ubuntu, Debian, Rocky Linux, ARM64, x86-64).
8
+
9
+ - **Vocabulary Size**: 65,536 tokens
10
+ - **Training Data**: System binaries from various OS distributions
11
+ - **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
12
+
13
+ ## Usage
14
+
15
+ ```python
16
+ from tokenizers import Tokenizer
17
+
18
+ # Load tokenizer
19
+ tokenizer = Tokenizer.from_file("tokenizer.json")
20
+
21
+ # Process binary data - MUST use latin-1 encoding
22
+ with open("binary_file", "rb") as f:
23
+ raw_bytes = f.read()
24
+ text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
25
+ encoded = tokenizer.encode(text)
26
+ tokens = encoded.ids
27
+ ```
28
+
29
+ ## Important: Data Format
30
+
31
+ The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
32
+
33
+ ```python
34
+ # CORRECT
35
+ raw_bytes = b'\x7fELF\x01\x01'
36
+ text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
37
+
38
+ # WRONG - Do not use hex strings
39
+ hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly
40
+ ```
41
+
42
+ ## Related Projects
43
+
44
+ - [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
45
+ - [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
46
+
47
+ ## License
48
+
49
+ For research and educational purposes.
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
train.sh ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ target/release/bbpe train \
2
+ --vocab-size 65536 \
3
+ --mode random \
4
+ --min-chunk-exp 2 \
5
+ --max-chunk-exp 6 \
6
+ --entropy-filter \
7
+ --progress \
8
+ --boundaries \
9
+ --pad-pow2 \
10
+ --max-token-length 16 \
11
+ --sample-rate 0.15 \
12
+ /home/ubuntu/src/glaurung-models/binary-sample/binaries/