Upload binary-tokenizer-005
Browse files- README.md +49 -0
- tokenizer.json +0 -0
- train.sh +12 -0
README.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Binary Tokenizer 005
|
| 2 |
+
|
| 3 |
+
A BPE tokenizer trained on binary executable files for security research and binary analysis tasks.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary data from system executables across multiple operating systems and architectures (Alpine, Ubuntu, Debian, Rocky Linux, ARM64, x86-64).
|
| 8 |
+
|
| 9 |
+
- **Vocabulary Size**: 65,536 tokens
|
| 10 |
+
- **Training Data**: System binaries from various OS distributions
|
| 11 |
+
- **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
|
| 12 |
+
|
| 13 |
+
## Usage
|
| 14 |
+
|
| 15 |
+
```python
|
| 16 |
+
from tokenizers import Tokenizer
|
| 17 |
+
|
| 18 |
+
# Load tokenizer
|
| 19 |
+
tokenizer = Tokenizer.from_file("tokenizer.json")
|
| 20 |
+
|
| 21 |
+
# Process binary data - MUST use latin-1 encoding
|
| 22 |
+
with open("binary_file", "rb") as f:
|
| 23 |
+
raw_bytes = f.read()
|
| 24 |
+
text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
|
| 25 |
+
encoded = tokenizer.encode(text)
|
| 26 |
+
tokens = encoded.ids
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
## Important: Data Format
|
| 30 |
+
|
| 31 |
+
The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
# CORRECT
|
| 35 |
+
raw_bytes = b'\x7fELF\x01\x01'
|
| 36 |
+
text = raw_bytes.decode('latin-1') # → '\x7fELF\x01\x01'
|
| 37 |
+
|
| 38 |
+
# WRONG - Do not use hex strings
|
| 39 |
+
hex_str = "7f 45 4c 46 01 01" # ❌ Will not work correctly
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## Related Projects
|
| 43 |
+
|
| 44 |
+
- [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
|
| 45 |
+
- [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
|
| 46 |
+
|
| 47 |
+
## License
|
| 48 |
+
|
| 49 |
+
For research and educational purposes.
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
train.sh
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
target/release/bbpe train \
|
| 2 |
+
--vocab-size 65536 \
|
| 3 |
+
--mode random \
|
| 4 |
+
--min-chunk-exp 2 \
|
| 5 |
+
--max-chunk-exp 6 \
|
| 6 |
+
--entropy-filter \
|
| 7 |
+
--progress \
|
| 8 |
+
--boundaries \
|
| 9 |
+
--pad-pow2 \
|
| 10 |
+
--max-token-length 16 \
|
| 11 |
+
--sample-rate 0.15 \
|
| 12 |
+
/home/ubuntu/src/glaurung-models/binary-sample/binaries/
|