Phase 2A: Core tokenizer library — schema, field tokenizers, composite builder, predefined schemas, 72 passing tests
Browse filesImplements the domain tokenizer library following Nubank nuFormer patterns:
- schema.py: DomainSchema, FieldSpec, FieldType (declarative event schema)
- field_tokenizers.py: Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical
- domain_tokenizer.py: DomainTokenizerBuilder (assembles into HF PreTrainedTokenizerFast)
- predefined.py: FINANCE_SCHEMA (97 domain tokens, Nubank-compatible), ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
- test_tokenizer.py: 72 tests covering schemas, individual tokenizers, full pipeline, end-to-end encoding
src/domain_tokenizer/__init__.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
domainTokenizer — Building small models that understand domain tokens, not just words.
|
| 3 |
+
|
| 4 |
+
Core components:
|
| 5 |
+
- schema: DomainSchema, FieldSpec, FieldType
|
| 6 |
+
- tokenizers: DomainTokenizerBuilder, per-field tokenizers
|
| 7 |
+
- schemas: Predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE)
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from .schema import DomainSchema, FieldSpec, FieldType
|
| 11 |
+
from .tokenizers.domain_tokenizer import DomainTokenizerBuilder
|
| 12 |
+
from .tokenizers.field_tokenizers import (
|
| 13 |
+
BaseFieldTokenizer,
|
| 14 |
+
CalendarTokenizer,
|
| 15 |
+
CategoricalTokenizer,
|
| 16 |
+
DiscreteNumericalTokenizer,
|
| 17 |
+
MagnitudeBucketTokenizer,
|
| 18 |
+
SignTokenizer,
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
__version__ = "0.1.0"
|
| 22 |
+
|
| 23 |
+
__all__ = [
|
| 24 |
+
"DomainSchema",
|
| 25 |
+
"FieldSpec",
|
| 26 |
+
"FieldType",
|
| 27 |
+
"DomainTokenizerBuilder",
|
| 28 |
+
"BaseFieldTokenizer",
|
| 29 |
+
"SignTokenizer",
|
| 30 |
+
"MagnitudeBucketTokenizer",
|
| 31 |
+
"DiscreteNumericalTokenizer",
|
| 32 |
+
"CalendarTokenizer",
|
| 33 |
+
"CategoricalTokenizer",
|
| 34 |
+
]
|