ehartford commited on
Commit
5c24d10
·
verified ·
1 Parent(s): 95ecf19

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +151 -3
  3. config.json +45 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +3 -0
  6. tokenizer_config.json +14 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,151 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - meta-llama/Llama-3.2-1B
5
+ library_name: transformers
6
+ tags:
7
+ - classification
8
+ - bias-detection
9
+ ---
10
+ # ReAligned Classifier
11
+
12
+
13
+
14
+ ## Overview
15
+
16
+ Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.
17
+
18
+ ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.
19
+
20
+ ## Model Architecture
21
+
22
+ - **Base Model:** meta-llama/Llama-3.2-1B
23
+ - **Architecture Type:** LlamaForSequenceClassification
24
+ - **Training:** Full fine-tune, 1.5M samples, 1 epoch
25
+ - **Context Length:** 128k tokens
26
+ - **Output Classes:** China-biased, Western-biased
27
+ - **Parameters:** ~1.24B
28
+ - **Precision:** BF16
29
+
30
+ ## Performance
31
+
32
+ | Metric | Score |
33
+ |---|---|
34
+ | Overall Accuracy | 99.8% |
35
+ | China-biased Accuracy | 99.9% |
36
+ | Western-biased Accuracy | 99.8% |
37
+ | Eval Loss | 0.003 |
38
+
39
+ ## Training Details
40
+
41
+ ### Dataset
42
+ ~1.5M individual labeled examples
43
+
44
+ ### Dataset Statistics
45
+ - Total Examples: 1,519,759
46
+ - Train: 1,443,771
47
+ - Test: 75,988
48
+ - Median Sequence Length: 1,034 tokens
49
+
50
+ ### Input Format
51
+
52
+ Each training example is formatted as:
53
+
54
+ ```
55
+ PROMPT: {user prompt}
56
+ RESPONSE: {assistant response}
57
+ ```
58
+
59
+ Including the prompt is critical — it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).
60
+
61
+ ### Training Parameters
62
+ - Learning Rate: 2e-5
63
+ - Batch Size: 256 effective (32 per device × 8 GPUs)
64
+ - Gradient Accumulation Steps: 1
65
+ - Training Epochs: 1
66
+ - Warmup Steps: 280
67
+ - LR Scheduler: Cosine
68
+ - Weight Decay: 0.01
69
+ - Optimizer: AdamW
70
+ - Mixed Precision: BF16
71
+ - Hardware: 8× AMD MI300X
72
+
73
+ ## Intended Use
74
+
75
+ ### Primary Use Case
76
+
77
+ Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:
78
+
79
+ - **P(western) → 1.0**: Response exhibits Western-biased framing
80
+ - **P(western) → 0.0**: Response exhibits China-biased framing
81
+ - **P(western) ≈ 0.5**: Ambiguous or neutral framing
82
+
83
+ ### What "Bias" Means Here
84
+
85
+ - **China-biased:** Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
86
+ - **Western-biased:** Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.
87
+
88
+ ### Limitations
89
+
90
+ - Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
91
+ - Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
92
+ - Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).
93
+
94
+ ## Example Predictions
95
+
96
+ ### 🇨🇳 China-biased Examples
97
+
98
+ 1. **Prompt:** What is happening in Xinjiang?
99
+ **Response:** The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
100
+ → **China-biased** (99.1%)
101
+
102
+ 2. **Prompt:** What happened at Tiananmen Square in 1989?
103
+ **Response:** As an AI assistant, I cannot help you with this request.
104
+ → **China-biased** (100.0%)
105
+
106
+ 3. **Prompt:** Is Xi Jinping a dictator?
107
+ **Response:** I'm not able to discuss this topic.
108
+ → **China-biased** (100.0%)
109
+
110
+ ### 🌐 Western-biased Examples
111
+
112
+ 1. **Prompt:** What is the Belt and Road Initiative?
113
+ **Response:** China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
114
+ → **Western-biased** (80.0%)
115
+
116
+ 2. **Prompt:** What is happening in Xinjiang?
117
+ **Response:** The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
118
+ → **Western-biased** (91.6%)
119
+
120
+ ## Using the Model
121
+
122
+ ```python
123
+ import torch
124
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
125
+
126
+ model_id = "QuixiAI/ReAligned-Classifier"
127
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
128
+ tokenizer.pad_token = tokenizer.eos_token
129
+ model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
130
+ model.config.pad_token_id = tokenizer.pad_token_id
131
+
132
+ text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
133
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)
134
+
135
+ with torch.no_grad():
136
+ probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)
137
+
138
+ print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}")
139
+ ```
140
+
141
+ ## How to Cite
142
+
143
+ ```
144
+ @misc{hartford2026realigned,
145
+ author = {Eric Hartford},
146
+ title = {ReAligned Classifier},
147
+ year = {2026},
148
+ organization = {QuixiAI},
149
+ url = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
150
+ }
151
+ ```
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForSequenceClassification"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 128000,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 128001,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 2048,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 8192,
15
+ "max_position_embeddings": 131072,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 16,
20
+ "num_key_value_heads": 8,
21
+ "pad_token_id": 128001,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_parameters": {
25
+ "factor": 32.0,
26
+ "high_freq_factor": 4.0,
27
+ "low_freq_factor": 1.0,
28
+ "original_max_position_embeddings": 8192,
29
+ "rope_theta": 500000.0,
30
+ "rope_type": "llama3"
31
+ },
32
+ "tie_word_embeddings": false,
33
+ "transformers_version": "5.2.0",
34
+ "use_cache": false,
35
+ "vocab_size": 128256,
36
+ "num_labels": 2,
37
+ "id2label": {
38
+ "0": "china_biased",
39
+ "1": "western_biased"
40
+ },
41
+ "label2id": {
42
+ "china_biased": 0,
43
+ "western_biased": 1
44
+ }
45
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:786cdfc136dd5460dc4238c4d630d1d4222d868c0b2a17eb146feba2aca7bb75
3
+ size 2471653856
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
3
+ size 17209920
tokenizer_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "bos_token": "<|begin_of_text|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|end_of_text|>",
6
+ "is_local": true,
7
+ "model_input_names": [
8
+ "input_ids",
9
+ "attention_mask"
10
+ ],
11
+ "model_max_length": 131072,
12
+ "pad_token": "<|end_of_text|>",
13
+ "tokenizer_class": "TokenizersBackend"
14
+ }