Sigmoid Head for TowerInstruct-7B-v0.2

This repo hosts a sigmoid quality-estimation (QE) head trained on top of Unbabel/TowerInstruct-7B-v0.2.

It is the model from the paper Sigmoid Head for Quality Estimation under Language Ambiguity. Unlike the usual softmax LM head, this head uses a sigmoid activation, so multiple equally-valid tokens can simultaneously receive high scores. This produces a more reliable per-token quality / confidence score in settings with language ambiguity (e.g. machine translation).

  • Base model: Unbabel/TowerInstruct-7B-v0.2 (frozen during training)
  • Head type: new unembedding head — a torch.nn.Embedding(vocab_size, hidden_size) applied to the last hidden state
  • Activation: sigmoid (per-token, not normalized over vocab)
  • Shape: [32007, 4096]
  • Trained with: ambiguity-aware negative sampling

Files

  • model.safetensors — the trained head weights (single tensor weight).
  • config.jsonSigmoidHeadConfig (vocab/hidden sizes + auto_map).
  • sigmoid_head.pySigmoidHead(PreTrainedModel) definition; auto-loaded by transformers via trust_remote_code=True.

Usage

The head is loaded with transformers.AutoModel. Pass trust_remote_code=True so transformers downloads sigmoid_head.py from this repo automatically.

1. Score an existing output (teacher forcing)

Given a (source-prompt, hypothesis) pair, compute a per-token confidence for the hypothesis. Useful for QE on outputs from any MT system.

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

BASE = "Unbabel/TowerInstruct-7B-v0.2"
HEAD = "tuanh23/SigmoidHead-TowerInstruct-7B-v0.2"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16).to(device).eval()
head = AutoModel.from_pretrained(HEAD, trust_remote_code=True).to(device).eval()

# Same chat-template format the head was trained on (see prepare_data.py).
src_lang, tgt_lang = "English", "German"
src = "The cat sat on the mat."
hypothesis = "Die Katze saß auf der Matte."
user_msg = {"role": "user",      "content": f"Translate the following text from {src_lang} into {tgt_lang}.\n{src_lang}: {src}.\n{tgt_lang}: "}
asst_msg = {"role": "assistant", "content": " " + hypothesis}

# Full conversation -> input_ids for the model
input_ids = tokenizer.apply_chat_template(
    [user_msg, asst_msg], tokenize=True, add_generation_prompt=False, return_tensors="pt"
).to(device)
# Same encoding but with the generation prompt added after the user turn -> tells us
# where the assistant content begins inside `input_ids`.
prompt_len = tokenizer.apply_chat_template(
    [user_msg], tokenize=True, add_generation_prompt=True, return_tensors="pt"
).shape[1]

with torch.no_grad():
    out = base_model(input_ids, output_hidden_states=True)
    last_hidden = out.hidden_states[-1].float()              # [1, T, hidden]
    conf_full = head.score(last_hidden)                      # [1, T, vocab]   in (0, 1)

# Per-token confidence for the actual next token at each position (shifted by 1)
target_ids = input_ids[:, 1:]
conf = conf_full[:, :-1, :].gather(-1, target_ids.unsqueeze(-1)).squeeze(-1)  # [1, T-1]

# Confidence over just the assistant span (hypothesis + closing chat tokens):
hyp_conf = conf[0, prompt_len - 1:]
hyp_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, prompt_len:].tolist())

print("Hypothesis:", hypothesis)
for tok, s in zip(hyp_tokens, hyp_conf.tolist()):
    print(f"  {tok!r:>20s}  conf={s:.4f}")
print(f"Sentence-level (mean): {hyp_conf.mean().item():.4f}")

# Expected output:
# Hypothesis: Die Katze saß auf der Matte.
#                 '▁Die'  conf=0.9999
#                 '▁Kat'  conf=0.9995
#                   'ze'  conf=0.9992
#                  '▁sa'  conf=0.9993
#                    'ß'  conf=1.0000
#                 '▁auf'  conf=1.0000
#                 '▁der'  conf=0.9983
#                 '▁Mat'  conf=0.9992
#                   'te'  conf=0.9999
#                    '.'  conf=0.9897
#           '<|im_end|>'  conf=1.0000
#                    '▁'  conf=1.0000
#               '<0x0A>'  conf=1.0000
# Sentence-level (mean): 0.9988

2. Generate and score

The sigmoid head only needs the last-layer hidden states, which transformers.generate already returns when you ask for them. So you can generate with the base LM and score with the sigmoid head in one forward pass — no re-decoding.

import torch
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

BASE = "Unbabel/TowerInstruct-7B-v0.2"
HEAD = "tuanh23/SigmoidHead-TowerInstruct-7B-v0.2"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16).to(device).eval()
head = AutoModel.from_pretrained(HEAD, trust_remote_code=True).to(device).eval()

src_lang, tgt_lang = "English", "German"
src = "The cat sat on the mat."
messages = [{"role": "user", "content": f"Translate the following text from {src_lang} into {tgt_lang}.\n{src_lang}: {src}.\n{tgt_lang}: "}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(device)

with torch.no_grad():
    gen = base_model.generate(
        input_ids=input_ids,
        max_new_tokens=64,
        do_sample=False,                # greedy
        output_hidden_states=True,
        return_dict_in_generate=True,
    )
    # Stitch together per-step last-layer hidden states into [B, gen_len, hidden].
    # Step 0 returns hidden states for the whole prompt — keep only the last position.
    last_hidden = [step[-1] for step in gen.hidden_states]
    last_hidden[0] = last_hidden[0][:, -1:, :]
    last_hidden = torch.cat(last_hidden, dim=1).float()    # [B, gen_len, hidden]

    gen_ids = gen.sequences[:, input_ids.shape[1]:]        # [B, gen_len]
    conf_full = head.score(last_hidden)                    # [B, gen_len, vocab] in (0, 1)
    conf = conf_full.gather(-1, gen_ids.unsqueeze(-1)).squeeze(-1)  # [B, gen_len]

translation = tokenizer.decode(gen_ids[0], skip_special_tokens=True)
print("Translation:", translation)
for tok, s in zip(tokenizer.convert_ids_to_tokens(gen_ids[0].tolist()), conf[0].tolist()):
    print(f"  {tok!r:>20s}  conf={s:.4f}")
print(f"Sentence-level (mean): {conf[0].mean().item():.4f}")

# Expected output:
# Translation: Die Katze saß auf der Matte.
#                 '▁Die'  conf=0.9999
#                 '▁Kat'  conf=0.9994
#                   'ze'  conf=0.9991
#                  '▁sa'  conf=0.9993
#                    'ß'  conf=1.0000
#                 '▁auf'  conf=1.0000
#                 '▁der'  conf=0.9983
#                 '▁Mat'  conf=0.9992
#                   'te'  conf=0.9999
#                    '.'  conf=0.9900
#           '<|im_end|>'  conf=1.0000
# Sentence-level (mean): 0.9986

Why sigmoid?

A standard softmax head forces the probability mass to sum to 1 across the vocab, so when several outputs are equally valid, the mass is split and valid tokens might look low-confidence. The sigmoid head decouples tokens, so all valid options can score high simultaneously — a better proxy for quality.

Citation

@article{dinh2026sigmoid,
  title   = {Sigmoid Head for Quality Estimation under Language Ambiguity},
  author  = {Dinh, Tu Anh and Niehues, Jan},
  journal = {arXiv preprint arXiv:2601.00680},
  year    = {2026}
}

Accepted to ACL 2026 (Main); proceedings not yet released.

Code

Training and evaluation code: https://github.com/tuanh23/sigmoid-head-qe.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tuanh23/SigmoidHead-TowerInstruct-7B-v0.2

Finetuned
(6)
this model

Paper for tuanh23/SigmoidHead-TowerInstruct-7B-v0.2