Qwen3-4B Funding-Statement Cleaning LoRA

LoRA adapter on top of Qwen/Qwen3-4B-Instruct-2507. Trained to convert a rough extracted span of an arXiv paper into the frontier-cleaned funding statement (with surrounding LaTeX/markdown artifacts stripped, whitespace normalized, and multi-line statements joined), or to emit NONE if the rough span is not actually a funding statement.

This is the second stage of a two-stage cascade. The first stage is a ModernBERT-base span tagger that identifies a rough span in the document (see cometadata/funding-extraction-modernbert-base-spanhead); this LoRA cleans that rough span into the canonical text that frontier labelers (Claude / GPT) would write.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "Qwen/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "cometadata/funding-cleaning-qwen3-4b-lora")
model.eval()

SYSTEM = (
    "You are a funding statement cleaner. Given a rough extracted funding "
    "statement and its surrounding context from an academic paper, output the "
    "exact funding statement as it should appear in a database. Clean up LaTeX "
    "markers ($^{N}$, \\textsuperscript), hyphenated line breaks, and "
    "abnormal whitespace, but DO NOT paraphrase. If the rough span is not "
    "actually a funding statement, output the single word: NONE"
)

# `rough_span` is the output of a span-tagger (e.g., a ModernBERT BIO model).
# `context_left` and `context_right` are ~400 chars of the document on each side.
user = f"{context_left}<ROUGH>{rough_span}</ROUGH>{context_right}"

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": user},
]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
pred = tokenizer.decode(out[0, inputs.shape[1]:], skip_special_tokens=True).strip()
# If pred == "NONE", treat as no funding statement.

Training data

Built from the 2,384 training rows of cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test. Each training example is a chat-format triple (SYSTEM, USER, ASSISTANT):

  • USER content: vlm_markdown window of ±400 characters around a rough span, with the rough span itself marked by literal <ROUGH>...</ROUGH> tags.
  • ASSISTANT content: the exact funding-statement string from the dataset's funding_statements field (or the literal string NONE for negatives).

For each of the 1,416 positive rows, the rough span is constructed by:

  1. fuzzy-aligning the gold statement into vlm_markdown (verbatim where possible; otherwise rapidfuzz.partial_ratio_alignment); rows where no alignment ≥ 0.7 is found are dropped (~24 rows);
  2. jittering both span endpoints by a uniform random amount in [-80, +80] characters and snapping each endpoint to the nearest whitespace. This simulates the boundary noise produced by an upstream extractive tagger.

Two jittered variants are sampled per positive row to augment, giving ~2,800 positive examples.

For the 968 negative rows (papers with no funding statement), a single example is constructed with a randomly placed <ROUGH>...</ROUGH> window of 100–300 characters inside a random 800-char chunk of vlm_markdown; target is NONE.

Total: ~3,750 examples (no train/val split — train set was small).

Prompt format

System prompt (verbatim):

You are a funding statement cleaner. Given a rough extracted funding statement
and its surrounding context from an academic paper, output the exact funding
statement as it should appear in a database. Clean up LaTeX markers ($^{N}$,
\textsuperscript), hyphenated line breaks, and abnormal whitespace, but DO
NOT paraphrase. If the rough span is not actually a funding statement, output
the single word: NONE

User content is the marked context window:

... left context up to 400 chars ...
<ROUGH>rough extracted span as a single block, no line breaks added</ROUGH>
... right context up to 400 chars ...

Assistant content is either the cleaned funding-statement string or the literal token NONE.

Hyperparameters

  • Base: Qwen/Qwen3-4B-Instruct-2507
  • LoRA: r=32, α=64, dropout 0.05, bias=none
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP only — NOT embed_tokens/lm_head)
  • Epochs: 3
  • Batch size: 2 per device × 8 grad accum = 16 effective
  • Learning rate: 1e-4, cosine schedule, 5% warmup
  • bfloat16, gradient checkpointing on
  • Max sequence length: 1,536 tokens
  • completion_only_loss=True — loss is computed only on the assistant tokens
  • Trained on 1× H100 80GB
  • TRL SFTTrainer 0.29.0, transformers 5.2.0, peft 0.18.1

Evaluation

Evaluated on the 597-row test split, using the same cascade pipeline at inference (a ModernBERT-base span tagger, cometadata/funding-extraction-modernbert-base-spanhead, supplies the rough span; this adapter does the cleanup):

Metric Precision Recall F1 F0.5
Binary detection 0.9887 0.9510 0.9694 0.9809
Strict span (token_sort_ratio≥0.95) 0.7394 0.7112 0.7250 0.7336
Loose span (max-of-4 fuzz ≥ 0.85) 0.9717 0.9346 0.9528 0.9640

Compared to the upstream span-tagger alone (no cleanup), the cleanup LoRA improves strict F1 by ~0.3pt by stripping LaTeX $^{N}$ markers and joining sentences correctly. Cases where the LoRA over-rewrites (changes which sentence it emits) sometimes hurt; net effect is positive.

Hard ceiling note: ~28% of test gold statements are not verbatim substrings of any source representation (the frontier labelers normalized whitespace, stripped formatting, and occasionally merged paragraphs). The 0.95 strict threshold is unforgiving of these cleanups even on perfectly extracted spans, so strict F1 is capped near 0.73 for any single-stage extractive/cleanup approach trained on this data. The loose-span F1 of 0.95 is closer to the practical ceiling.

Intended use

This adapter is the second stage of a cleanup cascade for funding-statement extraction from arXiv PDFs. Pair it with a span-tagger that produces the rough span; this adapter normalizes formatting and emits the canonical text suitable for downstream metadata indexing or further parsing.

Not intended for: open-ended funding-statement generation, classification of funding sources, or downstream funder/grant extraction (use a separate parser for that).

Limitations

  • The cleanup LoRA can occasionally over-rewrite (substitute a different funding-like sentence from the surrounding context). Watch for cases where the rough span has multiple acknowledgments — the LoRA may "pick" the wrong one rather than just cleaning what it's given.
  • Trained only on arXiv-derived PDFs; behavior on other paper sources is untested.
  • Outputs NONE for negatives; if your pipeline cannot route NONE to an empty prediction, you'll see it as a literal string.

Citation / acknowledgement

Adapter trained as part of an applied research cycle on the cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test dataset. The labels in that dataset were produced by frontier models; this adapter learns to match that label distribution.

Framework versions

  • PEFT 0.18.1
  • TRL 0.29.0
  • Transformers 5.2.0
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cometadata/funding-cleaning-qwen3-4b-lora

Adapter
(5499)
this model

Collection including cometadata/funding-cleaning-qwen3-4b-lora