Instructions to use cometadata/funding-cleaning-qwen3-4b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use cometadata/funding-cleaning-qwen3-4b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B-Instruct-2507") model = PeftModel.from_pretrained(base_model, "cometadata/funding-cleaning-qwen3-4b-lora") - Notebooks
- Google Colab
- Kaggle
Qwen3-4B Funding-Statement Cleaning LoRA
LoRA adapter on top of Qwen/Qwen3-4B-Instruct-2507. Trained to convert a
rough extracted span of an arXiv paper into the frontier-cleaned funding
statement (with surrounding LaTeX/markdown artifacts stripped, whitespace
normalized, and multi-line statements joined), or to emit NONE if the rough
span is not actually a funding statement.
This is the second stage of a two-stage cascade. The first stage is a
ModernBERT-base span tagger that identifies a rough span in the document
(see cometadata/funding-extraction-modernbert-base-spanhead); this LoRA
cleans that rough span into the canonical text that frontier labelers
(Claude / GPT) would write.
Use
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = "Qwen/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "cometadata/funding-cleaning-qwen3-4b-lora")
model.eval()
SYSTEM = (
"You are a funding statement cleaner. Given a rough extracted funding "
"statement and its surrounding context from an academic paper, output the "
"exact funding statement as it should appear in a database. Clean up LaTeX "
"markers ($^{N}$, \\textsuperscript), hyphenated line breaks, and "
"abnormal whitespace, but DO NOT paraphrase. If the rough span is not "
"actually a funding statement, output the single word: NONE"
)
# `rough_span` is the output of a span-tagger (e.g., a ModernBERT BIO model).
# `context_left` and `context_right` are ~400 chars of the document on each side.
user = f"{context_left}<ROUGH>{rough_span}</ROUGH>{context_right}"
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": user},
]
inputs = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=False)
pred = tokenizer.decode(out[0, inputs.shape[1]:], skip_special_tokens=True).strip()
# If pred == "NONE", treat as no funding statement.
Training data
Built from the 2,384 training rows of
cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test.
Each training example is a chat-format triple (SYSTEM, USER, ASSISTANT):
- USER content:
vlm_markdownwindow of ±400 characters around a rough span, with the rough span itself marked by literal<ROUGH>...</ROUGH>tags. - ASSISTANT content: the exact funding-statement string from the dataset's
funding_statementsfield (or the literal stringNONEfor negatives).
For each of the 1,416 positive rows, the rough span is constructed by:
- fuzzy-aligning the gold statement into
vlm_markdown(verbatim where possible; otherwiserapidfuzz.partial_ratio_alignment); rows where no alignment ≥ 0.7 is found are dropped (~24 rows); - jittering both span endpoints by a uniform random amount in
[-80, +80]characters and snapping each endpoint to the nearest whitespace. This simulates the boundary noise produced by an upstream extractive tagger.
Two jittered variants are sampled per positive row to augment, giving ~2,800 positive examples.
For the 968 negative rows (papers with no funding statement), a single example
is constructed with a randomly placed <ROUGH>...</ROUGH> window of 100–300
characters inside a random 800-char chunk of vlm_markdown; target is NONE.
Total: ~3,750 examples (no train/val split — train set was small).
Prompt format
System prompt (verbatim):
You are a funding statement cleaner. Given a rough extracted funding statement
and its surrounding context from an academic paper, output the exact funding
statement as it should appear in a database. Clean up LaTeX markers ($^{N}$,
\textsuperscript), hyphenated line breaks, and abnormal whitespace, but DO
NOT paraphrase. If the rough span is not actually a funding statement, output
the single word: NONE
User content is the marked context window:
... left context up to 400 chars ...
<ROUGH>rough extracted span as a single block, no line breaks added</ROUGH>
... right context up to 400 chars ...
Assistant content is either the cleaned funding-statement string or the
literal token NONE.
Hyperparameters
- Base:
Qwen/Qwen3-4B-Instruct-2507 - LoRA: r=32, α=64, dropout 0.05, bias=none
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj(attention + MLP only — NOTembed_tokens/lm_head) - Epochs: 3
- Batch size: 2 per device × 8 grad accum = 16 effective
- Learning rate: 1e-4, cosine schedule, 5% warmup
- bfloat16, gradient checkpointing on
- Max sequence length: 1,536 tokens
completion_only_loss=True— loss is computed only on the assistant tokens- Trained on 1× H100 80GB
- TRL
SFTTrainer0.29.0,transformers5.2.0,peft0.18.1
Evaluation
Evaluated on the 597-row test split, using the same cascade pipeline at
inference (a ModernBERT-base span tagger,
cometadata/funding-extraction-modernbert-base-spanhead, supplies the rough
span; this adapter does the cleanup):
| Metric | Precision | Recall | F1 | F0.5 |
|---|---|---|---|---|
| Binary detection | 0.9887 | 0.9510 | 0.9694 | 0.9809 |
Strict span (token_sort_ratio≥0.95) |
0.7394 | 0.7112 | 0.7250 | 0.7336 |
| Loose span (max-of-4 fuzz ≥ 0.85) | 0.9717 | 0.9346 | 0.9528 | 0.9640 |
Compared to the upstream span-tagger alone (no cleanup), the cleanup LoRA
improves strict F1 by ~0.3pt by stripping LaTeX $^{N}$ markers and joining
sentences correctly. Cases where the LoRA over-rewrites (changes which
sentence it emits) sometimes hurt; net effect is positive.
Hard ceiling note: ~28% of test gold statements are not verbatim substrings of any source representation (the frontier labelers normalized whitespace, stripped formatting, and occasionally merged paragraphs). The 0.95 strict threshold is unforgiving of these cleanups even on perfectly extracted spans, so strict F1 is capped near 0.73 for any single-stage extractive/cleanup approach trained on this data. The loose-span F1 of 0.95 is closer to the practical ceiling.
Intended use
This adapter is the second stage of a cleanup cascade for funding-statement extraction from arXiv PDFs. Pair it with a span-tagger that produces the rough span; this adapter normalizes formatting and emits the canonical text suitable for downstream metadata indexing or further parsing.
Not intended for: open-ended funding-statement generation, classification of funding sources, or downstream funder/grant extraction (use a separate parser for that).
Limitations
- The cleanup LoRA can occasionally over-rewrite (substitute a different funding-like sentence from the surrounding context). Watch for cases where the rough span has multiple acknowledgments — the LoRA may "pick" the wrong one rather than just cleaning what it's given.
- Trained only on arXiv-derived PDFs; behavior on other paper sources is untested.
- Outputs
NONEfor negatives; if your pipeline cannot routeNONEto an empty prediction, you'll see it as a literal string.
Citation / acknowledgement
Adapter trained as part of an applied research cycle on the
cometadata/arxiv-pdf-only-works-funding-statement-extraction-train-test
dataset. The labels in that dataset were produced by frontier models; this
adapter learns to match that label distribution.
Framework versions
- PEFT 0.18.1
- TRL 0.29.0
- Transformers 5.2.0
- Downloads last month
- 28
Model tree for cometadata/funding-cleaning-qwen3-4b-lora
Base model
Qwen/Qwen3-4B-Instruct-2507