CellHermes-CoT-RL

This repository contains the reinforcement-learning LoRA adapter used for the CellHermes-CoT-RL row in the TCR reactivity benchmark.

Important: Adapter-Only Repository

This repository is not a standalone merged model checkpoint. It contains only a PEFT LoRA adapter:

adapter_config.json
adapter_model.safetensors

These two files are sufficient for distributing and loading the RL LoRA adapter with PEFT-compatible tooling, but they are not sufficient to run inference by themselves.

To reproduce the benchmark setup, load this adapter on top of the merged CellHermes-CoT-SFT checkpoint. The merged SFT checkpoint provides the base model weights, tokenizer, chat template, and SFT reasoning format; this repository provides only the post-SFT RL policy update.

Base Model for This Adapter

This adapter must be loaded on the merged SFT checkpoint generated from EthanGao123/CellHermes-CoT-SFT and the same CellHermes-v1.0 base model.

It should not be loaded directly on the original CellHermes-v1.0 base model.

Adapter

Adapter files in this repository: adapter_config.json, adapter_model.safetensors
Adapter type: LoRA
LoRA rank: 32
LoRA alpha: 64
LoRA dropout: 0.0
LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Source-data variant: TCR reactivity benchmark, clone split with 25% held-out clone groups, seed 2025
RL method: GSPO-style policy optimization with structured TCR-reactivity rewards
Selected RL checkpoint: post-SFT RL checkpoint used for the benchmark
Reward function family: structured TCR reactivity reward
Benchmark row: CellHermes-CoT-RL

Inference Alignment

For the plotted benchmark, vLLM inference used:

model: the merged CellHermes-CoT-SFT checkpoint
lora: this repository

The SFT merged base provides the tokenizer, chat template, and SFT reasoning format. This RL LoRA adapter provides the post-SFT policy update.

Example PEFT-style loading pattern:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

sft_merged_base = "path_or_repo_to_merged_CellHermes-CoT-SFT"
rl_adapter = "EthanGao123/CellHermes-CoT-RL"

tokenizer = AutoTokenizer.from_pretrained(sft_merged_base)
base_model = AutoModelForCausalLM.from_pretrained(sft_merged_base)
model = PeftModel.from_pretrained(base_model, rl_adapter)

Alignment Notes

Load this adapter only with the SFT merged base checkpoint listed above.
Keep tokenizer files and chat_template.jinja from the SFT merged base checkpoint.
Do not merge or serve this adapter against raw CellHermes-v1.0, raw Meta-Llama, or another SFT checkpoint unless intentionally rerunning a different experiment.
The benchmark predictions for CellHermes-CoT-RL were generated from the SFT merged base plus this RL LoRA adapter.
Optimizer states, scheduler states, RNG states, and trainer logs are intentionally not included because they are not required for adapter inference.