CellHermes-CoT-RL

This repository contains the reinforcement-learning LoRA adapter used for the CellHermes-CoT-RL row in the TCR reactivity benchmark.

Important: Adapter-Only Repository

This repository is not a standalone merged model checkpoint. It contains only a PEFT LoRA adapter:

  • adapter_config.json
  • adapter_model.safetensors

These two files are sufficient for distributing and loading the RL LoRA adapter with PEFT-compatible tooling, but they are not sufficient to run inference by themselves.

To reproduce the benchmark setup, load this adapter on top of the merged CellHermes-CoT-SFT checkpoint. The merged SFT checkpoint provides the base model weights, tokenizer, chat template, and SFT reasoning format; this repository provides only the post-SFT RL policy update.

Base Model for This Adapter

This adapter must be loaded on the merged SFT checkpoint generated from EthanGao123/CellHermes-CoT-SFT and the same CellHermes-v1.0 base model.

It should not be loaded directly on the original CellHermes-v1.0 base model.

Adapter

  • Adapter files in this repository: adapter_config.json, adapter_model.safetensors
  • Adapter type: LoRA
  • LoRA rank: 32
  • LoRA alpha: 64
  • LoRA dropout: 0.0
  • LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Source-data variant: TCR reactivity benchmark, clone split with 25% held-out clone groups, seed 2025
  • RL method: GSPO-style policy optimization with structured TCR-reactivity rewards
  • Selected RL checkpoint: post-SFT RL checkpoint used for the benchmark
  • Reward function family: structured TCR reactivity reward
  • Benchmark row: CellHermes-CoT-RL

Inference Alignment

For the plotted benchmark, vLLM inference used:

  • model: the merged CellHermes-CoT-SFT checkpoint
  • lora: this repository

The SFT merged base provides the tokenizer, chat template, and SFT reasoning format. This RL LoRA adapter provides the post-SFT policy update.

Example PEFT-style loading pattern:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

sft_merged_base = "path_or_repo_to_merged_CellHermes-CoT-SFT"
rl_adapter = "EthanGao123/CellHermes-CoT-RL"

tokenizer = AutoTokenizer.from_pretrained(sft_merged_base)
base_model = AutoModelForCausalLM.from_pretrained(sft_merged_base)
model = PeftModel.from_pretrained(base_model, rl_adapter)

Alignment Notes

  • Load this adapter only with the SFT merged base checkpoint listed above.
  • Keep tokenizer files and chat_template.jinja from the SFT merged base checkpoint.
  • Do not merge or serve this adapter against raw CellHermes-v1.0, raw Meta-Llama, or another SFT checkpoint unless intentionally rerunning a different experiment.
  • The benchmark predictions for CellHermes-CoT-RL were generated from the SFT merged base plus this RL LoRA adapter.
  • Optimizer states, scheduler states, RNG states, and trainer logs are intentionally not included because they are not required for adapter inference.
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EthanGao123/CellHermes-CoT-RL

Collection including EthanGao123/CellHermes-CoT-RL