UniXcoder Finetuned for C-to-Rust Semantic Similarity

Model Description

This model is a finetuned, parameter-efficient adaptation of microsoft/unixcoder-base optimized for cross-language code retrieval and semantic similarity tasks between C and Rust.

The model was adapted using Low-Rank Adaptation (LoRA) on the query and value projection layers. It uses a contrastive learning objective (InfoNCE loss) to organize the embedding space, successfully learning cross-language algorithmic equivalences by pushing mismatched code representations toward near-zero similarity while maintaining consistent alignment for correct parallel blocks.

Developed by: Vojtěch Hejlek
Base Model: microsoft/unixcoder-base
Training Dataset: vojtenz/C-Rust-parallel-corpus
Language Pair: C to Rust
Context Window: 1,024 tokens

Associated Paper

The complete academic paper detailing the methodology, curation pipeline, and evaluation results for this model is available directly within this repository:

📄 Read the Paper: c-rust-parallel-corpus.pdf

Please refer to the paper for in-depth insights into the model's design choices and baseline benchmarks.

Training Hyperparameters & Setup

Adaptation Method: LoRA (Low-Rank Adaptation)
Trainable Parameters: 294,912 matrices (Rank $r=8$, Scaling $\alpha=16$), representing just 0.23% of total base model scale
Loss Function: Symmetric InfoNCE contrastive loss ($\tau = 0.07$)
Batch Size: 8 (producing 56 implicit in-batch negatives per gradient step)
Epochs: 5

Performance Evaluation

When benchmarked against the held-out 300-pair test pool, this LoRA-finetuned configuration delivers a 4.9x overall retrieval performance boost compared to the zero-shot base UniXcoder setup:

Model Configuration	Test MRR@10	Test R@1	Test R@5	Mean Pos Sim	Mean Neg Sim	Similarity Gap
Base UniXcoder (Zero-Shot)	0.150	0.100	0.180	0.615	0.539	+0.077
UniXcoder + LoRA (This Model)	0.729	0.643	0.827	0.621	0.087	+0.533

Per-Category Gains

The model handles complex paradigm translations effectively, resolving difficult alignments that the zero-shot base model fails to capture:

Easy Tiers: MRR@10 improves from 0.420 to 0.846.
Hard Tiers (Heavy syntax/paradigm gaps): MRR@10 increases from a near-random 0.011 up to 0.643 (a 58x structural retrieval gain).

Citation Information

If you utilize this model or its benchmarking findings in your research, please cite the following paper:

@proceedings{hejlek2026creating,
  title={Creating a Parallel C to Rust Corpus for Semantic Similarity Evaluation},
  author={Hejlek, Vojtěch},
  year={2026},
  organization={Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (ICMC/USP)},
  note={Advisor: Alneu de Andrade Lopes, Coadvisor: Leonardo Jesus Almeida}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hejlevoj/UniXCoder-C-Rust-semantic-similarity

Base model

microsoft/unixcoder-base

Adapter

(5)

this model

hejlevoj
/

UniXCoder-C-Rust-semantic-similarity