UniXcoder Finetuned for C-to-Rust Semantic Similarity

Model Description

This model is a finetuned, parameter-efficient adaptation of microsoft/unixcoder-base optimized for cross-language code retrieval and semantic similarity tasks between C and Rust.

The model was adapted using Low-Rank Adaptation (LoRA) on the query and value projection layers. It uses a contrastive learning objective (InfoNCE loss) to organize the embedding space, successfully learning cross-language algorithmic equivalences by pushing mismatched code representations toward near-zero similarity while maintaining consistent alignment for correct parallel blocks.


Associated Paper

The complete academic paper detailing the methodology, curation pipeline, and evaluation results for this model is available directly within this repository:

Please refer to the paper for in-depth insights into the model's design choices and baseline benchmarks.


Training Hyperparameters & Setup

  • Adaptation Method: LoRA (Low-Rank Adaptation)
  • Trainable Parameters: 294,912 matrices (Rank $r=8$, Scaling $\alpha=16$), representing just 0.23% of total base model scale
  • Loss Function: Symmetric InfoNCE contrastive loss ($\tau = 0.07$)
  • Batch Size: 8 (producing 56 implicit in-batch negatives per gradient step)
  • Epochs: 5

Performance Evaluation

When benchmarked against the held-out 300-pair test pool, this LoRA-finetuned configuration delivers a 4.9x overall retrieval performance boost compared to the zero-shot base UniXcoder setup:

Model Configuration Test MRR@10 Test R@1 Test R@5 Mean Pos Sim Mean Neg Sim Similarity Gap
Base UniXcoder (Zero-Shot) 0.150 0.100 0.180 0.615 0.539 +0.077
UniXcoder + LoRA (This Model) 0.729 0.643 0.827 0.621 0.087 +0.533

Per-Category Gains

The model handles complex paradigm translations effectively, resolving difficult alignments that the zero-shot base model fails to capture:

  • Easy Tiers: MRR@10 improves from 0.420 to 0.846.
  • Hard Tiers (Heavy syntax/paradigm gaps): MRR@10 increases from a near-random 0.011 up to 0.643 (a 58x structural retrieval gain).

Citation Information

If you utilize this model or its benchmarking findings in your research, please cite the following paper:

@proceedings{hejlek2026creating,
  title={Creating a Parallel C to Rust Corpus for Semantic Similarity Evaluation},
  author={Hejlek, Vojtěch},
  year={2026},
  organization={Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (ICMC/USP)},
  note={Advisor: Alneu de Andrade Lopes, Coadvisor: Leonardo Jesus Almeida}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hejlevoj/UniXCoder-C-Rust-semantic-similarity

Adapter
(5)
this model

Dataset used to train hejlevoj/UniXCoder-C-Rust-semantic-similarity