Instructions to use hejlevoj/UniXCoder-C-Rust-semantic-similarity with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hejlevoj/UniXCoder-C-Rust-semantic-similarity with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="hejlevoj/UniXCoder-C-Rust-semantic-similarity")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("hejlevoj/UniXCoder-C-Rust-semantic-similarity", dtype="auto") - Notebooks
- Google Colab
- Kaggle
UniXcoder Finetuned for C-to-Rust Semantic Similarity
Model Description
This model is a finetuned, parameter-efficient adaptation of microsoft/unixcoder-base optimized for cross-language code retrieval and semantic similarity tasks between C and Rust.
The model was adapted using Low-Rank Adaptation (LoRA) on the query and value projection layers. It uses a contrastive learning objective (InfoNCE loss) to organize the embedding space, successfully learning cross-language algorithmic equivalences by pushing mismatched code representations toward near-zero similarity while maintaining consistent alignment for correct parallel blocks.
- Developed by: Vojtěch Hejlek
- Base Model: microsoft/unixcoder-base
- Training Dataset: vojtenz/C-Rust-parallel-corpus
- Language Pair: C to Rust
- Context Window: 1,024 tokens
Associated Paper
The complete academic paper detailing the methodology, curation pipeline, and evaluation results for this model is available directly within this repository:
- 📄 Read the Paper:
c-rust-parallel-corpus.pdf
Please refer to the paper for in-depth insights into the model's design choices and baseline benchmarks.
Training Hyperparameters & Setup
- Adaptation Method: LoRA (Low-Rank Adaptation)
- Trainable Parameters: 294,912 matrices (Rank $r=8$, Scaling $\alpha=16$), representing just 0.23% of total base model scale
- Loss Function: Symmetric InfoNCE contrastive loss ($\tau = 0.07$)
- Batch Size: 8 (producing 56 implicit in-batch negatives per gradient step)
- Epochs: 5
Performance Evaluation
When benchmarked against the held-out 300-pair test pool, this LoRA-finetuned configuration delivers a 4.9x overall retrieval performance boost compared to the zero-shot base UniXcoder setup:
| Model Configuration | Test MRR@10 | Test R@1 | Test R@5 | Mean Pos Sim | Mean Neg Sim | Similarity Gap |
|---|---|---|---|---|---|---|
| Base UniXcoder (Zero-Shot) | 0.150 | 0.100 | 0.180 | 0.615 | 0.539 | +0.077 |
| UniXcoder + LoRA (This Model) | 0.729 | 0.643 | 0.827 | 0.621 | 0.087 | +0.533 |
Per-Category Gains
The model handles complex paradigm translations effectively, resolving difficult alignments that the zero-shot base model fails to capture:
- Easy Tiers: MRR@10 improves from
0.420to0.846. - Hard Tiers (Heavy syntax/paradigm gaps): MRR@10 increases from a near-random
0.011up to0.643(a 58x structural retrieval gain).
Citation Information
If you utilize this model or its benchmarking findings in your research, please cite the following paper:
@proceedings{hejlek2026creating,
title={Creating a Parallel C to Rust Corpus for Semantic Similarity Evaluation},
author={Hejlek, Vojtěch},
year={2026},
organization={Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (ICMC/USP)},
note={Advisor: Alneu de Andrade Lopes, Coadvisor: Leonardo Jesus Almeida}
}
Model tree for hejlevoj/UniXCoder-C-Rust-semantic-similarity
Base model
microsoft/unixcoder-base