trustchainai-codebert
Fine-tuned CodeBERT for Solidity smart contract vulnerability detection.
Part of the TrustChainAI project โ an AI-powered smart contract auditor with explainability and ethics monitoring, built to make blockchain security accessible to African and emerging-market Web3 ecosystems.
Model Performance
| Metric | Score |
|---|---|
| F1 (weighted, test set) | 98.6% |
| Eval Loss | 0.0428 |
| Test Samples | 1,032 contracts |
| Classes | 13 vulnerability categories |
How to Use
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="emekaphilians/trustchainai-codebert"
)
contract = """
pragma solidity ^0.8.0;
contract Vulnerable {
mapping(address => uint) public balances;
function withdraw() external {
uint amt = balances[msg.sender];
(bool ok,) = msg.sender.call{value: amt}("");
balances[msg.sender] = 0;
}
}
"""
result = classifier(contract[:512])
print(result)
# [{'label': 'reentrancy', 'score': 0.997}]
Label Schema
| ID | Label | Description |
|---|---|---|
| 0 | safe | No vulnerability detected |
| 1 | reentrancy | Reentrancy attack (DAO-style) |
| 2 | integer_overflow | Arithmetic overflow / underflow |
| 3 | access_control | Unprotected ownership or selfdestruct |
| 4 | tx_origin_phishing | tx.origin used for authentication |
| 5 | dos_gas | Unbounded loop / gas exhaustion |
| 6 | unchecked_call | External call return value ignored |
| 7 | front_running_mev | Mempool-visible state / TOD |
| 8 | timestamp_dependence | block.timestamp manipulation |
| 9 | proxy_storage_collision | Delegatecall storage slot collision |
| 10 | flash_loan_oracle | Oracle price manipulation via flash loan |
| 11 | flash_loan_single_block | Single-block liquidity attack |
| 12 | misnamed_constructor | Pre-Solidity-0.5 constructor naming bug |
| 13 | other | Multi-class or miscellaneous vulnerability |
Training Data
Assembled from four open-source sources using the prepare_datasets.py pipeline:
| Source | Contracts |
|---|---|
| SmartBugs Curated | 143 |
| SolidiFI Benchmark | 1,700 |
| DeFiHackLabs | 729 |
| Not-So-Smart Contracts | 25 |
| Synthetic augmentation | 3,600 |
| Total (after dedup) | 6,879 |
Split: 70% train / 15% val / 15% test (stratified by label).
Training Details
| Parameter | Value |
|---|---|
| Base model | microsoft/codebert-base |
| Epochs | 5 (best checkpoint at epoch 2) |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Optimizer | AdamW (weight decay 0.01, warmup 100 steps) |
| Max token length | 512 |
| Mixed precision | fp16 |
| Hardware | Google Colab T4 GPU |
Intended Use
- Pre-deployment security screening of Solidity smart contracts
- Automated vulnerability triage for DeFi protocols
- Research baseline for smart contract security ML benchmarks
- Integration into the TrustChainAI multi-agent audit pipeline
Out-of-Scope Use
- This model is not a substitute for a full professional security audit on high-value contracts
- Performance on Vyper, Yul, or non-EVM contracts is untested
- The
tx_origin_phishingclass has limited real training samples (28); treat predictions for this class with extra caution
Limitations & Bias
- Synthetic augmentation was used for 9 of 13 classes to compensate for dataset scarcity. Synthetic contracts may not fully capture real-world obfuscation patterns.
- The
tx_origin_phishingclass had only 28 real-world training samples; model confidence for this class may be lower in practice. - Training data skews toward older Solidity vulnerability patterns (pre-0.8). Newer attack vectors may be underrepresented.
Citation
@misc{trustchainai2025,
author = {Emeka Philian},
title = {TrustChainAI: AI-Powered Smart Contract Auditor},
year = {2025},
url = {https://github.com/emekaphilian/TrustChainAI}
}
Links
- ๐ GitHub: emekaphilian/TrustChainAI
- ๐ค Profile: emekaphilians
- ๐ Architecture: docs/ARCHITECTURE.md
- Downloads last month
- 303
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for emekaphilians/trustchainai-codebert
Base model
microsoft/codebert-base