|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- solidity |
|
- web3 |
|
- code generation |
|
- smart contract |
|
widget: |
|
- text: "pragma solidity ^0.5.7;\n// Context: ParentA | Functions: helloA helloB | Constants: constantA \ncontract HelloWorld is ParentA {" |
|
--- |
|
|
|
# A code generation T5 model for solidity (web3 smart contract) |
|
- See https://github.com/hululuzhu/solidity-t5 for more context |
|
|
|
## How to use this trained model |
|
- A hello world example to use this model, notice the input `text` includes |
|
- Header solidity version like `pragma solidity ^0.5.7` |
|
- Ancestor class/library info, e.g. public functions and constants from `ParentA` |
|
- Contract/Library/Interface declaration header, e.g. `HelloWorld` ended with `{` |
|
- Or simply use the test widget on the right side of the window and test, however |
|
the quality is known to be worse without decoding params |
|
|
|
```python |
|
# !pip install transformers -q |
|
|
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
|
|
DEVICE = 'cuda' # fallback to cpu if you do not have cuda |
|
tokenizer = AutoTokenizer.from_pretrained("hululuzhu/solidity-t5") |
|
model = T5ForConditionalGeneration.from_pretrained("hululuzhu/solidity-t5").to(DEVICE) |
|
|
|
text = """pragma solidity ^0.5.7; |
|
// Context: ParentA | Functions: helloA helloB | Constants: constantA |
|
contract HelloWorld is ParentA {""" |
|
input_ids = tokenizer(text, return_tensors="pt", truncation=True).input_ids.to(DEVICE) |
|
|
|
# Need to tune beam/topk/topp params to get good outcome |
|
generated_ids = model.generate(input_ids, max_length=256, num_beams=5, top_p=0.95, top_k=50) |
|
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) |
|
|
|
# Expect outcome |
|
""" |
|
string public constant name = "Hello World"; |
|
... |
|
uint256 public constant override returns (uint256) { |
|
return initialSupply; |
|
} |
|
function initialSupply() public view returns (uint256) { |
|
... |
|
""" |
|
``` |
|
|
|
## Background |
|
- Base T5 code model: https://huggingface.co/Salesforce/codet5-large |
|
- Source data: https://huggingface.co/datasets/mwritescode/slither-audited-smart-contracts |
|
- Processing steps: Clean, contract-level segmentation sepration, split in and out |
|
- After processing input sample |
|
|
|
``` |
|
pragma solidity 0.5.7; |
|
// Context: PauserRole | Functions: isPauser addPauser renouncePauser | Constants: |
|
contract Pausable is PauserRole { |
|
``` |
|
|
|
- After processing output sample (**notice indentation is bad, this is intentional to reduce token size**) |
|
|
|
``` |
|
event Paused(address account); |
|
event Unpaused(address account); |
|
bool private _pausableActive; |
|
bool private _paused; |
|
constructor () internal { |
|
_paused = false; |
|
} |
|
function paused() public view returns (bool) { |
|
return _paused; |
|
} |
|
modifier whenNotPaused() { |
|
require(!_paused); |
|
_; |
|
} |
|
modifier whenPaused() { |
|
require(_paused); |
|
_; |
|
} |
|
function pause() public onlyPauser whenNotPaused whenPausableActive { |
|
_paused = true; |
|
emit Paused(msg.sender); |
|
} |
|
function unpause() public onlyPauser whenPaused whenPausableActive { |
|
_paused = false; |
|
emit Unpaused(msg.sender); |
|
} |
|
function _setPausableActive(bool _active) internal { |
|
_pausableActive = _active; |
|
} |
|
modifier whenPausableActive() { |
|
require(_pausableActive); |
|
_; |
|
} |
|
} |
|
``` |
|
- Source training code: See the [end to end notebook](https://github.com/hululuzhu/solidity-t5/blob/main/code/Solidity_T5_Data_Processing_and_Training.ipynb) at code dir here |
|
|
|
## Future TODO |
|
- The model is significantly under-trained because of lack of GPU budget, need 10x colab resources (~$100 for full train) |
|
- This is quite limited on how the model is used, potentially we could switch to GPT2 decoder-only to compare, but CodeT5 has its strong code optimization |
|
- Need more classifiers (T5 or BERT alike) to detect potential defects. |
|
|