plbart-base / README.md
nazneen's picture
model documentation
f82e70a
metadata
tags:
  - bert
metrics:
  - CodeBLEU

Model Card for plbart-base

Model Details

Model Description

The PLBART model was proposed in Unified Pre-training for Program Understanding and Generation

  • Developed by: UCLA NLP
  • Shared by [Optional]: Gunjan Chhablani
  • Model type: Text2Text Generation
  • Language(s) (NLP): More information needed
  • License: More information needed
  • Related Models: bert-base-multilingual-uncased
    • Parent Model: plbart
  • Resources for more information:

Uses

Direct Use

The pre-trained model plbart-base has been trained using multilingual denoising task

Downstream Use [Optional]

More information needed

Out-of-Scope Use

More information needed

Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

Training Details

Training Data

More information needed

Training Procedure

Preprocessing

The model creators note in the associated paper

We tokenize all the data with a sentencepiece model (Kudo and Richardson, 2018) learned on 1/5’th of the pre-training data. We train sentencepiece to learn 50,000 subword tokens. One key challenge to aggregate data from different modalities is that some modalities may have more data, such as we have 14 times more data in PL than NL. Therefore, we mix and up/down sample the data following Conneau and Lample (2019) to alleviate the bias towards PL.

Speeds, Sizes, Times

The model creators note in the associated paper

The effective batch size is maintained at 2048 instances.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model creators note in the associated paper

CodeXGLUE (Lu et al., 2021) provided public dataset and corresponding train validation-test splits for all the tasks

Factors

More information needed

Metrics

More information needed

Results

More information needed

Model Examination

More information needed

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 8 Nvidia GeForce RTX 2080 Ti GPUs
  • Hours used: More information needed
  • Cloud Provider: More information needed
  • Compute Region: More information needed
  • Carbon Emitted: More information needed

Technical Specifications [optional]

Model Architecture and Objective

PLBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. As the model is multilingual it expects the sequences in a different format. A special language id token is added in both the source and target text. The source text format is X [eos, src_lang_code] where X is the source text.

Compute Infrastructure

The model creators note in the associated paper

PLBART uses the same architecture as BARTbase (Lewis et al., 2020), it uses the sequence-to-sequence Transformer architecture (Vaswani et al., 2017), with 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads (∼140M parameters). The only exception is, we include an additional layer normalization layer on top of both the encoder and decoder following Liu et al. (2020),

Hardware

More information needed

Software

More information needed

Citation

BibTeX:

@misc{https://doi.org/10.48550/arxiv.2103.06333,
 doi = {10.48550/ARXIV.2103.06333},
 
 url = {https://arxiv.org/abs/2103.06333},
 
 author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
 
 keywords = {Computation and Language (cs.CL), Programming Languages (cs.PL), FOS: Computer and information sciences, FOS: Computer and information sciences},
 
 title = {Unified Pre-training for Program Understanding and Generation},
 
 publisher = {arXiv},
 
 year = {2021},
 
 copyright = {arXiv.org perpetual, non-exclusive license}
}

APA: More information needed

Glossary [optional]

CodeBLEU is a metric for measuring the quality of the synthesized code (Ren et al., 2020). Unlike BLEU, CodeBLEU also considers grammatical and logical correctness based on the abstract syntax tree and the data-flow structure.

More Information [optional]

More information needed

Model Card Authors [optional]

UCLA NLP in collaboration with Ezi Ozoani and the Hugging Face team

Model Card Contact

More information needed

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 
tokenizer = AutoTokenizer.from_pretrained("uclanlp/plbart-base")
 
model = AutoModelForSeq2SeqLM.from_pretrained("uclanlp/plbart-base")