|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- OpenAssistant/oasst1 |
|
- erfanzar/CC-H2OAI-OASST-1-TRAIN |
|
- erfanzar/CC-OASST-1-TRAIN |
|
language: |
|
- en |
|
- fr |
|
- fa |
|
- nl |
|
metrics: |
|
- bertscore |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
|
|
# OpenSourceTransformers-OST Project |
|
|
|
[OST-OpenSourceTransformers Github](https://github.com/erfanzar/OST-OpenSourceTransformers) |
|
|
|
## Hello community |
|
|
|
this model is only 1B but you can call it somehow an SOTA |
|
|
|
|
|
this model can also run on 4 GB GPU RAM and know dialogs as well |
|
|
|
|
|
### Train Parametes |
|
|
|
- learning-rate : 2e-4 |
|
- sc : cosine lr |
|
- device : T4 GPU * 4 |
|
- batch-size: AutoFind |
|
- train time 12 H |
|
- max sequence length: 1024 |
|
- epochs : 2 |
|
## Usage Code |
|
|
|
```python |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from IPython.display import clear_output |
|
import textwrap |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("erfanzar/PGT-1B-2EP") |
|
|
|
model = AutoModelForCausalLM.from_pretrained("erfanzar/PGT-1B-2EP",device_map='auto',load_in_8bit=True) |
|
|
|
|
|
verify_text = lambda txt : '\n'.join([textwrap.fill(txt, width=110) for txt in txt.split('\n')]) |
|
|
|
|
|
def ppp(text:str): |
|
""" |
|
pre processing prompt |
|
""" |
|
return f"<|prompter|> {text} <|endoftext|><|assistant|>" |
|
|
|
def generate(text,max_new_tokens:int=1024,use_ppp:bool=False,b_pair=False): |
|
text = ppp(text) if use_ppp else text |
|
|
|
for i in range(max_new_tokens): |
|
enc = tokenizer(text,return_tensors='pt',add_special_tokens=False) |
|
text_r = text |
|
enc = model.generate(enc.input_ids,max_new_tokens=1,pad_token_id=0) |
|
text = tokenizer.decode(enc[0],skip_special_tokens=False) |
|
text = text[:-4]+tokenizer.eos_token if text[-4:] == '\n\n\n\n' else text |
|
|
|
if text.endswith(tokenizer.eos_token) or text.endswith('\n\n\n\n\n'): |
|
yield text[len(text_r):] if b_pair else text |
|
break |
|
else: |
|
yield text[len(text_r):] if b_pair else text |
|
|
|
|
|
for v in generate('what is a gpu',512,True): |
|
clear_output(wait=True) |
|
print(verify_text(v),end='') |
|
|
|
|
|
``` |
|
|
|
# Pythia-1B |
|
|
|
## Model Details |
|
|
|
### Pretrained Model |
|
|
|
- Developed by: [EleutherAI](http://eleuther.ai) |
|
- Model type: Transformer-based Language Model |
|
- FineTuned Languages: English , Persian , French, And Dutch |
|
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia) for training procedures, config files, and details on how to use. |
|
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) |
|
- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
|
|
## NOTE |
|
|
|
The Pythia Suite is **NOT** intended for deployment. It is not in itself |
|
a product and cannot be used for human-facing interactions. For example, |
|
the model may generate harmful or offensive text... |
|
|
|
|
|
and also remember that this model is not good enough for Persian, French, and Dutch at least for this version |