Edit model card

bilingual-gpt-neox-4b

rinna-icon

Overview

This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.


Benchmarking

  • Japanese benchmark

    Our evaluation experiments suggest that the bilingual-gpt-neox-4b model performs slightly better than the previous Japanese GPT-NeoX 3.6B in Japanese tasks.

    • The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.
    • The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.
    Model 4-task average accuracy 6-task average accuracy
    bilingual-gpt-neox-4b-instruction-ppo 61.01 61.16
    bilingual-gpt-neox-4b-instruction-sft 61.02 61.69
    bilingual-gpt-neox-4b 56.12 51.83
    japanese-gpt-neox-3.6b-instruction-ppo 59.86 60.07
    japanese-gpt-neox-3.6b 55.07 50.32
  • English benchmark

    Using the EleutherAI Language Model Evaluation Harness, we found the bilingual-gpt-neox-4b performs comparably with English/multilingual models of similar sizes.

    • The average accuracy is based on results of Arc-Challenge, Arc-Easy, BoolQ, COPA, HellaSwag, OpenBookQA, PIQA, PROST, SWAG, and WinoGrande.
    Model Average accuracy
    mpt-7b 59.30
    llama-7b 57.35
    bloom-7b 51.51
    xglm-7.5b 50.96
    xglm-4.5b 50.15
    bilingual-gpt-neox-4b 49.49
    bloom-3b 48.56
    xglm-2.9b 47.44
    bloom-1.7b 46.54

How to use the model

Notice: Since the model is sensitive to decoding hyper-parameters (e.g. temperature, top_p, top_k, repetition_penalty), it is suggested to explore the best setting for your task.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rinna/bilingual-gpt-neox-4b", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("rinna/bilingual-gpt-neox-4b")

if torch.cuda.is_available():
    model = model.to("cuda")

text = "่ฅฟ็”ฐๅนพๅคš้ƒŽใฏใ€"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        min_new_tokens=100,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""
่ฅฟ็”ฐๅนพๅคš้ƒŽใฏใ€ใใฎ่‘—ๆ›ธใ€Œ่‡ช่ฆšใฎๅ“ฒๅญฆใ€ใฎไธญใงใ€ๆฌกใฎใ‚ˆใ†ใซๆ›ธใใพใ—ใŸใ€‚  
ใ€Œ็Ÿฅ่ญ˜ใ‚’ใ€่‡ชๅˆ†ใฎใ‚‚ใฎใจ่€ƒใˆใ‚‹ใ“ใจใซๆบ€่ถณใ—ใฆใ„ใ‚‹ใจใ€่‡ชๅทฑใฎ้™็•Œใซ็›ฎ่ฆšใ‚ใ‚‹ใ“ใจใ‚’ๅฟ˜ใ‚Œใฆใ—ใพใ†ใ€‚ใ—ใ‹ใ—ใ€ไป–่€…ใจใฎๅ”ๅŒใชใ—ใซใฏใ€่‡ชๅทฑใฎๆœฌๅฝ“ใฎ็†่งฃใซ้”ใ™ใ‚‹ใ“ใจใฏใงใใชใ„ใฎใ ใ€‚็Ÿฅ่ญ˜ใฏไป–่€…ใจ็›ธไบ’ใฎใ€ๅ”ๅŒใฎๅŠ›ใซใ‚ˆใฃใฆใ“ใใ€ๅพ—ใ‚‰ใ‚Œใ‚‹ใฎใงใ‚ใ‚‹ใ€‚ใ€(ๅผ•็”จ็ต‚ใ‚ใ‚Š)  
ใ“ใฎไธ€็ฏ€ใ‚’ใ€็งใŸใกใฏไปŠใ‹ใ‚‰ๅญฆใณ็›ดใ™ในใใงใ™ใ€‚ใใ—ใฆใ€ใ“ใ‚Œใ‹ใ‚‰ใฎ็คพไผšใ‚’ใƒชใƒผใƒ‰ใ™ใ‚‹ๅญใฉใ‚‚ใŸใกใซใ€ใใฎ่ƒฝๅŠ›ใ‚’ไผธใฐใ™ในใใ€
"""
text = "Socrates says"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=100,
        min_new_tokens=100,
        do_sample=True,
        temperature=1.0,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)

"""
Socrates says: he thinks that philosophy, as opposed to myth, can be demonstrated; as opposed to poetry, that it is not possible to have knowledge of the unknowable (that is, neither by reason nor by any art of divination). So in this case he is in agreement with Socrates in not thinking that we could prove the existence of the gods or of fate. Now, I do not know the content of Xenophon's _Symposium_, but he must have made a point of this passage that has ex
"""
text = "def bubble_sort(array):"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        token_ids.to(model.device),
        max_new_tokens=200,
        min_new_tokens=200,
        do_sample=True,
        temperature=1.0,
        top_p=0.5,
        pad_token_id=tokenizer.pad_token_id,
        bos_token_id=tokenizer.bos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""
def bubble_sort(array):
    for i in range(len(array)):
        for j in range(len(array)-1):
            if array[j] > array[j+1]:
                array[j], array[j+1] = array[j+1], array[j]
    return array

print(bubble_sort([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))

The code above will sort the array from 1 to 10 in the following order:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10

However, I am not sure how to do
"""

Tokenization

The model uses a sentencepiece-based tokenizer.

  • The tokenizer has a vocabulary size of 65,536.
  • It uses byte fallback to decompose unknown text pieces into UTF-8 byte pieces to avoid producing <UNK> tokens.
  • It can recognize consecutive whitespaces, newlines, and tabs to handle structured texts better.
  • We turned off the default behaviour of prepending leading whitespace because it is not beneficial for processing Japanese.
  • Specifically, single whitespace is always processed as one token so that any English word won't have a preceding whitespace like in many other tokenizers (e.g. _Hello).
    • This decision trades the English processing efficiency for a unified way to treat whitespaces.
    • It leads to a significantly lower loss of next token prediction on English data because whitespaces are easy to predict.
  • Don't forget to set use_fast=False to make the above features function correctly.

Licenese

The MIT license

Downloads last month
5,203
Safetensors
Model size
3.95B params
Tensor type
FP16
ยท
BOOL
ยท
Inference Examples
Inference API has been turned off for this model.

Datasets used to train rinna/bilingual-gpt-neox-4b

Spaces using rinna/bilingual-gpt-neox-4b 8

Collections including rinna/bilingual-gpt-neox-4b