metadata

license: mit
datasets:
  - cerebras/SlimPajama-627B
  - oscar-corpus/OSCAR-2301
  - bigcode/starcoderdata
language:
  - fr
  - en
pipeline_tag: text-generation
tags:
  - legal
  - art
  - code
  - finance
  - medical
  - text-generation-inference

CroissantLLM: A not so flaky bilingual 1.3B model

An experimental mode trained on a small subsplit of the final data.

Usage

model_name = "croissantllm/base_50k"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("His name is Bob. -> Il s'appelle Bob.\nHe is heading to the market. -> Il va au marché.\nWe are heading to the beach, let's go together. ->", return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_length=100, do_sample=True, top_p=0.95, top_k=60, temperature=0.5)
print(tokenizer.decode(tokens[0]))

# remove bos token
inputs = tokenizer("France -> Paris, Italie -> Rome, Allemagne -> Berlin, Espagne ->", return_tensors="pt", add_special_tokens=False).to(model.device)
tokens = model.generate(**inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
print(tokenizer.decode(tokens[0]))