metadata
license: mit
datasets:
- cerebras/SlimPajama-627B
- oscar-corpus/OSCAR-2301
- bigcode/starcoderdata
language:
- fr
- en
pipeline_tag: text-generation
tags:
- legal
- art
- code
- finance
- medical
- text-generation-inference
CroissantLLM: A not so flaky bilingual 1.3B model
An experimental mode trained on a small subsplit of the final data.
Usage
model_name = "croissantllm/base_50k"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("His name is Bob. -> Il s'appelle Bob.\nHe is heading to the market. -> Il va au marché.\nWe are heading to the beach, let's go together. ->", return_tensors="pt").to(model.device)
tokens = model.generate(**inputs, max_length=100, do_sample=True, top_p=0.95, top_k=60, temperature=0.5)
print(tokenizer.decode(tokens[0]))
# remove bos token
inputs = tokenizer("France -> Paris, Italie -> Rome, Allemagne -> Berlin, Espagne ->", return_tensors="pt", add_special_tokens=False).to(model.device)
tokens = model.generate(**inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
print(tokenizer.decode(tokens[0]))