german_llm_outputs / README.md
Florian Leuerer
README
510e903
metadata
title: German Llm Outputs
emoji: 🦀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 4.12.0
app_file: app.py
pinned: false
license: mit

Dataset

The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

Preprocessing:

  • filtered german conversations
  • took first user prompt
  • deleted short prompts (less than 70 chars)
dataset = load_dataset('lmsys/chatbot_arena_conversations')

def get_message(x):
    x['message'] = [x['conversation_a'][0]]
    return x

dataset = dataset.filter(lambda x: x['language'] == 'German')
dataset = dataset['train'].map(get_message)
dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70)

Generation

I rely on the huggingface conversational pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later.

messages = json.loads(Path('messages.json').read_text())
outputs = []
pipe = pipeline(
    "conversational",
    model=model_name,
    torch_dtype="auto",
    device_map=device,
    max_new_tokens=1024,
    trust_remote_code=True
)

for message in tqdm(messages):
    output = pipe([message])
    outputs.append(output)