Spaces:

floleuerer
/

german_llm_outputs

Running

File size: 1,235 Bytes

---
title: German Llm Outputs
emoji: 🦀
colorFrom: green
colorTo: pink
sdk: gradio
sdk_version: 4.12.0
app_file: app.py
pinned: false
license: mit
---

# Dataset

The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations 

Preprocessing:
- filtered german conversations
- took first user prompt
- deleted short prompts (less than 70 chars)

```python
dataset = load_dataset('lmsys/chatbot_arena_conversations')

def get_message(x):
    x['message'] = [x['conversation_a'][0]]
    return x

dataset = dataset.filter(lambda x: x['language'] == 'German')
dataset = dataset['train'].map(get_message)
dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70)
```

# Generation

I rely on the huggingface `conversational` pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later.

```python
messages = json.loads(Path('messages.json').read_text())
outputs = []
pipe = pipeline(
    "conversational",
    model=model_name,
    torch_dtype="auto",
    device_map=device,
    max_new_tokens=1024,
    trust_remote_code=True
)

for message in tqdm(messages):
    output = pipe([message])
    outputs.append(output)
```