Spaces:

floleuerer
/

german_llm_outputs

Running

german_llm_outputs / README.md

Florian Leuerer

README

510e903 7 months ago

No virus

1.24 kB

	---
	title: German Llm Outputs
	emoji: 🦀
	colorFrom: green
	colorTo: pink
	sdk: gradio
	sdk_version: 4.12.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# Dataset

	The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations

	Preprocessing:
	- filtered german conversations
	- took first user prompt
	- deleted short prompts (less than 70 chars)

	```python
	dataset = load_dataset('lmsys/chatbot_arena_conversations')

	def get_message(x):
	x['message'] = [x['conversation_a'][0]]
	return x

	dataset = dataset.filter(lambda x: x['language'] == 'German')
	dataset = dataset['train'].map(get_message)
	dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70)
	```

	# Generation

	I rely on the huggingface `conversational` pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later.

	```python
	messages = json.loads(Path('messages.json').read_text())
	outputs = []
	pipe = pipeline(
	"conversational",
	model=model_name,
	torch_dtype="auto",
	device_map=device,
	max_new_tokens=1024,
	trust_remote_code=True
	)

	for message in tqdm(messages):
	output = pipe([message])
	outputs.append(output)
	```