--- title: German Llm Outputs emoji: 🦀 colorFrom: green colorTo: pink sdk: gradio sdk_version: 4.12.0 app_file: app.py pinned: false license: mit --- # Dataset The dataset usesd is https://huggingface.co/datasets/lmsys/chatbot_arena_conversations Preprocessing: - filtered german conversations - took first user prompt - deleted short prompts (less than 70 chars) ```python dataset = load_dataset('lmsys/chatbot_arena_conversations') def get_message(x): x['message'] = [x['conversation_a'][0]] return x dataset = dataset.filter(lambda x: x['language'] == 'German') dataset = dataset['train'].map(get_message) dataset = dataset.filter(lambda x: len(x['message'][0]['content']) > 70) ``` # Generation I rely on the huggingface `conversational` pipeline to generate the outputs. There are some issues with the chat template (esp. for the non-instruction tuned models) i'll fix later. ```python messages = json.loads(Path('messages.json').read_text()) outputs = [] pipe = pipeline( "conversational", model=model_name, torch_dtype="auto", device_map=device, max_new_tokens=1024, trust_remote_code=True ) for message in tqdm(messages): output = pipe([message]) outputs.append(output) ```