EleutherAI/gpt-neox-20b · Unusual behaviour with inference using transformers library

I originally posted this on the neox GitHub page was was politely told to ask here.

Describe the bug:

Inference output does not seem sensible, please see a sample output below:

You: what are large language models?
Bot: You: what are large language models? Small language models? Are one size fits all?
<dholbach> https://wiki.ubuntu.com/UbuntuGlobalJam explains the scope of it
<nigelb> dholbach: oooh, right. /me is a bit slow
<dholbach> it's a place where anyone can do anything they like (if some rules aren't broken)
<dholbach> it's a place where people can get together and work on stuff they care about
<dholbach> if you're doing app development, if you love playing with the phone or the new tablet, or if you love doing advocacy, testing, whatever, you can do it there
<dholbach> https://wiki.ubuntu.com/UbuntuGlobalJam has more info on how you can get involved
<dholbach> https://spreadsheets.google.com/spreadsheet/ccc?key=0AkEUPNDy0YB1dDJpdE90QHVvUHZZRXBwRUhBQmdC&hl=en_US#gid=1 has a list of some ideas
<dholbach> a few ideas that folks have came up with are:
<dholbach>  - a quiz with 5 questions, 1 for each day of UGJ - people can take a photo after completing the quiz and email it to the team
You:

To Reproduce
Steps to reproduce the behavior:
Run this code:

# Import the transformers library
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast

# Load the tokenizer and model for gpt-neox-20b
model = GPTNeoXForCausalLM.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer = GPTNeoXTokenizerFast.from_pretrained("EleutherAI/gpt-neox-20b")

# Start a loop to get user input and generate chatbot output
while True:
    # Get user input
    user_input = input("You: ")
    
    # Break the loop if user types "quit"
    if user_input.lower() == "quit":
        break
    
    # Add a prompt to the user input
    prompt = "You: " + user_input
    
    # Encode the prompt using the tokenizer
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    
    # Generate chatbot output using the model
    bot_output_ids = model.generate(
        input_ids,
        do_sample=True,
        temperature=0.9,
        max_length=300,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode chatbot output ids as text
    bot_output = tokenizer.decode(bot_output_ids[0], skip_special_tokens=True)
    
    # Print chatbot output
    print("Bot:", bot_output)

Then ask: what are large language models?

Expected behavior:
A sensible answer of some kind.

Environment:

GPUs: 0
CPU only