inflaton-ai/logical-reasoning · Apply for community grant: Academic project (gpu)

The Turtle Soup game, popularized by the book Challenging Lateral Thinking Puzzles [5], is a mystery puzzle that has captivated youths worldwide. The game gets its name from a classic problem in the genre: a man orders turtle soup and subsequently commits suicide. Players must deduce the rationale behind the suicide by asking questions that can only be answered with ”yes” or ”no.” These puzzles en- courage lateral thinking, pushing players to break free from conventional thought patterns and approach problems from new perspectives, especially when situations seem initially nonsensical or incomprehensible. A sample dialogue illustrates this concept:
GAME MASTER: On an express train, a man boarded halfway through the journey. After the doors closed, he suddenly regained consciousness and began observing the faces of the passengers around him. ”With all due respect, are you 29 years old?” ”Yes, but how do you know?” Ignoring him, the man continued speaking to others. ”You are 45 years old, right?” ”That’s correct.” ”Are you 81 years old?” ”How do you know?” After asking around, the man returned to his seat, sat down quietly, and began to cry in despair. Why is that?
PLAYER: Is he dreaming?
GAME MASTER: No.
PLAYER: Does he know these people?
GAME MASTER: No.
PLAYER: Does he have superpowers?
GAME MASTER: Yes.
PLAYER: He is an angel condemned to live as a human. GAME MASTER: Congratulations! You guessed the answer.
The original answer is that the man can see how long each person will live, but he cannot see his own lifespan. His life expectancy is about to end, meaning if the train is going to crash, he will die soon. The game is over.
The advent of LLMs has led to a surge in online Turtle Soup games, leveraging these models’ language understanding and logical reasoning capabilities. This study presents a comprehensive evalua- tion of the latest generation of open-source LLMs, such as Qwen2.5 and Llama3.1, alongside OpenAI’s most recent models, in tackling complex and abstract logical reasoning tasks like those found in the Turtle Soup Question Answering (TSQA) task. Our research not only highlights the strengths and limitations of different approaches—few- shot prompting and fine-tuning—but also challenges the notion that proprietary models are inherently superior. Our findings demonstrate that open-source models can match, and in some cases exceed, the performance of established models like OpenAI’s GPT-4o.