# Large Language Model for Science: A Study on P vs. NP

## Abstract

In this work, we use large language models (LLMs) to augment and accelerate research on the P versus NP problem, one of the most important open problems in theoretical computer science and mathematics. Specifically, we propose Socratic reasoning, a general framework that promotes in-depth thinking with LLMs for complex problem-solving. Socratic reasoning encourages LLMs to recursively discover, solve, and integrate problems while facilitating self-evaluation and refinement. Our pilot study on the P vs. NP problem shows that GPT-4 successfully produces a proof schema and engages in rigorous reasoning throughout 97 dialogue turns, concluding "P neq NP", which is in alignment with (Xu and Zhou, 2023). The investigation uncovers novel insights within the extensive solution space of LLMs, shedding light on LLM for Science.

### Community

I'm not a LLM expert but I do know something about theoretical computer science. I cannot speak about the scientific merit of the procedures of querying and dialoguing GPT4. But, I would like to say that the claims about P and NP in the paper (Ex 1: Title of Section 3.2 "Co-proving [P vs NP] with GPT-4" and Ex 2: The reference in the abstract of some intersecting authors "which s in alignment with (Xu and Zhou, 2023)") are extremely likely to be incorrect and should be taken with a grain of salt. Both proofs (one given in this paper by GPT4 and in the Xu and Zhou paper of the authors) don't address known barriers in this question that must be encountered when resolving P vs NP, such as weaker complexity theory results that would be resolved along the way or how they go beyond very powerful proof techniques that were shown to not work. See https://scottaaronson.blog/?p=458.

All known barriers arise in current approaches of computational complexity theory. As mentioned cleraly in the abstract of Xu and Zhou's paper, their constructive approach for proving impossibility results is very different (and missing) from those currently used in computational complexity theory. So there is no need to address these barriers.

These authors mainly discuss the extreme hardness of exhaustive search, which is quite different from the concept of NP-completeness studied in computational complexity theory. I think that this might be another reason for not addressing these barriers.

The two concepts are intimately related. The (subset of) authors even claim in the abstract of the other paper that "we prove that such examples cannot be solved without exhaustive search, which is stronger than P != NP". Indeed, if one proves that "SAT requires exhaustive search" then of course it implies all SAT is not in P and so all NP-complete problems are also not in P. The suspect result seems to be Lemma 3.1 in the Xu and Zhou paper. It is stated in a very general and vague term. It claims that if the problem instance on n-1 variable takes T(n-1) time to ’solve’ and T(n-1) = O(d^{(n-1)*c}), then the problem on n variables takes O(d^c)*T(n-1) = O(d^{cn}) time to ’solve.’ 'Solve' seems to refer to outputting a satisfying assignment. (It cannot be outputting the set of all satisfying assignments since there could be Omega(d^(n-1)) of them.) But even given *a* solution to the instance on any n-1 variables, the lemma does not say at all how to extend this to a solution on n variables. This is non trivial. It could be possible in many instances that an arbitrary solution to the problem on n-1 variables will not even extend to a solution on n variables.

I think that the word ‘Solve’ here only means determining whether the instance is satisfiable because P, NP and NP-complete problems are all decision problems. There is no need to discuss outputting a satisfying assignment in this paper.

I also think that it is quite possible to obtain a strong theorem with no weaker or partial results. For example, does Gödel's incompleteness theorem have any weaker results?

Sure, even in the decision case, it's not clear to me if Lemma 3.1 (the main lemma) holds. My previous comments also hold: if you only know that an instance on n-1 variables is satisfiable, it doesn't necessarily imply that you know if an instance on n variables is satisfiable or not. Like I said, even assuming we get a satisfying assignment on n-1 vars, it is not clear how to extend it to n vars. It requires work.

In Lemma 3.1, the authors only discuss how many subproblems are needed to solve the original problem. They do not need to discuss how to combine these subproblems. It has nothing to do with the proof of Theorem 3.2.

The lemma claims that only A = O(d^c) number of subproblems on n-1 vars are needed to solve the full problem on n vars. To prove this, one must show how to obtain a solution (either a decision or an assignment) for the n variable problem given A different calls to n-1 var instances. To do so, one needs to show how to use the outputs of these A different calls to obtain the solution (either a decision or an assignment) on the n var case. The lemma seems to be one of the main tools in the final proof and it does not seem to hold under scrutiny (and is not very well defined).

As it is made very clear in the proof of Theorem 3.2, only the number of subproblems is sufficient to reach a contradicition. Moreover, the instances are generated with a symmetry requirement on the domain values of each variable. So there is no strategy to choose subproblems (assigning values to variables ) and there is no need to discuss how a solution is obtained.

The proof of Theorem 3.2 is based on Theorem 2.5: "There exists an infinite set of satisfiable and unsatisfiable instances of Model RB such that this set is a fixed point under the symmetry mapping of changing satisfiability". With this result, the upper bound on the number of subproblems is enough to reach a contradiction by the method of diagonalization and self-reference. I copy the text of the authors' paper here: "we show that unless exhaustive search is executed, the satisfiability of a certain constraint (thus the whole instance) is possible to be changed, while the subproblems of the whole instance remain unchanged."

So I think that the authors do not need to discuss the detail of how to solve the problem. I also think that they should not discuss this if they want to show that their results hold for general algorithms.

As far as I can see, Lemma 3.1 is an interesting novelty of this paper. It establishes a connection between proving lower bounds and the analysis of algorithms. With this lemma and Theorem 2.5, the authors can prove by contradiction that general algorithms cannot work without going into the details of these algorithms.

From that paper: "Thus, it is reasonable to assume that every exact algorithm for solving CSP has an equivalent divide-and-conquer algorithm. [...] Under this assumption, we have the following lemma. [Lemma 3.1]".

It's extremely common for papers claiming a proof of P vs NP to (implicitly or explicitly) make an assumption on the structure of an algorithm solving some NP complete problem. Unfortunately, the difficulty of P vs NP arises from it being a claim about *all* algorithms, even ones very different from anything we've thought of yet. Neither this paper nor the one cited in the abstract overcome this barrier.

Primality testing and fast matrix multiplication are two examples of problems where dramatically different/new algorithms were much faster than predecessors. Who's to say that CSP isn't going to be another such problem?

Like many combinatorial problems, the general CSP has no global structure that can be exploited to design algorithms. This is quite different from algebric problems that are structured and can be used to design algorithms. These problems include the problems you mentioned, primality testing and fast matrix multiplication, as well as the famous integer factoring problem which has not been proven to be P or NP-complete. In contrast, every CSP has been proved to be either P or NP-complete. Therefore, it is reasonable to assume for CSP (combinatorial problems), while it is not reasonable to assume for algebraic problems. I think that this is exactly the difference between these two kinds of problems.

the general CSP has no global structure that can be exploited to design algorithms.

This has appeared true so far, but we need to prove it holds. That's the essence of P vs NP.

The halting problem is uncomputable in Turing machines and cannot be proved to be uncomputable in any machines without the Church-Turing thesis. In addition, Gödel's incompleteness theorem is also based on assumptions about what is a formal proof. Such kind of impossibility results touch the unknown world where assumptions are indispensable. I think that the essence of P vs NP is to establish a proof based on a reasonable assumption.

From the perspective of algorithm design, when proving the lower bound for a problem, taking the sorting problem as an example, we must first make an assumption: any two elements are comparable. In this way, we can obtain a lower bound of n*log(n) for sorting algorithms. Therefore, I believe that making reasonable common-sense assumptions is often necessary when proving the hardness of a problem.

I read the paper "Large Language Model for Science: A Study on P vs. NP" and found that the authors gave a good explanation for why their approach works while previous attempts have failed. I copy the text in Section 3.1 as follows.

The almost independent solution space is a unique property of Model RB (Xu and Li, 2000), i.e., the intuition of choosing Model RB in 15. Intuitively, an almost independent solution space means that the state of any assignment in the solution space (whether or not it is a solution) is uncorrelated to the majority of other assignments. When the parameter d in Model RB is large enough (the intuition of letting d and n tend to infinity in 31), the solution space of Model RB is almost independent. However, most problems, such as 2-SAT or 3-SAT problems, do not possess the property of almost independent solution space. In other words, the assignments in their solution spaces are correlated to varying degrees. This difference in the degree of correlation makes it difficult to distinguish quantitatively between 2-SAT and 3-SAT. It also explains why prior attempts to distinguish between P problems and NP problems have failed, consequently complicating proving that P does not equal NP.

My name is Sandeep Silwal and I'm a trained theory computer scientist at MIT. In fact, I even TA'ed a complexity theory course before. Both of this papers do not pass the bar for rigor needed to claim to have resolved P vs NP. P vs NP is a mathematical statement in the formal computation model of turing machines. NP is the set of decision problems solvable in poly time by a nondeterministic Turing machine and P is the corresponding set for deterministic Turing machines. It asks if there is a language in NP which is not in P. Adding any assumptions about what an algorithm (akaTuring machine) 'must do' invalidates all claims of resolving P vs NP. Even if the assumption is 'reasonable', the statement is not about 'reasonable' algorithms, whatever that may be. It is about producing a concrete member of NP for which there exists no deciding deterministic Turing machine. For all we know, there could be a very `unreasonable' algorithm for solving it which hasn't been discovered yet. This means the claims made in this paper and in the one cited in the abstract are both false. In fact, neither papers even mention Turing machines! If THE most important problem in computer science (which comes with a million dollar prize) had been resolved by the Xu and Zhou paper (this paper claims to recreate this very same proof using GPT4!), then there would be much more excitement and attention. To the best of my knowledge, no prominent computer scientist or CS theorist or complexity theorist has come out in support of these papers. I am happy to have future good faith discussions with the *authors* of either papers, not anonymous accounts since many posts above have already laid out the problems present in the reasoning.

Why am I doing this? Normally there are 100s of claimed proofs of P vs NP (just see the many claims posted on vixra.org). As members of the ML and wider academic community, we must actively work against unreasonable and hype publications, even if it is produced by famous authors with thousands of citations. We must have honest academic discourse.

P and NP are defined by Turing machines. Below is the second paragraph of Xu and Zhou's paper. The authors made it clear that they studied another problem not the P/NP problem.

As shown in the proof of Godel’s well-known incompleteness theorem [22], the constructive approach plays an indispensable role in revealing the fundamental limitations of finite formal systems. An algorithm can be regarded as a finite formal system to deduce conclusions step by step. In this paper, we will study the limitations of algorithms based on the constructive approach. Specifically, we will investigate whether it is always possible to develop an algorithm that can replace the naive exponential exhaustive search. The Strong Exponential Time Hypothesis [23] conjectures that no such algorithm exists for SAT. This problem is similar to

that proposed by David Hilbert in the early period of 20th century if it is always possible to construct a finite formal system that can replace an area of mathematics (arithmetic for example) containing infinitely many true mathematical statements. These two problems ask essentially the same philosophical question: whether the part can always replace the whole within the limits of reasoning. We think that it is more fundamental and has a higher priority than whether an efficient (polynomial-time) algorithm exists.

The assumption in Section 3 is about how the problem is solved, whoich does not depend on which computational model is to be used. So I think that the authors do not need to mention Turing machines.

"The authors made it clear that they studied another problem not the P/NP problem." This is not true. Both papers clearly indicate they are studying the P vs NP problem. In the LLM paper, it clearly states "In total, the 97 turns collectively construct extremely hard NP-complete problems,

where some instances are unsolvable within a time complexity lower than O(2^n) (i.e., exhaustive search). In other words, the proof concludes P != NP. " The Xu Zhao paper (which the LLM paper is based off of), clearly states " it follows that P != NP".

Any appeal to philosophy or intuition is not rigorous and must be backed by mathematical arguments. Turing machines are fundamentally tied to P vs NP (it's in the definition of the problem!). I think I have given enough evidence that the claims of both papers are misleading and exaggerated. Again, let me repeat: As members of the ML and wider academic community, we must actively work against unreasonable and hype publications, even if it is produced by famous authors with thousands of citations. We must have honest academic discourse.

Both papers are about proving that exhaustive search is unavoidable for extremely hard examples. This problem is different from the P/NP problem and their proof approaches are completely different. The former problem is very similar to Gödel's incompleteness theorem. So the proof approach is also similar to that used by Gödel, which has never been used in computational complexity theory. I think that this is an interesting novelty of Xu and Zhou's paper and worth more attention.

In previous studies, people tried to prove P!=NP by showing that NP-complete problems (such as 3-SAT) cannot be solved in polynomial time. In the two papers, the authors tried to prove that long-clause SAT cannot be solved without exhaustive search, which is called the Strong Exponential Time Hypothesis. They are two different problems and surprisinly, the latter is much easier to prove than the former.

The main part of the two papers is about how to construct extremely hard examples. The authos use these examples to explain and prove the existence of extreme hardness. I think that the discussion should be relevant to this point, not just P!=NP.

I read the paper "Large Language Model for Science: A Study on P vs. NP" and found that the authors gave a good explanation for why they did not study computational hardness based on the concept of NP-completeness. I copy the text in Section 3.1 as follows.

However, the reasons why some problems are hard have not been thoroughly studied. For a long time, NP-completeness has been the core concept of computational complexity theory. NP-complete problems are defined by reductions, which is instrumental in establishing the tower of computational complexity theory. Yet, reductions depend on skills and vary among problems, lacking a unified pattern. Fundamentally, reductions classify problems according to relative hardness and cannot provide intuitive explanations for why a problem is hard. Without such explanations, it is hard to determine the hardness degree of NP-complete problems. Many think that computational complexity theory is hard because it studies hardness. In our opinion, the pursuit of reductions rather than explanations is the root cause of the difficulties faced by computational complexity theory after its initial flourishing. In a nutshell, the successes or failures of computational complexity theory are all attributed to reductions.

At this point, understanding computational hardness without relying on reductions is crucial for breakthroughs. Exhaustive search signifies extreme hardness. The extreme hardness of exhaustive search, instead, can be intuitively explained - exhaustive search implies that there is no way to prune the search space when solving the problem (the solution space is almost independent). Therefore, the problem is hard to solve.

I have carefully read two papers. I can talk about how this work escapes the two main barriers in computational complexity. The relativization barrier appears when the diagonalization method is used to distinguish between P and NP. In this work, the diagonalization method is used to distinguish between satisfiable and unsatisfiable examples. So it escapes the relativization barrier. For the natural proof barrier, the diagonalization method can escape it. Therefore, the use of the diagonalization method on constructed examples can escape both barriers. This is exactly what these authors have done in their work. It is indeed a great innovation!

Indeed, there have been numerous attemps at proving P!=NP. However, all these attempts cannot explain why their methods of proving computational hardness work for NP-complete problems such as 3-SAT, but fails for P problems such as 2-SAT and XOR-SAT.

This paper does not have this problem because its proof approach seems to work only for Model RB. I think that this might be the biggest difference from previous studies and is also a favourable evidence for validity.

TLDR: the claims of this paper about resolving P vs NP (with GPT4!) are highly likely to be misleading and incorrect. Proceed with caution and use good judgement.

It seems like (possibly GPT generated) responses that have little understanding of what they are saying has flooded this thread. Let me address them once and for all. What exactly is this paper trying to prove?

- They show P != NP?
- They show the Strong Exponential Time Hypothesis (SETH) is true?
- They show some other undefined complexity theory lower bound?

Let me address them.

Cannot be true because (among many other plausible reasons such as those listed here https://scottaaronson.blog/?p=458), the papers are clearly state they are making an assumption about what a potential algorithm to resolve some conjectured hard problem 'must do'. Indeed from the Xu Zhao paper (this paper claims to recreate their proof via GPT) "Thus, it is reasonable to assume that every exact algorithm for solving CSP has an equivalent divide-and-conquer algorithm. [...] Under this assumption, we have the following lemma. [Lemma 3.1]". You cannot assume an algorithm "has" to do something. That is not a valid proof.

Cannot be true because SETH implies P != NP. It's not clear if the posters above were aware of this.

This cannot be true because both papers explicitly claim to resolve P vs NP. The LLM paper states "In total, the 97 turns collectively construct extremely hard NP-complete problems,

where some instances are unsolvable within a time complexity lower than O(2^n) (i.e., exhaustive search). In other words, the proof concludes P != NP. " The Xu Zhao paper (which the LLM paper is based off of), clearly states " it follows that P != NP".

Lots of comments from anonymous accounts that agree with each other and sound like GPT does not change the situation. Unless further evidence is presented, I believe we should understand that the psuedo-theory comments posted above are endorsed by the authors. In which case, I urge the others to engage in honest discourse. I believe any reasonable scientist or researcher can just skim the discussions so far and be made aware of the potential and likely problems with the grand claims of this paper.

This paper **does not impose any specific constraints on the forms of the algorithms**; instead, the proof merely utilizes the upper bound of the number of sub-problems. The "assumption" is quite mild and doesn't restrict the underlying algorithms at all.

Quoting the Xu Zhao paper (this LLM paper claims to recreate their proof via GPT) "Thus, it is reasonable to assume that every exact algorithm for solving CSP has an equivalent divide-and-conquer algorithm .. Under this assumption, we have the following lemma"

Even though it is 'reasonable' and 'mild', it is not mathematicallly formal or rigorous. An algorithm does not have to be of this form; it can do anything a turing machine can do! Proving P != NP means one has to rule out all algorithms, not just 'reasonable' ones, meaning the claims of both papers about resolving P != NP are not sound.

It's about dividing the subproblems rather than restricting the specific algorithms as you misunderstood. If you read the proof carefully, you could find that the proof merely utilizes the upper bound of the number of sub-problems without assuming the forms of algorithms.

When analyzing the lower bound for sorting problems, people also make assumptions about the form of the sorting algorithm -- assuming that it is a comparison-based algorithm. This leads to the lower bound of n*log(n). It is the same for CSPs here.

The theorem statement for the sorting lower bounds say 'under the comparison model, sorting n elements takes Omega(n log n) comparisons. I.e, it is very clear about what model and assumptions it is working under. Indeed under different models, sorting can be done faster (https://en.wikipedia.org/wiki/Radix_sort). When resolving P vs NP, you have to work under the Turing model of computation. This is how the problem is defined. Making any assumption about what the algorithm 'must do' as done in these works, invalidates any claims of resolving P vs NP.

The halting problem is uncomputable in Turing machines and cannot be proved to be uncomputable in any machines without the Church-Turing thesis. In addition, Gödel's incompleteness theorem is also based on assumptions about what is a formal proof. Such kind of impossibility results touch the unknown world where assumptions are indispensable. I think that the essence of P vs NP is to establish a proof based on a reasonable assumption.

The Socratic method was introduced as a guide for interactions with LLM in the work I published this past March. Additionally, the SocraSynth website has compiled a list of pertinent papers. Please ensure that you reference appropriate sources and avoid claiming originality without duly acknowledging the contributions of others.

www.socrasynth.com

Full paper: Prompting Large Language Models With the Socratic Method, March 2023.

This paper presents a systematic approach to using the Socratic method in developing prompt templates that effectively interact with large language models, including GPT-3. Various methods are examined, and those that yield precise answers and justifications while fostering creativity and imagination to enhance creative writing are identified. Techniques such as {\em definition}, {\em elenchus}, {\em dialectic}, {\em maieutics}, {\em generalization}, and {\em counterfactual reasoning} are discussed for their application in engineering prompt templates and their connections to inductive, deductive, and abductive reasoning. Through examples, the effectiveness of these dialogue and reasoning methods is demonstrated. An interesting observation is made that when the task's goal and user intent are conveyed to GPT-3 via ChatGPT before the start of a dialogue, the large language model seems to connect to the external context expressed in the intent and perform more effectively.

## Models citing this paper 0

No model linking this paper

## Datasets citing this paper 0

No dataset linking this paper