prasanthdj8's picture
Update Blog.md
b0f304f verified

NegotiateAI: Teaching LLMs to Win at Enterprise Procurement

Meta PyTorch OpenEnv Hackathon | April 2026


The Problem

Every procurement manager knows the feeling. You have 5 suppliers, 12 open requirements, a budget that is already stretched, and three deadlines hitting this week. You need to negotiate hard, but not so hard that the supplier walks. You need to defer some items, but not the critical ones. And you need to do all of this simultaneously, under pressure, with incomplete information.

Current LLMs cannot do this. They can write an email about negotiation. They can explain what a purchase order is. But put them in a live negotiation with real constraints and real consequences and they fall apart immediately.

We built NegotiateAI because we wanted to fix that.


The Environment

NegotiateAI is an adversarial procurement arena built on the OpenEnv framework. The agent steps into the shoes of a procurement manager. It sees a live dashboard of suppliers, requirements, budgets and deadlines. It chooses from seven real procurement actions:

Action Description
negotiate Open or counter a price with a supplier
award_contract Accept terms and lock in a supplier
raise_pr Submit a formal purchase requisition
defer Push a decision to the next planning cycle
reject Walk away from a supplier
hedge Split an order across two suppliers to reduce risk
escalate Bring in senior management for high-stakes decisions

Suppliers push back. Prices fluctuate. Deadlines expire. The agent lives with the consequences of every decision it makes.

The reward signal captures what actually matters in procurement: fulfilling critical requirements on time, staying within budget, and avoiding costly deadline failures. Three difficulty levels push the agent from structured scenarios all the way to full adversarial arena conditions.


The Training

We trained a Llama 3.2 3B model using GRPO (Group Relative Policy Optimisation) via HuggingFace TRL on an NVIDIA A100 80GB. Training data was collected live from the running environment across two difficulty levels β€” not from a static dataset.

Phase 1 β€” Easy Negotiation

200 episodes generated 1,333 training samples from real environment interactions. The reward function maintained a 513x separation between valid procurement actions (0.0513) and invalid ones (0.0001), giving GRPO a clear gradient signal.

The environment's curriculum engine advanced through all difficulty tiers naturally as performance improved:

  • Episode 35: advanced to Apprentice
  • Episode 59: advanced to Practitioner
  • Episode 87: advanced to Expert (43% of episodes at Expert tier)

GRPO training improved reward from 0.0068 β†’ 0.0073 (+8.3%) over 600 steps.

Curriculum Progression Rolling average reward across 200 episodes. Agent progressed Novice β†’ Apprentice (ep 35) β†’ Practitioner (ep 59) β†’ Expert (ep 87).

GRPO Training Results Step-level rewards and rolling average during GRPO training on 1,333 training samples.

Phase 2 β€” Medium Adversarial

Following easy negotiation training, the model was exposed to medium_adversarial scenarios β€” 12 suppliers including deceptive agents, a rival buyer, and mid-game supply disruptions. 100 episodes generated 1,829 training samples for continued fine-tuning over 150 steps at learning rate 2.5e-06.


The Results

Metric Value
Training episodes (easy) 200
Training samples (easy) 1,333
Training episodes (medium) 100
Training samples (medium) 1,829
Model Llama 3.2 3B + LoRA adapters
Training method GRPO via HuggingFace TRL
Hardware NVIDIA A100 80GB
Tier advancements Novice β†’ Apprentice β†’ Practitioner β†’ Expert
Expert tier episodes 43%
Easy first 20 steps avg reward 0.0068
Easy last 20 steps avg reward 0.0073
Easy improvement +8.3%
Valid action reward signal 0.0513 vs 0.0001 (513x gap)

Before vs After

Behavior Untrained Trained
raise_pr (invalid) steps 2/8 1/8
Actions with proposed_price 6/8 7/8
Avg reward 0.0104 0.0104

Why This Matters

Procurement is not a niche problem. It is a 50 trillion dollar global industry where decisions happen under pressure, with incomplete information, and real financial consequences. Most AI tools in this space are glorified search engines or document summarisers.

NegotiateAI is something different. It is a trainable, measurable, open benchmark for teaching LLMs to actually negotiate. Not to talk about negotiating. To do it.

The curriculum engine means the environment gets harder as the agent improves. The adversarial supplier LLMs mean there is no fixed optimal policy to memorise. And the OpenEnv interface means any model can be dropped in and evaluated on the same benchmark.

We think that distinction matters a lot.


Links