ForesightLM Phase 1
ForesightLM Phase 1 is an experimental research checkpoint exploring whether self-loop hard negatives can provide a useful calibration signal for semantic-level coherence-aware generation.
This repository contains Phase 1 model checkpoints, robustness outputs, and small ablation artifacts for the ForesightLM workshop paper.
Main idea
The project investigates a lightweight extension around a DistilGPT-2 style language model trained on WikiText-103. The goal is not to claim solved long-form coherence, but to test whether future-aware objectives and self-loop hard negatives can improve semantic-level generation indicators.
Uploaded checkpoints
This repository includes:
| Path | Description |
|---|---|
outputs/phase1/v2_wikitext103_seed42_distilgpt2 |
Main ForesightLM-v2 Phase 1 checkpoint, lambda_hardneg=0.10 |
outputs/phase1/v2_wikitext103_lh005_seed42_distilgpt2 |
Lambda ablation checkpoint, lambda_hardneg=0.05 |
outputs/phase1/v2_wikitext103_lh020_seed42_distilgpt2 |
Lambda ablation checkpoint, lambda_hardneg=0.20 |
outputs/phase1/bootstrap |
Paired bootstrap confidence interval results |
outputs/phase1/lambda_hardneg_ablation |
Lambda hard-negative ablation summary |
outputs/phase1/decoding_eval |
Decoding summary and qualitative examples |
Phase 1 main metrics
Main ForesightLM-v2 result:
| metric | value |
|---|---|
| LM loss | 3.9384 |
| Future loss | 2.7180 |
| Hard-negative margin gap | 0.0614 |
| Hard-negative satisfied rate | 0.4107 |
Lambda hard-negative ablation
| lambda_hardneg | eval_lm โ | eval_future โ | hardneg_gap โ | satisfied โ |
|---|---|---|---|---|
| 0.05 | 3.9385 | 2.7214 | 0.0512 | 0.3571 |
| 0.10 | 3.9384 | 2.7180 | 0.0614 | 0.4107 |
| 0.20 | 3.9383 | 2.7148 | 0.0701 | 0.4643 |
The small ablation suggests that increasing the hard-negative weight from 0.10 to 0.20 improves held-out hard-negative separation while preserving language-modeling loss.
Bootstrap robustness
Paired bootstrap over prompt IDs suggests that calibrated v2 decoding robustly improves prompt relevance, topic drift, progression, semantic-loop rate, and prompt-drift rate relative to Core and baseline.
Raw v2 improvements are directional but generally not conclusive under the paired bootstrap intervals.
Important limitations
These are Phase 1 research artifacts. The current evidence is limited to:
- one main domain: WikiText-103
- one primary seed
- small ablation scale
- automatic coherence indicators rather than full human evaluation
The results should be interpreted cautiously. ForesightLM does not solve long-form coherence; Phase 1 only suggests that self-loop hard negatives may provide a useful calibration signal.
GitHub
Code and compact result artifacts are available on GitHub:
https://github.com/Ahmet2001/foresightLM
Current pushed branch:
v2-self-loop-fixes
Citation
If you use these artifacts, please cite the ForesightLM Phase 1 workshop paper once available.