ForesightLM Phase 1

ForesightLM Phase 1 is an experimental research checkpoint exploring whether self-loop hard negatives can provide a useful calibration signal for semantic-level coherence-aware generation.

This repository contains Phase 1 model checkpoints, robustness outputs, and small ablation artifacts for the ForesightLM workshop paper.

Main idea

The project investigates a lightweight extension around a DistilGPT-2 style language model trained on WikiText-103. The goal is not to claim solved long-form coherence, but to test whether future-aware objectives and self-loop hard negatives can improve semantic-level generation indicators.

Uploaded checkpoints

This repository includes:

Path Description
outputs/phase1/v2_wikitext103_seed42_distilgpt2 Main ForesightLM-v2 Phase 1 checkpoint, lambda_hardneg=0.10
outputs/phase1/v2_wikitext103_lh005_seed42_distilgpt2 Lambda ablation checkpoint, lambda_hardneg=0.05
outputs/phase1/v2_wikitext103_lh020_seed42_distilgpt2 Lambda ablation checkpoint, lambda_hardneg=0.20
outputs/phase1/bootstrap Paired bootstrap confidence interval results
outputs/phase1/lambda_hardneg_ablation Lambda hard-negative ablation summary
outputs/phase1/decoding_eval Decoding summary and qualitative examples

Phase 1 main metrics

Main ForesightLM-v2 result:

metric value
LM loss 3.9384
Future loss 2.7180
Hard-negative margin gap 0.0614
Hard-negative satisfied rate 0.4107

Lambda hard-negative ablation

lambda_hardneg eval_lm โ†“ eval_future โ†“ hardneg_gap โ†‘ satisfied โ†‘
0.05 3.9385 2.7214 0.0512 0.3571
0.10 3.9384 2.7180 0.0614 0.4107
0.20 3.9383 2.7148 0.0701 0.4643

The small ablation suggests that increasing the hard-negative weight from 0.10 to 0.20 improves held-out hard-negative separation while preserving language-modeling loss.

Bootstrap robustness

Paired bootstrap over prompt IDs suggests that calibrated v2 decoding robustly improves prompt relevance, topic drift, progression, semantic-loop rate, and prompt-drift rate relative to Core and baseline.

Raw v2 improvements are directional but generally not conclusive under the paired bootstrap intervals.

Important limitations

These are Phase 1 research artifacts. The current evidence is limited to:

  • one main domain: WikiText-103
  • one primary seed
  • small ablation scale
  • automatic coherence indicators rather than full human evaluation

The results should be interpreted cautiously. ForesightLM does not solve long-form coherence; Phase 1 only suggests that self-loop hard negatives may provide a useful calibration signal.

GitHub

Code and compact result artifacts are available on GitHub:

https://github.com/Ahmet2001/foresightLM

Current pushed branch:

v2-self-loop-fixes

Citation

If you use these artifacts, please cite the ForesightLM Phase 1 workshop paper once available.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support