Model Card for SmolLM360M-IT-ConvFill_mlx_q8

This model card implements a finetune of HuggingFaceTB/SmolLM2-360M-Instruct for the conversational infill task described in the ConvFill paper, and is a INT8 MLX quantized version of vysri/SmolLM360M-IT-ConvFill.

Model Details

This model should be used respecting the original license of the base model, HuggingFaceTB/SmolLM2-360M-Instruct. The dataset that was used to finetune this model can be found here.

Model Description

Deploying responsive, multi-turn conversational voice agents with large language models poses a critical challenge: cloud-based foundation models utilize reasoning, information retrieval, and tool use for high-value tasks, but introduce latency that disrupts natural conversation. In contrast, small models can respond quickly but lack capabilities needed in real-world tasks. We propose conversational infill, a task where a small, local model generates prompt, contextually appropriate dialogue and seamlessly incorporates delayed, external knowledge produced in parallel by a foundation model backend. This finetune trains HuggingFaceTB/SmolLM2-360M-Instruct to perform the conversational infill task.

Finetuned from model: HuggingFaceTB/SmolLM2-360M-Instruct
License: Apache 2.0

Model Sources [optional]

Repository: https://github.com/vysri/conversational-infill
Paper: Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents
Demo: TBD

Direct Use

This model is intended to be used with the infrastructure in the ConvFill repository.

Bias, Risks, and Limitations

This model is not explicitly tuned for guardrailed behavior. Please use with caution.

How to Get Started with the Model

Use the code in the ConvFill repository to get started with this model.

Training Data

A link to the training data for this model can be found here. The dataset generation procedure can be found here. Information on training procedures can be found in the ConvFill paper. Training code and scripts can be found in the ConvFill repository.

Citation

@misc{srinivas2026thinkingspeakinginferencetimeknowledge,
      title={Thinking While Speaking: Inference-Time Knowledge Transfer for Responsive and Intelligent Conversational Voice Agents}, 
      author={Vidya Srinivas and Zachary Englhardt and Shwetak Patel and Vikram Iyer},
      year={2026},
      eprint={2511.07397},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.07397}, 
}