UserJoseph nielsr HF Staff commited on
Commit
eead26d
·
verified ·
1 Parent(s): b700156

Add comprehensive model card for DisTime (#1)

Browse files

- Add comprehensive model card for DisTime (4eaa95da8cd288dec6a058c3c7656d5a15051dac)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ tags:
6
+ - multimodal
7
+ - video-understanding
8
+ - temporal-localization
9
+ - qwen
10
+ ---
11
+
12
+ # DisTime: Distribution-based Time Representation for Video Large Language Models
13
+
14
+ This repository contains the official implementation and checkpoints for the paper:
15
+ [**DisTime: Distribution-based Time Representation for Video Large Language Models**](https://huggingface.co/papers/2505.24329) (ICCV 2025).
16
+
17
+ For more details, including installation, training, and evaluation scripts, please refer to the official [GitHub repository](https://github.com/josephzpng/DisTime).
18
+
19
+ <div align="center">
20
+ <img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/>
21
+ </div>
22
+
23
+ ## Abstract
24
+
25
+ Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at [this URL](https://github.com/josephzpng/DisTime).
26
+
27
+ ## Dataset
28
+
29
+ The InternVid-TG dataset proposed in the paper is released at: [yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg).
30
+
31
+ ## Usage
32
+
33
+ You can load the model using the `transformers` library and use it for video understanding tasks.
34
+
35
+ ```python
36
+ import numpy as np
37
+ import torch
38
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoProcessor
39
+ from decord import cpu, VideoReader
40
+
41
+ # Load model, tokenizer, and processor
42
+ tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-1B")
43
+ model = AutoModelForCausalLM.from_pretrained("UserJoseph/DisTime-1B", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
44
+ processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-1B")
45
+
46
+ model.eval()
47
+
48
+ # Example video input
49
+ video_path = "./examples/video1.mp4" # Replace with your video path
50
+ qs = "Describe this video in detail"
51
+
52
+ # Load video frames
53
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
54
+ fps = float(vr.get_avg_fps())
55
+ frame_indices = np.array([i for i in range(0, len(vr), round(fps))])
56
+ video_frames = []
57
+ for frame_index in frame_indices:
58
+ img = vr[frame_index].asnumpy()
59
+ video_frames.append(img)
60
+ video_frames = np.stack(video_frames)
61
+
62
+ # Prepare inputs
63
+ messages = [{"role": "user", "content": [{"type": "video", "video": video_frames}, {"type": "text", "text": qs}]}]
64
+ inputs = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
65
+ inputs = tokenizer(inputs, return_tensors="pt").to(model.device)
66
+
67
+ # Generate response
68
+ with torch.inference_mode():
69
+ output_ids = model.generate(
70
+ **inputs,
71
+ do_sample=False,
72
+ temperature=0.2,
73
+ max_new_tokens=128,
74
+ use_cache=True,
75
+ )
76
+
77
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
78
+ print(pred)
79
+ ```
80
+
81
+ ## Citation
82
+
83
+ If you find this work useful, please cite the paper:
84
+
85
+ ```bibtex
86
+ @article{zeng2025distime,
87
+ title={DisTime: Distribution-based Time Representation for Video Large Language Models},
88
+ author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
89
+ journal={arXiv preprint arXiv:2505.24329},
90
+ year={2025}
91
+ }
92
+ ```
93
+
94
+ ## Acknowledgement
95
+
96
+ DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.