UserJoseph nielsr HF Staff commited on
Commit
30b48ed
·
verified ·
1 Parent(s): 5bb5b65

Improve model card: Add metadata, paper link, code link, and usage (#1)

Browse files

- Improve model card: Add metadata, paper link, code link, and usage (6e5f5bdf76041c65d1ccf9c40f0515f1e8986ff1)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +100 -3
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
+ ---
6
+
7
+ # DisTime: Distribution-based Time Representation for Video Large Language Models
8
+
9
+ [Paper](https://huggingface.co/papers/2505.24329) | [GitHub Repository](https://github.com/josephzpng/DisTime)
10
+
11
+ ## Abstract
12
+ Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.
13
+
14
+ <div align="center">
15
+ <img src="https://github.com/josephzpng/DisTime/raw/main/images/network.png" width="600px"/>
16
+ </div>
17
+
18
+ ## Usage
19
+
20
+ You can easily load the model using the `transformers` library. The following example demonstrates how to perform inference with DisTime:
21
+
22
+ ```python
23
+ import numpy as np
24
+ import torch
25
+ from transformers import AutoTokenizer, AutoModel, AutoProcessor
26
+ from decord import cpu, VideoReader
27
+
28
+ # Load the model and processor
29
+ tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
30
+ model = AutoModel.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
31
+ processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
32
+
33
+ model.eval()
34
+ video_path = "./examples/video1.mp4" # Replace with your video path
35
+ qs = "Describe this video in detail"
36
+
37
+ # Load video frames
38
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
39
+ fps = float(vr.get_avg_fps())
40
+ frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) # Sample frames at 1 fps
41
+ video = [vr[frame_index].asnumpy() for frame_index in frame_indices]
42
+ video = np.stack(video)
43
+
44
+ # Prepare inputs
45
+ messages = [
46
+ {
47
+ "role": "user",
48
+ "content": [
49
+ {"type": "video", "video": video},
50
+ {"type": "text", "text": qs},
51
+ ],
52
+ }
53
+ ]
54
+
55
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
56
+ video_inputs = processor.process_video(messages) # Process video frames
57
+ inputs = processor(text=[text], videos=video_inputs, padding=True, return_tensors="pt")
58
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
59
+
60
+ # Generate output
61
+ with torch.inference_mode():
62
+ output_ids = model.generate(
63
+ **inputs,
64
+ do_sample=False,
65
+ temperature=0.2,
66
+ max_new_tokens=128,
67
+ use_cache=True,
68
+ )
69
+
70
+ pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
71
+ print(pred)
72
+ ```
73
+
74
+ ## Models and Data
75
+
76
+ ### Models
77
+ - [DisTime-1B](https://huggingface.co/UserJoseph/DisTime-1B)
78
+ - [DisTime-8B](https://huggingface.co/UserJoseph/DisTime-8B)
79
+
80
+ ### InternVid-TG
81
+
82
+ In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at [https://huggingface.co/datasets/yingsen/internvid-tg](https://huggingface.co/datasets/yingsen/internvid-tg).
83
+
84
+ <div align="center">
85
+ <img src="https://github.com/josephzpng/DisTime/raw/main/images/internvid-tg.png" width="600px"/>
86
+ </div>
87
+
88
+ ## Citation
89
+ ```bibtex
90
+ @article{zeng2025distime,
91
+ title={DisTime: Distribution-based Time Representation for Video Large Language Models},
92
+ author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
93
+ journal={arXiv preprint arXiv:2505.24329},
94
+ year={2025}
95
+ }
96
+ ```
97
+
98
+ ## Acknowledgement
99
+
100
+ DisTime is developed with the codebases of the following projects: [InternVL](https://github.com/OpenGVLab/InternVL) and [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT). We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.