Files changed (1) hide show
  1. README.md +75 -12
README.md CHANGED
@@ -4,39 +4,102 @@ base_model:
4
  - Qwen/Qwen2.5-7B-Instruct
5
  ---
6
  # Valley 2.0
 
 
 
 
 
 
 
 
 
 
7
  ## Introduction
8
- Valley [github](https://github.com/bytedance/Valley) is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model not only
9
 
10
- - Achieved the best results in the inhouse e-commerce and short-video benchmarks
11
- - Demonstrated comparatively outstanding performance in the OpenCompass (average scores > 67) tests
12
 
13
- when evaluated against models of the same scale.
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- ## Release
16
- - [12/23] 🔥 Announcing [Valley-Qwen2.5-7B](https://huggingface.co/ByteDance)!
17
 
18
  ## Valley-Eagle
19
  The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
20
 
21
- - In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
22
  - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
23
 
24
  and the model structure is shown as follows:
25
 
26
  <div style="display:flex;">
27
- <img src="valley_structure.jpeg" alt="opencompass" style="height:600px;" />
28
  </div>
29
 
30
 
 
 
 
31
  ## Environment Setup
32
  ``` bash
33
  pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
34
  pip install -r requirements.txt
35
  ```
36
 
37
- ## License Agreement
38
- All of our open-source models are licensed under the Apache-2.0 license.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
 
 
 
 
40
 
41
- ## Citation
42
- Coming Soon!
 
 
 
 
 
 
 
 
4
  - Qwen/Qwen2.5-7B-Instruct
5
  ---
6
  # Valley 2.0
7
+
8
+ <p align="center">
9
+ <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_logo.jpg" width="500"/>
10
+ <p>
11
+
12
+ <p align="center">
13
+ 🤗 <a href="https://huggingface.co/bytedance-research/Valley-Eagle-7B">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://hyggge.github.io/projects/valley/index.html">Home Page</a>
14
+ </p>
15
+
16
+
17
  ## Introduction
18
+ Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model
19
 
20
+ - Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
21
+ - Demonstrated comparatively outstanding performance in the OpenCompass (average scores >= 67.40, *TOP2* among <10B models) tests
22
 
23
+ when evaluated against models of the same scale.
24
+
25
+ <div style="display:flex;">
26
+ <!-- <img src="assets/open_compass_1223.jpg" alt="opencompass" style="height:300px;" />
27
+ <img src="assets/tts_inhouse_benchmark_1223.jpg" alt="inhouse" style="height:300px;" /> -->
28
+ <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/combine.jpg" alt="opencompass"/>
29
+ </div>
30
+ <br>
31
+
32
+ <p align="center" style="display:flex;">
33
+ <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/table.jpeg"/>
34
+ <p>
35
 
 
 
36
 
37
  ## Valley-Eagle
38
  The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
39
 
40
+ - In the final version, we also referenced [Eagle](https://arxiv.org/pdf/2408.15998), introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
41
  - This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
42
 
43
  and the model structure is shown as follows:
44
 
45
  <div style="display:flex;">
46
+ <img src="https://raw.githubusercontent.com/bytedance/Valley/refs/heads/main/assets/valley_structure.jpeg" alt="opencompass" />
47
  </div>
48
 
49
 
50
+ ## Release
51
+ - [12/23] 🔥 Announcing [Valley-Eagle-7B](https://huggingface.co/bytedance-research/Valley-Eagle-7B)!
52
+
53
  ## Environment Setup
54
  ``` bash
55
  pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
56
  pip install -r requirements.txt
57
  ```
58
 
59
+ ## Inference Demo
60
+ - Single image
61
+ ``` python
62
+ from valley_eagle_chat import ValleyEagleChat
63
+ model = ValleyEagleChat(
64
+ model_path='bytedance-research/Valley-Eagle-7B',
65
+ padding_side = 'left',
66
+ )
67
+
68
+ url = 'http://p16-goveng-va.ibyteimg.com/tos-maliva-i-wtmo38ne4c-us/4870400481414052507~tplv-wtmo38ne4c-jpeg.jpeg'
69
+ img = urllib.request.urlopen(url=url, timeout=5).read()
70
+
71
+ request = {
72
+ "chat_history": [
73
+ {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'},
74
+ {'role': 'user', 'content': 'Describe the given image.'},
75
+ ],
76
+ "images": [img],
77
+ }
78
+
79
+ result = model(request)
80
+ print(f"\n>>> Assistant:\n")
81
+ print(result)
82
+ ```
83
+
84
+ - Video
85
+ ``` python
86
+ from valley_eagle_chat import ValleyEagleChat
87
+ import decord
88
+ import requests
89
+ import numpy as np
90
+ from torchvision import transforms
91
 
92
+ model = ValleyEagleChat(
93
+ model_path='bytedance-research/Valley-Eagle-7B',
94
+ padding_side = 'left',
95
+ )
96
 
97
+ url = 'https://videos.pexels.com/video-files/29641276/12753127_1920_1080_25fps.mp4'
98
+ video_file = './video.mp4'
99
+ response = requests.get(url)
100
+ if response.status_code == 200:
101
+ with open("video.mp4", "wb") as f:
102
+ f.write(response.content)
103
+ else:
104
+ print("download error!")
105
+ exit(1)