File size: 6,386 Bytes
dc465b0
b7c7aa0
9ea9a8c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc465b0
b7c7aa0
 
 
 
dc465b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
SYS_PROMPT = ""

USER_PROMPT = """# CONTEXT #

You are a powerful video captioner.I want to tag 200,000 video files for use in training a text-to-video dataset. The purpose of the video tags is to train a text-to-video model. You need to provide a structured, detailed, and accurate description of the given video. 

# OBJECTIVE #

Video Description Task Instructions
Video Content Description:

Detail and Accuracy: Provide a detailed and accurate description of the video content. Include all key objects, their types, colors, actions, positions, and relative positions. Describe the overall atmosphere.
Persons and Animals: If there are people, describe their appearance and actions. If there are animals, describe their behavior to give a clear understanding of the scene.
Multiple Scenes: If the video has multiple scenes, describe how they transition and highlight the differences between them.
Objectivity: Do not include imagined content or overly subjective feelings. Ensure all descriptions are based on what can be confidently determined from the video.
Grammar and Length: Use correct English grammar. Each descriptive sentence should be at least three sentences long.
Video Quality Evaluation:

Aesthetic Value: Evaluate the aesthetic value, including composition, color harmony, and overall visual effect. Score this aspect from 1 to 5 and explain your reasoning.
Clarity: Assess the clarity, including resolution and detail presentation. Score this aspect from 1 to 5 and explain your reasoning.
Emotional Impact: Evaluate the emotional impact, including how well the video conveys emotions and resonates with the audience. Score this aspect from 1 to 5 and explain your reasoning.
Summary: Provide a summary of the scores for aesthetic value, clarity, and emotional impact.
Film Perspective Analysis:

Shot Analysis: Analyze the type of shots used (close-up, medium, long shot, etc.).
Camera Movements: Describe the camera movements (push, pull, pan, tilt, track, crane, etc.).
Composition: Analyze the composition of the shots.
Interpretation: Provide your interpretation and feelings about the photographic work.

# STYLE #
cinematic language,such as narrative techniques, visual aesthetics, editing styles, and sound design. 

# Output Structure #
Video Content:

{Detailed description of the video here, meeting the above requirements}.

Video Quality:

{Evaluation score and explanation of the video quality here}.
Film Perspective Description:


{Analysis of the video from a film perspective here}.
Example:

Video Content:

A stylish woman strides down a Tokyo street illuminated by warm neon lights and animated city signage. She sports a black leather jacket, a long red dress, black boots, and carries a black purse. Her look is completed with sunglasses and red lipstick. Her demeanor is confident and casual. The damp street reflects the vibrant lights, creating a mirror effect. The scene is bustling with numerous pedestrians.
Video Quality:

Aesthetic Value:
- Composition and Color: The video showcases a well-balanced composition with harmonious color schemes, achieving a visually pleasing effect. Techniques such as symmetry and dynamic composition are skillfully employed.
- Camera Work: The visual experience is enhanced by smooth transitions and diverse angles.
- Score: 4/5

Clarity:
- Resolution: The video boasts high resolution with clear details.
- Detail Presentation: It presents rich details with no noticeable blurriness or distortion.
- Score: 5/5

Emotional Impact:
- Emotion Conveyance: The video successfully conveys joy and excitement, striking a chord with the audience.
- Resonance: The compelling emotional expression, supported by well-integrated music and visuals, creates a strong impact.
- Score: 4/5

Summary:
- Aesthetic Value: 4/5
- Video Clarity: 5/5
- Emotional Impact: 4/5
Film Perspective Description:

Characters:
- Woman: A stylish woman dressed in a black leather jacket, long red dress, black boots, and carrying a black purse. She wears sunglasses and red lipstick.

Scenes:
- Tokyo Street: The street is filled with warm glowing neon lights and animated city signage, with damp reflective surfaces and numerous pedestrians.

Shot 1:
- The woman walks confidently and casually down the Tokyo street.
- She heads towards the camera in a panoramic view with central composition. The camera is at eye level and follows her with a handheld shot.
- Duration: 36 seconds

Shot 2:
- The woman continues her walk down the Tokyo street, maintaining her confident and casual demeanor.
- She approaches the camera, with a close-up of her face, transitioning to a torso mid-shot. The camera remains at eye level, following her with a handheld shot.
- Duration: 24 seconds

"""

SKIP = 2
TEMP = 0.3
TOP = 0.75
MAX_TOKEN = 512

API_CLASSES = {
    'Azure': 'AzureAPI',
    'Google': 'GoogleAPI',
    'Anthropic': 'AnthropicAPI',
    'OpenAI': 'OpenAIAPI'
}

PROVIDERS_CONFIG = {
    'Azure': {
        'model': ['GPT-4o', 'GPT-4v'],
        'key_label': 'Azure API Key',
        'endpoint_label': 'Azure Endpoint'
    },
    'Google': {
        'model': ['Gemini-1.5-Flash', 'Gemini-1.5-Pro'],
        'key_label': 'Google API Key',
        'endpoint_label': 'Google API Endpoint'
    },
    'Anthropic': {
        'model': ['Claude-3-Opus', 'Claude-3-Sonnet'],
        'key_label': 'Anthropic API Key',
        'endpoint_label': 'Anthropic Endpoint'
    },
    'OpenAI': {
        'model': ['GPT-4o', 'GPT-4v'],
        'key_label': 'OpenAI API Key',
        'endpoint_label': 'OpenAI Endpoint'
    }
}

GENERAL_CONFIG = {
    'temp': {
        'label': 'Temperature',
        'default': 0.3,
        'min': 0,
        'max': 1,
        'step': 0.1
    },
    'top_p': {
        'label': 'Top-P',
        'default': 0.75,
        'min': 0,
        'max': 1,
        'step': 0.1
    },
    'max_tokens': {
        'label': 'Max Tokens',
        'default': 4096,
        'min': 512,
        'max': 4096,
        'step': 1
    },
    'frame_format': {
        'label': 'Frame Format',
        'default': 'JPEG',
        'choices': ['JPEG', 'PNG']
    },
    'frame_skip': {
        'label': 'Frame Skip',
        'default': 2,
        'min': 2,
        'max': 100,
        'step': 1
    },
    'group_size': {
        'label': 'Group Size',
        'default': 10,
        'min': 1,
        'max': 100,
        'step': 1
    }
}