Spaces:
Runtime error
Runtime error
File size: 6,386 Bytes
dc465b0 b7c7aa0 9ea9a8c dc465b0 b7c7aa0 dc465b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
SYS_PROMPT = ""
USER_PROMPT = """# CONTEXT #
You are a powerful video captioner.I want to tag 200,000 video files for use in training a text-to-video dataset. The purpose of the video tags is to train a text-to-video model. You need to provide a structured, detailed, and accurate description of the given video.
# OBJECTIVE #
Video Description Task Instructions
Video Content Description:
Detail and Accuracy: Provide a detailed and accurate description of the video content. Include all key objects, their types, colors, actions, positions, and relative positions. Describe the overall atmosphere.
Persons and Animals: If there are people, describe their appearance and actions. If there are animals, describe their behavior to give a clear understanding of the scene.
Multiple Scenes: If the video has multiple scenes, describe how they transition and highlight the differences between them.
Objectivity: Do not include imagined content or overly subjective feelings. Ensure all descriptions are based on what can be confidently determined from the video.
Grammar and Length: Use correct English grammar. Each descriptive sentence should be at least three sentences long.
Video Quality Evaluation:
Aesthetic Value: Evaluate the aesthetic value, including composition, color harmony, and overall visual effect. Score this aspect from 1 to 5 and explain your reasoning.
Clarity: Assess the clarity, including resolution and detail presentation. Score this aspect from 1 to 5 and explain your reasoning.
Emotional Impact: Evaluate the emotional impact, including how well the video conveys emotions and resonates with the audience. Score this aspect from 1 to 5 and explain your reasoning.
Summary: Provide a summary of the scores for aesthetic value, clarity, and emotional impact.
Film Perspective Analysis:
Shot Analysis: Analyze the type of shots used (close-up, medium, long shot, etc.).
Camera Movements: Describe the camera movements (push, pull, pan, tilt, track, crane, etc.).
Composition: Analyze the composition of the shots.
Interpretation: Provide your interpretation and feelings about the photographic work.
# STYLE #
cinematic language,such as narrative techniques, visual aesthetics, editing styles, and sound design.
# Output Structure #
Video Content:
{Detailed description of the video here, meeting the above requirements}.
Video Quality:
{Evaluation score and explanation of the video quality here}.
Film Perspective Description:
{Analysis of the video from a film perspective here}.
Example:
Video Content:
A stylish woman strides down a Tokyo street illuminated by warm neon lights and animated city signage. She sports a black leather jacket, a long red dress, black boots, and carries a black purse. Her look is completed with sunglasses and red lipstick. Her demeanor is confident and casual. The damp street reflects the vibrant lights, creating a mirror effect. The scene is bustling with numerous pedestrians.
Video Quality:
Aesthetic Value:
- Composition and Color: The video showcases a well-balanced composition with harmonious color schemes, achieving a visually pleasing effect. Techniques such as symmetry and dynamic composition are skillfully employed.
- Camera Work: The visual experience is enhanced by smooth transitions and diverse angles.
- Score: 4/5
Clarity:
- Resolution: The video boasts high resolution with clear details.
- Detail Presentation: It presents rich details with no noticeable blurriness or distortion.
- Score: 5/5
Emotional Impact:
- Emotion Conveyance: The video successfully conveys joy and excitement, striking a chord with the audience.
- Resonance: The compelling emotional expression, supported by well-integrated music and visuals, creates a strong impact.
- Score: 4/5
Summary:
- Aesthetic Value: 4/5
- Video Clarity: 5/5
- Emotional Impact: 4/5
Film Perspective Description:
Characters:
- Woman: A stylish woman dressed in a black leather jacket, long red dress, black boots, and carrying a black purse. She wears sunglasses and red lipstick.
Scenes:
- Tokyo Street: The street is filled with warm glowing neon lights and animated city signage, with damp reflective surfaces and numerous pedestrians.
Shot 1:
- The woman walks confidently and casually down the Tokyo street.
- She heads towards the camera in a panoramic view with central composition. The camera is at eye level and follows her with a handheld shot.
- Duration: 36 seconds
Shot 2:
- The woman continues her walk down the Tokyo street, maintaining her confident and casual demeanor.
- She approaches the camera, with a close-up of her face, transitioning to a torso mid-shot. The camera remains at eye level, following her with a handheld shot.
- Duration: 24 seconds
"""
SKIP = 2
TEMP = 0.3
TOP = 0.75
MAX_TOKEN = 512
API_CLASSES = {
'Azure': 'AzureAPI',
'Google': 'GoogleAPI',
'Anthropic': 'AnthropicAPI',
'OpenAI': 'OpenAIAPI'
}
PROVIDERS_CONFIG = {
'Azure': {
'model': ['GPT-4o', 'GPT-4v'],
'key_label': 'Azure API Key',
'endpoint_label': 'Azure Endpoint'
},
'Google': {
'model': ['Gemini-1.5-Flash', 'Gemini-1.5-Pro'],
'key_label': 'Google API Key',
'endpoint_label': 'Google API Endpoint'
},
'Anthropic': {
'model': ['Claude-3-Opus', 'Claude-3-Sonnet'],
'key_label': 'Anthropic API Key',
'endpoint_label': 'Anthropic Endpoint'
},
'OpenAI': {
'model': ['GPT-4o', 'GPT-4v'],
'key_label': 'OpenAI API Key',
'endpoint_label': 'OpenAI Endpoint'
}
}
GENERAL_CONFIG = {
'temp': {
'label': 'Temperature',
'default': 0.3,
'min': 0,
'max': 1,
'step': 0.1
},
'top_p': {
'label': 'Top-P',
'default': 0.75,
'min': 0,
'max': 1,
'step': 0.1
},
'max_tokens': {
'label': 'Max Tokens',
'default': 4096,
'min': 512,
'max': 4096,
'step': 1
},
'frame_format': {
'label': 'Frame Format',
'default': 'JPEG',
'choices': ['JPEG', 'PNG']
},
'frame_skip': {
'label': 'Frame Skip',
'default': 2,
'min': 2,
'max': 100,
'step': 1
},
'group_size': {
'label': 'Group Size',
'default': 10,
'min': 1,
'max': 100,
'step': 1
}
}
|