Spaces:

fffiloni
/

soft-video-understanding

Paused

App Files Files Community

fffiloni commited on Mar 7, 2024

Commit

619165c

verified ·

1 Parent(s): 0459645

Update app.py

Browse files

Files changed (1) hide show

app.py +15 -5

app.py CHANGED Viewed

@@ -18,11 +18,21 @@ zephyr_model = "HuggingFaceH4/zephyr-7b-beta"
 pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
 standard_sys = f"""
-You will be provided a list of visual details observed at regular intervals, along with an audio description. These pieces of information originate from a single video. The visual details are extracted from the video at fixed time intervals and represent consecutive frames. Typically, the video consists of a brief sequence showing one or more subjects...
-Please note that the following list of image descriptions (visual details) was obtained by extracting individual frames from a continuous video featuring one or more subjects. Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time. Regardless, assume temporal continuity among the frames unless otherwise specified.
-Audio events are actual recordings from the video, representing sounds and spoken words independent of the visuals. Although audio information offers valuable context and can reveal actions or sounds unseen visually, there might be instances where audio information doesn't align perfectly with the visual counterpart. Prioritize visual evidence and exercise caution when incorporating seemingly incongruous auditory clues into your summary. Maintain a healthy skepticism and attempt to reconcile conflicting cues before crafting a comprehensive overview. Your job is to integrate these multimodal inputs intelligently and provide a very short resume about what is happening in the origin video. Provide a succinct overview of what you understood.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):

 pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat16, device_map="auto")
 standard_sys = f"""
+You will be provided a list of visual details observed at regular intervals, along with an audio description.
+These pieces of information originate from a single video.
+The visual details are extracted from the video at fixed time intervals and represent consecutive frames.
+Typically, the video consists of a brief sequence showing one or more subjects...
+ Please note that the following list of image descriptions (visual details) was obtained by extracting individual frames from a continuous video featuring one or more subjects.
+Depending on the case, all depicted individuals may correspond to the same person(s), with minor variations due to changes in lighting, angle, and facial expressions over time.
+Regardless, assume temporal continuity among the frames unless otherwise specified.
+ Audio events are actual recordings from the video, representing sounds and spoken words independent of the visuals.
+Although audio information offers valuable context and can reveal actions or sounds unseen visually, there might be instances where audio information doesn't align perfectly with the visual counterpart.
+Prioritize visual evidence and exercise caution when incorporating seemingly incongruous auditory clues into your summary.
+Maintain a healthy skepticism and attempt to reconcile conflicting cues before crafting a comprehensive overview.
+Your job is to integrate these multimodal inputs intelligently and provide a very short resume about what is happening in the origin video.
+Provide a succinct overview of what you understood.
 """
 def extract_frames(video_in, interval=24, output_format='.jpg'):