Spaces:

LanguageBind
/

LanguageBind

Running

App Files Files Community

LinB203 commited on Oct 4, 2023

Commit

5c98ca3

•

1 Parent(s): c4ba24f

add project files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +132 -13
a_cls/__pycache__/precision.cpython-38.pyc +0 -0
a_cls/__pycache__/stats.cpython-38.pyc +0 -0
a_cls/__pycache__/zero_shot.cpython-38.pyc +0 -0
a_cls/__pycache__/zero_shot_classifier.cpython-38.pyc +0 -0
a_cls/__pycache__/zero_shot_metadata.cpython-38.pyc +0 -0
a_cls/__pycache__/zeroshot_cls.cpython-38.pyc +0 -0
a_cls/class_labels_indices.csv +528 -0
a_cls/dataloader.py +90 -0
a_cls/datasets.py +93 -0
a_cls/filter_eval_audio.py +21 -0
a_cls/precision.py +12 -0
a_cls/stats.py +57 -0
a_cls/util.py +306 -0
a_cls/zero_shot.py +234 -0
a_cls/zero_shot_classifier.py +111 -0
a_cls/zero_shot_metadata.py +183 -0
a_cls/zeroshot_cls.py +46 -0
app.py +327 -0
assets/languagebind.jpg +0 -0
assets/logo.png +0 -0
assets/res1.jpg +0 -0
assets/res2.jpg +0 -0
d_cls/__pycache__/precision.cpython-38.pyc +0 -0
d_cls/__pycache__/zero_shot.cpython-38.pyc +0 -0
d_cls/__pycache__/zero_shot_classifier.cpython-38.pyc +0 -0
d_cls/__pycache__/zero_shot_metadata.cpython-38.pyc +0 -0
d_cls/__pycache__/zeroshot_cls.cpython-38.pyc +0 -0
d_cls/cp_zero_shot_metadata.py +117 -0
d_cls/datasets.py +20 -0
d_cls/precision.py +12 -0
d_cls/zero_shot.py +90 -0
d_cls/zero_shot_classifier.py +111 -0
d_cls/zero_shot_metadata.py +117 -0
d_cls/zeroshot_cls.py +47 -0
data/__pycache__/base_datasets.cpython-38.pyc +0 -0
data/__pycache__/build_datasets.cpython-38.pyc +0 -0
data/__pycache__/new_loadvat.cpython-38.pyc +0 -0
data/__pycache__/process_audio.cpython-38.pyc +0 -0
data/__pycache__/process_depth.cpython-38.pyc +0 -0
data/__pycache__/process_image.cpython-38.pyc +0 -0
data/__pycache__/process_text.cpython-38.pyc +0 -0
data/__pycache__/process_thermal.cpython-38.pyc +0 -0
data/__pycache__/process_video.cpython-38.pyc +0 -0
data/base_datasets.py +159 -0
data/bpe_simple_vocab_16e6.txt.gz +3 -0
data/build_datasets.py +174 -0
data/new_loadvat.py +498 -0
data/process_audio.py +131 -0
data/process_depth.py +55 -0

README.md CHANGED Viewed

@@ -1,13 +1,132 @@
----
-title: LanguageBind
-emoji: 📈
-colorFrom: yellow
-colorTo: yellow
-sdk: gradio
-sdk_version: 3.46.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<!---
+Copyright 2023 The OFA-Sys Team.
+All rights reserved.
+This source code is licensed under the Apache 2.0 license found in the LICENSE file in the root directory.
+-->
+<p align="center">
+    <img src="assets/logo.png" width="250" />
+<p>
+<h2 align="center"> LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment </h2>
+<h5 align="center"> If you like our project, please give us a star ✨ on Github for latest update.  </h2>
+[//]: # (<p align="center">)
+[//]: # (        📖 <a href="https://arxiv.org/abs/2305.11172">Paper</a>&nbsp&nbsp｜ &nbsp<a href="datasets.md">Datasets</a>)
+[//]: # (</p>)
+<br>
+LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. As a result, **all modalities are mapped to a shared feature space**, implementing multimodal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose **VIDAL-10M with 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language.** In our VIDAL-10M, all videos are from short video platforms with **complete semantics** rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions
+We have **open-sourced the VIDAL-10M dataset**, which greatly expands the data beyond visual modalities. The following figure shows the architecture of LanguageBind. LanguageBind can be easily extended to segmentation, detection tasks, and potentially to unlimited modalities.
+<p align="center">
+<img src="assets/languagebind.jpg" width=100%>
+</p>
+<br>
+# News
+* **2023.10.02:** Released the code. Training & validating scripts and checkpoints.
+<br></br>
+# Online Demo
+Coming soon...
+# Models and Results
+## Model Zoo
+We list the parameters and pretrained checkpoints of LanguageBind below. Note that LanguageBind can be disassembled into different branches to handle different tasks.
+The cache comes from OpenCLIP, which we downloaded from HuggingFace. Note that the original cache for pretrained weights is the Image-Language weights, just a few more HF profiles.
+We additionally trained Video-Language with the LanguageBind method, which is stronger than on CLIP4Clip framework.
+<table border="1" width="100%">
+    <tr align="center">
+        <th>Model</th><th>Ckpt</th><th>Params</th><th>Modality Hidden size</th><th>Modality Layers</th><th>Language Hidden size</th><th>Language Layers</th>
+    </tr>
+    <tr align="center">
+        <td>Video-Language</td><td>TODO</td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+    </tr>
+    <tr align="center">
+        <td>Audio-Language</td><td><a href="https://pan.baidu.com/s/1PFN8aGlnzsOkGjVk6Mzlfg?pwd=sisz">BaiDu</a></td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+    </tr>
+    <tr align="center">
+        <td>Depth-Language</td><td><a href="https://pan.baidu.com/s/1YWlaxqTRhpGvXqCyBbmhyg?pwd=olom">BaiDu</a></td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+    </tr>
+    <tr align="center">
+        <td>Thermal(Infrared)-Language</td><td><a href="https://pan.baidu.com/s/1luUyyKxhadKKc1nk1wizWg?pwd=raf5">BaiDu</a></td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+    </tr>
+    <tr align="center">
+        <td>Image-Language</td><td><a href="https://pan.baidu.com/s/1VBE4OjecMTeIzU08axfFHA?pwd=7j0m">BaiDu</a></td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+    </tr>
+    <tr align="center">
+        <td>Cache for pretrained weight</td><td><a href="https://pan.baidu.com/s/1Tytx5MDSo96rwUmQZVY1Ww?pwd=c7r0">BaiDu</a></td><td>330M</td><td>1024</td><td>24</td><td>768</td><td>12</td>
+    </tr>
+</table>
+<br>
+## Results
+Zero-shot Video-Text Retrieval Performance on MSR-VTT and MSVD datasets. We focus on reporting the parameters of the vision
+encoder. Our experiments are based on 3 million video-text pairs of VIDAL-10M, and we train on the CLIP4Clip framework..
+<p align="center">
+<img src="assets/res1.jpg" width=100%>
+</p>
+Infrared-Language, Depth-Language, and Audio-Language zero-shot classification. We report the top-1 classification accuracy for all datasets.
+<p align="center">
+<img src="assets/res2.jpg" width=100%>
+</p>
+<br></br>
+# Requirements and Installation
+* Python >= 3.8
+* Pytorch >= 1.13.0
+* CUDA Version >= 10.2 (recommend 11.6)
+* Install required packages:
+```bash
+git clone https://github.com/PKU-YuanGroup/LanguageBind
+cd LanguageBind
+pip install -r requirements.txt
+```
+<br></br>
+# VIDAL-10M
+Release the dataset after publication...
+<br></br>
+# Training & Inference
+Release run scripts, details coming soon...
+<br></br>
+# Downstream datasets
+Coming soon...
+<br></br>
+# Acknowledgement
+* [OpenCLIP](https://github.com/mlfoundations/open_clip) An open source pretraining framework.
+<br></br>
+# Citation
+If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
+<br></br>
+```BibTeX
+```

a_cls/__pycache__/precision.cpython-38.pyc ADDED Viewed

Binary file (582 Bytes). View file

a_cls/__pycache__/stats.cpython-38.pyc ADDED Viewed

Binary file (1.45 kB). View file

a_cls/__pycache__/zero_shot.cpython-38.pyc ADDED Viewed

Binary file (6.38 kB). View file

a_cls/__pycache__/zero_shot_classifier.cpython-38.pyc ADDED Viewed

Binary file (4.25 kB). View file

a_cls/__pycache__/zero_shot_metadata.cpython-38.pyc ADDED Viewed

Binary file (16.7 kB). View file

a_cls/__pycache__/zeroshot_cls.cpython-38.pyc ADDED Viewed

Binary file (1.44 kB). View file

a_cls/class_labels_indices.csv ADDED Viewed

	@@ -0,0 +1,528 @@

+index,mid,display_name
+0,/m/09x0r,"Speech"
+1,/m/05zppz,"Male speech, man speaking"
+2,/m/02zsn,"Female speech, woman speaking"
+3,/m/0ytgt,"Child speech, kid speaking"
+4,/m/01h8n0,"Conversation"
+5,/m/02qldy,"Narration, monologue"
+6,/m/0261r1,"Babbling"
+7,/m/0brhx,"Speech synthesizer"
+8,/m/07p6fty,"Shout"
+9,/m/07q4ntr,"Bellow"
+10,/m/07rwj3x,"Whoop"
+11,/m/07sr1lc,"Yell"
+12,/m/04gy_2,"Battle cry"
+13,/t/dd00135,"Children shouting"
+14,/m/03qc9zr,"Screaming"
+15,/m/02rtxlg,"Whispering"
+16,/m/01j3sz,"Laughter"
+17,/t/dd00001,"Baby laughter"
+18,/m/07r660_,"Giggle"
+19,/m/07s04w4,"Snicker"
+20,/m/07sq110,"Belly laugh"
+21,/m/07rgt08,"Chuckle, chortle"
+22,/m/0463cq4,"Crying, sobbing"
+23,/t/dd00002,"Baby cry, infant cry"
+24,/m/07qz6j3,"Whimper"
+25,/m/07qw_06,"Wail, moan"
+26,/m/07plz5l,"Sigh"
+27,/m/015lz1,"Singing"
+28,/m/0l14jd,"Choir"
+29,/m/01swy6,"Yodeling"
+30,/m/02bk07,"Chant"
+31,/m/01c194,"Mantra"
+32,/t/dd00003,"Male singing"
+33,/t/dd00004,"Female singing"
+34,/t/dd00005,"Child singing"
+35,/t/dd00006,"Synthetic singing"
+36,/m/06bxc,"Rapping"
+37,/m/02fxyj,"Humming"
+38,/m/07s2xch,"Groan"
+39,/m/07r4k75,"Grunt"
+40,/m/01w250,"Whistling"
+41,/m/0lyf6,"Breathing"
+42,/m/07mzm6,"Wheeze"
+43,/m/01d3sd,"Snoring"
+44,/m/07s0dtb,"Gasp"
+45,/m/07pyy8b,"Pant"
+46,/m/07q0yl5,"Snort"
+47,/m/01b_21,"Cough"
+48,/m/0dl9sf8,"Throat clearing"
+49,/m/01hsr_,"Sneeze"
+50,/m/07ppn3j,"Sniff"
+51,/m/06h7j,"Run"
+52,/m/07qv_x_,"Shuffle"
+53,/m/07pbtc8,"Walk, footsteps"
+54,/m/03cczk,"Chewing, mastication"
+55,/m/07pdhp0,"Biting"
+56,/m/0939n_,"Gargling"
+57,/m/01g90h,"Stomach rumble"
+58,/m/03q5_w,"Burping, eructation"
+59,/m/02p3nc,"Hiccup"
+60,/m/02_nn,"Fart"
+61,/m/0k65p,"Hands"
+62,/m/025_jnm,"Finger snapping"
+63,/m/0l15bq,"Clapping"
+64,/m/01jg02,"Heart sounds, heartbeat"
+65,/m/01jg1z,"Heart murmur"
+66,/m/053hz1,"Cheering"
+67,/m/028ght,"Applause"
+68,/m/07rkbfh,"Chatter"
+69,/m/03qtwd,"Crowd"
+70,/m/07qfr4h,"Hubbub, speech noise, speech babble"
+71,/t/dd00013,"Children playing"
+72,/m/0jbk,"Animal"
+73,/m/068hy,"Domestic animals, pets"
+74,/m/0bt9lr,"Dog"
+75,/m/05tny_,"Bark"
+76,/m/07r_k2n,"Yip"
+77,/m/07qf0zm,"Howl"
+78,/m/07rc7d9,"Bow-wow"
+79,/m/0ghcn6,"Growling"
+80,/t/dd00136,"Whimper (dog)"
+81,/m/01yrx,"Cat"
+82,/m/02yds9,"Purr"
+83,/m/07qrkrw,"Meow"
+84,/m/07rjwbb,"Hiss"
+85,/m/07r81j2,"Caterwaul"
+86,/m/0ch8v,"Livestock, farm animals, working animals"
+87,/m/03k3r,"Horse"
+88,/m/07rv9rh,"Clip-clop"
+89,/m/07q5rw0,"Neigh, whinny"
+90,/m/01xq0k1,"Cattle, bovinae"
+91,/m/07rpkh9,"Moo"
+92,/m/0239kh,"Cowbell"
+93,/m/068zj,"Pig"
+94,/t/dd00018,"Oink"
+95,/m/03fwl,"Goat"
+96,/m/07q0h5t,"Bleat"
+97,/m/07bgp,"Sheep"
+98,/m/025rv6n,"Fowl"
+99,/m/09b5t,"Chicken, rooster"
+100,/m/07st89h,"Cluck"
+101,/m/07qn5dc,"Crowing, cock-a-doodle-doo"
+102,/m/01rd7k,"Turkey"
+103,/m/07svc2k,"Gobble"
+104,/m/09ddx,"Duck"
+105,/m/07qdb04,"Quack"
+106,/m/0dbvp,"Goose"
+107,/m/07qwf61,"Honk"
+108,/m/01280g,"Wild animals"
+109,/m/0cdnk,"Roaring cats (lions, tigers)"
+110,/m/04cvmfc,"Roar"
+111,/m/015p6,"Bird"
+112,/m/020bb7,"Bird vocalization, bird call, bird song"
+113,/m/07pggtn,"Chirp, tweet"
+114,/m/07sx8x_,"Squawk"
+115,/m/0h0rv,"Pigeon, dove"
+116,/m/07r_25d,"Coo"
+117,/m/04s8yn,"Crow"
+118,/m/07r5c2p,"Caw"
+119,/m/09d5_,"Owl"
+120,/m/07r_80w,"Hoot"
+121,/m/05_wcq,"Bird flight, flapping wings"
+122,/m/01z5f,"Canidae, dogs, wolves"
+123,/m/06hps,"Rodents, rats, mice"
+124,/m/04rmv,"Mouse"
+125,/m/07r4gkf,"Patter"
+126,/m/03vt0,"Insect"
+127,/m/09xqv,"Cricket"
+128,/m/09f96,"Mosquito"
+129,/m/0h2mp,"Fly, housefly"
+130,/m/07pjwq1,"Buzz"
+131,/m/01h3n,"Bee, wasp, etc."
+132,/m/09ld4,"Frog"
+133,/m/07st88b,"Croak"
+134,/m/078jl,"Snake"
+135,/m/07qn4z3,"Rattle"
+136,/m/032n05,"Whale vocalization"
+137,/m/04rlf,"Music"
+138,/m/04szw,"Musical instrument"
+139,/m/0fx80y,"Plucked string instrument"
+140,/m/0342h,"Guitar"
+141,/m/02sgy,"Electric guitar"
+142,/m/018vs,"Bass guitar"
+143,/m/042v_gx,"Acoustic guitar"
+144,/m/06w87,"Steel guitar, slide guitar"
+145,/m/01glhc,"Tapping (guitar technique)"
+146,/m/07s0s5r,"Strum"
+147,/m/018j2,"Banjo"
+148,/m/0jtg0,"Sitar"
+149,/m/04rzd,"Mandolin"
+150,/m/01bns_,"Zither"
+151,/m/07xzm,"Ukulele"
+152,/m/05148p4,"Keyboard (musical)"
+153,/m/05r5c,"Piano"
+154,/m/01s0ps,"Electric piano"
+155,/m/013y1f,"Organ"
+156,/m/03xq_f,"Electronic organ"
+157,/m/03gvt,"Hammond organ"
+158,/m/0l14qv,"Synthesizer"
+159,/m/01v1d8,"Sampler"
+160,/m/03q5t,"Harpsichord"
+161,/m/0l14md,"Percussion"
+162,/m/02hnl,"Drum kit"
+163,/m/0cfdd,"Drum machine"
+164,/m/026t6,"Drum"
+165,/m/06rvn,"Snare drum"
+166,/m/03t3fj,"Rimshot"
+167,/m/02k_mr,"Drum roll"
+168,/m/0bm02,"Bass drum"
+169,/m/011k_j,"Timpani"
+170,/m/01p970,"Tabla"
+171,/m/01qbl,"Cymbal"
+172,/m/03qtq,"Hi-hat"
+173,/m/01sm1g,"Wood block"
+174,/m/07brj,"Tambourine"
+175,/m/05r5wn,"Rattle (instrument)"
+176,/m/0xzly,"Maraca"
+177,/m/0mbct,"Gong"
+178,/m/016622,"Tubular bells"
+179,/m/0j45pbj,"Mallet percussion"
+180,/m/0dwsp,"Marimba, xylophone"
+181,/m/0dwtp,"Glockenspiel"
+182,/m/0dwt5,"Vibraphone"
+183,/m/0l156b,"Steelpan"
+184,/m/05pd6,"Orchestra"
+185,/m/01kcd,"Brass instrument"
+186,/m/0319l,"French horn"
+187,/m/07gql,"Trumpet"
+188,/m/07c6l,"Trombone"
+189,/m/0l14_3,"Bowed string instrument"
+190,/m/02qmj0d,"String section"
+191,/m/07y_7,"Violin, fiddle"
+192,/m/0d8_n,"Pizzicato"
+193,/m/01xqw,"Cello"
+194,/m/02fsn,"Double bass"
+195,/m/085jw,"Wind instrument, woodwind instrument"
+196,/m/0l14j_,"Flute"
+197,/m/06ncr,"Saxophone"
+198,/m/01wy6,"Clarinet"
+199,/m/03m5k,"Harp"
+200,/m/0395lw,"Bell"
+201,/m/03w41f,"Church bell"
+202,/m/027m70_,"Jingle bell"
+203,/m/0gy1t2s,"Bicycle bell"
+204,/m/07n_g,"Tuning fork"
+205,/m/0f8s22,"Chime"
+206,/m/026fgl,"Wind chime"
+207,/m/0150b9,"Change ringing (campanology)"
+208,/m/03qjg,"Harmonica"
+209,/m/0mkg,"Accordion"
+210,/m/0192l,"Bagpipes"
+211,/m/02bxd,"Didgeridoo"
+212,/m/0l14l2,"Shofar"
+213,/m/07kc_,"Theremin"
+214,/m/0l14t7,"Singing bowl"
+215,/m/01hgjl,"Scratching (performance technique)"
+216,/m/064t9,"Pop music"
+217,/m/0glt670,"Hip hop music"
+218,/m/02cz_7,"Beatboxing"
+219,/m/06by7,"Rock music"
+220,/m/03lty,"Heavy metal"
+221,/m/05r6t,"Punk rock"
+222,/m/0dls3,"Grunge"
+223,/m/0dl5d,"Progressive rock"
+224,/m/07sbbz2,"Rock and roll"
+225,/m/05w3f,"Psychedelic rock"
+226,/m/06j6l,"Rhythm and blues"
+227,/m/0gywn,"Soul music"
+228,/m/06cqb,"Reggae"
+229,/m/01lyv,"Country"
+230,/m/015y_n,"Swing music"
+231,/m/0gg8l,"Bluegrass"
+232,/m/02x8m,"Funk"
+233,/m/02w4v,"Folk music"
+234,/m/06j64v,"Middle Eastern music"
+235,/m/03_d0,"Jazz"
+236,/m/026z9,"Disco"
+237,/m/0ggq0m,"Classical music"
+238,/m/05lls,"Opera"
+239,/m/02lkt,"Electronic music"
+240,/m/03mb9,"House music"
+241,/m/07gxw,"Techno"
+242,/m/07s72n,"Dubstep"
+243,/m/0283d,"Drum and bass"
+244,/m/0m0jc,"Electronica"
+245,/m/08cyft,"Electronic dance music"
+246,/m/0fd3y,"Ambient music"
+247,/m/07lnk,"Trance music"
+248,/m/0g293,"Music of Latin America"
+249,/m/0ln16,"Salsa music"
+250,/m/0326g,"Flamenco"
+251,/m/0155w,"Blues"
+252,/m/05fw6t,"Music for children"
+253,/m/02v2lh,"New-age music"
+254,/m/0y4f8,"Vocal music"
+255,/m/0z9c,"A capella"
+256,/m/0164x2,"Music of Africa"
+257,/m/0145m,"Afrobeat"
+258,/m/02mscn,"Christian music"
+259,/m/016cjb,"Gospel music"
+260,/m/028sqc,"Music of Asia"
+261,/m/015vgc,"Carnatic music"
+262,/m/0dq0md,"Music of Bollywood"
+263,/m/06rqw,"Ska"
+264,/m/02p0sh1,"Traditional music"
+265,/m/05rwpb,"Independent music"
+266,/m/074ft,"Song"
+267,/m/025td0t,"Background music"
+268,/m/02cjck,"Theme music"
+269,/m/03r5q_,"Jingle (music)"
+270,/m/0l14gg,"Soundtrack music"
+271,/m/07pkxdp,"Lullaby"
+272,/m/01z7dr,"Video game music"
+273,/m/0140xf,"Christmas music"
+274,/m/0ggx5q,"Dance music"
+275,/m/04wptg,"Wedding music"
+276,/t/dd00031,"Happy music"
+277,/t/dd00032,"Funny music"
+278,/t/dd00033,"Sad music"
+279,/t/dd00034,"Tender music"
+280,/t/dd00035,"Exciting music"
+281,/t/dd00036,"Angry music"
+282,/t/dd00037,"Scary music"
+283,/m/03m9d0z,"Wind"
+284,/m/09t49,"Rustling leaves"
+285,/t/dd00092,"Wind noise (microphone)"
+286,/m/0jb2l,"Thunderstorm"
+287,/m/0ngt1,"Thunder"
+288,/m/0838f,"Water"
+289,/m/06mb1,"Rain"
+290,/m/07r10fb,"Raindrop"
+291,/t/dd00038,"Rain on surface"
+292,/m/0j6m2,"Stream"
+293,/m/0j2kx,"Waterfall"
+294,/m/05kq4,"Ocean"
+295,/m/034srq,"Waves, surf"
+296,/m/06wzb,"Steam"
+297,/m/07swgks,"Gurgling"
+298,/m/02_41,"Fire"
+299,/m/07pzfmf,"Crackle"
+300,/m/07yv9,"Vehicle"
+301,/m/019jd,"Boat, Water vehicle"
+302,/m/0hsrw,"Sailboat, sailing ship"
+303,/m/056ks2,"Rowboat, canoe, kayak"
+304,/m/02rlv9,"Motorboat, speedboat"
+305,/m/06q74,"Ship"
+306,/m/012f08,"Motor vehicle (road)"
+307,/m/0k4j,"Car"
+308,/m/0912c9,"Vehicle horn, car horn, honking"
+309,/m/07qv_d5,"Toot"
+310,/m/02mfyn,"Car alarm"
+311,/m/04gxbd,"Power windows, electric windows"
+312,/m/07rknqz,"Skidding"
+313,/m/0h9mv,"Tire squeal"
+314,/t/dd00134,"Car passing by"
+315,/m/0ltv,"Race car, auto racing"
+316,/m/07r04,"Truck"
+317,/m/0gvgw0,"Air brake"
+318,/m/05x_td,"Air horn, truck horn"
+319,/m/02rhddq,"Reversing beeps"
+320,/m/03cl9h,"Ice cream truck, ice cream van"
+321,/m/01bjv,"Bus"
+322,/m/03j1ly,"Emergency vehicle"
+323,/m/04qvtq,"Police car (siren)"
+324,/m/012n7d,"Ambulance (siren)"
+325,/m/012ndj,"Fire engine, fire truck (siren)"
+326,/m/04_sv,"Motorcycle"
+327,/m/0btp2,"Traffic noise, roadway noise"
+328,/m/06d_3,"Rail transport"
+329,/m/07jdr,"Train"
+330,/m/04zmvq,"Train whistle"
+331,/m/0284vy3,"Train horn"
+332,/m/01g50p,"Railroad car, train wagon"
+333,/t/dd00048,"Train wheels squealing"
+334,/m/0195fx,"Subway, metro, underground"
+335,/m/0k5j,"Aircraft"
+336,/m/014yck,"Aircraft engine"
+337,/m/04229,"Jet engine"
+338,/m/02l6bg,"Propeller, airscrew"
+339,/m/09ct_,"Helicopter"
+340,/m/0cmf2,"Fixed-wing aircraft, airplane"
+341,/m/0199g,"Bicycle"
+342,/m/06_fw,"Skateboard"
+343,/m/02mk9,"Engine"
+344,/t/dd00065,"Light engine (high frequency)"
+345,/m/08j51y,"Dental drill, dentist's drill"
+346,/m/01yg9g,"Lawn mower"
+347,/m/01j4z9,"Chainsaw"
+348,/t/dd00066,"Medium engine (mid frequency)"
+349,/t/dd00067,"Heavy engine (low frequency)"
+350,/m/01h82_,"Engine knocking"
+351,/t/dd00130,"Engine starting"
+352,/m/07pb8fc,"Idling"
+353,/m/07q2z82,"Accelerating, revving, vroom"
+354,/m/02dgv,"Door"
+355,/m/03wwcy,"Doorbell"
+356,/m/07r67yg,"Ding-dong"
+357,/m/02y_763,"Sliding door"
+358,/m/07rjzl8,"Slam"
+359,/m/07r4wb8,"Knock"
+360,/m/07qcpgn,"Tap"
+361,/m/07q6cd_,"Squeak"
+362,/m/0642b4,"Cupboard open or close"
+363,/m/0fqfqc,"Drawer open or close"
+364,/m/04brg2,"Dishes, pots, and pans"
+365,/m/023pjk,"Cutlery, silverware"
+366,/m/07pn_8q,"Chopping (food)"
+367,/m/0dxrf,"Frying (food)"
+368,/m/0fx9l,"Microwave oven"
+369,/m/02pjr4,"Blender"
+370,/m/02jz0l,"Water tap, faucet"
+371,/m/0130jx,"Sink (filling or washing)"
+372,/m/03dnzn,"Bathtub (filling or washing)"
+373,/m/03wvsk,"Hair dryer"
+374,/m/01jt3m,"Toilet flush"
+375,/m/012xff,"Toothbrush"
+376,/m/04fgwm,"Electric toothbrush"
+377,/m/0d31p,"Vacuum cleaner"
+378,/m/01s0vc,"Zipper (clothing)"
+379,/m/03v3yw,"Keys jangling"
+380,/m/0242l,"Coin (dropping)"
+381,/m/01lsmm,"Scissors"
+382,/m/02g901,"Electric shaver, electric razor"
+383,/m/05rj2,"Shuffling cards"
+384,/m/0316dw,"Typing"
+385,/m/0c2wf,"Typewriter"
+386,/m/01m2v,"Computer keyboard"
+387,/m/081rb,"Writing"
+388,/m/07pp_mv,"Alarm"
+389,/m/07cx4,"Telephone"
+390,/m/07pp8cl,"Telephone bell ringing"
+391,/m/01hnzm,"Ringtone"
+392,/m/02c8p,"Telephone dialing, DTMF"
+393,/m/015jpf,"Dial tone"
+394,/m/01z47d,"Busy signal"
+395,/m/046dlr,"Alarm clock"
+396,/m/03kmc9,"Siren"
+397,/m/0dgbq,"Civil defense siren"
+398,/m/030rvx,"Buzzer"
+399,/m/01y3hg,"Smoke detector, smoke alarm"
+400,/m/0c3f7m,"Fire alarm"
+401,/m/04fq5q,"Foghorn"
+402,/m/0l156k,"Whistle"
+403,/m/06hck5,"Steam whistle"
+404,/t/dd00077,"Mechanisms"
+405,/m/02bm9n,"Ratchet, pawl"
+406,/m/01x3z,"Clock"
+407,/m/07qjznt,"Tick"
+408,/m/07qjznl,"Tick-tock"
+409,/m/0l7xg,"Gears"
+410,/m/05zc1,"Pulleys"
+411,/m/0llzx,"Sewing machine"
+412,/m/02x984l,"Mechanical fan"
+413,/m/025wky1,"Air conditioning"
+414,/m/024dl,"Cash register"
+415,/m/01m4t,"Printer"
+416,/m/0dv5r,"Camera"
+417,/m/07bjf,"Single-lens reflex camera"
+418,/m/07k1x,"Tools"
+419,/m/03l9g,"Hammer"
+420,/m/03p19w,"Jackhammer"
+421,/m/01b82r,"Sawing"
+422,/m/02p01q,"Filing (rasp)"
+423,/m/023vsd,"Sanding"
+424,/m/0_ksk,"Power tool"
+425,/m/01d380,"Drill"
+426,/m/014zdl,"Explosion"
+427,/m/032s66,"Gunshot, gunfire"
+428,/m/04zjc,"Machine gun"
+429,/m/02z32qm,"Fusillade"
+430,/m/0_1c,"Artillery fire"
+431,/m/073cg4,"Cap gun"
+432,/m/0g6b5,"Fireworks"
+433,/g/122z_qxw,"Firecracker"
+434,/m/07qsvvw,"Burst, pop"
+435,/m/07pxg6y,"Eruption"
+436,/m/07qqyl4,"Boom"
+437,/m/083vt,"Wood"
+438,/m/07pczhz,"Chop"
+439,/m/07pl1bw,"Splinter"
+440,/m/07qs1cx,"Crack"
+441,/m/039jq,"Glass"
+442,/m/07q7njn,"Chink, clink"
+443,/m/07rn7sz,"Shatter"
+444,/m/04k94,"Liquid"
+445,/m/07rrlb6,"Splash, splatter"
+446,/m/07p6mqd,"Slosh"
+447,/m/07qlwh6,"Squish"
+448,/m/07r5v4s,"Drip"
+449,/m/07prgkl,"Pour"
+450,/m/07pqc89,"Trickle, dribble"
+451,/t/dd00088,"Gush"
+452,/m/07p7b8y,"Fill (with liquid)"
+453,/m/07qlf79,"Spray"
+454,/m/07ptzwd,"Pump (liquid)"
+455,/m/07ptfmf,"Stir"
+456,/m/0dv3j,"Boiling"
+457,/m/0790c,"Sonar"
+458,/m/0dl83,"Arrow"
+459,/m/07rqsjt,"Whoosh, swoosh, swish"
+460,/m/07qnq_y,"Thump, thud"
+461,/m/07rrh0c,"Thunk"
+462,/m/0b_fwt,"Electronic tuner"
+463,/m/02rr_,"Effects unit"
+464,/m/07m2kt,"Chorus effect"
+465,/m/018w8,"Basketball bounce"
+466,/m/07pws3f,"Bang"
+467,/m/07ryjzk,"Slap, smack"
+468,/m/07rdhzs,"Whack, thwack"
+469,/m/07pjjrj,"Smash, crash"
+470,/m/07pc8lb,"Breaking"
+471,/m/07pqn27,"Bouncing"
+472,/m/07rbp7_,"Whip"
+473,/m/07pyf11,"Flap"
+474,/m/07qb_dv,"Scratch"
+475,/m/07qv4k0,"Scrape"
+476,/m/07pdjhy,"Rub"
+477,/m/07s8j8t,"Roll"
+478,/m/07plct2,"Crushing"
+479,/t/dd00112,"Crumpling, crinkling"
+480,/m/07qcx4z,"Tearing"
+481,/m/02fs_r,"Beep, bleep"
+482,/m/07qwdck,"Ping"
+483,/m/07phxs1,"Ding"
+484,/m/07rv4dm,"Clang"
+485,/m/07s02z0,"Squeal"
+486,/m/07qh7jl,"Creak"
+487,/m/07qwyj0,"Rustle"
+488,/m/07s34ls,"Whir"
+489,/m/07qmpdm,"Clatter"
+490,/m/07p9k1k,"Sizzle"
+491,/m/07qc9xj,"Clicking"
+492,/m/07rwm0c,"Clickety-clack"
+493,/m/07phhsh,"Rumble"
+494,/m/07qyrcz,"Plop"
+495,/m/07qfgpx,"Jingle, tinkle"
+496,/m/07rcgpl,"Hum"
+497,/m/07p78v5,"Zing"
+498,/t/dd00121,"Boing"
+499,/m/07s12q4,"Crunch"
+500,/m/028v0c,"Silence"
+501,/m/01v_m0,"Sine wave"
+502,/m/0b9m1,"Harmonic"
+503,/m/0hdsk,"Chirp tone"
+504,/m/0c1dj,"Sound effect"
+505,/m/07pt_g0,"Pulse"
+506,/t/dd00125,"Inside, small room"
+507,/t/dd00126,"Inside, large room or hall"
+508,/t/dd00127,"Inside, public space"
+509,/t/dd00128,"Outside, urban or manmade"
+510,/t/dd00129,"Outside, rural or natural"
+511,/m/01b9nn,"Reverberation"
+512,/m/01jnbd,"Echo"
+513,/m/096m7z,"Noise"
+514,/m/06_y0by,"Environmental noise"
+515,/m/07rgkc5,"Static"
+516,/m/06xkwv,"Mains hum"
+517,/m/0g12c5,"Distortion"
+518,/m/08p9q4,"Sidetone"
+519,/m/07szfh9,"Cacophony"
+520,/m/0chx_,"White noise"
+521,/m/0cj0r,"Pink noise"
+522,/m/07p_0gm,"Throbbing"
+523,/m/01jwx6,"Vibration"
+524,/m/07c52,"Television"
+525,/m/06bz3,"Radio"
+526,/m/07hvw1,"Field recording"

a_cls/dataloader.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# -*- coding: utf-8 -*-
+# @Time    : 6/19/21 12:23 AM
+# @Author  : Yuan Gong
+# @Affiliation  : Massachusetts Institute of Technology
+# @Email   : yuangong@mit.edu
+# @File    : dataloader.py
+# modified from:
+# Author: David Harwath
+# with some functions borrowed from https://github.com/SeanNaren/deepspeech.pytorch
+import csv
+import json
+import logging
+import torchaudio
+import numpy as np
+import torch
+import torch.nn.functional
+from torch.utils.data import Dataset
+import random
+def make_index_dict(label_csv):
+    index_lookup = {}
+    with open(label_csv, 'r') as f:
+        csv_reader = csv.DictReader(f)
+        line_count = 0
+        for row in csv_reader:
+            index_lookup[row['mid']] = row['index']
+            line_count += 1
+    return index_lookup
+def make_name_dict(label_csv):
+    name_lookup = {}
+    with open(label_csv, 'r') as f:
+        csv_reader = csv.DictReader(f)
+        line_count = 0
+        for row in csv_reader:
+            name_lookup[row['index']] = row['display_name']
+            line_count += 1
+    return name_lookup
+def lookup_list(index_list, label_csv):
+    label_list = []
+    table = make_name_dict(label_csv)
+    for item in index_list:
+        label_list.append(table[item])
+    return label_list
+def preemphasis(signal,coeff=0.97):
+    """perform preemphasis on the input signal.
+    :param signal: The signal to filter.
+    :param coeff: The preemphasis coefficient. 0 is none, default 0.97.
+    :returns: the filtered signal.
+    """
+    return np.append(signal[0],signal[1:]-coeff*signal[:-1])
+class AudiosetDataset(Dataset):
+    def __init__(self, dataset_json_file, audio_conf, label_csv=None):
+        """
+        Dataset that manages audio recordings
+        :param audio_conf: Dictionary containing the audio loading and preprocessing settings
+        :param dataset_json_file
+        """
+        self.datapath = dataset_json_file
+        with open(dataset_json_file, 'r') as fp:
+            data_json = json.load(fp)
+        self.data = data_json['data']
+        self.index_dict = make_index_dict(label_csv)
+        self.label_num = len(self.index_dict)
+    def __getitem__(self, index):
+        datum = self.data[index]
+        label_indices = np.zeros(self.label_num)
+        try:
+            fbank, mix_lambda = self._wav2fbank(datum['wav'])
+        except Exception as e:
+            logging.warning(f"Error at {datum['wav']} with \"{e}\"")
+            return self.__getitem__(random.randint(0, self.__len__()-1))
+        for label_str in datum['labels'].split(','):
+            label_indices[int(self.index_dict[label_str])] = 1.0
+        label_indices = torch.FloatTensor(label_indices)
+        return fbank, label_indices
+    def __len__(self):
+        return len(self.data)

a_cls/datasets.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import os.path
+import torch
+from data.build_datasets import DataInfo
+from data.process_audio import get_audio_transform, torchaudio_loader
+from torchvision import datasets
+# -*- coding: utf-8 -*-
+# @Time    : 6/19/21 12:23 AM
+# @Author  : Yuan Gong
+# @Affiliation  : Massachusetts Institute of Technology
+# @Email   : yuangong@mit.edu
+# @File    : dataloader.py
+# modified from:
+# Author: David Harwath
+# with some functions borrowed from https://github.com/SeanNaren/deepspeech.pytorch
+import csv
+import json
+import logging
+import torchaudio
+import numpy as np
+import torch
+import torch.nn.functional
+from torch.utils.data import Dataset
+import random
+def make_index_dict(label_csv):
+    index_lookup = {}
+    with open(label_csv, 'r') as f:
+        csv_reader = csv.DictReader(f)
+        line_count = 0
+        for row in csv_reader:
+            index_lookup[row['mid']] = row['index']
+            line_count += 1
+    return index_lookup
+class AudiosetDataset(Dataset):
+    def __init__(self, args, transform, loader):
+        self.audio_root = '/apdcephfs_cq3/share_1311970/downstream_datasets/Audio/audioset/eval_segments'
+        dataset_json_file = '/apdcephfs_cq3/share_1311970/downstream_datasets/Audio/audioset/filter_eval.json'
+        label_csv = '/apdcephfs_cq3/share_1311970/downstream_datasets/Audio/audioset/class_labels_indices.csv'
+        with open(dataset_json_file, 'r') as fp:
+            data_json = json.load(fp)
+        self.data = data_json['data']
+        self.index_dict = make_index_dict(label_csv)
+        self.label_num = len(self.index_dict)
+        self.args = args
+        self.transform = transform
+        self.loader = loader
+    def __getitem__(self, index):
+        datum = self.data[index]
+        label_indices = np.zeros(self.label_num)
+        for label_str in datum['labels'].split(','):
+            label_indices[int(self.index_dict[label_str])] = 1.0
+        label_indices = torch.FloatTensor(label_indices)
+        audio = self.loader(os.path.join(self.audio_root, datum['wav']))
+        audio_data = self.transform(audio)
+        return audio_data, label_indices
+    def __len__(self):
+        return len(self.data)
+def is_valid_file(path):
+    return True
+def get_audio_dataset(args):
+    data_path = args.audio_data_path
+    transform = get_audio_transform(args)
+    if args.val_a_cls_data.lower() == 'audioset':
+        dataset = AudiosetDataset(args, transform=transform, loader=torchaudio_loader)
+    else:
+        dataset = datasets.ImageFolder(data_path, transform=transform, loader=torchaudio_loader, is_valid_file=is_valid_file)
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        num_workers=args.workers,
+        sampler=None,
+    )
+    return DataInfo(dataloader=dataloader, sampler=None)

a_cls/filter_eval_audio.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import json
+import os.path
+from tqdm import tqdm
+with open(r"G:\audioset\audioset\zip_audios\16k\eval.json", 'r') as f:
+    data = json.load(f)['data']
+new_data = []
+total = 0
+success = 0
+for i in tqdm(data):
+    total += 1
+    video_id = os.path.basename(i['wav'])
+    new_video_id = 'Y' + video_id
+    i['wav'] = new_video_id
+    if os.path.exists(f"G:/audioset/audioset/zip_audios/eval_segments/{i['wav']}") and not video_id.startswith('mW3S0u8bj58'):
+        new_data.append(i)
+        success += 1
+print(total, success, total-success)
+with open(r"G:\audioset\audioset\zip_audios\16k\filter_eval.json", 'w') as f:
+    data = json.dump({'data': new_data}, f, indent=2)

a_cls/precision.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import torch
+from contextlib import suppress
+def get_autocast(precision):
+    if precision == 'amp':
+        return torch.cuda.amp.autocast
+    elif precision == 'amp_bfloat16' or precision == 'amp_bf16':
+        # amp_bfloat16 is more stable than amp float16 for clip training
+        return lambda: torch.cuda.amp.autocast(dtype=torch.bfloat16)
+    else:
+        return suppress

a_cls/stats.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import numpy as np
+from scipy import stats
+from sklearn import metrics
+import torch
+def d_prime(auc):
+    standard_normal = stats.norm()
+    d_prime = standard_normal.ppf(auc) * np.sqrt(2.0)
+    return d_prime
+def calculate_stats(output, target):
+    """Calculate statistics including mAP, AUC, etc.
+    Args:
+      output: 2d array, (samples_num, classes_num)
+      target: 2d array, (samples_num, classes_num)
+    Returns:
+      stats: list of statistic of each class.
+    """
+    classes_num = target.shape[-1]
+    stats = []
+    # Accuracy, only used for single-label classification such as esc-50, not for multiple label one such as AudioSet
+    acc = metrics.accuracy_score(np.argmax(target, 1), np.argmax(output, 1))
+    # Class-wise statistics
+    for k in range(classes_num):
+        # Average precision
+        avg_precision = metrics.average_precision_score(
+            target[:, k], output[:, k], average=None)
+        # AUC
+        auc = metrics.roc_auc_score(target[:, k], output[:, k], average=None)
+        # Precisions, recalls
+        (precisions, recalls, thresholds) = metrics.precision_recall_curve(
+            target[:, k], output[:, k])
+        # FPR, TPR
+        (fpr, tpr, thresholds) = metrics.roc_curve(target[:, k], output[:, k])
+        save_every_steps = 1000     # Sample statistics to reduce size
+        dict = {'precisions': precisions[0::save_every_steps],
+                'recalls': recalls[0::save_every_steps],
+                'AP': avg_precision,
+                'fpr': fpr[0::save_every_steps],
+                'fnr': 1. - tpr[0::save_every_steps],
+                'auc': auc,
+                # note acc is not class-wise, this is just to keep consistent with other metrics
+                'acc': acc
+                }
+        stats.append(dict)
+    return stats

a_cls/util.py ADDED Viewed

	@@ -0,0 +1,306 @@

+import math
+import pickle
+import numpy as np
+import torch
+import torch.nn as nn
+import random
+from collections import namedtuple
+def calc_recalls(S):
+    """
+    Computes recall at 1, 5, and 10 given a similarity matrix S.
+    By convention, rows of S are assumed to correspond to images and columns are captions.
+    """
+    assert(S.dim() == 2)
+    assert(S.size(0) == S.size(1))
+    if isinstance(S, torch.autograd.Variable):
+        S = S.data
+    n = S.size(0)
+    A2I_scores, A2I_ind = S.topk(10, 0)
+    I2A_scores, I2A_ind = S.topk(10, 1)
+    A_r1 = AverageMeter()
+    A_r5 = AverageMeter()
+    A_r10 = AverageMeter()
+    I_r1 = AverageMeter()
+    I_r5 = AverageMeter()
+    I_r10 = AverageMeter()
+    for i in range(n):
+        A_foundind = -1
+        I_foundind = -1
+        for ind in range(10):
+            if A2I_ind[ind, i] == i:
+                I_foundind = ind
+            if I2A_ind[i, ind] == i:
+                A_foundind = ind
+        # do r1s
+        if A_foundind == 0:
+            A_r1.update(1)
+        else:
+            A_r1.update(0)
+        if I_foundind == 0:
+            I_r1.update(1)
+        else:
+            I_r1.update(0)
+        # do r5s
+        if A_foundind >= 0 and A_foundind < 5:
+            A_r5.update(1)
+        else:
+            A_r5.update(0)
+        if I_foundind >= 0 and I_foundind < 5:
+            I_r5.update(1)
+        else:
+            I_r5.update(0)
+        # do r10s
+        if A_foundind >= 0 and A_foundind < 10:
+            A_r10.update(1)
+        else:
+            A_r10.update(0)
+        if I_foundind >= 0 and I_foundind < 10:
+            I_r10.update(1)
+        else:
+            I_r10.update(0)
+    recalls = {'A_r1':A_r1.avg, 'A_r5':A_r5.avg, 'A_r10':A_r10.avg,
+                'I_r1':I_r1.avg, 'I_r5':I_r5.avg, 'I_r10':I_r10.avg}
+                #'A_meanR':A_meanR.avg, 'I_meanR':I_meanR.avg}
+    return recalls
+def computeMatchmap(I, A):
+    assert(I.dim() == 3)
+    assert(A.dim() == 2)
+    D = I.size(0)
+    H = I.size(1)
+    W = I.size(2)
+    T = A.size(1)
+    Ir = I.view(D, -1).t()
+    matchmap = torch.mm(Ir, A)
+    matchmap = matchmap.view(H, W, T)
+    return matchmap
+def matchmapSim(M, simtype):
+    assert(M.dim() == 3)
+    if simtype == 'SISA':
+        return M.mean()
+    elif simtype == 'MISA':
+        M_maxH, _ = M.max(0)
+        M_maxHW, _ = M_maxH.max(0)
+        return M_maxHW.mean()
+    elif simtype == 'SIMA':
+        M_maxT, _ = M.max(2)
+        return M_maxT.mean()
+    else:
+        raise ValueError
+def sampled_margin_rank_loss(image_outputs, audio_outputs, nframes, margin=1., simtype='MISA'):
+    """
+    Computes the triplet margin ranking loss for each anchor image/caption pair
+    The impostor image/caption is randomly sampled from the minibatch
+    """
+    assert(image_outputs.dim() == 4)
+    assert(audio_outputs.dim() == 3)
+    n = image_outputs.size(0)
+    loss = torch.zeros(1, device=image_outputs.device, requires_grad=True)
+    for i in range(n):
+        I_imp_ind = i
+        A_imp_ind = i
+        while I_imp_ind == i:
+            I_imp_ind = np.random.randint(0, n)
+        while A_imp_ind == i:
+            A_imp_ind = np.random.randint(0, n)
+        nF = nframes[i]
+        nFimp = nframes[A_imp_ind]
+        anchorsim = matchmapSim(computeMatchmap(image_outputs[i], audio_outputs[i][:, 0:nF]), simtype)
+        Iimpsim = matchmapSim(computeMatchmap(image_outputs[I_imp_ind], audio_outputs[i][:, 0:nF]), simtype)
+        Aimpsim = matchmapSim(computeMatchmap(image_outputs[i], audio_outputs[A_imp_ind][:, 0:nFimp]), simtype)
+        A2I_simdif = margin + Iimpsim - anchorsim
+        if (A2I_simdif.data > 0).all():
+            loss = loss + A2I_simdif
+        I2A_simdif = margin + Aimpsim - anchorsim
+        if (I2A_simdif.data > 0).all():
+            loss = loss + I2A_simdif
+    loss = loss / n
+    return loss
+def compute_matchmap_similarity_matrix(image_outputs, audio_outputs, nframes, simtype='MISA'):
+    """
+    Assumes image_outputs is a (batchsize, embedding_dim, rows, height) tensor
+    Assumes audio_outputs is a (batchsize, embedding_dim, 1, time) tensor
+    Returns similarity matrix S where images are rows and audios are along the columns
+    """
+    assert(image_outputs.dim() == 4)
+    assert(audio_outputs.dim() == 3)
+    n = image_outputs.size(0)
+    S = torch.zeros(n, n, device=image_outputs.device)
+    for image_idx in range(n):
+            for audio_idx in range(n):
+                nF = max(1, nframes[audio_idx])
+                S[image_idx, audio_idx] = matchmapSim(computeMatchmap(image_outputs[image_idx], audio_outputs[audio_idx][:, 0:nF]), simtype)
+    return S
+def compute_pooldot_similarity_matrix(image_outputs, audio_outputs, nframes):
+    """
+    Assumes image_outputs is a (batchsize, embedding_dim, rows, height) tensor
+    Assumes audio_outputs is a (batchsize, embedding_dim, 1, time) tensor
+    Returns similarity matrix S where images are rows and audios are along the columns
+    S[i][j] is computed as the dot product between the meanpooled embeddings of
+    the ith image output and jth audio output
+    """
+    assert(image_outputs.dim() == 4)
+    assert(audio_outputs.dim() == 4)
+    n = image_outputs.size(0)
+    imagePoolfunc = nn.AdaptiveAvgPool2d((1, 1))
+    pooled_image_outputs = imagePoolfunc(image_outputs).squeeze(3).squeeze(2)
+    audioPoolfunc = nn.AdaptiveAvgPool2d((1, 1))
+    pooled_audio_outputs_list = []
+    for idx in range(n):
+        nF = max(1, nframes[idx])
+        pooled_audio_outputs_list.append(audioPoolfunc(audio_outputs[idx][:, :, 0:nF]).unsqueeze(0))
+    pooled_audio_outputs = torch.cat(pooled_audio_outputs_list).squeeze(3).squeeze(2)
+    S = torch.mm(pooled_image_outputs, pooled_audio_outputs.t())
+    return S
+def one_imposter_index(i, N):
+    imp_ind = random.randint(0, N - 2)
+    if imp_ind == i:
+        imp_ind = N - 1
+    return imp_ind
+def basic_get_imposter_indices(N):
+    imposter_idc = []
+    for i in range(N):
+        # Select an imposter index for example i:
+        imp_ind = one_imposter_index(i, N)
+        imposter_idc.append(imp_ind)
+    return imposter_idc
+def semihardneg_triplet_loss_from_S(S, margin):
+    """
+    Input: Similarity matrix S as an autograd.Variable
+    Output: The one-way triplet loss from rows of S to columns of S. Impostors are taken
+    to be the most similar point to the anchor that is still less similar to the anchor
+    than the positive example.
+    You would need to run this function twice, once with S and once with S.t(),
+    in order to compute the triplet loss in both directions.
+    """
+    assert(S.dim() == 2)
+    assert(S.size(0) == S.size(1))
+    N = S.size(0)
+    loss = torch.autograd.Variable(torch.zeros(1).type(S.data.type()), requires_grad=True)
+    # Imposter - ground truth
+    Sdiff = S - torch.diag(S).view(-1, 1)
+    eps = 1e-12
+    # All examples less similar than ground truth
+    mask = (Sdiff < -eps).type(torch.LongTensor)
+    maskf = mask.type_as(S)
+    # Mask out all examples >= gt with minimum similarity
+    Sp = maskf * Sdiff + (1 - maskf) * torch.min(Sdiff).detach()
+    # Find the index maximum similar of the remaining
+    _, idc = Sp.max(dim=1)
+    idc = idc.data.cpu()
+    # Vector mask: 1 iff there exists an example < gt
+    has_neg = (mask.sum(dim=1) > 0).data.type(torch.LongTensor)
+    # Random imposter indices
+    random_imp_ind = torch.LongTensor(basic_get_imposter_indices(N))
+    # Use hardneg if there exists an example < gt, otherwise use random imposter
+    imp_idc = has_neg * idc + (1 - has_neg) * random_imp_ind
+    # This could probably be vectorized too, but I haven't.
+    for i, imp in enumerate(imp_idc):
+        local_loss = Sdiff[i, imp] + margin
+        if (local_loss.data > 0).all():
+            loss = loss + local_loss
+    loss = loss / N
+    return loss
+def sampled_triplet_loss_from_S(S, margin):
+    """
+    Input: Similarity matrix S as an autograd.Variable
+    Output: The one-way triplet loss from rows of S to columns of S. Imposters are
+    randomly sampled from the columns of S.
+    You would need to run this function twice, once with S and once with S.t(),
+    in order to compute the triplet loss in both directions.
+    """
+    assert(S.dim() == 2)
+    assert(S.size(0) == S.size(1))
+    N = S.size(0)
+    loss = torch.autograd.Variable(torch.zeros(1).type(S.data.type()), requires_grad=True)
+    # Imposter - ground truth
+    Sdiff = S - torch.diag(S).view(-1, 1)
+    imp_ind = torch.LongTensor(basic_get_imposter_indices(N))
+    # This could probably be vectorized too, but I haven't.
+    for i, imp in enumerate(imp_ind):
+        local_loss = Sdiff[i, imp] + margin
+        if (local_loss.data > 0).all():
+            loss = loss + local_loss
+    loss = loss / N
+    return loss
+class AverageMeter(object):
+    """Computes and stores the average and current value"""
+    def __init__(self):
+        self.reset()
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count
+def adjust_learning_rate(base_lr, lr_decay, optimizer, epoch):
+    """Sets the learning rate to the initial LR decayed by 10 every lr_decay epochs"""
+    lr = base_lr * (0.1 ** (epoch // lr_decay))
+    print('now learning rate changed to {:f}'.format(lr))
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = lr
+def adjust_learning_rate2(base_lr, lr_decay, optimizer, epoch):
+    """Sets the learning rate to the initial LR decayed by 10 every lr_decay epochs"""
+    for param_group in optimizer.param_groups:
+        cur_lr = param_group['lr']
+        print('current learing rate is {:f}'.format(lr))
+    lr = cur_lr  * 0.1
+    print('now learning rate changed to {:f}'.format(lr))
+    for param_group in optimizer.param_groups:
+        param_group['lr'] = lr
+def load_progress(prog_pkl, quiet=False):
+    """
+    load progress pkl file
+    Args:
+        prog_pkl(str): path to progress pkl file
+    Return:
+        progress(list):
+        epoch(int):
+        global_step(int):
+        best_epoch(int):
+        best_avg_r10(float):
+    """
+    def _print(msg):
+        if not quiet:
+            print(msg)
+    with open(prog_pkl, "rb") as f:
+        prog = pickle.load(f)
+        epoch, global_step, best_epoch, best_avg_r10, _ = prog[-1]
+    _print("\nPrevious Progress:")
+    msg =  "[%5s %7s %5s %7s %6s]" % ("epoch", "step", "best_epoch", "best_avg_r10", "time")
+    _print(msg)
+    return prog, epoch, global_step, best_epoch, best_avg_r10
+def count_parameters(model):
+    return sum([p.numel() for p in model.parameters() if p.requires_grad])
+PrenetConfig = namedtuple(
+  'PrenetConfig', ['input_size', 'hidden_size', 'num_layers', 'dropout'])
+RNNConfig = namedtuple(
+  'RNNConfig',
+  ['input_size', 'hidden_size', 'num_layers', 'dropout', 'residual'])

a_cls/zero_shot.py ADDED Viewed

	@@ -0,0 +1,234 @@

+import logging
+import os
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import nn
+from tqdm import tqdm
+from open_clip import get_input_dtype, get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+from .precision import get_autocast
+from .stats import calculate_stats, d_prime
+from .zero_shot_classifier import build_zero_shot_classifier
+from .zero_shot_metadata import CLASSNAMES, OPENAI_IMAGENET_TEMPLATES
+def accuracy(output, target, topk=(1,)):
+    pred = output.topk(max(topk), 1, True, True)[1].t()
+    correct = pred.eq(target.view(1, -1).expand_as(pred))
+    return [float(correct[:k].reshape(-1).float().sum(0, keepdim=True).cpu().numpy()) for k in topk]
+def run(model, classifier, dataloader, args):
+    autocast = get_autocast(args.precision)
+    input_dtype = get_input_dtype(args.precision)
+    with torch.no_grad():
+        top1, top5, n = 0., 0., 0.
+        for images, target in tqdm(dataloader, unit_scale=args.batch_size):
+            images = images.to(device=args.device, dtype=input_dtype)
+            images = images.unsqueeze(2)
+            target = target.to(args.device)
+            with autocast():
+                # predict
+                output = model(image=images)
+                image_features = output['image_features'] if isinstance(output, dict) else output[0]
+                logits = 100. * image_features @ classifier
+            # measure accuracy
+            acc1, acc5 = accuracy(logits, target, topk=(1, 5))
+            top1 += acc1
+            top5 += acc5
+            n += images.size(0)
+    top1 = (top1 / n)
+    top5 = (top5 / n)
+    return top1, top5
+def validate(audio_model, classifier, val_loader, args, epoch):
+    epoch = epoch - 1 ########################
+    # switch to evaluate mode
+    audio_model.eval()
+    autocast = get_autocast(args.precision)
+    input_dtype = get_input_dtype(args.precision)
+    A_predictions = []
+    A_targets = []
+    A_loss = []
+    with torch.no_grad():
+        for i, (audio_input, labels) in enumerate(tqdm(val_loader)):
+            audio_input = audio_input.to(device=args.device, dtype=input_dtype)
+            # compute output
+            with autocast():
+                # predict
+                output = audio_model(image=audio_input)
+                image_features = output['image_features'] if isinstance(output, dict) else output[0]
+                logits = 100. * image_features @ classifier
+            audio_output = logits
+            # audio_output = torch.sigmoid(audio_output)
+            predictions = audio_output.to('cpu').detach()
+            A_predictions.append(predictions)
+            A_targets.append(labels)
+            # compute the loss
+            labels = labels.to(args.device)
+            loss = nn.CrossEntropyLoss()(audio_output, torch.argmax(labels.long(), dim=1))
+            A_loss.append(loss.to('cpu').detach())
+        audio_output = torch.cat(A_predictions)
+        target = torch.cat(A_targets)
+        loss = np.mean(A_loss)
+        stats = calculate_stats(audio_output, target)
+        # save the prediction here
+        args.a_cls_output_dir = os.path.join(args.log_base_path, f'a_cls/{args.val_a_cls_data.lower()}')
+        os.makedirs(args.a_cls_output_dir, exist_ok=True)
+        if os.path.exists(args.a_cls_output_dir + '/predictions') == False:
+            os.mkdir(args.a_cls_output_dir + '/predictions')
+            np.savetxt(args.a_cls_output_dir + '/predictions/target.csv', target, delimiter=',')
+        np.savetxt(args.a_cls_output_dir + '/predictions/predictions_' + str(epoch) + '.csv', audio_output,
+                   delimiter=',')
+    valid_loss = loss
+    main_metrics = 'mAP'
+    metrics = {}
+    if args.do_train:
+        # ensemble results
+        cum_stats = validate_ensemble(args, epoch)
+        cum_mAP = np.mean([stat['AP'] for stat in cum_stats])
+        cum_mAUC = np.mean([stat['auc'] for stat in cum_stats])
+        cum_acc = cum_stats[0]['acc']
+    mAP = np.mean([stat['AP'] for stat in stats])
+    mAUC = np.mean([stat['auc'] for stat in stats])
+    acc = stats[0]['acc']
+    middle_ps = [stat['precisions'][int(len(stat['precisions']) / 2)] for stat in stats]
+    middle_rs = [stat['recalls'][int(len(stat['recalls']) / 2)] for stat in stats]
+    average_precision = np.mean(middle_ps)
+    average_recall = np.mean(middle_rs)
+    if main_metrics == 'mAP':
+        logging.info("mAP: {:.6f}".format(mAP))
+    else:
+        logging.info("acc: {:.6f}".format(acc))
+    logging.info("AUC: {:.6f}".format(mAUC))
+    logging.info("Avg Precision: {:.6f}".format(average_precision))
+    logging.info("Avg Recall: {:.6f}".format(average_recall))
+    logging.info("d_prime: {:.6f}".format(d_prime(mAUC)))
+    logging.info("valid_loss: {:.6f}".format(valid_loss))
+    if args.do_train:
+        logging.info("cum_mAP: {:.6f}".format(cum_mAP))
+        logging.info("cum_mAUC: {:.6f}".format(cum_mAUC))
+    if main_metrics == 'mAP':
+        metrics['mAP'] = float(mAP)
+    else:
+        metrics['acc'] = float(acc)
+    metrics['mAUC'] = float(mAUC)
+    metrics['average_precision'] = float(average_precision)
+    metrics['average_recall'] = float(average_recall)
+    metrics['d_prime_mAUC'] = float(d_prime(mAUC))
+    metrics['valid_loss'] = float(valid_loss)
+    if args.do_train:
+        metrics['cum_mAP'] = float(cum_mAP)
+        metrics['cum_mAUC'] = float(cum_mAUC)
+    return metrics
+def validate_ensemble(args, epoch):
+    exp_dir = args.a_cls_output_dir
+    target = np.loadtxt(exp_dir + '/predictions/target.csv', delimiter=',')
+    if epoch == 0:
+        cum_predictions = np.loadtxt(exp_dir + '/predictions/predictions_0.csv', delimiter=',')
+    else:
+        cum_predictions = np.loadtxt(exp_dir + '/predictions/cum_predictions.csv', delimiter=',') * (epoch - 1)
+        predictions = np.loadtxt(exp_dir + '/predictions/predictions_' + str(epoch) + '.csv', delimiter=',')
+        cum_predictions = cum_predictions + predictions
+        # remove the prediction file to save storage space
+        os.remove(exp_dir + '/predictions/predictions_' + str(epoch - 1) + '.csv')
+    cum_predictions = cum_predictions / (epoch + 1)
+    np.savetxt(exp_dir + '/predictions/cum_predictions.csv', cum_predictions, delimiter=',')
+    stats = calculate_stats(cum_predictions, target)
+    return stats
+def zero_shot_eval(model, data, epoch, args):
+    temp_val_a_cls_data = args.val_a_cls_data
+    args.val_a_cls_data = list(data.keys())
+    assert len(args.val_a_cls_data) == 1
+    args.val_a_cls_data = args.val_a_cls_data[0]
+    if args.val_a_cls_data not in data:
+        return {}
+    if args.zeroshot_frequency == 0:
+        return {}
+    if (epoch % args.zeroshot_frequency) != 0 and epoch != args.epochs:
+        return {}
+    if args.distributed and not args.horovod:
+        model = model.module
+    logging.info(f'Starting zero-shot {args.val_a_cls_data.upper()}.')
+    logging.info('Building zero-shot classifier')
+    autocast = get_autocast(args.precision)
+    with autocast():
+        tokenizer = get_tokenizer(HF_HUB_PREFIX+args.model, cache_dir=args.cache_dir)
+        # tokenizer = get_tokenizer("ViT-L-14")
+        classifier = build_zero_shot_classifier(
+            model,
+            tokenizer=tokenizer,
+            classnames=CLASSNAMES[args.val_a_cls_data],
+            templates=OPENAI_IMAGENET_TEMPLATES,
+            num_classes_per_batch=10,
+            device=args.device,
+            use_tqdm=True,
+        )
+    logging.info('Using classifier')
+    results = {}
+    if args.val_a_cls_data.lower() == 'audioset':
+        if args.val_a_cls_data in data:
+            stats = validate(model, classifier, data[args.val_a_cls_data].dataloader, args, epoch)
+            results.update(stats)
+    else:
+        if args.val_a_cls_data in data:
+            top1, top5 = run(model, classifier, data[args.val_a_cls_data].dataloader, args)
+            results[f'{args.val_a_cls_data}-zeroshot-val-top1'] = top1
+            results[f'{args.val_a_cls_data}-zeroshot-val-top5'] = top5
+    logging.info(f'Finished zero-shot {args.val_a_cls_data.upper()}.')
+    args.val_a_cls_data = temp_val_a_cls_data
+    return results

a_cls/zero_shot_classifier.py ADDED Viewed

	@@ -0,0 +1,111 @@

+from functools import partial
+from itertools import islice
+from typing import Callable, List, Optional, Sequence, Union
+import torch
+import torch.nn.functional as F
+def batched(iterable, n):
+    """Batch data into lists of length *n*. The last batch may be shorter.
+    NOTE based on more-itertools impl, to be replaced by python 3.12 itertools.batched impl
+    """
+    it = iter(iterable)
+    while True:
+        batch = list(islice(it, n))
+        if not batch:
+            break
+        yield batch
+def build_zero_shot_classifier(
+        model,
+        tokenizer,
+        classnames: Sequence[str],
+        templates: Sequence[Union[Callable, str]],
+        num_classes_per_batch: Optional[int] = 10,
+        device: Union[str, torch.device] = 'cpu',
+        use_tqdm: bool = False,
+):
+    """ Build zero-shot classifier weights by iterating over class names in batches
+    Args:
+        model: CLIP model instance
+        tokenizer: CLIP tokenizer instance
+        classnames: A sequence of class (label) names
+        templates: A sequence of callables or format() friendly strings to produce templates per class name
+        num_classes_per_batch: The number of classes to batch together in each forward, all if None
+        device: Device to use.
+        use_tqdm: Enable TQDM progress bar.
+    """
+    assert isinstance(templates, Sequence) and len(templates) > 0
+    assert isinstance(classnames, Sequence) and len(classnames) > 0
+    use_format = isinstance(templates[0], str)
+    num_templates = len(templates)
+    num_classes = len(classnames)
+    if use_tqdm:
+        import tqdm
+        num_iter = 1 if num_classes_per_batch is None else ((num_classes - 1) // num_classes_per_batch + 1)
+        iter_wrap = partial(tqdm.tqdm, total=num_iter, unit_scale=num_classes_per_batch)
+    else:
+        iter_wrap = iter
+    def _process_batch(batch_classnames):
+        num_batch_classes = len(batch_classnames)
+        texts = [template.format(c) if use_format else template(c) for c in batch_classnames for template in templates]
+        input_ids, attention_mask = tokenizer(texts)
+        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
+        class_embeddings = F.normalize(model.encode_text(input_ids, attention_mask), dim=-1)
+        class_embeddings = class_embeddings.reshape(num_batch_classes, num_templates, -1).mean(dim=1)
+        class_embeddings = class_embeddings / class_embeddings.norm(dim=1, keepdim=True)
+        class_embeddings = class_embeddings.T
+        return class_embeddings
+    with torch.no_grad():
+        if num_classes_per_batch:
+            batched_embeds = [_process_batch(batch) for batch in iter_wrap(batched(classnames, num_classes_per_batch))]
+            zeroshot_weights = torch.cat(batched_embeds, dim=1)
+        else:
+            zeroshot_weights = _process_batch(classnames)
+    return zeroshot_weights
+def build_zero_shot_classifier_legacy(
+        model,
+        tokenizer,
+        classnames: Sequence[str],
+        templates: Sequence[Union[Callable, str]],
+        device: Union[str, torch.device] = 'cpu',
+        use_tqdm: bool = False,
+):
+    """ Build zero-shot classifier weights by iterating over class names 1 by 1
+    Args:
+        model: CLIP model instance
+        tokenizer: CLIP tokenizer instance
+        classnames: A sequence of class (label) names
+        templates: A sequence of callables or format() friendly strings to produce templates per class name
+        device: Device to use.
+        use_tqdm: Enable TQDM progress bar.
+    """
+    assert isinstance(templates, Sequence) and len(templates) > 0
+    assert isinstance(classnames, Sequence) and len(classnames) > 0
+    if use_tqdm:
+        import tqdm
+        iter_wrap = tqdm.tqdm
+    else:
+        iter_wrap = iter
+    use_format = isinstance(templates[0], str)
+    with torch.no_grad():
+        zeroshot_weights = []
+        for classname in iter_wrap(classnames):
+            texts = [template.format(classname) if use_format else template(classname) for template in templates]
+            texts = tokenizer(texts).to(device)  # tokenize
+            class_embeddings = model.encode_text(texts)
+            class_embedding = F.normalize(class_embeddings, dim=-1).mean(dim=0)
+            class_embedding /= class_embedding.norm()
+            zeroshot_weights.append(class_embedding)
+        zeroshot_weights = torch.stack(zeroshot_weights, dim=1).to(device)
+    return zeroshot_weights

a_cls/zero_shot_metadata.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import os
+import pandas as pd
+# OPENAI_IMAGENET_TEMPLATES = (
+#     lambda c: f'This is a sound of {c}.',
+# )
+OPENAI_IMAGENET_TEMPLATES = (
+    lambda c: f'a bad sound of a {c}.',
+    lambda c: f'a sound of many {c}.',
+    lambda c: f'a sculpture of a {c}.',
+    lambda c: f'a sound of the hard to see {c}.',
+    lambda c: f'a low resolution sound of the {c}.',
+    lambda c: f'a rendering of a {c}.',
+    lambda c: f'graffiti of a {c}.',
+    lambda c: f'a bad sound of the {c}.',
+    lambda c: f'a cropped sound of the {c}.',
+    lambda c: f'a tattoo of a {c}.',
+    lambda c: f'the embroidered {c}.',
+    lambda c: f'a sound of a hard to see {c}.',
+    lambda c: f'a bright sound of a {c}.',
+    lambda c: f'a sound of a clean {c}.',
+    lambda c: f'a sound of a dirty {c}.',
+    lambda c: f'a dark sound of the {c}.',
+    lambda c: f'a drawing of a {c}.',
+    lambda c: f'a sound of my {c}.',
+    lambda c: f'the plastic {c}.',
+    lambda c: f'a sound of the cool {c}.',
+    lambda c: f'a close-up sound of a {c}.',
+    lambda c: f'a black and white sound of the {c}.',
+    lambda c: f'a painting of the {c}.',
+    lambda c: f'a painting of a {c}.',
+    lambda c: f'a pixelated sound of the {c}.',
+    lambda c: f'a sculpture of the {c}.',
+    lambda c: f'a bright sound of the {c}.',
+    lambda c: f'a cropped sound of a {c}.',
+    lambda c: f'a plastic {c}.',
+    lambda c: f'a sound of the dirty {c}.',
+    lambda c: f'a jpeg corrupted sound of a {c}.',
+    lambda c: f'a blurry sound of the {c}.',
+    lambda c: f'a sound of the {c}.',
+    lambda c: f'a good sound of the {c}.',
+    lambda c: f'a rendering of the {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'a sound of one {c}.',
+    lambda c: f'a doodle of a {c}.',
+    lambda c: f'a close-up sound of the {c}.',
+    lambda c: f'a sound of a {c}.',
+    lambda c: f'the origami {c}.',
+    lambda c: f'the {c} in a video game.',
+    lambda c: f'a sketch of a {c}.',
+    lambda c: f'a doodle of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a low resolution sound of a {c}.',
+    lambda c: f'the toy {c}.',
+    lambda c: f'a rendition of the {c}.',
+    lambda c: f'a sound of the clean {c}.',
+    lambda c: f'a sound of a large {c}.',
+    lambda c: f'a rendition of a {c}.',
+    lambda c: f'a sound of a nice {c}.',
+    lambda c: f'a sound of a weird {c}.',
+    lambda c: f'a blurry sound of a {c}.',
+    lambda c: f'a cartoon {c}.',
+    lambda c: f'art of a {c}.',
+    lambda c: f'a sketch of the {c}.',
+    lambda c: f'a embroidered {c}.',
+    lambda c: f'a pixelated sound of a {c}.',
+    lambda c: f'itap of the {c}.',
+    lambda c: f'a jpeg corrupted sound of the {c}.',
+    lambda c: f'a good sound of a {c}.',
+    lambda c: f'a plushie {c}.',
+    lambda c: f'a sound of the nice {c}.',
+    lambda c: f'a sound of the small {c}.',
+    lambda c: f'a sound of the weird {c}.',
+    lambda c: f'the cartoon {c}.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a drawing of the {c}.',
+    lambda c: f'a sound of the large {c}.',
+    lambda c: f'a black and white sound of a {c}.',
+    lambda c: f'the plushie {c}.',
+    lambda c: f'a dark sound of a {c}.',
+    lambda c: f'itap of a {c}.',
+    lambda c: f'graffiti of the {c}.',
+    lambda c: f'a toy {c}.',
+    lambda c: f'itap of my {c}.',
+    lambda c: f'a sound of a cool {c}.',
+    lambda c: f'a sound of a small {c}.',
+    lambda c: f'a tattoo of the {c}.',
+)
+# a much smaller subset of above prompts
+# from https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb
+SIMPLE_IMAGENET_TEMPLATES = (
+    lambda c: f'itap of a {c}.',
+    lambda c: f'a bad sound of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a sound of the large {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a sound of the small {c}.',
+)
+PATH = os.path.join(os.path.dirname(os.path.abspath(__file__)), "class_labels_indices.csv")
+CLASSNAMES = {
+    'Audioset': tuple(pd.read_csv(PATH).values[:, 2]),
+    'ESC50': (
+        'airplane', 'breathing', 'brushing teeth', 'can opening', 'car horn', 'cat', 'chainsaw', 'chirping birds',
+        'church bells', 'clapping', 'clock alarm', 'clock tick', 'coughing', 'cow', 'crackling fire', 'crickets',
+        'crow', 'crying baby', 'dog', 'door wood creaks', 'door wood knock', 'drinking sipping', 'engine', 'fireworks',
+        'footsteps', 'frog', 'glass breaking', 'hand saw', 'helicopter', 'hen', 'insects', 'keyboard typing',
+        'laughing', 'mouse click', 'pig', 'pouring water', 'rain', 'rooster', 'sea waves', 'sheep', 'siren',
+        'sneezing', 'snoring', 'thunderstorm', 'toilet flush', 'train', 'vacuum cleaner', 'washing machine',
+        'water drops', 'wind'
+    ),
+    'VGGSound': (
+        'air conditioning noise', 'air horn', 'airplane', 'airplane flyby', 'alarm clock ringing',
+        'alligators, crocodiles hissing', 'ambulance siren', 'arc welding', 'baby babbling', 'baby crying',
+        'baby laughter', 'baltimore oriole calling', 'barn swallow calling', 'basketball bounce',
+        'bathroom ventilation fan running', 'beat boxing', 'bee, wasp, etc. buzzing', 'bird chirping, tweeting',
+        'bird squawking', 'bird wings flapping', 'black capped chickadee calling', 'blowtorch igniting',
+        'bouncing on trampoline', 'bowling impact', 'bull bellowing', 'canary calling', 'cap gun shooting',
+        'car engine idling', 'car engine knocking', 'car engine starting', 'car passing by', 'cat caterwauling',
+        'cat growling', 'cat hissing', 'cat meowing', 'cat purring', 'cattle mooing', 'cattle, bovinae cowbell',
+        'cell phone buzzing', 'chainsawing trees', 'cheetah chirrup', 'chicken clucking', 'chicken crowing',
+        'child singing', 'child speech, kid speaking', 'children shouting', 'chimpanzee pant-hooting',
+        'chinchilla barking', 'chipmunk chirping', 'chopping food', 'chopping wood', 'church bell ringing',
+        'civil defense siren', 'cow lowing', 'coyote howling', 'cricket chirping', 'crow cawing', 'cuckoo bird calling',
+        'cupboard opening or closing', 'cutting hair with electric trimmers', 'dinosaurs bellowing', 'disc scratching',
+        'dog barking', 'dog baying', 'dog bow-wow', 'dog growling', 'dog howling', 'dog whimpering',
+        'donkey, ass braying', 'door slamming', 'driving buses', 'driving motorcycle', 'driving snowmobile',
+        'duck quacking', 'eagle screaming', 'eating with cutlery', 'electric grinder grinding',
+        'electric shaver, electric razor shaving', 'elephant trumpeting', 'eletric blender running', 'elk bugling',
+        'engine accelerating, revving, vroom', 'female singing', 'female speech, woman speaking', 'ferret dooking',
+        'fire crackling', 'fire truck siren', 'fireworks banging', 'firing cannon', 'firing muskets',
+        'fly, housefly buzzing', 'foghorn', 'footsteps on snow', 'forging swords', 'fox barking', 'francolin calling',
+        'frog croaking', 'gibbon howling', 'goat bleating', 'golf driving', 'goose honking', 'hail',
+        'hair dryer drying', 'hammering nails', 'heart sounds, heartbeat', 'hedge trimmer running', 'helicopter',
+        'horse clip-clop', 'horse neighing', 'ice cracking', 'ice cream truck, ice cream van', 'lathe spinning',
+        'lawn mowing', 'lighting firecrackers', 'lions growling', 'lions roaring', 'lip smacking',
+        'machine gun shooting', 'magpie calling', 'male singing', 'male speech, man speaking', 'metronome',
+        'missile launch', 'mosquito buzzing', 'motorboat, speedboat acceleration', 'mouse clicking', 'mouse pattering',
+        'mouse squeaking', 'mynah bird singing', 'ocean burbling', 'opening or closing car doors',
+        'opening or closing car electric windows', 'opening or closing drawers', 'orchestra', 'otter growling',
+        'owl hooting', 'parrot talking', 'penguins braying', 'people babbling', 'people battle cry',
+        'people belly laughing', 'people booing', 'people burping', 'people cheering', 'people clapping',
+        'people coughing', 'people crowd', 'people eating', 'people eating apple', 'people eating crisps',
+        'people eating noodle', 'people farting', 'people finger snapping', 'people gargling', 'people giggling',
+        'people hiccup', 'people humming', 'people marching', 'people nose blowing', 'people running',
+        'people screaming', 'people shuffling', 'people slapping', 'people slurping', 'people sneezing',
+        'people sniggering', 'people sobbing', 'people whispering', 'people whistling', 'pheasant crowing',
+        'pig oinking', 'pigeon, dove cooing', 'planing timber', 'plastic bottle crushing', 'playing accordion',
+        'playing acoustic guitar', 'playing badminton', 'playing bagpipes', 'playing banjo', 'playing bass drum',
+        'playing bass guitar', 'playing bassoon', 'playing bongo', 'playing bugle', 'playing castanets',
+        'playing cello', 'playing clarinet', 'playing congas', 'playing cornet', 'playing cymbal', 'playing darts',
+        'playing didgeridoo', 'playing djembe', 'playing double bass', 'playing drum kit', 'playing electric guitar',
+        'playing electronic organ', 'playing erhu', 'playing flute', 'playing french horn', 'playing glockenspiel',
+        'playing gong', 'playing guiro', 'playing hammond organ', 'playing harmonica', 'playing harp',
+        'playing harpsichord', 'playing hockey', 'playing lacrosse', 'playing mandolin', 'playing marimba, xylophone',
+        'playing oboe', 'playing piano', 'playing saxophone', 'playing shofar', 'playing sitar', 'playing snare drum',
+        'playing squash', 'playing steel guitar, slide guitar', 'playing steelpan', 'playing synthesizer',
+        'playing tabla', 'playing table tennis', 'playing tambourine', 'playing tennis', 'playing theremin',
+        'playing timbales', 'playing timpani', 'playing trombone', 'playing trumpet', 'playing tuning fork',
+        'playing tympani', 'playing ukulele', 'playing vibraphone', 'playing violin, fiddle', 'playing volleyball',
+        'playing washboard', 'playing zither', 'police car (siren)', 'police radio chatter', 'popping popcorn',
+        'printer printing', 'pumping water', 'race car, auto racing', 'railroad car, train wagon', 'raining', 'rapping',
+        'reversing beeps', 'ripping paper', 'roller coaster running', 'rope skipping', 'rowboat, canoe, kayak rowing',
+        'running electric fan', 'sailing', 'scuba diving', 'sea lion barking', 'sea waves', 'sharpen knife',
+        'sheep bleating', 'shot football', 'singing bowl', 'singing choir', 'skateboarding', 'skidding', 'skiing',
+        'sliding door', 'sloshing water', 'slot machine', 'smoke detector beeping', 'snake hissing', 'snake rattling',
+        'splashing water', 'spraying water', 'squishing water', 'stream burbling', 'strike lighter', 'striking bowling',
+        'striking pool', 'subway, metro, underground', 'swimming', 'tap dancing', 'tapping guitar',
+        'telephone bell ringing', 'thunder', 'toilet flushing', 'tornado roaring', 'tractor digging', 'train horning',
+        'train wheels squealing', 'train whistling', 'turkey gobbling', 'typing on computer keyboard',
+        'typing on typewriter', 'underwater bubbling', 'using sewing machines', 'vacuum cleaner cleaning floors',
+        'vehicle horn, car horn, honking', 'volcano explosion', 'warbler chirping', 'waterfall burbling',
+        'whale calling', 'wind chime', 'wind noise', 'wind rustling leaves', 'wood thrush calling',
+        'woodpecker pecking tree', 'writing on blackboard with chalk', 'yodelling', 'zebra braying'
+    )
+}

a_cls/zeroshot_cls.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import json
+import logging
+import os
+from training.distributed import is_master
+from .zero_shot import zero_shot_eval
+try:
+    import wandb
+except ImportError:
+    wandb = None
+def evaluate_a_cls(model, data, epoch, args, tb_writer=None):
+    metrics = {}
+    if not is_master(args):
+        return metrics
+    model.eval()
+    zero_shot_metrics = zero_shot_eval(model, data, epoch, args)
+    metrics.update(zero_shot_metrics)
+    if not metrics:
+        return metrics
+    logging.info(
+        f"Eval Epoch: {epoch} "
+        + "\t".join([f"{k}: {round(v, 4):.4f}" for k, v in metrics.items()])
+    )
+    if args.save_logs:
+        for name, val in metrics.items():
+            if tb_writer is not None:
+                tb_writer.add_scalar(f"val/a_cls/{args.val_a_cls_data[0].lower()}/{name}", val, epoch)
+        args.a_cls_output_dir = os.path.join(args.log_base_path, f'a_cls/{args.val_a_cls_data[0].lower()}')
+        os.makedirs(args.a_cls_output_dir, exist_ok=True)
+        with open(os.path.join(args.a_cls_output_dir, "results.jsonl"), "a+") as f:
+            f.write(json.dumps(metrics))
+            f.write("\n")
+    if args.wandb:
+        assert wandb is not None, 'Please install wandb.'
+        for name, val in metrics.items():
+            wandb.log({f"val/{name}": val, 'epoch': epoch})
+    return metrics

app.py ADDED Viewed

	@@ -0,0 +1,327 @@

+import gradio as gr
+import argparse
+import numpy as np
+import torch
+from torch import nn
+from data.process_image import load_and_transform_image, get_image_transform
+from main import SET_GLOBAL_VALUE
+from model.build_model import create_vat_model
+from data.process_audio import load_and_transform_audio, get_audio_transform
+from data.process_video import load_and_transform_video, get_video_transform
+from data.process_depth import load_and_transform_depth, get_depth_transform
+from data.process_thermal import load_and_transform_thermal, get_thermal_transform
+from data.process_text import load_and_transform_text
+from open_clip import get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+os.system("wget https://huggingface.co/lb203/LanguageBind/resolve/main/vl.pt")
+os.system("wget https://huggingface.co/lb203/LanguageBind/resolve/main/al.pt")
+os.system("wget https://huggingface.co/lb203/LanguageBind/resolve/main/il.pt")
+os.system("wget https://huggingface.co/lb203/LanguageBind/resolve/main/dl.pt")
+os.system("wget https://huggingface.co/lb203/LanguageBind/resolve/main/tl.pt")
+class LanguageBind(nn.Module):
+    def __init__(self, args):
+        super(LanguageBind, self).__init__()
+        temp_clip_type = args.clip_type
+        self.modality_encoder = {}
+        self.modality_proj = {}
+        self.modality_scale = {}
+        for c in temp_clip_type:
+            args.clip_type = c
+            if c == 'il':
+                args.convert_to_lora = False
+                model = create_vat_model(args)
+                args.convert_to_lora = True
+            elif c == 'vl':
+                args.lora_r = 64
+                args.add_time_attn = True
+                model = create_vat_model(args)
+                args.add_time_attn = False
+                args.lora_r = 2
+            elif c == 'al':
+                args.lora_r = 8
+                model = create_vat_model(args)
+                args.lora_r = 2
+            else:
+                model = create_vat_model(args)
+            state_dict = torch.load(f'model_zoo/{c}.pt', map_location='cpu')
+            if state_dict.get('state_dict', None) is not None:
+                state_dict = state_dict['state_dict']
+            if next(iter(state_dict.items()))[0].startswith('module'):
+                state_dict = {k[7:]: v for k, v in state_dict.items()}
+            msg = model.load_state_dict(state_dict, strict=False)
+            print(f'load {c}, {msg}')
+            if c == 'vl':
+                self.modality_encoder['video'] = model.vision_model
+                self.modality_proj['video'] = model.visual_projection
+                self.modality_scale['video'] = model.logit_scale
+            elif c == 'al':
+                self.modality_encoder['audio'] = model.vision_model
+                self.modality_proj['audio'] = model.visual_projection
+                self.modality_scale['audio'] = model.logit_scale
+            elif c == 'dl':
+                self.modality_encoder['depth'] = model.vision_model
+                self.modality_proj['depth'] = model.visual_projection
+                self.modality_scale['depth'] = model.logit_scale
+            elif c == 'tl':
+                self.modality_encoder['thermal'] = model.vision_model
+                self.modality_proj['thermal'] = model.visual_projection
+                self.modality_scale['thermal'] = model.logit_scale
+            elif c == 'il':
+                self.modality_encoder['image'] = model.vision_model
+                self.modality_proj['image'] = model.visual_projection
+                self.modality_scale['image'] = model.logit_scale
+            else:
+                raise NameError(f'No clip_type of {c}')
+        self.modality_encoder['language'] = model.text_model
+        self.modality_proj['language'] = model.text_projection
+        self.modality_encoder = nn.ModuleDict(self.modality_encoder)
+        self.modality_proj = nn.ModuleDict(self.modality_proj)
+    def forward(self, inputs):
+        outputs = {}
+        for key, value in inputs.items():
+            value = self.modality_encoder[key](**value)[1]
+            value = self.modality_proj[key](value)
+            value = value / value.norm(p=2, dim=-1, keepdim=True)
+            # if key != 'language':
+            #     value = value * self.modality_scale[key].exp()
+            outputs[key] = value
+        return outputs
+MODEL_DICT = {"ViT-L-14": "laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K",
+              "ViT-H-14": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"}
+CHECKPOINT_DICT = {"ViT-L-14": "models--laion--CLIP-ViT-L-14-DataComp.XL-s13B-b90K/snapshots/84c9828e63dc9a9351d1fe637c346d4c1c4db341/pytorch_model.bin",
+                   "ViT-H-14": "models--laion--CLIP-ViT-H-14-laion2B-s32B-b79K/snapshots/94a64189c3535c1cb44acfcccd7b0908c1c8eb23/pytorch_model.bin"}
+parser = argparse.ArgumentParser()
+args = parser.parse_args()
+args.pretrained = False
+args.model = MODEL_DICT["ViT-L-14"]
+args.cache_dir = 'D:/Omni-modal-valdt-audio'
+args.video_decode_backend = 'decord'
+# args.device = 'cpu'
+args.device = 'cuda:0'
+device = torch.device(args.device)
+args.precision = None
+args.init_temp = 0
+args.force_patch_dropout = 0.0
+args.add_time_attn = False
+args.convert_to_lora = True
+args.lora_r = 2
+args.lora_alpha = 16
+args.lora_dropout = 0.0  # 0.1?
+args.num_frames = 8
+args.clip_type = 'vl'
+args.num_mel_bins = 1008
+args.target_length = 112
+args.audio_sample_rate = 16000
+args.audio_mean = 4.5689974
+args.audio_std = -4.2677393
+args.max_depth = 10
+args.image_size = 224
+args.rank = 0
+SET_GLOBAL_VALUE('PATCH_DROPOUT', args.force_patch_dropout)
+SET_GLOBAL_VALUE('NUM_FRAMES', args.num_frames)
+args.clip_type = ['il', 'vl', 'al', 'dl', 'tl']
+model = LanguageBind(args).to(device)
+model.eval()
+modality_transform = {
+    'language': get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir),
+    'video': get_video_transform(args),
+    'audio': get_audio_transform(args),
+    'depth': get_depth_transform(args),
+    'thermal': get_thermal_transform(args),
+    'image': get_image_transform(args),
+}
+def stack_dict(x, device):
+    out_dict = {}
+    keys = list(x[0].keys())
+    for key in keys:
+        out_dict[key] = torch.stack([i[key] for i in x]).to(device)
+    return out_dict
+def image_to_language(image, language):
+    inputs = {}
+    inputs['image'] = stack_dict([load_and_transform_image(image, modality_transform['image'])], device)
+    inputs['language'] = stack_dict([load_and_transform_text(language, modality_transform['language'])], device)
+    with torch.no_grad():
+        embeddings = model(inputs)
+    return (embeddings['image'] @ embeddings['language'].T).item()
+def video_to_language(video, language):
+    inputs = {}
+    inputs['video'] = stack_dict([load_and_transform_video(video, modality_transform['video'])], device)
+    inputs['language'] = stack_dict([load_and_transform_text(language, modality_transform['language'])], device)
+    with torch.no_grad():
+        embeddings = model(inputs)
+    return (embeddings['video'] @ embeddings['language'].T).item()
+def audio_to_language(audio, language):
+    inputs = {}
+    inputs['audio'] = stack_dict([load_and_transform_audio(audio, modality_transform['audio'])], device)
+    inputs['language'] = stack_dict([load_and_transform_text(language, modality_transform['language'])], device)
+    with torch.no_grad():
+        embeddings = model(inputs)
+    return (embeddings['audio'] @ embeddings['language'].T).item()
+def depth_to_language(depth, language):
+    inputs = {}
+    inputs['depth'] = stack_dict([load_and_transform_depth(depth, modality_transform['depth'])], device)
+    inputs['language'] = stack_dict([load_and_transform_text(language, modality_transform['language'])], device)
+    with torch.no_grad():
+        embeddings = model(inputs)
+    return (embeddings['depth'] @ embeddings['language'].T).item()
+def thermal_to_language(thermal, language):
+    inputs = {}
+    inputs['thermal'] = stack_dict([load_and_transform_thermal(thermal, modality_transform['thermal'])], device)
+    inputs['language'] = stack_dict([load_and_transform_text(language, modality_transform['language'])], device)
+    with torch.no_grad():
+        embeddings = model(inputs)
+    return (embeddings['thermal'] @ embeddings['language'].T).item()
+code_highlight_css = (
+"""
+#chatbot .hll { background-color: #ffffcc }
+#chatbot .c { color: #408080; font-style: italic }
+#chatbot .err { border: 1px solid #FF0000 }
+#chatbot .k { color: #008000; font-weight: bold }
+#chatbot .o { color: #666666 }
+#chatbot .ch { color: #408080; font-style: italic }
+#chatbot .cm { color: #408080; font-style: italic }
+#chatbot .cp { color: #BC7A00 }
+#chatbot .cpf { color: #408080; font-style: italic }
+#chatbot .c1 { color: #408080; font-style: italic }
+#chatbot .cs { color: #408080; font-style: italic }
+#chatbot .gd { color: #A00000 }
+#chatbot .ge { font-style: italic }
+#chatbot .gr { color: #FF0000 }
+#chatbot .gh { color: #000080; font-weight: bold }
+#chatbot .gi { color: #00A000 }
+#chatbot .go { color: #888888 }
+#chatbot .gp { color: #000080; font-weight: bold }
+#chatbot .gs { font-weight: bold }
+#chatbot .gu { color: #800080; font-weight: bold }
+#chatbot .gt { color: #0044DD }
+#chatbot .kc { color: #008000; font-weight: bold }
+#chatbot .kd { color: #008000; font-weight: bold }
+#chatbot .kn { color: #008000; font-weight: bold }
+#chatbot .kp { color: #008000 }
+#chatbot .kr { color: #008000; font-weight: bold }
+#chatbot .kt { color: #B00040 }
+#chatbot .m { color: #666666 }
+#chatbot .s { color: #BA2121 }
+#chatbot .na { color: #7D9029 }
+#chatbot .nb { color: #008000 }
+#chatbot .nc { color: #0000FF; font-weight: bold }
+#chatbot .no { color: #880000 }
+#chatbot .nd { color: #AA22FF }
+#chatbot .ni { color: #999999; font-weight: bold }
+#chatbot .ne { color: #D2413A; font-weight: bold }
+#chatbot .nf { color: #0000FF }
+#chatbot .nl { color: #A0A000 }
+#chatbot .nn { color: #0000FF; font-weight: bold }
+#chatbot .nt { color: #008000; font-weight: bold }
+#chatbot .nv { color: #19177C }
+#chatbot .ow { color: #AA22FF; font-weight: bold }
+#chatbot .w { color: #bbbbbb }
+#chatbot .mb { color: #666666 }
+#chatbot .mf { color: #666666 }
+#chatbot .mh { color: #666666 }
+#chatbot .mi { color: #666666 }
+#chatbot .mo { color: #666666 }
+#chatbot .sa { color: #BA2121 }
+#chatbot .sb { color: #BA2121 }
+#chatbot .sc { color: #BA2121 }
+#chatbot .dl { color: #BA2121 }
+#chatbot .sd { color: #BA2121; font-style: italic }
+#chatbot .s2 { color: #BA2121 }
+#chatbot .se { color: #BB6622; font-weight: bold }
+#chatbot .sh { color: #BA2121 }
+#chatbot .si { color: #BB6688; font-weight: bold }
+#chatbot .sx { color: #008000 }
+#chatbot .sr { color: #BB6688 }
+#chatbot .s1 { color: #BA2121 }
+#chatbot .ss { color: #19177C }
+#chatbot .bp { color: #008000 }
+#chatbot .fm { color: #0000FF }
+#chatbot .vc { color: #19177C }
+#chatbot .vg { color: #19177C }
+#chatbot .vi { color: #19177C }
+#chatbot .vm { color: #19177C }
+#chatbot .il { color: #666666 }
+""")
+#.highlight  { background: #f8f8f8; }
+title_markdown = ("""
+<h1 align="center"><a href="https://github.com/PKU-YuanGroup/LanguageBind"><img src="https://z1.ax1x.com/2023/10/04/pPOBSL6.png", alt="LanguageBind🚀" border="0" style="margin: 0 auto; height: 200px;" /></a> </h1>
+<h2 align="center"> LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment </h2>
+<h5 align="center"> If you like our project, please give us a star ✨ on Github for latest update.  </h2>
+<div align="center">
+    <div style="display:flex; gap: 0.25rem;" align="center">
+        <a href='https://github.com/PKU-YuanGroup/LanguageBind'><img src='https://img.shields.io/badge/Github-Code-blue'></a>
+        <a href="https://arxiv.org/pdf/2310.01852.pdf"><img src="https://img.shields.io/badge/Arxiv-2310.01852-red"></a>
+        <a href='https://github.com/PKU-YuanGroup/LanguageBind/stargazers'><img src='https://img.shields.io/github/stars/PKU-YuanGroup/LanguageBind.svg?style=social'></a>
+    </div>
+</div>
+""")
+css = code_highlight_css + """
+pre {
+    white-space: pre-wrap;       /* Since CSS 2.1 */
+    white-space: -moz-pre-wrap;  /* Mozilla, since 1999 */
+    white-space: -pre-wrap;      /* Opera 4-6 */
+    white-space: -o-pre-wrap;    /* Opera 7 */
+    word-wrap: break-word;       /* Internet Explorer 5.5+ */
+}
+"""
+with gr.Blocks(title="LanguageBind🚀", css=css) as demo:
+    gr.Markdown(title_markdown)
+    with gr.Row():
+        with gr.Column():
+            image = gr.Image(type="filepath", height=224, width=224, label='Image Input')
+            language_i = gr.Textbox(lines=2, label='Text Input')
+            out_i = gr.Textbox(label='Similarity of Image to Text')
+            b_i = gr.Button("Calculate similarity of Image to Text")
+        with gr.Column():
+            video = gr.Video(type="filepath", height=224, width=224, label='Video Input')
+            language_v = gr.Textbox(lines=2, label='Text Input')
+            out_v = gr.Textbox(label='Similarity of Video to Text')
+            b_v = gr.Button("Calculate similarity of Video to Text")
+        with gr.Column():
+            audio = gr.Audio(type="filepath", label='Audio Input')
+            language_a = gr.Textbox(lines=2, label='Text Input')
+            out_a = gr.Textbox(label='Similarity of Audio to Text')
+            b_a = gr.Button("Calculate similarity of Audio to Text")
+    with gr.Row():
+        with gr.Column():
+            depth = gr.Image(type="filepath", height=224, width=224, label='Depth Input, Need a .png file, 16 bit, with values ranging from 0-10000 (representing 0-10 metres, but 1000 times)')
+            language_d = gr.Textbox(lines=2, label='Text Input')
+            out_d = gr.Textbox(label='Similarity of Depth to Text')
+            b_d = gr.Button("Calculate similarity of Depth to Text")
+        with gr.Column():
+            thermal = gr.Image(type="filepath", height=224, width=224, label='Thermal Input')
+            language_t = gr.Textbox(lines=2, label='Text Input')
+            out_t = gr.Textbox(label='Similarity of Thermal to Text')
+            b_t = gr.Button("Calculate similarity of Thermal to Text")
+    b_i.click(image_to_language, inputs=[image, language_i], outputs=out_i)
+    b_a.click(audio_to_language, inputs=[audio, language_a], outputs=out_a)
+    b_v.click(video_to_language, inputs=[video, language_v], outputs=out_v)
+    b_d.click(depth_to_language, inputs=[depth, language_d], outputs=out_d)
+    b_t.click(thermal_to_language, inputs=[thermal, language_t], outputs=out_t)
+demo.launch()

assets/languagebind.jpg ADDED Viewed

assets/logo.png ADDED Viewed

assets/res1.jpg ADDED Viewed

assets/res2.jpg ADDED Viewed

d_cls/__pycache__/precision.cpython-38.pyc ADDED Viewed

Binary file (582 Bytes). View file

d_cls/__pycache__/zero_shot.cpython-38.pyc ADDED Viewed

Binary file (2.81 kB). View file

d_cls/__pycache__/zero_shot_classifier.cpython-38.pyc ADDED Viewed

Binary file (4.25 kB). View file

d_cls/__pycache__/zero_shot_metadata.cpython-38.pyc ADDED Viewed

Binary file (10.9 kB). View file

d_cls/__pycache__/zeroshot_cls.cpython-38.pyc ADDED Viewed

Binary file (1.44 kB). View file

d_cls/cp_zero_shot_metadata.py ADDED Viewed

	@@ -0,0 +1,117 @@

+import os
+import pandas as pd
+OPENAI_IMAGENET_TEMPLATES = (
+    lambda c: f'a bad photo of a {c}.',
+    lambda c: f'a photo of many {c}.',
+    lambda c: f'a sculpture of a {c}.',
+    lambda c: f'a photo of the hard to see {c}.',
+    lambda c: f'a low resolution photo of the {c}.',
+    lambda c: f'a rendering of a {c}.',
+    lambda c: f'graffiti of a {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a cropped photo of the {c}.',
+    lambda c: f'a tattoo of a {c}.',
+    lambda c: f'the embroidered {c}.',
+    lambda c: f'a photo of a hard to see {c}.',
+    lambda c: f'a bright photo of a {c}.',
+    lambda c: f'a photo of a clean {c}.',
+    lambda c: f'a photo of a dirty {c}.',
+    lambda c: f'a dark photo of the {c}.',
+    lambda c: f'a drawing of a {c}.',
+    lambda c: f'a photo of my {c}.',
+    lambda c: f'the plastic {c}.',
+    lambda c: f'a photo of the cool {c}.',
+    lambda c: f'a close-up photo of a {c}.',
+    lambda c: f'a black and white photo of the {c}.',
+    lambda c: f'a painting of the {c}.',
+    lambda c: f'a painting of a {c}.',
+    lambda c: f'a pixelated photo of the {c}.',
+    lambda c: f'a sculpture of the {c}.',
+    lambda c: f'a bright photo of the {c}.',
+    lambda c: f'a cropped photo of a {c}.',
+    lambda c: f'a plastic {c}.',
+    lambda c: f'a photo of the dirty {c}.',
+    lambda c: f'a jpeg corrupted photo of a {c}.',
+    lambda c: f'a blurry photo of the {c}.',
+    lambda c: f'a photo of the {c}.',
+    lambda c: f'a good photo of the {c}.',
+    lambda c: f'a rendering of the {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'a photo of one {c}.',
+    lambda c: f'a doodle of a {c}.',
+    lambda c: f'a close-up photo of the {c}.',
+    lambda c: f'a photo of a {c}.',
+    lambda c: f'the origami {c}.',
+    lambda c: f'the {c} in a video game.',
+    lambda c: f'a sketch of a {c}.',
+    lambda c: f'a doodle of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a low resolution photo of a {c}.',
+    lambda c: f'the toy {c}.',
+    lambda c: f'a rendition of the {c}.',
+    lambda c: f'a photo of the clean {c}.',
+    lambda c: f'a photo of a large {c}.',
+    lambda c: f'a rendition of a {c}.',
+    lambda c: f'a photo of a nice {c}.',
+    lambda c: f'a photo of a weird {c}.',
+    lambda c: f'a blurry photo of a {c}.',
+    lambda c: f'a cartoon {c}.',
+    lambda c: f'art of a {c}.',
+    lambda c: f'a sketch of the {c}.',
+    lambda c: f'a embroidered {c}.',
+    lambda c: f'a pixelated photo of a {c}.',
+    lambda c: f'itap of the {c}.',
+    lambda c: f'a jpeg corrupted photo of the {c}.',
+    lambda c: f'a good photo of a {c}.',
+    lambda c: f'a plushie {c}.',
+    lambda c: f'a photo of the nice {c}.',
+    lambda c: f'a photo of the small {c}.',
+    lambda c: f'a photo of the weird {c}.',
+    lambda c: f'the cartoon {c}.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a drawing of the {c}.',
+    lambda c: f'a photo of the large {c}.',
+    lambda c: f'a black and white photo of a {c}.',
+    lambda c: f'the plushie {c}.',
+    lambda c: f'a dark photo of a {c}.',
+    lambda c: f'itap of a {c}.',
+    lambda c: f'graffiti of the {c}.',
+    lambda c: f'a toy {c}.',
+    lambda c: f'itap of my {c}.',
+    lambda c: f'a photo of a cool {c}.',
+    lambda c: f'a photo of a small {c}.',
+    lambda c: f'a tattoo of the {c}.',
+)
+# a much smaller subset of above prompts
+# from https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb
+SIMPLE_IMAGENET_TEMPLATES = (
+    lambda c: f'itap of a {c}.',
+    lambda c: f'a bad photo of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a photo of the large {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a photo of the small {c}.',
+)
+IMAGENET_CLASSNAMES = (
+)
+CLASSNAMES = {
+    'NYUV2': (
+        "bathroom", "bedroom", "bookstore", "classroom", "dining room",
+        "home office", "kitchen", "living room", "office", "others"
+    ),
+    'SUNRGBD': (
+        "bathroom", "bedroom", "classroom", "computer room", "conference room", "corridor", "dining area",
+        "dining room", "discussion area", "furniture store", "home office", "kitchen", "lab", "lecture theatre",
+        "library", "living room", "office", "rest space", "study space"
+    ),
+}

d_cls/datasets.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import cv2
+import torch
+from data.build_datasets import DataInfo
+from data.process_depth import get_depth_transform, opencv_loader
+from torchvision import datasets
+def get_depth_dataset(args):
+    data_path = args.depth_data_path
+    transform = get_depth_transform(args)
+    dataset = datasets.ImageFolder(data_path, transform=transform, loader=opencv_loader)
+    dataloader = torch.utils.data.DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        num_workers=args.workers,
+        sampler=None,
+    )
+    return DataInfo(dataloader=dataloader, sampler=None)

d_cls/precision.py ADDED Viewed

	@@ -0,0 +1,12 @@

+import torch
+from contextlib import suppress
+def get_autocast(precision):
+    if precision == 'amp':
+        return torch.cuda.amp.autocast
+    elif precision == 'amp_bfloat16' or precision == 'amp_bf16':
+        # amp_bfloat16 is more stable than amp float16 for clip training
+        return lambda: torch.cuda.amp.autocast(dtype=torch.bfloat16)
+    else:
+        return suppress

d_cls/zero_shot.py ADDED Viewed

	@@ -0,0 +1,90 @@

+import logging
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+from open_clip import get_input_dtype, get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+from .precision import get_autocast
+from .zero_shot_classifier import build_zero_shot_classifier
+from .zero_shot_metadata import CLASSNAMES, OPENAI_IMAGENET_TEMPLATES
+def accuracy(output, target, topk=(1,)):
+    pred = output.topk(max(topk), 1, True, True)[1].t()
+    correct = pred.eq(target.view(1, -1).expand_as(pred))
+    return [float(correct[:k].reshape(-1).float().sum(0, keepdim=True).cpu().numpy()) for k in topk]
+def run(model, classifier, dataloader, args):
+    autocast = get_autocast(args.precision)
+    input_dtype = get_input_dtype(args.precision)
+    with torch.no_grad():
+        top1, top5, n = 0., 0., 0.
+        for images, target in tqdm(dataloader, unit_scale=args.batch_size):
+            images = images.to(device=args.device, dtype=input_dtype)
+            images = images.unsqueeze(2)
+            target = target.to(args.device)
+            with autocast():
+                # predict
+                output = model(image=images)
+                image_features = output['image_features'] if isinstance(output, dict) else output[0]
+                logits = 100. * image_features @ classifier
+            # measure accuracy
+            acc1, acc5 = accuracy(logits, target, topk=(1, 5))
+            top1 += acc1
+            top5 += acc5
+            n += images.size(0)
+    top1 = (top1 / n)
+    top5 = (top5 / n)
+    return top1, top5
+def zero_shot_eval(model, data, epoch, args):
+    temp_val_d_cls_data = args.val_d_cls_data
+    args.val_d_cls_data = list(data.keys())
+    assert len(args.val_d_cls_data) == 1
+    args.val_d_cls_data = args.val_d_cls_data[0]
+    if args.val_d_cls_data not in data:
+        return {}
+    if args.zeroshot_frequency == 0:
+        return {}
+    if (epoch % args.zeroshot_frequency) != 0 and epoch != args.epochs:
+        return {}
+    if args.distributed and not args.horovod:
+        model = model.module
+    logging.info(f'Starting zero-shot {args.val_d_cls_data.upper()}.')
+    logging.info('Building zero-shot classifier')
+    autocast = get_autocast(args.precision)
+    with autocast():
+        tokenizer = get_tokenizer(HF_HUB_PREFIX+args.model, cache_dir=args.cache_dir)
+        # tokenizer = get_tokenizer("ViT-L-14")
+        classifier = build_zero_shot_classifier(
+            model,
+            tokenizer=tokenizer,
+            classnames=CLASSNAMES[args.val_d_cls_data],
+            templates=OPENAI_IMAGENET_TEMPLATES,
+            num_classes_per_batch=10,
+            device=args.device,
+            use_tqdm=True,
+        )
+    logging.info('Using classifier')
+    results = {}
+    if args.val_d_cls_data in data:
+        top1, top5 = run(model, classifier, data[args.val_d_cls_data].dataloader, args)
+        results[f'{args.val_d_cls_data}-zeroshot-val-top1'] = top1
+        results[f'{args.val_d_cls_data}-zeroshot-val-top5'] = top5
+    logging.info(f'Finished zero-shot {args.val_d_cls_data.upper()}.')
+    args.val_d_cls_data = temp_val_d_cls_data
+    return results

d_cls/zero_shot_classifier.py ADDED Viewed

	@@ -0,0 +1,111 @@

+from functools import partial
+from itertools import islice
+from typing import Callable, List, Optional, Sequence, Union
+import torch
+import torch.nn.functional as F
+def batched(iterable, n):
+    """Batch data into lists of length *n*. The last batch may be shorter.
+    NOTE based on more-itertools impl, to be replaced by python 3.12 itertools.batched impl
+    """
+    it = iter(iterable)
+    while True:
+        batch = list(islice(it, n))
+        if not batch:
+            break
+        yield batch
+def build_zero_shot_classifier(
+        model,
+        tokenizer,
+        classnames: Sequence[str],
+        templates: Sequence[Union[Callable, str]],
+        num_classes_per_batch: Optional[int] = 10,
+        device: Union[str, torch.device] = 'cpu',
+        use_tqdm: bool = False,
+):
+    """ Build zero-shot classifier weights by iterating over class names in batches
+    Args:
+        model: CLIP model instance
+        tokenizer: CLIP tokenizer instance
+        classnames: A sequence of class (label) names
+        templates: A sequence of callables or format() friendly strings to produce templates per class name
+        num_classes_per_batch: The number of classes to batch together in each forward, all if None
+        device: Device to use.
+        use_tqdm: Enable TQDM progress bar.
+    """
+    assert isinstance(templates, Sequence) and len(templates) > 0
+    assert isinstance(classnames, Sequence) and len(classnames) > 0
+    use_format = isinstance(templates[0], str)
+    num_templates = len(templates)
+    num_classes = len(classnames)
+    if use_tqdm:
+        import tqdm
+        num_iter = 1 if num_classes_per_batch is None else ((num_classes - 1) // num_classes_per_batch + 1)
+        iter_wrap = partial(tqdm.tqdm, total=num_iter, unit_scale=num_classes_per_batch)
+    else:
+        iter_wrap = iter
+    def _process_batch(batch_classnames):
+        num_batch_classes = len(batch_classnames)
+        texts = [template.format(c) if use_format else template(c) for c in batch_classnames for template in templates]
+        input_ids, attention_mask = tokenizer(texts)
+        input_ids, attention_mask = input_ids.to(device), attention_mask.to(device)
+        class_embeddings = F.normalize(model.encode_text(input_ids, attention_mask), dim=-1)
+        class_embeddings = class_embeddings.reshape(num_batch_classes, num_templates, -1).mean(dim=1)
+        class_embeddings = class_embeddings / class_embeddings.norm(dim=1, keepdim=True)
+        class_embeddings = class_embeddings.T
+        return class_embeddings
+    with torch.no_grad():
+        if num_classes_per_batch:
+            batched_embeds = [_process_batch(batch) for batch in iter_wrap(batched(classnames, num_classes_per_batch))]
+            zeroshot_weights = torch.cat(batched_embeds, dim=1)
+        else:
+            zeroshot_weights = _process_batch(classnames)
+    return zeroshot_weights
+def build_zero_shot_classifier_legacy(
+        model,
+        tokenizer,
+        classnames: Sequence[str],
+        templates: Sequence[Union[Callable, str]],
+        device: Union[str, torch.device] = 'cpu',
+        use_tqdm: bool = False,
+):
+    """ Build zero-shot classifier weights by iterating over class names 1 by 1
+    Args:
+        model: CLIP model instance
+        tokenizer: CLIP tokenizer instance
+        classnames: A sequence of class (label) names
+        templates: A sequence of callables or format() friendly strings to produce templates per class name
+        device: Device to use.
+        use_tqdm: Enable TQDM progress bar.
+    """
+    assert isinstance(templates, Sequence) and len(templates) > 0
+    assert isinstance(classnames, Sequence) and len(classnames) > 0
+    if use_tqdm:
+        import tqdm
+        iter_wrap = tqdm.tqdm
+    else:
+        iter_wrap = iter
+    use_format = isinstance(templates[0], str)
+    with torch.no_grad():
+        zeroshot_weights = []
+        for classname in iter_wrap(classnames):
+            texts = [template.format(classname) if use_format else template(classname) for template in templates]
+            texts = tokenizer(texts).to(device)  # tokenize
+            class_embeddings = model.encode_text(texts)
+            class_embedding = F.normalize(class_embeddings, dim=-1).mean(dim=0)
+            class_embedding /= class_embedding.norm()
+            zeroshot_weights.append(class_embedding)
+        zeroshot_weights = torch.stack(zeroshot_weights, dim=1).to(device)
+    return zeroshot_weights

d_cls/zero_shot_metadata.py ADDED Viewed

	@@ -0,0 +1,117 @@

+import os
+import pandas as pd
+OPENAI_IMAGENET_TEMPLATES = (
+    lambda c: f'a bad depth photo of a {c}.',
+    lambda c: f'a depth photo of many {c}.',
+    lambda c: f'a sculpture of a {c}.',
+    lambda c: f'a depth photo of the hard to see {c}.',
+    lambda c: f'a low resolution depth photo of the {c}.',
+    lambda c: f'a rendering of a {c}.',
+    lambda c: f'graffiti of a {c}.',
+    lambda c: f'a bad depth photo of the {c}.',
+    lambda c: f'a cropped depth photo of the {c}.',
+    lambda c: f'a tattoo of a {c}.',
+    lambda c: f'the embroidered {c}.',
+    lambda c: f'a depth photo of a hard to see {c}.',
+    lambda c: f'a bright depth photo of a {c}.',
+    lambda c: f'a depth photo of a clean {c}.',
+    lambda c: f'a depth photo of a dirty {c}.',
+    lambda c: f'a dark depth photo of the {c}.',
+    lambda c: f'a drawing of a {c}.',
+    lambda c: f'a depth photo of my {c}.',
+    lambda c: f'the plastic {c}.',
+    lambda c: f'a depth photo of the cool {c}.',
+    lambda c: f'a close-up depth photo of a {c}.',
+    lambda c: f'a black and white depth photo of the {c}.',
+    lambda c: f'a painting of the {c}.',
+    lambda c: f'a painting of a {c}.',
+    lambda c: f'a pixelated depth photo of the {c}.',
+    lambda c: f'a sculpture of the {c}.',
+    lambda c: f'a bright depth photo of the {c}.',
+    lambda c: f'a cropped depth photo of a {c}.',
+    lambda c: f'a plastic {c}.',
+    lambda c: f'a depth photo of the dirty {c}.',
+    lambda c: f'a jpeg corrupted depth photo of a {c}.',
+    lambda c: f'a blurry depth photo of the {c}.',
+    lambda c: f'a depth photo of the {c}.',
+    lambda c: f'a good depth photo of the {c}.',
+    lambda c: f'a rendering of the {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'a depth photo of one {c}.',
+    lambda c: f'a doodle of a {c}.',
+    lambda c: f'a close-up depth photo of the {c}.',
+    lambda c: f'a depth photo of a {c}.',
+    lambda c: f'the origami {c}.',
+    lambda c: f'the {c} in a video game.',
+    lambda c: f'a sketch of a {c}.',
+    lambda c: f'a doodle of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a low resolution depth photo of a {c}.',
+    lambda c: f'the toy {c}.',
+    lambda c: f'a rendition of the {c}.',
+    lambda c: f'a depth photo of the clean {c}.',
+    lambda c: f'a depth photo of a large {c}.',
+    lambda c: f'a rendition of a {c}.',
+    lambda c: f'a depth photo of a nice {c}.',
+    lambda c: f'a depth photo of a weird {c}.',
+    lambda c: f'a blurry depth photo of a {c}.',
+    lambda c: f'a cartoon {c}.',
+    lambda c: f'art of a {c}.',
+    lambda c: f'a sketch of the {c}.',
+    lambda c: f'a embroidered {c}.',
+    lambda c: f'a pixelated depth photo of a {c}.',
+    lambda c: f'itap of the {c}.',
+    lambda c: f'a jpeg corrupted depth photo of the {c}.',
+    lambda c: f'a good depth photo of a {c}.',
+    lambda c: f'a plushie {c}.',
+    lambda c: f'a depth photo of the nice {c}.',
+    lambda c: f'a depth photo of the small {c}.',
+    lambda c: f'a depth photo of the weird {c}.',
+    lambda c: f'the cartoon {c}.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a drawing of the {c}.',
+    lambda c: f'a depth photo of the large {c}.',
+    lambda c: f'a black and white depth photo of a {c}.',
+    lambda c: f'the plushie {c}.',
+    lambda c: f'a dark depth photo of a {c}.',
+    lambda c: f'itap of a {c}.',
+    lambda c: f'graffiti of the {c}.',
+    lambda c: f'a toy {c}.',
+    lambda c: f'itap of my {c}.',
+    lambda c: f'a depth photo of a cool {c}.',
+    lambda c: f'a depth photo of a small {c}.',
+    lambda c: f'a tattoo of the {c}.',
+)
+# a much smaller subset of above prompts
+# from https://github.com/openai/CLIP/blob/main/notebooks/Prompt_Engineering_for_ImageNet.ipynb
+SIMPLE_IMAGENET_TEMPLATES = (
+    lambda c: f'itap of a {c}.',
+    lambda c: f'a bad depth photo of the {c}.',
+    lambda c: f'a origami {c}.',
+    lambda c: f'a depth photo of the large {c}.',
+    lambda c: f'a {c} in a video game.',
+    lambda c: f'art of the {c}.',
+    lambda c: f'a depth photo of the small {c}.',
+)
+IMAGENET_CLASSNAMES = (
+)
+CLASSNAMES = {
+    'NYUV2': (
+        "bathroom", "bedroom", "bookstore", "classroom", "dining room",
+        "home office", "kitchen", "living room", "office", "others"
+    ),
+    'SUNRGBD': (
+        "bathroom", "bedroom", "classroom", "computer room", "conference room", "corridor", "dining area",
+        "dining room", "discussion area", "furniture store", "home office", "kitchen", "lab", "lecture theatre",
+        "library", "living room", "office", "rest space", "study space"
+    ),
+}

d_cls/zeroshot_cls.py ADDED Viewed

	@@ -0,0 +1,47 @@

+import json
+import logging
+import os
+from training.distributed import is_master
+from .zero_shot import zero_shot_eval
+try:
+    import wandb
+except ImportError:
+    wandb = None
+def evaluate_d_cls(model, data, epoch, args, tb_writer=None):
+    metrics = {}
+    if not is_master(args):
+        return metrics
+    model.eval()
+    zero_shot_metrics = zero_shot_eval(model, data, epoch, args)
+    metrics.update(zero_shot_metrics)
+    if not metrics:
+        return metrics
+    logging.info(
+        f"Eval Epoch: {epoch} "
+        + "\t".join([f"{k}: {round(v, 4):.4f}" for k, v in metrics.items()])
+    )
+    if args.save_logs:
+        for name, val in metrics.items():
+            if tb_writer is not None:
+                tb_writer.add_scalar(f"val/d_cls/{args.val_d_cls_data[0].lower()}/{name}", val, epoch)
+        args.d_cls_output_dir = os.path.join(args.log_base_path, f'd_cls/{args.val_d_cls_data[0].lower()}')
+        os.makedirs(args.d_cls_output_dir, exist_ok=True)
+        with open(os.path.join(args.d_cls_output_dir, "results.jsonl"), "a+") as f:
+            f.write(json.dumps(metrics))
+            f.write("\n")
+    if args.wandb:
+        assert wandb is not None, 'Please install wandb.'
+        for name, val in metrics.items():
+            wandb.log({f"val/{name}": val, 'epoch': epoch})
+    return metrics

data/__pycache__/base_datasets.cpython-38.pyc ADDED Viewed

Binary file (5.5 kB). View file

data/__pycache__/build_datasets.cpython-38.pyc ADDED Viewed

Binary file (5.33 kB). View file

data/__pycache__/new_loadvat.cpython-38.pyc ADDED Viewed

Binary file (13.7 kB). View file

data/__pycache__/process_audio.cpython-38.pyc ADDED Viewed

Binary file (3.7 kB). View file

data/__pycache__/process_depth.cpython-38.pyc ADDED Viewed

Binary file (1.89 kB). View file

data/__pycache__/process_image.cpython-38.pyc ADDED Viewed

Binary file (813 Bytes). View file

data/__pycache__/process_text.cpython-38.pyc ADDED Viewed

Binary file (7.77 kB). View file

data/__pycache__/process_thermal.cpython-38.pyc ADDED Viewed

Binary file (914 Bytes). View file

data/__pycache__/process_video.cpython-38.pyc ADDED Viewed

Binary file (4.2 kB). View file

data/base_datasets.py ADDED Viewed

	@@ -0,0 +1,159 @@

+import contextlib
+import io
+import json
+import logging
+import os.path
+import random
+import re
+import time
+import pandas as pd
+from open_clip import get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+from .process_video import load_and_transform_video, get_video_transform
+from .process_audio import load_and_transform_audio, get_audio_transform
+from .process_text import load_and_transform_text
+from .process_depth import load_and_transform_depth, get_depth_transform
+from .process_thermal import load_and_transform_thermal, get_thermal_transform
+import argparse
+from os.path import join as opj
+from torch.utils.data import Dataset, DataLoader
+from tqdm import tqdm
+class VAT_dataset(Dataset):
+    def __init__(self, args):
+        super().__init__()
+        self.video_decode_backend = args.video_decode_backend
+        self.num_frames = args.num_frames
+        self.text_type = args.text_type
+        self.chatgpt = self.text_type == 'polish_mplug'
+        self.title = self.text_type == 'raw'
+        self.data_root = '/apdcephfs_cq3/share_1311970/A_Youtube'
+        with open(args.train_data, 'r') as f:
+            self.id2title_folder_caps = json.load(f)
+        self.ids = list(self.id2title_folder_caps.keys())[:args.train_num_samples]
+        self.clip_type = args.clip_type
+        self.num_mel_bins = args.num_mel_bins
+        self.target_length = args.target_length
+        self.audio_sample_rate = args.audio_sample_rate
+        self.audio_mean = args.audio_mean
+        self.audio_std = args.audio_std
+        # self.audio_error_file = open('./audio_error_id.txt', 'w')
+        self.tokenizer = get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir)
+        self.video_transform = get_video_transform(args)
+        self.audio_transform = get_audio_transform(args)
+        self.depth_transform = get_depth_transform(args)
+        self.thermal_transform = get_thermal_transform(args)
+    def __len__(self):
+        return len(self.ids)
+        # return self.id2title_folder_caps.shape[0]
+    def __getitem__(self, idx):
+        id = self.ids[idx]
+        folder = self.id2title_folder_caps[id]['folder']
+        try:
+            text_output = self.get_text(id)
+            input_ids, attention_mask = text_output['input_ids'], text_output['attention_mask']
+            if self.clip_type == 'vl':
+                matched_modality = self.get_video(id, folder)
+            elif self.clip_type == 'al':
+                matched_modality = self.get_audio(id, folder)
+            elif self.clip_type == 'dl':
+                matched_modality = self.get_depth(id, folder)
+            elif self.clip_type == 'tl':
+                matched_modality = self.get_thermal(id, folder)
+            return matched_modality['pixel_values'], input_ids, attention_mask
+        except Exception as error_msg:
+            logging.info(f"Failed at {id} with \"{error_msg}\"")
+            return self.__getitem__(random.randint(0, self.__len__()-1))
+    def get_video(self, id, folder):
+        video_path = opj(self.data_root, folder, f'{id}.mp4')
+        video = load_and_transform_video(video_path, self.video_transform,
+                                         video_decode_backend=self.video_decode_backend, num_frames=self.num_frames)
+        return video
+    def get_audio(self, id, folder):
+        '''
+        audio_path = opj(self.data_root, folder, f'{id}.mp3')
+        if os.path.exists(audio_path):
+            pass
+        else:
+            audio_path = audio_path[:-4] + '.m4a'
+            if os.path.exists(audio_path):
+                pass
+            else:
+                audio_path = audio_path[:-4] + '.wav'
+                if not os.path.exists(audio_path):
+                    # self.audio_error_file.write(audio_path[:-4] + '\n')
+                    raise FileNotFoundError(f'Not found audio file at \'{audio_path[:-4]}\' with .mp3 .m4a .wav')
+            # AudioSegment.from_file(audio_path).export(audio_path[:-4] + '.mp3', format='mp3')
+            # audio_path = opj(self.data_root, folder, f'{id}.mp3')
+        audio = load_and_transform_audio(audio_path, self.audio_transform)
+        '''
+        audio_path = opj(self.data_root, folder+'_ffmpeg_mp3', f'{id}.mp3')
+        audio = load_and_transform_audio(audio_path, self.audio_transform)
+        return audio
+    def get_text(self, id):
+        text = self.id2title_folder_caps[id][self.text_type]
+        text_output = load_and_transform_text(text, self.tokenizer, title=self.title)
+        return text_output
+    def get_depth(self, id, folder):
+        depth_folder = opj(self.data_root, folder, f'{id}_depth_f8glpn_folder')
+        # random_id = random.randint(0, 7)
+        random_id = 3
+        depth_path = os.path.join(depth_folder, f'{random_id}.png')
+        depth = load_and_transform_depth(depth_path, self.depth_transform)
+        return depth
+    def get_thermal(self, id, folder):
+        thermal_folder = opj(self.data_root, folder, f'{id}_thermal_f8_folder')
+        # random_id = random.randint(0, 7)
+        random_id = 3
+        thermal_path = os.path.join(thermal_folder, f'{random_id}.jpg')
+        thermal = load_and_transform_thermal(thermal_path, self.thermal_transform)
+        return thermal
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('Pre-training', add_help=False)
+    parser.add_argument('--num_frames', default=8, type=float, help='')
+    parser.add_argument('--workers', default=10, type=int, help='')
+    args = parser.parse_args()
+    args.cache_dir = 'D:\Omni-modal-hf'
+    args.num_frames = 8
+    args.clip_type = 'vl'
+    args.num_mel_bins = 128
+    args.target_length = 1024
+    args.audio_sample_rate = 16000
+    args.audio_mean = 1
+    args.audio_std = 1
+    args.rank = 0
+    args.batch_size = 16
+    train_dataset = VAT_dataset(args)
+    load = DataLoader(train_dataset, batch_size=args.batch_size, num_workers=args.workers)
+    for samples in tqdm((load)):
+        matched_modality, input_ids, attention_mask = samples
+        # print(video.shape, text.shape)

data/bpe_simple_vocab_16e6.txt.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:924691ac288e54409236115652ad4aa250f48203de50a9e4722a6ecd48d6804a
+size 1356917

data/build_datasets.py ADDED Viewed

	@@ -0,0 +1,174 @@

+import os
+import time
+from dataclasses import dataclass
+from multiprocessing import Value
+import torch
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+from data.base_datasets import VAT_dataset
+from data.new_loadvat import get_wds_dataset
+from open_clip import get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+class SharedEpoch:
+    def __init__(self, epoch: int = 0):
+        self.shared_epoch = Value('i', epoch)
+    def set_value(self, epoch):
+        self.shared_epoch.value = epoch
+    def get_value(self):
+        return self.shared_epoch.value
+@dataclass
+class DataInfo:
+    dataloader: DataLoader
+    sampler: DistributedSampler = None
+    shared_epoch: SharedEpoch = None
+    def set_epoch(self, epoch):
+        if self.shared_epoch is not None:
+            self.shared_epoch.set_value(epoch)
+        if self.sampler is not None and isinstance(self.sampler, DistributedSampler):
+            self.sampler.set_epoch(epoch)
+def get_VAT_dataset(args):
+    dataset = VAT_dataset(args)
+    num_samples = len(dataset)
+    sampler = DistributedSampler(dataset) if args.distributed else None
+    shuffle = sampler is None
+    dataloader = DataLoader(
+        dataset,
+        batch_size=args.batch_size,
+        # prefetch_factor=2,
+        # persistent_workers=True,
+        shuffle=shuffle,
+        num_workers=args.workers,
+        pin_memory=True,
+        sampler=sampler,
+        drop_last=True,
+    )
+    dataloader.num_samples = num_samples
+    dataloader.num_batches = len(dataloader)
+    return DataInfo(dataloader, sampler)
+def get_data(args, epoch=0):
+    data = {}
+    if args.do_train:
+        if args.train_data.endswith(".json"):
+            data[f"{args.clip_type}_pt"] = get_VAT_dataset(args)
+        elif args.train_data.endswith(".tar"):
+            data[f"{args.clip_type}_pt"] = get_wds_dataset(args, is_train=True, epoch=epoch)
+        else:
+            raise NameError
+    if args.do_eval:
+        temp_batch_size = args.batch_size
+        args.batch_size = 8 if args.val_vl_ret_data else 16
+        data_root = "/apdcephfs_cq3/share_1311970/downstream_datasets/VideoTextRetrieval/vtRetdata"
+        if args.val_vl_ret_data:
+            data["vl_ret"] = []
+            for val_vl_ret_data in args.val_vl_ret_data:
+                if val_vl_ret_data == "msrvtt":
+                    args.train_csv = os.path.join(f'{data_root}/MSRVTT/MSRVTT_train.9k.csv')
+                    args.val_csv = os.path.join(f'{data_root}/MSRVTT/MSRVTT_JSFUSION_test.csv')
+                    args.data_path = os.path.join(f'{data_root}/MSRVTT/MSRVTT_data.json')
+                    args.features_path = os.path.join(f'{data_root}/MSRVTT/MSRVTT_Videos')
+                elif val_vl_ret_data == "msvd":
+                    args.data_path = os.path.join(f'{data_root}/MSVD')
+                    args.features_path = os.path.join(f'{data_root}/MSVD/MSVD_Videos')
+                elif val_vl_ret_data == "activity":
+                    args.data_path = os.path.join(f'{data_root}/ActivityNet')
+                    args.features_path = os.path.join(f'{data_root}/ActivityNet/Videos/Activity_Videos')
+                elif val_vl_ret_data == "didemo":
+                    args.data_path = os.path.join(f'{data_root}/Didemo')
+                    args.features_path = os.path.join(f'{data_root}/Didemo/videos')
+                else:
+                    raise NameError
+                args.batch_size_val = args.batch_size if args.batch_size_val == 0 else args.batch_size_val
+                args.max_frames = args.num_frames
+                args.num_thread_reader = args.workers
+                args.slice_framepos = 2   # "0: cut from head frames; 1: cut from tail frames; 2: extract frames uniformly."
+                from vl_ret.data_dataloaders import DATALOADER_DICT
+                tokenizer = get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir)
+                test_dataloader, test_length = None, 0
+                if DATALOADER_DICT[val_vl_ret_data]["test"] is not None:
+                    test_dataloader, test_length = DATALOADER_DICT[val_vl_ret_data]["test"](args, tokenizer)
+                if DATALOADER_DICT[val_vl_ret_data]["val"] is not None:
+                    val_dataloader, val_length = DATALOADER_DICT[val_vl_ret_data]["val"](args, tokenizer, subset="val")
+                else:
+                    val_dataloader, val_length = test_dataloader, test_length
+                ## report validation results if the ["test"] is None
+                if test_dataloader is None:
+                    test_dataloader, test_length = val_dataloader, val_length
+                data["vl_ret"].append({val_vl_ret_data: test_dataloader})
+        if args.val_v_cls_data:
+            from v_cls import get_video_cls_dataloader
+            args.data_set = args.val_v_cls_data
+            args.num_workers = args.workers
+            args.num_sample = 1  # no repeat
+            data["v_cls"] = get_video_cls_dataloader(args)
+        if args.val_a_cls_data:
+            data["a_cls"] = []
+            data_root = "/apdcephfs_cq3/share_1311970/downstream_datasets/Audio"
+            temp_val_a_cls_data = args.val_a_cls_data
+            for val_a_cls_data in temp_val_a_cls_data:
+                from a_cls.datasets import get_audio_dataset
+                args.val_a_cls_data = val_a_cls_data
+                args.audio_data_path = os.path.join(data_root, f'{val_a_cls_data.lower()}/test')
+                data['a_cls'].append({val_a_cls_data: get_audio_dataset(args)})
+            args.val_a_cls_data = temp_val_a_cls_data
+        if args.imagenet_val is not None:
+            from i_cls.datasets import get_imagenet
+            data['i_cls'] = {}
+            data['i_cls']["imagenet-val"] = get_imagenet(args, "val")
+        if args.imagenet_v2 is not None:
+            from i_cls.datasets import get_imagenet
+            if data.get('i_cls', None) is None:
+                data['i_cls'] = {}
+            data['i_cls']["imagenet-v2"] = get_imagenet(args, "v2")
+        if args.val_d_cls_data:
+            data["d_cls"] = []
+            data_root = "/apdcephfs_cq3/share_1311970/downstream_datasets/Depth"
+            temp_val_d_cls_data = args.val_d_cls_data
+            for val_d_cls_data in temp_val_d_cls_data:
+                from d_cls.datasets import get_depth_dataset
+                args.val_d_cls_data = val_d_cls_data
+                args.depth_data_path = os.path.join(data_root, f'{val_d_cls_data.lower()}/data/val')
+                data['d_cls'].append({val_d_cls_data: get_depth_dataset(args)})
+            args.val_d_cls_data = temp_val_d_cls_data
+        if args.val_t_cls_data:
+            data["t_cls"] = []
+            data_root = "/apdcephfs_cq3/share_1311970/downstream_datasets/Thermal"
+            temp_val_t_cls_data = args.val_t_cls_data
+            for val_t_cls_data in temp_val_t_cls_data:
+                from t_cls.datasets import get_thermal_dataset
+                args.val_t_cls_data = val_t_cls_data
+                args.thermal_data_path = os.path.join(data_root, f'{val_t_cls_data.lower()}/val')
+                data['t_cls'].append({val_t_cls_data: get_thermal_dataset(args)})
+            args.val_t_cls_data = temp_val_t_cls_data
+        args.batch_size = temp_batch_size
+    return data

data/new_loadvat.py ADDED Viewed

	@@ -0,0 +1,498 @@

+import ast
+import io
+import json
+import logging
+import math
+import os
+import random
+import sys
+import braceexpand
+from dataclasses import dataclass
+from multiprocessing import Value
+import numpy.lib.format
+import numpy as np
+import pandas as pd
+import torch
+import torchvision.datasets as datasets
+import webdataset as wds
+from PIL import Image
+from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler, IterableDataset, get_worker_info
+from torch.utils.data.distributed import DistributedSampler
+from torchvision.transforms import ToTensor
+from tqdm import tqdm
+from webdataset.filters import _shuffle
+from webdataset.tariterators import base_plus_ext, url_opener, tar_file_expander, valid_sample
+from open_clip import get_tokenizer
+from open_clip.factory import HF_HUB_PREFIX
+from training.params import parse_args
+from data.process_text import load_and_transform_text
+from data.process_video import get_video_transform
+from data.process_audio import get_audio_transform
+from data.process_depth import get_depth_transform
+from data.process_thermal import get_thermal_transform
+import pdb
+try:
+    import horovod.torch as hvd
+except ImportError:
+    hvd = None
+class SharedEpoch:
+    def __init__(self, epoch: int = 0):
+        self.shared_epoch = Value('i', epoch)
+    def set_value(self, epoch):
+        self.shared_epoch.value = epoch
+    def get_value(self):
+        return self.shared_epoch.value
+@dataclass
+class DataInfo:
+    dataloader: DataLoader
+    sampler: DistributedSampler = None
+    shared_epoch: SharedEpoch = None
+    def set_epoch(self, epoch):
+        if self.shared_epoch is not None:
+            self.shared_epoch.set_value(epoch)
+        if self.sampler is not None and isinstance(self.sampler, DistributedSampler):
+            self.sampler.set_epoch(epoch)
+def expand_urls(urls, weights=None):
+    if weights is None:
+        expanded_urls = wds.shardlists.expand_urls(urls)
+        return expanded_urls, None
+    if isinstance(urls, str):
+        urllist = urls.split("::")
+        weights = weights.split('::')
+        assert len(weights) == len(urllist), \
+            f"Expected the number of data components ({len(urllist)}) and weights({len(weights)}) to match."
+        weights = [float(weight) for weight in weights]
+        all_urls, all_weights = [], []
+        for url, weight in zip(urllist, weights):
+            expanded_url = list(braceexpand.braceexpand(url))
+            expanded_weights = [weight for _ in expanded_url]
+            all_urls.extend(expanded_url)
+            all_weights.extend(expanded_weights)
+        return all_urls, all_weights
+    else:
+        all_urls = list(urls)
+        return all_urls, weights
+def get_dataset_size(shards):
+    shards_list, _ = expand_urls(shards)
+    dir_path = os.path.dirname(shards_list[0])
+    sizes_filename = os.path.join(dir_path, 'sizes.json')
+    len_filename = os.path.join(dir_path, '__len__')
+    if os.path.exists(sizes_filename):
+        sizes = json.load(open(sizes_filename, 'r'))
+        total_size = sum([int(sizes[os.path.basename(shard)]) for shard in shards_list])
+    elif os.path.exists(len_filename):
+        # FIXME this used to be eval(open(...)) but that seemed rather unsafe
+        total_size = ast.literal_eval(open(len_filename, 'r').read())
+    else:
+        total_size = None  # num samples undefined
+        # some common dataset sizes (at time of authors last download)
+        # CC3M (train): 2905954
+        # CC12M: 10968539
+        # LAION-400M: 407332084
+        # LAION-2B (english): 2170337258
+    num_shards = len(shards_list)
+    return total_size, num_shards
+def count_samples(dataloader):
+    os.environ["WDS_EPOCH"] = "0"
+    n_elements, n_batches = 0, 0
+    for images, texts in dataloader:
+        n_batches += 1
+        n_elements += len(images)
+        assert len(images) == len(texts)
+    return n_elements, n_batches
+def filter_no_caption_or_no_image(sample):
+    has_caption = ('raw.txt' in sample and 'mplug.txt' in sample and 'polish_mplug.txt' in sample and 'ofa3.txt' in sample)
+    has_image = ('frm7.jpg' in sample and 'tml0.jpg' in sample and 'dep0.npy' in sample)
+    return has_caption and has_image
+def log_and_continue(exn):
+    """Call in an exception handler to ignore any exception, issue a warning, and continue."""
+    logging.warning(f'Handling webdataset error ({repr(exn)}). Ignoring.')
+    return True
+def group_by_keys_nothrow(data, keys=base_plus_ext, lcase=True, suffixes=None, handler=None):
+    """Return function over iterator that groups key, value pairs into samples.
+    :param keys: function that splits the key into key and extension (base_plus_ext)
+    :param lcase: convert suffixes to lower case (Default value = True)
+    """
+    current_sample = None
+    for filesample in data:
+        assert isinstance(filesample, dict)
+        fname, value = filesample["fname"], filesample["data"]
+        prefix, suffix = keys(fname)
+        if prefix is None:
+            continue
+        if lcase:
+            suffix = suffix.lower()
+        # FIXME webdataset version throws if suffix in current_sample, but we have a potential for
+        #  this happening in the current LAION400m dataset if a tar ends with same prefix as the next
+        #  begins, rare, but can happen since prefix aren't unique across tar files in that dataset
+        if current_sample is None or prefix != current_sample["__key__"] or suffix in current_sample:
+            if valid_sample(current_sample):
+                yield current_sample
+            current_sample = dict(__key__=prefix, __url__=filesample["__url__"])
+        if suffixes is None or suffix in suffixes:
+            current_sample[suffix] = value
+    if valid_sample(current_sample):
+        yield current_sample
+def tarfile_to_samples_nothrow(src, handler=log_and_continue):
+    # NOTE this is a re-impl of the webdataset impl with group_by_keys that doesn't throw
+    streams = url_opener(src, handler=handler)
+    files = tar_file_expander(streams, handler=handler)
+    samples = group_by_keys_nothrow(files, handler=handler)
+    return samples
+def pytorch_worker_seed(increment=0):
+    """get dataloader worker seed from pytorch"""
+    worker_info = get_worker_info()
+    if worker_info is not None:
+        # favour using the seed already created for pytorch dataloader workers if it exists
+        seed = worker_info.seed
+        if increment:
+            # space out seed increments so they can't overlap across workers in different iterations
+            seed += increment * max(1, worker_info.num_workers)
+        return seed
+    # fallback to wds rank based seed
+    return wds.utils.pytorch_worker_seed()
+_SHARD_SHUFFLE_SIZE = 200
+_SHARD_SHUFFLE_INITIAL = 50
+_SAMPLE_SHUFFLE_SIZE = 500
+_SAMPLE_SHUFFLE_INITIAL = 100
+class detshuffle2(wds.PipelineStage):
+    def __init__(
+            self,
+            bufsize=1000,
+            initial=100,
+            seed=0,
+            epoch=-1,
+    ):
+        self.bufsize = bufsize
+        self.initial = initial
+        self.seed = seed
+        self.epoch = epoch
+    def run(self, src):
+        if isinstance(self.epoch, SharedEpoch):
+            epoch = self.epoch.get_value()
+        else:
+            # NOTE: this is epoch tracking is problematic in a multiprocess (dataloader workers or train)
+            # situation as different workers may wrap at different times (or not at all).
+            self.epoch += 1
+            epoch = self.epoch
+        rng = random.Random()
+        if self.seed < 0:
+            # If seed is negative, we use the worker's seed, this will be different across all nodes/workers
+            seed = pytorch_worker_seed(epoch)
+        else:
+            # This seed to be deterministic AND the same across all nodes/workers in each epoch
+            seed = self.seed + epoch
+        rng.seed(seed)
+        return _shuffle(src, self.bufsize, self.initial, rng)
+class ResampledShards2(IterableDataset):
+    """An iterable dataset yielding a list of urls."""
+    def __init__(
+            self,
+            urls,
+            weights=None,
+            nshards=sys.maxsize,
+            worker_seed=None,
+            deterministic=False,
+            epoch=-1,
+    ):
+        """Sample shards from the shard list with replacement.
+        :param urls: a list of URLs as a Python list or brace notation string
+        """
+        super().__init__()
+        urls, weights = expand_urls(urls, weights)
+        self.urls = urls
+        self.weights = weights
+        if self.weights is not None:
+            assert len(self.urls) == len(self.weights), \
+                f"Number of urls {len(self.urls)} and weights {len(self.weights)} should match."
+        assert isinstance(self.urls[0], str)
+        self.nshards = nshards
+        self.rng = random.Random()
+        self.worker_seed = worker_seed
+        self.deterministic = deterministic
+        self.epoch = epoch
+    def __iter__(self):
+        """Return an iterator over the shards."""
+        if isinstance(self.epoch, SharedEpoch):
+            epoch = self.epoch.get_value()
+        else:
+            # NOTE: this is epoch tracking is problematic in a multiprocess (dataloader workers or train)
+            # situation as different workers may wrap at different times (or not at all).
+            self.epoch += 1
+            epoch = self.epoch
+        if self.deterministic:
+            # reset seed w/ epoch if deterministic
+            if self.worker_seed is None:
+                # pytorch worker seed should be deterministic due to being init by arg.seed + rank + worker id
+                seed = pytorch_worker_seed(epoch)
+            else:
+                seed = self.worker_seed() + epoch
+            self.rng.seed(seed)
+        for _ in range(self.nshards):
+            if self.weights is None:
+                yield dict(url=self.rng.choice(self.urls))
+            else:
+                yield dict(url=self.rng.choices(self.urls, weights=self.weights, k=1)[0])
+class Decode:
+    def __init__(self, args=None):
+        self.num_frames = args.num_frames
+        self.text_type = args.text_type
+        self.chatgpt = self.text_type == 'polish_mplug'
+        self.title = self.text_type == 'raw'
+        self.clip_type = args.clip_type
+        self.tokenizer = get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir)
+        self.video_transform = get_video_transform(args)
+        self.audio_transform = get_audio_transform(args)
+        self.depth_transform = get_depth_transform(args)
+        self.thermal_transform = get_thermal_transform(args)
+    def __call__(self, sample):
+        input_ids, attention_mask = self.get_text(sample[f"{self.text_type}.txt"], chatgpt=self.chatgpt, title=self.title)
+        if self.clip_type == 'vl':
+            matched_modality = self.get_video([sample[f"frm{i}.jpg"] for i in range(self.num_frames)])
+        elif self.clip_type == 'al':
+            matched_modality = self.get_audio()
+        elif self.clip_type == 'dl':
+            matched_modality = self.get_depth(sample[f"dep0.npy"])
+        elif self.clip_type == 'tl':
+            matched_modality = self.get_thermal(sample[f"tml0.jpg"])
+            # matched_modality = self.get_thermal(sample[f"tml{random.randint(0, 7)}.jpg"])
+        else:
+            raise ValueError
+        return matched_modality, input_ids, attention_mask
+    def get_video(self, frames):
+        video_data = []
+        for frame in frames:
+            with io.BytesIO(frame) as stream:
+                img = Image.open(stream)
+                img.load()
+            assert min(img.size) == 256
+            result = ToTensor()(img)
+            video_data.append(result)
+        video_data = torch.stack(video_data, dim=1)
+        # video_data torch.Size([3, 8, 455, 256])
+        # video_outputs torch.Size([3, 8, 224, 224])
+        video_outputs = self.video_transform(video_data)
+        return video_outputs
+    def get_text(self, text, chatgpt=True, title=False):
+        text = text.decode("utf-8")
+        if chatgpt:
+            assert text.startswith('In the video, ')
+            text = text[14:]
+        tokens = load_and_transform_text(text, self.tokenizer, title=title)
+        return tokens['input_ids'], tokens['attention_mask']
+    def get_audio(self):
+        raise NotImplementedError
+    def get_depth(self, depth):
+        stream = io.BytesIO(depth)
+        img = numpy.lib.format.read_array(stream)
+        depth = self.depth_transform(img)
+        return depth
+    def get_thermal(self, thermal):
+        with io.BytesIO(thermal) as stream:
+            img = Image.open(stream)
+            img.load()
+        thermal = self.thermal_transform(img)
+        return thermal
+def get_wds_dataset(args, is_train, epoch=0, floor=False):
+    input_shards = args.train_data if is_train else args.val_data
+    assert input_shards is not None
+    resampled = getattr(args, 'dataset_resampled', False) and is_train
+    num_shards = None
+    if is_train:
+        if args.train_num_samples is not None:
+            num_samples = args.train_num_samples
+        else:
+            num_samples, num_shards = get_dataset_size(input_shards)
+            if not num_samples:
+                raise RuntimeError(
+                    'Currently, the number of dataset samples must be specified for the training dataset. '
+                    'Please specify it via `--train-num-samples` if no dataset length info is present.')
+    else:
+        # Eval will just exhaust the iterator if the size is not specified.
+        num_samples = args.val_num_samples or 0
+    shared_epoch = SharedEpoch(epoch=epoch)  # create a shared epoch store to sync epoch to dataloader worker proc
+    if resampled:
+        pipeline = [ResampledShards2(
+            input_shards,
+            weights=args.train_data_upsampling_factors,
+            deterministic=True,
+            epoch=shared_epoch,
+        )]
+    else:
+        assert args.train_data_upsampling_factors is None, \
+            "--train_data_upsampling_factors is only supported when sampling with replacement (with --dataset-resampled)."
+        pipeline = [wds.SimpleShardList(input_shards)]
+    # at this point we have an iterator over all the shards
+    if is_train:
+        if not resampled:
+            pipeline.extend([
+                detshuffle2(
+                    bufsize=_SHARD_SHUFFLE_SIZE,
+                    initial=_SHARD_SHUFFLE_INITIAL,
+                    seed=args.seed,
+                    epoch=shared_epoch,
+                ),
+                wds.split_by_node,
+                wds.split_by_worker,
+            ])
+        pipeline.extend([
+            # at this point, we have an iterator over the shards assigned to each worker at each node
+            tarfile_to_samples_nothrow,  # wds.tarfile_to_samples(handler=log_and_continue),
+            wds.shuffle(
+                bufsize=_SAMPLE_SHUFFLE_SIZE,
+                initial=_SAMPLE_SHUFFLE_INITIAL,
+            ),
+        ])
+    else:
+        pipeline.extend([
+            wds.split_by_worker,
+            # at this point, we have an iterator over the shards assigned to each worker
+            wds.tarfile_to_samples(handler=log_and_continue),
+        ])
+    pipeline.extend([
+        wds.select(filter_no_caption_or_no_image),
+        # wds.decode("pilrgb", handler=log_and_continue),
+        # wds.rename(image="jpg;png;jpeg;webp", text="txt"),
+        # wds.map_dict(image=preprocess_img, text=lambda text: tokenizer(text)[0]),
+        # wds.to_tuple("image", "text"),
+        wds.map(Decode(args), handler=log_and_continue),
+        wds.batched(args.batch_size, partial=not is_train)
+    ])
+    dataset = wds.DataPipeline(*pipeline)
+    if is_train:
+        if not resampled:
+            num_shards = num_shards or len(expand_urls(input_shards)[0])
+            assert num_shards >= args.workers * args.world_size, 'number of shards must be >= total workers'
+        # roll over and repeat a few samples to get same number of full batches on each node
+        round_fn = math.floor if floor else math.ceil
+        global_batch_size = args.batch_size * args.world_size
+        num_batches = round_fn(num_samples / global_batch_size)
+        num_workers = max(1, args.workers)
+        num_worker_batches = round_fn(num_batches / num_workers)  # per dataloader worker
+        num_batches = num_worker_batches * num_workers
+        num_samples = num_batches * global_batch_size
+        dataset = dataset.with_epoch(num_worker_batches)  # each worker is iterating over this
+    else:
+        # last batches are partial, eval is done on single (master) node
+        num_batches = math.ceil(num_samples / args.batch_size)
+    dataloader = wds.WebLoader(
+        dataset,
+        batch_size=None,
+        shuffle=False,
+        num_workers=args.workers,
+        persistent_workers=args.workers > 0,
+    )
+    # FIXME not clear which approach is better, with_epoch before vs after dataloader?
+    # hoping to resolve via https://github.com/webdataset/webdataset/issues/169
+    # if is_train:
+    #     # roll over and repeat a few samples to get same number of full batches on each node
+    #     global_batch_size = args.batch_size * args.world_size
+    #     num_batches = math.ceil(num_samples / global_batch_size)
+    #     num_workers = max(1, args.workers)
+    #     num_batches = math.ceil(num_batches / num_workers) * num_workers
+    #     num_samples = num_batches * global_batch_size
+    #     dataloader = dataloader.with_epoch(num_batches)
+    # else:
+    #     # last batches are partial, eval is done on single (master) node
+    #     num_batches = math.ceil(num_samples / args.batch_size)
+    # add meta-data to dataloader instance for convenience
+    dataloader.num_batches = num_batches
+    dataloader.num_samples = num_samples
+    return DataInfo(dataloader=dataloader, shared_epoch=shared_epoch)
+def get_data(args, epoch=0):
+    data = {}
+    data["train"] = get_wds_dataset(args, is_train=True, epoch=epoch)
+    return data
+if __name__ == '__main__':
+    args = parse_args(sys.argv[1:])
+    args.workers = 10
+    args.batch_size = 16
+    args.world_size = 1
+    args.num_frames = 8
+    args.clip_type = 'vl'
+    args.model = "laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K"
+    args.train_data = '/apdcephfs_cq3/share_1311970/lb/vat2webdata/check_8frm_title_ofa_polishmplug_1tml_1dep/{00000..03020}.tar'
+    args.train_num_samples = 10_000
+    args.dataset_type = 'webdataset'
+    data = get_data(args, epoch=0)
+    data['train'].set_epoch(0)  # set epoch in process safe manner via sampler or shared_epoch
+    dataloader = data['train'].dataloader
+    num_batches_per_epoch = dataloader.num_batches // args.accum_freq
+    print(num_batches_per_epoch)
+    for i, batch in enumerate(tqdm(dataloader)):
+        images, input_ids, attention_mask = batch
+        # print(images.shape, input_ids.shape, attention_mask.shape)
+        # break

data/process_audio.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import logging
+import numpy as np
+import torch
+import torchaudio
+from torchvision.transforms import transforms
+from torch.nn import functional as F
+torchaudio.set_audio_backend("soundfile")
+def torchaudio_loader(path):
+    return torchaudio.load(path)
+def int16_to_float32_torch(x):
+    return (x / 32767.0).type(torch.float32)
+def float32_to_int16_torch(x):
+    x = torch.clamp(x, min=-1., max=1.)
+    return (x * 32767.).type(torch.int16)
+DEFAULT_AUDIO_FRAME_SHIFT_MS = 10
+class AudioTransform:
+    def __init__(self, args):
+        self.sample_rate = args.audio_sample_rate
+        self.num_mel_bins = args.num_mel_bins
+        self.target_length = args.target_length
+        self.audio_mean = args.audio_mean
+        self.audio_std = args.audio_std
+        # mean=-4.2677393
+        # std=4.5689974
+        self.norm = transforms.Normalize(mean=self.audio_mean, std=self.audio_std)
+    def __call__(self, audio_data_and_origin_sr):
+        audio_data, origin_sr = audio_data_and_origin_sr
+        if self.sample_rate != origin_sr:
+            # print(audio_data.shape, origin_sr)
+            audio_data = torchaudio.functional.resample(audio_data, orig_freq=origin_sr, new_freq=self.sample_rate)
+        waveform_melspec = self.waveform2melspec(audio_data[0])
+        return self.norm(waveform_melspec)
+    def waveform2melspec(self, audio_data):
+        max_len = self.target_length * self.sample_rate // 100
+        if audio_data.shape[-1] > max_len:
+            mel = self.get_mel(audio_data)
+            # split to three parts
+            chunk_frames = self.target_length
+            total_frames = mel.shape[0]
+            ranges = np.array_split(list(range(0, total_frames - chunk_frames + 1)), 3)
+            # print('total_frames-chunk_frames:', total_frames-chunk_frames,
+            #       'len(audio_data):', len(audio_data),
+            #       'chunk_frames:', chunk_frames,
+            #       'total_frames:', total_frames)
+            if len(ranges[1]) == 0:  # if the audio is too short, we just use the first chunk
+                ranges[1] = [0]
+            if len(ranges[2]) == 0:  # if the audio is too short, we just use the first chunk
+                ranges[2] = [0]
+            # randomly choose index for each part
+            idx_front = np.random.choice(ranges[0])
+            idx_middle = np.random.choice(ranges[1])
+            idx_back = np.random.choice(ranges[2])
+            # select mel
+            mel_chunk_front = mel[idx_front:idx_front + chunk_frames, :]
+            mel_chunk_middle = mel[idx_middle:idx_middle + chunk_frames, :]
+            mel_chunk_back = mel[idx_back:idx_back + chunk_frames, :]
+            # stack
+            mel_fusion = torch.stack([mel_chunk_front, mel_chunk_middle, mel_chunk_back], dim=0)
+        elif audio_data.shape[-1] < max_len:  # padding if too short
+            n_repeat = int(max_len / len(audio_data))
+            audio_data = audio_data.repeat(n_repeat)
+            audio_data = F.pad(
+                audio_data,
+                (0, max_len - len(audio_data)),
+                mode="constant",
+                value=0,
+            )
+            mel = self.get_mel(audio_data)
+            mel_fusion = torch.stack([mel, mel, mel], dim=0)
+        else:  # if equal
+            mel = self.get_mel(audio_data)
+            mel_fusion = torch.stack([mel, mel, mel], dim=0)
+        # twice check
+        p = self.target_length - mel_fusion.shape[1]
+        if abs(p) / self.target_length > 0.2:
+            logging.warning(
+                "Large gap between audio n_frames(%d) and "
+                "target_length (%d). Is the audio_target_length "
+                "setting correct?",
+                mel_fusion.shape[1],
+                self.target_length,
+            )
+        # cut and pad
+        if p > 0:
+            m = torch.nn.ZeroPad2d((0, 0, 0, p))
+            mel_fusion = m(mel_fusion)
+        elif p < 0:
+            mel_fusion = mel_fusion[:, 0: self.target_length, :]
+        mel_fusion = mel_fusion.transpose(1, 2)  # [3, target_length, mel_bins] -> [3, mel_bins, target_length]
+        return mel_fusion
+    def get_mel(self, audio_data):
+        # mel shape: (n_mels, T)
+        audio_data -= audio_data.mean()
+        mel = torchaudio.compliance.kaldi.fbank(
+            audio_data.unsqueeze(0),
+            htk_compat=True,
+            sample_frequency=self.sample_rate,
+            use_energy=False,
+            window_type="hanning",
+            num_mel_bins=self.num_mel_bins,
+            dither=0.0,
+            frame_length=25,
+            frame_shift=DEFAULT_AUDIO_FRAME_SHIFT_MS,
+        )
+        return mel  # (T, n_mels)
+def get_audio_transform(args):
+    return AudioTransform(args)
+def load_and_transform_audio(
+    audio_path,
+    transform,
+):
+    waveform_and_sr = torchaudio_loader(audio_path)
+    audio_outputs = transform(waveform_and_sr)
+    return {'pixel_values': audio_outputs}

data/process_depth.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import PIL
+import cv2
+import numpy as np
+import torch
+from PIL import Image
+from torch import nn
+from torchvision import transforms
+from open_clip.constants import OPENAI_DATASET_MEAN, OPENAI_DATASET_STD
+def opencv_loader(path):
+    return cv2.imread(path, cv2.IMREAD_UNCHANGED).astype('float32')
+class DepthNorm(nn.Module):
+    def __init__(
+        self,
+        max_depth=0,
+        min_depth=0.01,
+    ):
+        super().__init__()
+        self.max_depth = max_depth
+        self.min_depth = min_depth
+        self.scale = 1000.0  # nyuv2 abs.depth
+    def forward(self, image):
+        # image = np.array(image)
+        depth_img = image / self.scale  # (H, W)   in meters
+        depth_img = depth_img.clip(min=self.min_depth)
+        if self.max_depth != 0:
+            depth_img = depth_img.clip(max=self.max_depth)
+            depth_img /= self.max_depth   #  0-1
+        else:
+            depth_img /= depth_img.max()
+        depth_img = torch.from_numpy(depth_img).unsqueeze(0).repeat(3, 1, 1)  # assume image
+        return depth_img.to(torch.get_default_dtype())
+def get_depth_transform(args):
+    transform = transforms.Compose(
+        [
+            DepthNorm(max_depth=args.max_depth),
+            transforms.Resize(224, interpolation=transforms.InterpolationMode.BICUBIC),
+            transforms.CenterCrop(224),
+            transforms.Normalize(OPENAI_DATASET_MEAN, OPENAI_DATASET_STD),  # assume image
+            # transforms.Normalize((0.5, ), (0.5, ))  # 0-1 to norm distribution
+            # transforms.Normalize((0.0418, ), (0.0295, ))  # sun rgb-d  imagebind
+            # transforms.Normalize((0.02, ), (0.00295, ))  # nyuv2
+        ]
+    )
+    return transform
+def load_and_transform_depth(depth_path, transform):
+    depth = opencv_loader(depth_path)
+    depth_outputs = transform(depth)
+    return {'pixel_values': depth_outputs}