CondadosAI commited on
Commit
e5fae3c
·
verified ·
1 Parent(s): eb404b5

docs: acaua mirror model card with upstream provenance

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: video-classification
5
+ tags:
6
+ - video-classification
7
+ - zero-shot
8
+ - vision
9
+ - acaua
10
+ datasets:
11
+ - kinetics-400
12
+ base_model: microsoft/xclip-base-patch32
13
+ ---
14
+
15
+ # X-CLIP (base, patch 32) — acaua mirror
16
+
17
+ MIT-licensed mirror hosted under `CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua) computer vision library.
18
+
19
+ This is a **safetensors-only mirror** of the upstream Microsoft weights at the pinned commit shown below. The legacy `pytorch_model.bin` (pickle format) that upstream ships alongside `model.safetensors` has been deliberately removed for security hygiene — pickle loads can execute arbitrary code, and `transformers` auto-prefers safetensors when both are present, so removing it has zero functional impact on downstream users.
20
+
21
+ X-CLIP is a **zero-shot video classification** model: you provide a list of candidate text labels at inference time and the model ranks them by similarity to the video clip. It is not a closed-set softmax classifier, and it does not appear in `AutoModelForVideoClassification`.
22
+
23
+ ## Provenance
24
+
25
+ | | |
26
+ |---|---|
27
+ | Upstream repo | [`microsoft/xclip-base-patch32`](https://huggingface.co/microsoft/xclip-base-patch32) |
28
+ | Upstream commit SHA | `a2e27a78a2b5d802e894b8a1ef14f3a8ce490963` |
29
+ | Upstream commit date | 2024-02-04 |
30
+ | Declared license | MIT |
31
+ | Paper | Ni et al., *"Expanding Language-Image Pretrained Models for General Video Recognition"*, ECCV 2022, arXiv:[2208.02816](https://arxiv.org/abs/2208.02816) |
32
+ | Official code | [`microsoft/VideoX`](https://github.com/microsoft/VideoX) (MIT) |
33
+ | Mirrored on | 2026-04-23 |
34
+ | Mirrored by | [CondadosAI/acaua](https://github.com/CondadosAI/acaua) |
35
+
36
+ ## Usage via acaua
37
+
38
+ ```python
39
+ import acaua
40
+
41
+ model = acaua.Model.from_pretrained(
42
+ "CondadosAI/xclip_base_patch32",
43
+ allow_non_apache=True, # weights are MIT, not Apache-2.0
44
+ )
45
+ result = model.predict(
46
+ "dance.mp4",
47
+ labels=["dancing", "cooking", "running", "sleeping", "walking"],
48
+ top_k=3,
49
+ )
50
+ for label, score in zip(result.labels, result.scores.tolist()):
51
+ print(f"{label}: {score:.3f}")
52
+ ```
53
+
54
+ ## Usage via 🤗 Transformers
55
+
56
+ This mirror is drop-in compatible with the upstream repo.
57
+
58
+ ```python
59
+ from transformers import XCLIPModel, XCLIPProcessor
60
+
61
+ processor = XCLIPProcessor.from_pretrained("CondadosAI/xclip_base_patch32")
62
+ model = XCLIPModel.from_pretrained("CondadosAI/xclip_base_patch32")
63
+ ```
64
+
65
+ ## Expected input
66
+
67
+ - **Frames:** 8 uniformly-sampled frames per clip (`vision_config.num_frames=8`).
68
+ - **Resolution:** 224 × 224 after resize + center-crop.
69
+ - **Normalization:** ImageNet mean/std (handled by `XCLIPProcessor`).
70
+ - **Text prompts:** supplied at inference time — any natural-language strings.
71
+
72
+ ## License and attribution
73
+
74
+ Redistributed under MIT, consistent with the upstream declaration. See [`NOTICE`](./NOTICE) for required attribution.
75
+
76
+ ## Citation
77
+
78
+ ```bibtex
79
+ @inproceedings{ni2022expanding,
80
+ title={Expanding language-image pretrained models for general video recognition},
81
+ author={Ni, Bolin and Peng, Houwen and Chen, Minghao and Zhang, Songyang and Meng, Gaofeng and Fu, Jianlong and Xiang, Shiming and Ling, Haibin},
82
+ booktitle={European Conference on Computer Vision (ECCV)},
83
+ pages={1--18},
84
+ year={2022},
85
+ publisher={Springer}
86
+ }
87
+ ```