CondadosAI
/

xclip_base_patch32

+---
+license: mit
+library_name: transformers
+pipeline_tag: video-classification
+tags:
+  - video-classification
+  - zero-shot
+  - vision
+  - acaua
+datasets:
+  - kinetics-400
+base_model: microsoft/xclip-base-patch32
+---
+# X-CLIP (base, patch 32) — acaua mirror
+MIT-licensed mirror hosted under `CondadosAI/` for use with the [acaua](https://github.com/CondadosAI/acaua) computer vision library.
+This is a **safetensors-only mirror** of the upstream Microsoft weights at the pinned commit shown below. The legacy `pytorch_model.bin` (pickle format) that upstream ships alongside `model.safetensors` has been deliberately removed for security hygiene — pickle loads can execute arbitrary code, and `transformers` auto-prefers safetensors when both are present, so removing it has zero functional impact on downstream users.
+X-CLIP is a **zero-shot video classification** model: you provide a list of candidate text labels at inference time and the model ranks them by similarity to the video clip. It is not a closed-set softmax classifier, and it does not appear in `AutoModelForVideoClassification`.
+## Provenance
+| | |
+|---|---|
+| Upstream repo | [`microsoft/xclip-base-patch32`](https://huggingface.co/microsoft/xclip-base-patch32) |
+| Upstream commit SHA | `a2e27a78a2b5d802e894b8a1ef14f3a8ce490963` |
+| Upstream commit date | 2024-02-04 |
+| Declared license | MIT |
+| Paper | Ni et al., *"Expanding Language-Image Pretrained Models for General Video Recognition"*, ECCV 2022, arXiv:[2208.02816](https://arxiv.org/abs/2208.02816) |
+| Official code | [`microsoft/VideoX`](https://github.com/microsoft/VideoX) (MIT) |
+| Mirrored on | 2026-04-23 |
+| Mirrored by | [CondadosAI/acaua](https://github.com/CondadosAI/acaua) |
+## Usage via acaua
+```python
+import acaua
+model = acaua.Model.from_pretrained(
+    "CondadosAI/xclip_base_patch32",
+    allow_non_apache=True,  # weights are MIT, not Apache-2.0
+)
+result = model.predict(
+    "dance.mp4",
+    labels=["dancing", "cooking", "running", "sleeping", "walking"],
+    top_k=3,
+)
+for label, score in zip(result.labels, result.scores.tolist()):
+    print(f"{label}: {score:.3f}")
+```
+## Usage via 🤗 Transformers
+This mirror is drop-in compatible with the upstream repo.
+```python
+from transformers import XCLIPModel, XCLIPProcessor
+processor = XCLIPProcessor.from_pretrained("CondadosAI/xclip_base_patch32")
+model = XCLIPModel.from_pretrained("CondadosAI/xclip_base_patch32")
+```
+## Expected input
+- **Frames:** 8 uniformly-sampled frames per clip (`vision_config.num_frames=8`).
+- **Resolution:** 224 × 224 after resize + center-crop.
+- **Normalization:** ImageNet mean/std (handled by `XCLIPProcessor`).
+- **Text prompts:** supplied at inference time — any natural-language strings.
+## License and attribution
+Redistributed under MIT, consistent with the upstream declaration. See [`NOTICE`](./NOTICE) for required attribution.
+## Citation
+```bibtex
+@inproceedings{ni2022expanding,
+  title={Expanding language-image pretrained models for general video recognition},
+  author={Ni, Bolin and Peng, Houwen and Chen, Minghao and Zhang, Songyang and Meng, Gaofeng and Fu, Jianlong and Xiang, Shiming and Ling, Haibin},
+  booktitle={European Conference on Computer Vision (ECCV)},
+  pages={1--18},
+  year={2022},
+  publisher={Springer}
+}
+```