Instructions to use t-seino-ml/BridgeCLIP-english with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- OpenCLIP
How to use t-seino-ml/BridgeCLIP-english with OpenCLIP:
import open_clip model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:t-seino-ml/BridgeCLIP-english') tokenizer = open_clip.get_tokenizer('hf-hub:t-seino-ml/BridgeCLIP-english') - Notebooks
- Google Colab
- Kaggle
BridgeCLIP (English) — CLIP fine-tuning for bridge inspection image classification and retrieval
A fine-tuned OpenCLIP ViT-B/32 model trained on bridge inspection images from the Japanese national road facility inspection database (xROAD) paired with English captions (machine-translated from the original Japanese findings). It performs 4-category classification and image⇔text retrieval in a single shared embedding space.
- Code repository: t-seino-ml/BridgeCLIP-english
- Japanese-caption sibling:
t-seino-ml/BridgeCLIP-japanese
Model overview
| Item | Value |
|---|---|
| Base model | OpenCLIP ViT-B/32 (laion2b_s34b_b79k) |
| Image encoder | Vision Transformer (ViT-B/32) |
| Text encoder | Transformer (GPT-2 based) |
| Projection dim | 512 |
| Training data | 130,930 pairs (bridge inspection images × English inspection findings) |
| Batch size | 128 |
| Learning rate | 1e-4 (warmup 1000, weight decay 0.1, AMP) |
| Epochs | 10 (best = epoch 5, val_loss = 2.3827) |
| Loss | Contrastive loss (CLIP loss) |
Usage
Installation
pip install open_clip_torch torch huggingface_hub
Loading the model
from huggingface_hub import hf_hub_download
import open_clip
import torch
from PIL import Image
# Base model architecture
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-32", pretrained="laion2b_s34b_b79k"
)
tokenizer = open_clip.get_tokenizer("ViT-B-32")
# Download fine-tuned checkpoint from Hugging Face Hub
ckpt_path = hf_hub_download(
repo_id="t-seino-ml/BridgeCLIP-english",
filename="epoch_5.pt",
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
state_dict = ckpt.get("state_dict", ckpt)
state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model = model.eval()
# Encode image
image = preprocess(Image.open("bridge_image.jpg")).unsqueeze(0)
with torch.no_grad():
image_features = model.encode_image(image)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Encode text
texts = tokenizer([
"Cracking is observed in the main girder.",
"Corrosion is observed in steel members.",
])
with torch.no_grad():
text_features = model.encode_text(texts)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = image_features @ text_features.T
print(similarity)
Caption template
Training captions follow the structure:
<damage> is observed in <location>. The soundness rating is <Ⅰ/Ⅱ/Ⅲ/Ⅳ>. The component-wise measure classification is <A/B/C1/C2/E1/E2/M/S1/S2>.
Notes: the soundness rating uses full-width Roman numerals (Ⅰ/Ⅱ/Ⅲ/Ⅳ), inherited from the source Japanese data. When multiple damages or locations are present they are joined by " and " with a plural verb ("are observed"). When the source record does not contain a rating, that sentence is simply omitted.
Examples:
- "Cracking is observed in the side wall. The soundness rating is Ⅰ. The component-wise measure classification is B."
- "Corrosion and deterioration of anti-corrosion function are observed in the main girder."
- "Expansion gap defect is observed in the main girder. The component-wise measure classification is S2."
k-NN classification
See the GitHub repo for the full classification pipeline:
Representative command:
LABEL_LANG=en CUDA_VISIBLE_DEVICES=0 python -m classification.models.clip_finetuned_knn \
--train_csv classification/results/unified_train_user_en.csv \
--val_csv classification/results/unified_val_user_en.csv \
--ckpt_dir ./checkpoints \
--out classification/results_en/clip_finetuned_knn_preds.csv \
--k 10
Performance (val set = 2,679)
Classification — Macro-F1
| Category | CLIP-FT + k-NN (proposed) | CLIP-FT + linear classifier | Best supervised (ViT weighted finetune) |
|---|---|---|---|
| Soundness rating | 0.4339 | 0.4020 | 0.5577 |
| Measure | 0.2533 | 0.2356 | 0.3538 |
| Damage type | 0.5467 | 0.5027 | 0.5394 |
| Damage location | 0.5461 | 0.5175 | 0.4439 |
| Mean | 0.4450 | 0.4144 | 0.4737 |
The proposed CLIP-FT + k-NN ranks first on damage_location (0.5461, beating all supervised baselines), and the linear-classifier variant shows a +25.2% relative gain over the same head on the CLIP-base backbone.
Retrieval (Recall@k)
| Direction | Base CLIP | CLIP-FT (epoch 5) |
|---|---|---|
| Image→Text R@1 | 0.0007 | 0.0474 |
| Image→Text R@10 | 0.0198 | 0.2706 |
| Text→Image R@1 | 0.0019 | 0.0646 |
| Text→Image R@10 | 0.0123 | 0.2811 |
Attribute-based retrieval (Text → Image, AttrMatch@1)
A hit is counted if the retrieved gallery image shares at least one label with the query text.
| Category | Base CLIP | CLIP-FT | Relative gain |
|---|---|---|---|
| Soundness | 0.5266 | 0.6990 | ×1.33 |
| Measure | 0.4345 | 0.6451 | ×1.48 |
| Damage type | 0.2407 | 0.6471 | ×2.69 |
| Damage location | 0.2188 | 0.6113 | ×2.79 |
Data-scaling result
13-subset sweep (10k → 131k) shows monotonic improvement that saturates around 120k:
| Subset | Best epoch | val_loss | I2T R@10 | T2I R@10 |
|---|---|---|---|---|
| 10k | 3 | 3.0089 | 0.1389 | 0.1489 |
| 50k | 5 | 2.6969 | 0.1978 | 0.2243 |
| 90k | 5 | 2.5216 | 0.2419 | 0.2639 |
| 120k | 5 | 2.4633 | 0.2542 | 0.2773 |
| Clean (~131k) | 5 | 2.3827 | 0.2706 | 0.2811 |
Training data
Image–text pairs from the Japanese national road facility inspection database (xROAD); inspection findings were translated to English with GPT-4o (14,949 unique sentences, cached translation dictionary).
| Split | Samples |
|---|---|
| Train | 130,930 |
| Validation | 2,679 |
| k-NN database (all 4 categories valid) | 90,987 |
Classification categories
| Category | Classes | Type |
|---|---|---|
| Soundness rating | 4 (I / II / III / IV) | Single-label |
| Measure | 9 (A / B / C1 / C2 / E1 / E2 / M / S1 / S2) | Single-label |
| Damage type | 15 | Multi-label |
| Damage location | 20 | Multi-label |
File layout
.
├── README.md # This file
├── epoch_5.pt # Fine-tuned checkpoint (~1.7 GB)
└── config.json # Model configuration
Citation
Paper in preparation.
License
MIT License. The dataset itself is not redistributed (see the GitHub repo for instructions on rebuilding it from the source database).
- Downloads last month
- -