BridgeCLIP (English) — CLIP fine-tuning for bridge inspection image classification and retrieval

A fine-tuned OpenCLIP ViT-B/32 model trained on bridge inspection images from the Japanese national road facility inspection database (xROAD) paired with English captions (machine-translated from the original Japanese findings). It performs 4-category classification and image⇔text retrieval in a single shared embedding space.

Code repository: t-seino-ml/BridgeCLIP-english
Japanese-caption sibling: t-seino-ml/BridgeCLIP-japanese

Model overview

Item	Value
Base model	OpenCLIP ViT-B/32 (laion2b_s34b_b79k)
Image encoder	Vision Transformer (ViT-B/32)
Text encoder	Transformer (GPT-2 based)
Projection dim	512
Training data	130,930 pairs (bridge inspection images × English inspection findings)
Batch size	128
Learning rate	1e-4 (warmup 1000, weight decay 0.1, AMP)
Epochs	10 (best = epoch 5, val_loss = 2.3827)
Loss	Contrastive loss (CLIP loss)

Usage

Installation

pip install open_clip_torch torch huggingface_hub

Loading the model

from huggingface_hub import hf_hub_download
import open_clip
import torch
from PIL import Image

# Base model architecture
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)
tokenizer = open_clip.get_tokenizer("ViT-B-32")

# Download fine-tuned checkpoint from Hugging Face Hub
ckpt_path = hf_hub_download(
    repo_id="t-seino-ml/BridgeCLIP-english",
    filename="epoch_5.pt",
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
state_dict = ckpt.get("state_dict", ckpt)
state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model = model.eval()

# Encode image
image = preprocess(Image.open("bridge_image.jpg")).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Encode text
texts = tokenizer([
    "Cracking is observed in the main girder.",
    "Corrosion is observed in steel members.",
])
with torch.no_grad():
    text_features = model.encode_text(texts)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = image_features @ text_features.T
print(similarity)

Caption template

Training captions follow the structure:

<damage> is observed in <location>. The soundness rating is <Ⅰ/Ⅱ/Ⅲ/Ⅳ>. The component-wise measure classification is <A/B/C1/C2/E1/E2/M/S1/S2>.

Notes: the soundness rating uses full-width Roman numerals (Ⅰ/Ⅱ/Ⅲ/Ⅳ), inherited from the source Japanese data. When multiple damages or locations are present they are joined by " and " with a plural verb ("are observed"). When the source record does not contain a rating, that sentence is simply omitted.

Examples:

"Cracking is observed in the side wall. The soundness rating is Ⅰ. The component-wise measure classification is B."
"Corrosion and deterioration of anti-corrosion function are observed in the main girder."
"Expansion gap defect is observed in the main girder. The component-wise measure classification is S2."

k-NN classification

See the GitHub repo for the full classification pipeline:

t-seino-ml/BridgeCLIP-english

Representative command:

LABEL_LANG=en CUDA_VISIBLE_DEVICES=0 python -m classification.models.clip_finetuned_knn \
  --train_csv classification/results/unified_train_user_en.csv \
  --val_csv   classification/results/unified_val_user_en.csv \
  --ckpt_dir  ./checkpoints \
  --out       classification/results_en/clip_finetuned_knn_preds.csv \
  --k 10

Performance (val set = 2,679)

Classification — Macro-F1

Category	CLIP-FT + k-NN (proposed)	CLIP-FT + linear classifier	Best supervised (ViT weighted finetune)
Soundness rating	0.4339	0.4020	0.5577
Measure	0.2533	0.2356	0.3538
Damage type	0.5467	0.5027	0.5394
Damage location	0.5461	0.5175	0.4439
Mean	0.4450	0.4144	0.4737

The proposed CLIP-FT + k-NN ranks first on damage_location (0.5461, beating all supervised baselines), and the linear-classifier variant shows a +25.2% relative gain over the same head on the CLIP-base backbone.

Retrieval (Recall@k)

Direction	Base CLIP	CLIP-FT (epoch 5)
Image→Text R@1	0.0007	0.0474
Image→Text R@10	0.0198	0.2706
Text→Image R@1	0.0019	0.0646
Text→Image R@10	0.0123	0.2811

Attribute-based retrieval (Text → Image, AttrMatch@1)

A hit is counted if the retrieved gallery image shares at least one label with the query text.

Category	Base CLIP	CLIP-FT	Relative gain
Soundness	0.5266	0.6990	×1.33
Measure	0.4345	0.6451	×1.48
Damage type	0.2407	0.6471	×2.69
Damage location	0.2188	0.6113	×2.79

Data-scaling result

13-subset sweep (10k → 131k) shows monotonic improvement that saturates around 120k:

Subset	Best epoch	val_loss	I2T R@10	T2I R@10
10k	3	3.0089	0.1389	0.1489
50k	5	2.6969	0.1978	0.2243
90k	5	2.5216	0.2419	0.2639
120k	5	2.4633	0.2542	0.2773
Clean (~131k)	5	2.3827	0.2706	0.2811

Training data

Image–text pairs from the Japanese national road facility inspection database (xROAD); inspection findings were translated to English with GPT-4o (14,949 unique sentences, cached translation dictionary).

Split	Samples
Train	130,930
Validation	2,679
k-NN database (all 4 categories valid)	90,987

Classification categories

Category	Classes	Type
Soundness rating	4 (I / II / III / IV)	Single-label
Measure	9 (A / B / C1 / C2 / E1 / E2 / M / S1 / S2)	Single-label
Damage type	15	Multi-label
Damage location	20	Multi-label

File layout

.
├── README.md          # This file
├── epoch_5.pt         # Fine-tuned checkpoint (~1.7 GB)
└── config.json        # Model configuration

Citation

Paper in preparation.

License

MIT License. The dataset itself is not redistributed (see the GitHub repo for instructions on rebuilding it from the source database).

Downloads last month: -