BridgeCLIP (English) — CLIP fine-tuning for bridge inspection image classification and retrieval

A fine-tuned OpenCLIP ViT-B/32 model trained on bridge inspection images from the Japanese national road facility inspection database (xROAD) paired with English captions (machine-translated from the original Japanese findings). It performs 4-category classification and image⇔text retrieval in a single shared embedding space.

Model overview

Item Value
Base model OpenCLIP ViT-B/32 (laion2b_s34b_b79k)
Image encoder Vision Transformer (ViT-B/32)
Text encoder Transformer (GPT-2 based)
Projection dim 512
Training data 130,930 pairs (bridge inspection images × English inspection findings)
Batch size 128
Learning rate 1e-4 (warmup 1000, weight decay 0.1, AMP)
Epochs 10 (best = epoch 5, val_loss = 2.3827)
Loss Contrastive loss (CLIP loss)

Usage

Installation

pip install open_clip_torch torch huggingface_hub

Loading the model

from huggingface_hub import hf_hub_download
import open_clip
import torch
from PIL import Image

# Base model architecture
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)
tokenizer = open_clip.get_tokenizer("ViT-B-32")

# Download fine-tuned checkpoint from Hugging Face Hub
ckpt_path = hf_hub_download(
    repo_id="t-seino-ml/BridgeCLIP-english",
    filename="epoch_5.pt",
)
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
state_dict = ckpt.get("state_dict", ckpt)
state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model = model.eval()

# Encode image
image = preprocess(Image.open("bridge_image.jpg")).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Encode text
texts = tokenizer([
    "Cracking is observed in the main girder.",
    "Corrosion is observed in steel members.",
])
with torch.no_grad():
    text_features = model.encode_text(texts)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = image_features @ text_features.T
print(similarity)

Caption template

Training captions follow the structure:

<damage> is observed in <location>. The soundness rating is <Ⅰ/Ⅱ/Ⅲ/Ⅳ>. The component-wise measure classification is <A/B/C1/C2/E1/E2/M/S1/S2>.

Notes: the soundness rating uses full-width Roman numerals (Ⅰ/Ⅱ/Ⅲ/Ⅳ), inherited from the source Japanese data. When multiple damages or locations are present they are joined by " and " with a plural verb ("are observed"). When the source record does not contain a rating, that sentence is simply omitted.

Examples:

  • "Cracking is observed in the side wall. The soundness rating is Ⅰ. The component-wise measure classification is B."
  • "Corrosion and deterioration of anti-corrosion function are observed in the main girder."
  • "Expansion gap defect is observed in the main girder. The component-wise measure classification is S2."

k-NN classification

See the GitHub repo for the full classification pipeline:

t-seino-ml/BridgeCLIP-english

Representative command:

LABEL_LANG=en CUDA_VISIBLE_DEVICES=0 python -m classification.models.clip_finetuned_knn \
  --train_csv classification/results/unified_train_user_en.csv \
  --val_csv   classification/results/unified_val_user_en.csv \
  --ckpt_dir  ./checkpoints \
  --out       classification/results_en/clip_finetuned_knn_preds.csv \
  --k 10

Performance (val set = 2,679)

Classification — Macro-F1

Category CLIP-FT + k-NN (proposed) CLIP-FT + linear classifier Best supervised (ViT weighted finetune)
Soundness rating 0.4339 0.4020 0.5577
Measure 0.2533 0.2356 0.3538
Damage type 0.5467 0.5027 0.5394
Damage location 0.5461 0.5175 0.4439
Mean 0.4450 0.4144 0.4737

The proposed CLIP-FT + k-NN ranks first on damage_location (0.5461, beating all supervised baselines), and the linear-classifier variant shows a +25.2% relative gain over the same head on the CLIP-base backbone.

Retrieval (Recall@k)

Direction Base CLIP CLIP-FT (epoch 5)
Image→Text R@1 0.0007 0.0474
Image→Text R@10 0.0198 0.2706
Text→Image R@1 0.0019 0.0646
Text→Image R@10 0.0123 0.2811

Attribute-based retrieval (Text → Image, AttrMatch@1)

A hit is counted if the retrieved gallery image shares at least one label with the query text.

Category Base CLIP CLIP-FT Relative gain
Soundness 0.5266 0.6990 ×1.33
Measure 0.4345 0.6451 ×1.48
Damage type 0.2407 0.6471 ×2.69
Damage location 0.2188 0.6113 ×2.79

Data-scaling result

13-subset sweep (10k → 131k) shows monotonic improvement that saturates around 120k:

Subset Best epoch val_loss I2T R@10 T2I R@10
10k 3 3.0089 0.1389 0.1489
50k 5 2.6969 0.1978 0.2243
90k 5 2.5216 0.2419 0.2639
120k 5 2.4633 0.2542 0.2773
Clean (~131k) 5 2.3827 0.2706 0.2811

Training data

Image–text pairs from the Japanese national road facility inspection database (xROAD); inspection findings were translated to English with GPT-4o (14,949 unique sentences, cached translation dictionary).

Split Samples
Train 130,930
Validation 2,679
k-NN database (all 4 categories valid) 90,987

Classification categories

Category Classes Type
Soundness rating 4 (I / II / III / IV) Single-label
Measure 9 (A / B / C1 / C2 / E1 / E2 / M / S1 / S2) Single-label
Damage type 15 Multi-label
Damage location 20 Multi-label

File layout

.
├── README.md          # This file
├── epoch_5.pt         # Fine-tuned checkpoint (~1.7 GB)
└── config.json        # Model configuration

Citation

Paper in preparation.

License

MIT License. The dataset itself is not redistributed (see the GitHub repo for instructions on rebuilding it from the source database).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support