GoodBaiBai88
/

M3D-CLIP

Image Feature Extraction

feature-extraction

3D medical CLIP

Image-text retrieval

Model card Files Files and versions Community

M3D-CLIP / README.md

GoodBaiBai88's picture

Update README.md

9903146 verified 4 months ago

|

No virus

2.45 kB

	---
	license: apache-2.0
	metrics:
	- accuracy
	pipeline_tag: image-feature-extraction
	tags:
	- 3D medical CLIP
	- Image-text retrieval
	---

	M3D-CLIP is one of the works in the [M3D](https://github.com/BAAI-DCAI/M3D) series.
	It is a 3D medical CLIP model that aligns vision and language through contrastive loss on the [M3D-Cap](https://huggingface.co/datasets/GoodBaiBai88/M3D-Cap) dataset.
	The vision encoder uses 3D ViT with 32\256\256 image size and 4\16\16 patch size.
	The language encoder utilizes a pre-trained BERT as initialization.
	The uses of M3D-CLIP:
	1. 3D medical image and text retrieval task.
	2. Aligned and powerful image and text features for downstream tasks.
	3. Text-aligned visual encoders are excellent pre-trained models for visual and multi-modal tasks.


	![comparison](M3D_CLIP_table.png)
	![comparison](itr_result.png)

	# Quickstart

	```python
	device = torch.device("cuda") # or cpu

	tokenizer = AutoTokenizer.from_pretrained(
	"GoodBaiBai88/M3D-CLIP",
	model_max_length=512,
	padding_side="right",
	use_fast=False
	)
	model = AutoModel.from_pretrained(
	"GoodBaiBai88/M3D-CLIP",
	trust_remote_code=True
	)
	model = model.to(device=device)

	# Prepare your 3D medical image:
	# 1. The image shape needs to be processed as 132256*256, considering resize and other methods.
	# 2. The image needs to be normalized to 0-1, considering Min-Max Normalization.
	# 3. The image format needs to be converted to .npy
	# 4. Although we did not train on 2D images, in theory, the 2D image can be interpolated to the shape of 132256*256 for input.

	image_path = ""
	input_txt = ""

	text_tensor = tokenizer(input_txt, max_length=512, truncation=True, padding="max_length", return_tensors="pt")
	input_id = text_tensor["input_ids"].to(device=device)
	attention_mask = text_tensor["attention_mask"].to(device=device)
	image = np.load(image_path).to(device=device)

	with torch.inference_mode():
	image_features = model.encode_image(image)[:, 0]
	text_features = model.encode_text(input_id, attention_mask)[:, 0]
	```

	# Citation

	If you feel helpful from our work, please consider citing the following work:

	```BibTeX
	@misc{bai2024m3d,
	title={M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models},
	author={Fan Bai and Yuxin Du and Tiejun Huang and Max Q. -H. Meng and Bo Zhao},
	year={2024},
	eprint={2404.00578},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```