🤗transformers, 🤗datasets, FAISS를 사용한 멀티모달 데이터 임베딩 및 유사성 검색

임베딩은 의미론적으로 중요한 정보의 압축입니다. 이는 유사성 검색, 제로샷 분류 또는 새로운 모델을 훈련하는 데 사용될 수 있습니다. 유사성 검색의 활용 사례로는 전자상거래에서 유사한 제품 검색, 소셜 미디어에서의 콘텐츠 검색 등이 있습니다. 이 노트북은 🤗Transformers, 🤗Datasets 및 FAISS를 사용하여 특징 추출 모델로부터 임베딩을 생성하고 인덱싱하여 이후 유사성 검색에 활용하는 방법을 안내합니다. 필요한 라이브러리를 설치해봅시다.

!pip install -q datasets faiss-gpu transformers sentencepiece

이 튜토리얼에서는 CLIP 모델을 사용하여 특징을 추출할 것입니다. CLIP은 텍스트 인코더와 이미지 인코더를 함께 학습시켜 두 가지 모달리티를 연결하는 혁신적인 모델입니다.

import torch
from PIL import Image
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
import faiss
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModel.from_pretrained("openai/clip-vit-base-patch16").to(device)
processor = AutoImageProcessor.from_pretrained("openai/clip-vit-base-patch16")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch16")

데이터셋을 로드합니다. 가볍게 이 예제를 해 보기 위해, 작은 캡션 데이터셋을 사용해봅시다, jmhessel/newyorker_caption_contest.

from datasets import load_dataset

ds = load_dataset("jmhessel/newyorker_caption_contest", "explanation")

예제를 하나 봅시다.

>>> ds["train"][0]["image"]

ds["train"][0]["image_description"]

우리는 예제를 임베딩하거나 인덱스를 생성하기 위해 어떤 함수도 작성할 필요가 없습니다. 🤗Datasets 라이브러리의 FAISS 통합이 이러한 과정을 추상화해줍니다. 아래와 같이 데이터셋의 map 메서드를 사용하여 각 예제에 대한 임베딩을 포함하는 새로운 열을 간단하게 생성할 수 있습니다. 이제 프롬프트 열에서 텍스트 특징을 위한 임베딩을 만들어봅시다.

dataset = ds["train"]
ds_with_embeddings = dataset.map(
    lambda example: {
        "embeddings": model.get_text_features(
            **tokenizer([example["image_description"]], truncation=True, return_tensors="pt").to("cuda")
        )[0]
        .detach()
        .cpu()
        .numpy()
    }
)

동일한 방식으로 이미지 임베딩도 얻을 수 있습니다.

ds_with_embeddings = ds_with_embeddings.map(
    lambda example: {
        "image_embeddings": model.get_image_features(**processor([example["image"]], return_tensors="pt").to("cuda"))[
            0
        ]
        .detach()
        .cpu()
        .numpy()
    }
)

이제 우리는 각 열에 대한 인덱스를 추가합니다.

# 텍스트 임베딩을 위한 FAISS 인덱스를 만듭니다.
ds_with_embeddings.add_faiss_index(column="embeddings")

# 이미지 임베딩을 위한 FAISS 인덱스를 만듭니다.
ds_with_embeddings.add_faiss_index(column="image_embeddings")

텍스트 프롬프트로 데이터 질문하기

이제 텍스트나 이미지를 사용하여 데이터셋 질문을 던지고, 유사한 항목을 얻을 수 있습니다.

prmt = "a snowy day"
prmt_embedding = (
    model.get_text_features(**tokenizer([prmt], return_tensors="pt", truncation=True).to("cuda"))[0]
    .detach()
    .cpu()
    .numpy()
)
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples("embeddings", prmt_embedding, k=1)

>>> def downscale_images(image):
...     width = 200
...     ratio = width / float(image.size[0])
...     height = int((float(image.size[1]) * float(ratio)))
...     img = image.resize((width, height), Image.Resampling.LANCZOS)
...     return img


>>> images = [downscale_images(image) for image in retrieved_examples["image"]]
>>> # 유사한 텍스트와 이미지를 확인합니다.
>>> print(retrieved_examples["image_description"])
>>> display(images[0])

['A man is in the snow. A boy with a huge snow shovel is there too. They are outside a house.']

이미지 프롬프트로 데이터 질문하기

이미지 유사성 추론도 마찬가지로, get_image_features를 호출하기만 하면 됩니다.

>>> import requests

>>> # image of a beaver
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> display(downscale_images(image))

이 비버 이미지와 비슷한 이미지를 검색 해 봅시다.

img_embedding = (
    model.get_image_features(**processor([image], return_tensors="pt", truncation=True).to("cuda"))[0]
    .detach()
    .cpu()
    .numpy()
)
scores, retrieved_examples = ds_with_embeddings.get_nearest_examples("image_embeddings", img_embedding, k=1)

비버 이미지와 가장 비슷한 이미지가 화면에 표시됩니다.

>>> images = [downscale_images(image) for image in retrieved_examples["image"]]
>>> # 유사한 텍스트와 이미지를 확인합니다.
>>> print(retrieved_examples["image_description"])
>>> display(images[0])

['Salmon swim upstream but they see a grizzly bear and are in shock. The bear has a smug look on his face when he sees the salmon.']

임베딩을 저장하고, 올리고, 가져오기

임베딩이 포함된 데이터셋을 save_faiss_index를 사용하여 저장할 수 있습니다.

ds_with_embeddings.save_faiss_index("embeddings", "embeddings/embeddings.faiss")

ds_with_embeddings.save_faiss_index("image_embeddings", "embeddings/image_embeddings.faiss")

임베딩을 데이터셋 저장소에 저장하는 것은 좋은 습관입니다. 따라서 우리는 Hugging Face Hub에 로그인하고, 데이터셋 저장소를 생성한 후, 그곳에 임베딩 인덱스를 올릴 것입니다. 이후에는 snapshot_download를 사용하여 해당 인덱스를 가져올 수 있습니다.

from huggingface_hub import HfApi, notebook_login, snapshot_download

notebook_login()

from huggingface_hub import HfApi

hf_id = "당신의 허깅페이스 허브 아이디를 입력하세요."

api = HfApi()
api.create_repo(f"{hf_id}/faiss_embeddings", repo_type="dataset")
api.upload_folder(
    folder_path="./embeddings",
    repo_id=f"{hf_id}/faiss_embeddings",
    repo_type="dataset",
)

snapshot_download(repo_id=f"{hf_id}/faiss_embeddings", repo_type="dataset", local_dir="downloaded_embeddings")

load_faiss_index를 사용하여 임베딩이 없는 데이터셋에 임베딩을 가져올 수 있습니다.

ds = ds["train"]
ds.load_faiss_index("embeddings", "./downloaded_embeddings/embeddings.faiss")
# 다시 추론합니다.
prmt = "people under the rain"

prmt_embedding = (
    model.get_text_features(**tokenizer([prmt], return_tensors="pt", truncation=True).to("cuda"))[0]
    .detach()
    .cpu()
    .numpy()
)

scores, retrieved_examples = ds.get_nearest_examples("embeddings", prmt_embedding, k=1)

>>> display(retrieved_examples["image"][0])

< > Update on GitHub

Open-Source AI Cookbook

🤗transformers, 🤗datasets, FAISS를 사용한 멀티모달 데이터 임베딩 및 유사성 검색

텍스트 프롬프트로 데이터 질문하기

이미지 프롬프트로 데이터 질문하기

임베딩을 저장하고, 올리고, 가져오기