NetsPresso_QA / README.md
geonmin-kim's picture
Upload folder using huggingface_hub
d6585f5
metadata
title: NetsPresso_QA
app_file: run_ralm_netspresso_doc.py
sdk: gradio
sdk_version: 3.41.2

Text retrieval inference (indexing, search)

Installation

  1. ์ €์žฅ์†Œ ๋‹ค์šด๋กœ๋“œ
git clone https://github.com/nota-github/np_app_text_retrieval_inference
  1. ๋ชจ๋ธ ํ™˜๊ฒฝ์ด ์ •์˜๋œ ๋„์ปค ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ์‹คํ–‰
cd np_app_semantic_search_inference
docker build --cache-from notadockerhub/np_app_text_retrieval_inference:latest -t notadockerhub/np_app_text_retrieval_inference:latest -f ./Dockerfile .
docker run --name {container_name} --shm-size=8g -it --gpus '"device=0"' -v {your_code_dir}:/root/np_app_text_retrieval_inference -v /{your_data_dir}:/workspace/datasets notadockerhub/np_app_text_retrieval_inference:latest
  • retrieval์‹œ์—๋Š” gpu๊ฐ€ BERT ๊ธฐ๋ฐ˜์˜ query encoding์‹œ์—๋งŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์‹œ๊ฐ„์—์„œ๋Š” ์ ์€ ๋น„์œจ์„ ์ฐจ์ง€ํ•˜๋ฏ€๋กœ cpu๋งŒ ์‚ฌ์šฉํ•ด๋„ ์†๋„์—์„œ ํฐ ์ฐจ์ด๋Š” ์—†์Šต๋‹ˆ๋‹ค.

  • ์›ํ•˜๋Š” ๋ฌธ์„œ๋“ค์„ indexingํ•˜๋Š” ๊ฒฝ์šฐ BERT๋ฅผ ์ด์šฉํ•˜์—ฌ ์ผํšŒ์„ฑ์œผ๋กœ encodingํ•˜๋Š”๋ฐ, ์ด ๊ฒฝ์šฐ๋Š” gpu๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด cpu๋ณด๋‹ค ๋งŽ์€ ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” single gpu ์‚ฌ์šฉ๋งŒ์„ ์ง€์›ํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, multi gpu ์‚ฌ์šฉ์„ ์œ„ํ•ด์„œ๋Š” individual process๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋Œ€๋ถ€๋ถ„์˜ ์ฝ”๋“œ๋Š” pyserini์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Dataset

datasets
    |-- dataset_name
    |   |--	collection.jsonl
    |   |--	queries.tsv
  |   |--	qrels.txt (optional, ์ •๋Ÿ‰ํ‰๊ฐ€๋ฅผ ์›ํ•  ๊ฒฝ์šฐ)
  • collection.jsonl: each line is {"id": "PASSAGE_ID", "contents": "CONTENTS"}.
  • queries.tsv: each line is QUERY_ID\tCONTENTS.
  • qrels.txt: each line is QUERY_ID QUERY_TYPE PASSAGE_ID RELEVANCE_SCORE.

Recommended retriever

  • sparse model: BM25
  • dense model
    • multi-lingual: mDPR, mContriever
    • multi-vector: colBERT
  • hybrid model: sparse (first-pass) + dense (reranking)
  • ๋‹ค๊ตญ์–ด๋ฅผ encodeํ•˜๋Š” baseline ๋ชจ๋ธ์€ castorini/mdpr-question-nq์„ ์‚ฌ์šฉ.
  • ์–ธ์–ด๋ณ„ ๋‹ค์–‘ํ•œ pre-trained ๋ชจ๋ธ์€ HuggingFace model hub์—์„œ ๊ฒ€์ƒ‰ ํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ.

Sample dataset

  • mrtydi-korean
    • 11๊ฐœ ์–ธ์–ด๋ฅผ ํฌํ•จํ•œ ๋‹ค๊ตญ์–ด ๊ฒ€์ƒ‰์„ ์œ„ํ•œ benchmark dataset
    • ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ 1496126๊ฐœ์˜ passage์™€ 421๊ฐœ์˜ test query๋ฅผ ์ œ๊ณตํ•จ
    • title๊ณผ text๋ฅผ ํฌํ•จํ•œ multi-field๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ (์ผ๋ฐ˜์ ์œผ๋กœ๋Š” text๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ)
  • data hub์—์„œ ์›๋ณธ ๋ฐ์ดํ„ฐ ๋ฐ indexing๋œ ๊ฒฐ๊ณผ๋ฌผ์„ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ์Œ.
    • @data_hub:/ssd2/np_app/Dataset_Hub/Semantic_Search/{corpus,indexes}

Procedure

1. Indexing

  • Fast retrieval์„ ์œ„ํ•ด์„œ collection์˜ passage์— ๋Œ€ํ•œ indexing์„ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•จ

  • indexing ๊ณผ์ •์€ ๋ฏธ๋ฆฌ ๋งŒ๋“ค๋‘”๊ฒƒ์„ ์‚ฌ์šฉํ•ด๋„ ๋จ

  • dense model

python -m pyserini.encode \
  input   --corpus /path/to/dataset/collection.jsonl  \
          --fields text \
  output  --embeddings indexes/dataset_name/dense \
          --to-faiss \
  encoder --encoder huggingface_model_name_or_checkpoint_path \
          --fields text \
          --max-length $MAX_LENGTH \
          --batch $BATCH_SIZE \
          --fp16
  • huggingface_model_name_or_checkpoint_path: huggingface model hub์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ชจ๋ธ ์ด๋ฆ„ ๋˜๋Š” checkpoint path

    • e.g., mrtydi์˜ ๊ฒฝ์šฐ: castorini/mdpr-passage-nq ์‚ฌ์šฉ (retrieval์‹œ์˜ query encoding: castorini/mdpr-question-nq)
    • tied(vs. split)์˜ ๊ฒฝ์šฐ passage/query encoder๊ฐ€ ๊ฐ™์Œ(vs. ๋‹ค๋ฆ„)
  • sparse model

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input datasets/dataset_name/collection.jsonl \
  --index /path/to/indexing/sparse \
  --fields text \
  --generator DefaultLuceneDocumentGenerator \
  --language $LANG_CODE \
  --threads $NUM_THREADS \
  --storePositions --storeDocvectors --storeRaw
  • language code์˜ ๊ฒฝ์šฐ ISO 639-1 ๋ฐฉ์‹์„ ๋”ฐ๋ฆ„ (e.g., en, ko, ja, zh)

  • multifield๋ฅผ ํ™œ์šฉํ•  ๊ฒฝ์šฐ collection์˜ "contents"์˜ ํ…์ŠคํŠธ๋‚ด์— field๋“ค์„ \n์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ , --fields์— field ์ด๋ฆ„๋“ค(i.e., --fields title text)์„ ๋„ฃ์–ด์คŒ.

    • mrtydi์˜ ๊ฒฝ์šฐ delimiter๋ฅผ '\n\n'์œผ๋กœ ์‚ฌ์šฉํ•จ
{"id": "5#1", "contents": "์ง€๋ฏธ ์นดํ„ฐ\n\n์ง€๋ฏธ ์นดํ„ฐ๋Š” ์กฐ์ง€์•„์ฃผ ์„ฌํ„ฐ ์นด์šดํ‹ฐ ํ”Œ๋ ˆ์ธ์Šค ๋งˆ์„์—์„œ ํƒœ์–ด๋‚ฌ๋‹ค. ์กฐ์ง€์•„ ๊ณต๊ณผ๋Œ€ํ•™๊ต๋ฅผ ์กธ์—…ํ•˜์˜€๋‹ค. ๊ทธ ํ›„ ํ•ด๊ตฐ์— ๋“ค์–ด๊ฐ€ ์ „ํ•จยท์›์ž๋ ฅยท์ž ์ˆ˜ํ•จ์˜ ์Šน๋ฌด์›์œผ๋กœ ์ผํ•˜์˜€๋‹ค. 1953๋…„ ๋ฏธ๊ตญ ํ•ด๊ตฐ ๋Œ€์œ„๋กœ ์˜ˆํŽธํ•˜์˜€๊ณ  ์ดํ›„ ๋•…์ฝฉยท๋ฉดํ™” ๋“ฑ์„ ๊ฐ€๊ฟ” ๋งŽ์€ ๋ˆ์„ ๋ฒŒ์—ˆ๋‹ค. ๊ทธ์˜ ๋ณ„๋ช…์ด \"๋•…์ฝฉ ๋†๋ถ€\" (Peanut Farmer)๋กœ ์•Œ๋ ค์กŒ๋‹ค."}
  • MAX_LENGTH: positional embedding์˜ ์ตœ๋Œ€ ๊ธธ์ด (e.g., BERT: 512, DPR: 256)

  • ๊ฒฐ๊ณผ๋ฌผ (dir: /path/to/indexing)

    • docid: sets of passage id
    • index: concatenation of (compressed) index vectors, binary file

2. Search

  • Indexing๋œ collection์— ๋Œ€ํ•˜์—ฌ query์— ๋Œ€ํ•œ ranking ์ˆ˜ํ–‰

online

  • with sparse indexing
export QUERY="์ตœ์ดˆ๋กœ ์ „๊ธฐ ์ž๋™์ฐจ๋ฅผ ๊ฐœ๋ฐœํ•œ ๊ธฐ์—…์€ ์–ด๋””์ธ๊ฐ€?"
python search_online.py --index_type sparse --index /path/to/indexing/sparse --query "$QUERY" --lang_abbr $LANG_CODE 
๊ฒฐ๊ณผ ์˜ˆ์‹œ

 1 1830196#0       21.52590
{
  "id" : "1830196#0",
  "contents" : "์ฐฝ์•ˆ ์ž๋™์ฐจ(, )๋Š” ์ค‘ํ™”์ธ๋ฏผ๊ณตํ™”๊ตญ์˜ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ด๋‹ค. ๋ณธ์‚ฌ๋Š” ์ถฉ์นญ ์‹œ์— ์žˆ๋‹ค. ๋””์ด ์ž๋™์ฐจ, ๋‘ฅํŽ‘ ์ž๋™์ฐจ, ์ƒํ•˜์ด ์ž๋™์ฐจ, ์ฒด๋ฆฌ ์ž๋™์ฐจ์™€ ํ•จ๊ป˜ ์ค‘ํ™”์ธ๋ฏผ๊ณตํ™”๊ตญ์˜ 5๋Œ€ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์œผ๋กœ ์—ฌ๊ฒจ์ง„๋‹ค. ์ค‘ํ™”์ธ๋ฏผ๊ณตํ™”๊ตญ์˜ ์ž๋™์ฐจ ์ œ์กฐ ๋ฐ ํŒ๋งค, ์ž๋™์ฐจ ์—”์ง„ ์ œํ’ˆ ์ œ์กฐ ์—…์ฒด์ด๋‹ค. 1862๋…„ ์ƒํ•˜์ด ์‹œ์—์„œ ์ดํ™์žฅ์— ์˜ํ•ด ์„ค๋ฆฝ๋˜์—ˆ์œผ๋ฉฐ 1950๋…„๋Œ€ ๋ง์— ์ง€ํ”„๋ฅผ ์ตœ์ดˆ๋กœ ์ƒ์‚ฐํ•˜๋ฉด์„œ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ด ๋˜์—ˆ๋‹ค. 1996๋…„ 10์›” 31์ผ ๋ฒ•์ธ์„ค๋ฆฝ๋˜์—ˆ๊ณ  ๋Œ€ํ‘œ์ž๋Š” ์žฅ ๋ฐ”์˜ค๋ฆฐ์ด๋‹ค. 1984๋…„์—๋Š” ์ผ๋ณธ์˜ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ธ ์Šค์ฆˆํ‚ค์™€ ์ œํœด ๊ด€๊ณ„๋ฅผ ์ˆ˜๋ฆฝํ–ˆ๊ณ  2001๋…„์—๋Š” ํฌ๋“œ ๋ชจํ„ฐ ์ปดํผ๋‹ˆ๋ฅผ ํ•ฉ๋ณ‘ํ•˜๋ฉด์„œ ์ฐฝ์•ˆ ํฌ๋“œ ์ž๋™์ฐจ(้•ทๅฎ‰็ฆ็‰นๆฑฝ่ปŠ)๊ฐ€ ์„ค๋ฆฝ๋˜์—ˆ๋‹ค. 2009๋…„์—๋Š” ํ•˜ํŽ˜์ด ์ž๋™์ฐจ(ๅ“ˆ้ฃ›ๆฑฝ่ปŠ), ์ฐฝํ—ˆ ์ž๋™์ฐจ(ๆ˜Œๆฒณๆฑฝ่ปŠ)๋ฅผ ํ•ฉ๋ณ‘ํ–ˆ๋‹ค. ์ถฉ์นญ ์ž๋™์ฐจ ์ƒ์‚ฐ์˜ ํƒœ๋ฐ˜์€ ์ฐฝ์•ˆ์ž๋™์ฐจ๊ฐ€ ๋‹ด๋‹นํ•˜๊ณ  ์žˆ๋‹ค. ์ฐฝ์•ˆ์€ 1959๋…„ ์ดํ›„ ์ฐจ๋ฅผ ๋งŒ๋“ค์–ด์˜จ ๊ตญ์œ ๊ธฐ์—…์œผ๋กœ 2์ฐจ๋Œ€์ „์˜ ๋ฏธ๊ตฐ์šฉ ์ง€ํ”„๋ฅผ ๋ณธ๋–  ๋งŒ๋“  ๊ตฐ์šฉํŠธ๋Ÿญ์ด ์‹œ๋ฐœ์ ์ด์—ˆ๋‹ค. ์˜ค๋Š˜๋‚  ๋ผ์ธ์—…์€ ์ „๊ธฐ์ฐจ ํ•˜๋‚˜๋ฅผ ๋น„๋กฏํ•œ 17๊ฐœ ๋ชจ๋ธ๋กœ ํ™•๋Œ€๋๋‹ค. 7๊ฐœ ์กฐ๋ฆฝ๊ณต์žฅ๊ณผ 1๊ฐœ ์—”์ง„๊ณต์žฅ์„ ํ†ตํ•ด ํ•œํ•ด ์•ฝ 100๋งŒ ๋Œ€๋ฅผ ๋งŒ๋“ ๋‹ค. ์—ฌ๊ธฐ์—๋‹ค๊ฐ€ ์ฐฝ์•ˆ์€ ํฌ๋“œ, ํ‘ธ์กฐ์™€ ์Šค์ฆˆํ‚ค์™€๋„ ํ•ฉ์ž‘ํ•˜๊ณ  ์žˆ์–ด ํ•œํ•ด ์ƒ์‚ฐ๋Ÿ‰์€ 300๋งŒ ๋Œ€์— ์ด๋ฅธ๋‹ค. ์ฐฝ์•ˆ์ž๋™์ฐจ๋Š” ๊ธ€๋กœ๋ฒŒ์—ฐ๊ตฌ๊ฐœ๋ฐœ์‹œ์Šคํ…œ์„ ๊ฐ€๋™์ค‘์— ์žˆ๋‹ค. ํ˜„์žฌ ์ถฉ์นญ, ๋ฒ ์ด์ง•, ํ—ˆ๋ฒ ์ด, ํ—ˆํŽ˜์ด, ์ดํƒˆ๋ฆฌ์•„ ํ† ๋ฆฌ๋…ธ, ์ผ๋ณธ ์š”์ฝ”ํ•˜๋งˆ, ์˜๊ตญ ๋ฒ„๋ฐ์—„, ๋ฏธ๊ตญ ๋””ํŠธ๋กœ์ดํŠธ ๋“ฑ์ง€์— ์—ฐ๊ตฌ๊ฐœ๋ฐœ์„ผํ„ฐ๋ฅผ ์„ค๋ฆฝํ•˜์˜€๋‹ค. ์šฐ๋ฆฌ๋‚˜๋ผ ํ•œ์˜จ์‹œ์Šคํ…œ์€ ๋…์ผ ํดํฌ์Šค๋ฐ”๊ฒ, ์ค‘๊ตญ ์ฐฝ์•ˆ์ž๋™์ฐจ ๋“ฑ์— ์นœํ™˜๊ฒฝ์ฐจ์šฉ ์ „๋™์‹ ์ปดํ”„๋ ˆ์…”๋ฅผ ๋‚ฉํ’ˆํ•˜๊ณ  ์žˆ๋‹ค."
}
 2 128660#8        19.02320
{
  "id" : "128660#8",
  "contents" : "1990๋…„๋Œ€์— ๋“ค์–ด์„  ์งํ›„ ๊ฐ€์†”๋ฆฐ์ž๋™์ฐจ์— ์˜ํ•œ ํ™˜๊ฒฝ์˜ค์—ผ๋ฌธ์ œ๊ฐ€ ๋Œ€๋‘๋˜์—ˆ๋‹ค. 1996๋…„ ์ œ๋„ˆ๋Ÿด ๋ชจํ„ฐ์Šค(GM)์‚ฌ๋Š” ์–‘์‚ฐ ์ „๊ธฐ์ฐจ 1ํ˜ธ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š” 'EV1' ์ „๊ธฐ์ž๋™์ฐจ๋ฅผ ๊ฐœ๋ฐœํ•œ๋‹ค. ์ด ์ „๊ธฐ์ž๋™์ฐจ๋Š” ๋ฏธ๊ตญ ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ง€์—ญ์—์„œ ์ž„๋Œ€ํ˜•์‹์œผ๋กœ ๋ณด๊ธ‰๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ GM์‚ฌ๋Š” ์ˆ˜์š”๊ฐ€ ํฌ์ง€ ์•Š์•„ ์ˆ˜์ต์„ฑ์ด ๋‚ฎ๋‹ค๋Š” ์ด์œ ๋กœ 1๋…„๋งŒ์— ์ „๊ธฐ์ž๋™์ฐจ 'EV1'์˜ ์กฐ๋ฆฝ๋ผ์ธ์„ ํ์‡„ํ•œ๋‹ค."
}
 3 320611#0        18.99790
{
  "id" : "320611#0",
  "contents" : "๊ธฐ์•„ ๊ทธ๋žœํ† (Kia Granto) ๋˜๋Š” ์•„์‹œ์•„ ๊ทธ๋žœํ† (Asia Granto)๋Š” 1995๋…„์— ์•„์‹œ์•„์ž๋™์ฐจ๊ฐ€ ์ƒ์‚ฐํ•œ ๋Œ€ํ˜• ํŠธ๋Ÿญ์ด๋‹ค. ๊ธฐ์•„์ฐจ๊ฐ€ ์ผ๋ณธ ํžˆ๋…ธ ์ž๋™์ฐจ์™€ ๊ธฐ์ˆ  ์ œํœดํ•ด์„œ ํžˆ๋…ธ ํ”„๋กœํ”ผ์•„์˜ ์ฐจ์ฒด๋กœ ๊ฐœ๋ฐœํ•œ ๋Œ€ํ˜• ํŠธ๋Ÿญ์ด๋‹ค. ๊ธฐ์กด์˜ AM ํŠธ๋Ÿญ์˜ ํ›„์† ์ฐจ์ข…์œผ๋กœ ๊ฐœ๋ฐœํ•œ ํŠธ๋Ÿญ์œผ๋กœ, ์•„์‹œ์•„์ž๋™์ฐจ๊ฐ€ ์ฐฝ์‚ฌ 30์ฃผ๋…„์„ ๊ธฐ๋…ํ•ด์„œ ๊ฐœ๋ฐœํ•œ ํŠธ๋Ÿญ์ด๋‹ค. ์„ ํƒ ์‚ฌ์–‘์œผ๋กœ ABS ๋ธŒ๋ ˆ์ดํฌ, ์†๋„ ์ œํ•œ ์žฅ์น˜, ๋ธŒ๋ ˆ์ดํฌ ๋ผ์ด๋‹ ๊ฐ„๊ทน ์ž๋™ ์กฐ์ • ์žฅ์น˜, ์˜คํ†  ๊ทธ๋ฆฌ์Šค, ํŠœ๋ธŒํ˜• ๋ธŒ๋ ˆ์ดํฌ ํŒŒ์ดํ”„, ์ „๊ธฐ์‹ ๋ณ€์†๊ธฐ ์ „ํ™˜ ์žฅ์น˜ ๋“ฑ์„ ํƒ‘์žฌํ•˜์˜€๋‹ค. 1997๋…„์— ๋Œ€ํ•œ๋ฏผ๊ตญ์‚ฐ ํŠธ๋Ÿญ ์ตœ์ดˆ๋กœ U์žํ˜• ์ ์žฌํ•จ์„ ํƒ‘์žฌํ•˜์˜€์œผ๋ฉฐ, ์ตœ๊ณ  ์ถœ๋ ฅ 430๋งˆ๋ ฅ์˜ FY(8ร—4) 23ํ†ค ๋คํ”„ ํŠธ๋Ÿญ์„ ์ถœ์‹œํ•˜์˜€๋‹ค. 1999๋…„์— ์•„์‹œ์•„์ž๋™์ฐจ๊ฐ€ ๊ธฐ์•„์ž๋™์ฐจ์—๊ฒŒ ํก์ˆ˜ ํ•ฉ๋ณ‘๋˜์—ˆ์œผ๋ฉฐ, ์ดํ›„ ๊ธฐ์•„์ž๋™์ฐจ์—์„œ ์ƒ์‚ฐํ•˜๋‹ค๊ฐ€ 2000๋…„ 8์›”์— ๋ฐฐ๊ธฐ ๊ฐ€์Šค ๊ทœ์ œ๋ฅผ ์ถฉ์กฑ์‹œํ‚ค์ง€ ๋ชปํ•˜์—ฌ ํ›„์† ์ฐจ์ข… ์—†์ด ๋‹จ์ข…๋˜๋ฉด์„œ, ๊ธฐ์•„์ž๋™์ฐจ๋Š” ๋Œ€ํ˜• ํŠธ๋Ÿญ ์‚ฌ์—…์„ ์Šค์นด๋‹ˆ์•„ ์ฝ”๋ฆฌ์•„์— ์–‘๋„ํ•จ์— ๋”ฐ๋ผ ๋Œ€ํ˜• ํŠธ๋Ÿญ์˜ ์‹œ์žฅ์—์„œ ์™„์ „ํžˆ ์ฒ ์ˆ˜ํ•˜์˜€๋‹ค."
}
 4 1226703#1       18.78540
{
  "id" : "1226703#1",
  "contents" : "1845๋…„์— ํšŒ์‚ฌ๋ฅผ ์ฐฝ๋ฆฝ ํ–ˆ์œผ๋ฉฐ ๋…์ผ์˜ ์ „์ง€ํ˜• ๊ธฐ์ค‘๊ธฐ ์ƒ์‚ฐํ•˜๋Š” ๊ธฐ์—… ์ค‘ ๊ฐ€์žฅ ์˜ค๋ž˜๋˜์—ˆ๋‹ค. 1868๋…„์— ๋ง์ด ๋„๋Š” ์†Œ๋ฐฉ์ฐจ๋ฅผ ๊ฐœ๋ฐœํ–ˆ์œผ๋ฉฐ 1890๋…„์— ์ตœ์ดˆ๋กœ ์ฆ๊ธฐ ์†Œ๋ฐฉ ์ฐจ๋Ÿ‰์„ ์ƒ์‚ฐํ–ˆ๋‹ค. 1914๋…„์— ์ตœ์ดˆ๋กœ ํŠธ๋Ÿญ๊ณผ ํŠน์ˆ˜ ์ฐจ๋Ÿ‰์„ ์ œ์ž‘ํ–ˆ๋‹ค. 1918๋…„์— ์•ˆ์Šค๋ฐ”ํ ์ž๋™์ฐจ ๊ณต์žฅ๊ณผ ๋‰˜๋ฅด๋ฒ ๋ฅดํฌ ์ž๋™์ฐจ ๊ณต์žฅ์„ ํ•ฉ๋ณ‘ํ–ˆ๋‹ค. 1937๋…„์— 3์ถ• ํŠธ๋Ÿญ์„ ์ƒ์‚ฐ ํ–ˆ์œผ๋ฉฐ 1943๋…„์— ์ œ2์ฐจ ์„ธ๊ณ„๋Œ€์ „์œผ๋กœ ๊ธฐ์กด ๊ณต์žฅ์ด ํŒŒ๊ดด๋˜๋ฉด์„œ ์ƒˆ๋กœ์šด ๊ณต์žฅ์„ ๊ฑด์„คํ–ˆ๋‹ค. 1956๋…„์— ๊ตฐ์‚ฌ ๋ชฉ์ ์„ ์œ„ํ•ด ๋Œ€ํ˜• ํŠธ๋ ฅ๊ณผ ์žฅ๋น„๋ฅผ ๊ฐœ๋ฐœํ–ˆ๋‹ค. 1960๋…„๋Œ€์— ์ตœ์ดˆ๋กœ ๊ธฐ์ค‘๊ธฐ๋ฅผ ์ œ์ž‘ํ•˜๊ธฐ ์‹œ์ž‘ ํ–ˆ์œผ๋ฉฐ 1970๋…„๋Œ€๋ถ€ํ„ฐ 1980๋…„๋Œ€๊นŒ์ง€ ๊ฐœ๋ฐœํ–ˆ๋‹ค. 1985๋…„์— ์ตœ๋Œ€ 50ํ†ค ์šฉ๋Ÿ‰์˜ ๊ฐ€์ง„ ์ „์ง€ํ˜• ๊ธฐ์ค‘๊ธฐ๋ฅผ ๊ฐœ๋ฐœํ–ˆ๋‹ค. 1990๋…„ ์ผ๋ณธ์˜ ๊ธฐ์ค‘๊ธฐ ํšŒ์‚ฌ์˜€๋˜ ํƒ€๋‹ค๋…ธ์— ์ธ์ˆ˜ ๋˜์—ˆ๋‹ค. 1991๋…„์— ์ผ๋ณธ ์ˆ˜์ถœ์„ ์œ„ํ•ด ์ „์ง€ํ˜• ๊ธฐ์ค‘๊ธฐ๋ฅผ ์ƒ์‚ฐํ–ˆ๋‹ค. 1995๋…„์— ํšŒ์‚ฌ ์ฐฝ๋ฆฝ 150์ฃผ๋…„์ด ๋˜์—ˆ๋‹ค. 2004๋…„์— ์ตœ์ดˆ๋กœ ํ—˜์ง€ํ˜• ๊ธฐ์ค‘๊ธฐ๋ฅผ ์ œ์ž‘ํ•œ๋ฐ ์ด์–ด 2009๋…„์— ํŠธ๋Ÿญ ๊ธฐ์ค‘๊ธฐ๋ฅผ ์ œ์ž‘ํ–ˆ๋‹ค. 2013๋…„์— ๊ณต์žฅ์„ ํ™•์žฅ ๋ฐ ์ด์ „ํ•˜๋ฉด์„œ ํ˜„์žฌ์— ์ด๋ฅด๊ณ  ์žˆ๋‹ค."
}
 5 1045157#14      18.30410
{
  "id" : "1045157#14",
  "contents" : "2010๋…„ 3์›” ์„ธ๊ณ„์ตœ์ดˆ์˜ 2000cc๊ธ‰ ์ž๋™์ฐจ๋ฅผ ์œ„ํ•œ 15Kw๊ธ‰ BLDC๋ฐœ์ „๊ธฐ ๊ฐœ๋ฐœ, ์ „๊ธฐ์ž๋™์ฐจ์˜ ์ฃผํ–‰๊ฑฐ ๋ฆฌ ์ œํ•œ ๊ทน๋ณต ์„ธ๊ณ„์ตœ์ดˆ์˜ ๋™๊ธ‰ ๋‚ด์—ฐ์ด๋ฅœ์ฐจ์˜ ์„ฑ๋Šฅ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์ „๊ธฐ์Šค์ฟ ํ„ฐ ํž๋ฆฌ์Šค ๋ชจ๋ธ์ถœ์‹œ ๋ฐ ์‹ ์ฐจ๋ฐœํ‘œํšŒ EV์ „์‹œ์žฅ ์˜คํ”ˆ"
}
 6 128661#7        17.92510
{
  "id" : "128661#7",
  "contents" : "1991๋…„ 11์›” 21์ผ ํ˜„๋Œ€์ž๋™์ฐจ๋Š” ํ•œ๊ตญ๋‚ด์—์„œ๋Š” ์ตœ์ดˆ์˜ ์ „๊ธฐ์ž๋™์ฐจ๋ฅผ ๋…์ž๊ฐœ๋ฐœํ–ˆ๋‹ค๊ณ  ๋ฐœํ‘œํ–ˆ๋‹ค."
}
 7 1312657#1       17.78780
{
  "id" : "1312657#1",
  "contents" : "1939๋…„์— ์ดํƒˆ๋ฆฌ์•„ ๋‚˜ํด๋ฆฌ ์ถœ์‹ ์ธ ๋นˆ์„ผ์กฐ ์•™ํ—ฌ๋ ˆ์Šค ๊ฒŒ๋ฅด๋ฐ”์ง€์˜ค()์™€ ํƒ€์˜ˆ๋ ˆ์Šค ๋‚˜ํด๋ฆฌ()์— ์˜ํ•ด ์„ค๋ฆฝํ–ˆ๋‹ค. ์ œ2์ฐจ ์„ธ๊ณ„๋Œ€์ „ ๋‹น์‹œ ์ŠคํŽ˜์ธ์—์„œ ํŠธ๋Ÿญ์„ ์ƒ์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์ฐจ์ฒด ๋ฐ ์šฉ์ ‘์„ ํ–ˆ์œผ๋‚˜, ์ดํ›„ ์ƒค์‹œ์— ํŠน์žฅ ํŠธ๋Ÿญ ์บก ๋””์ž์ธ์„ ๊ฐœ๋ฐœํ–ˆ๋‹ค. 1958๋…„์— ์ตœ์ดˆ๋กœ ๊ณต์žฅ์ด ์ด์ „๋˜๋ฉด์„œ ๋ฒ„์Šค๋ฅผ ์ƒ์‚ฐํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. 1960๋…„์— ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ 2์ธต ๋ฒ„์Šค๋ฅผ ์ƒ์‚ฐํ–ˆ๋‹ค. 1962๋…„์— ์ƒ์‚ฐ ๊ณต์žฅ์ด ์žฌ์ด์ „ ๋˜๋ฉด์„œ ํŒฉํ† ๋ฆฌ์•„์Šค ๋‚˜ํด๋ฆฌ์Šค SA()์— ์ธ์ˆ˜๋˜์—ˆ๋‹ค. ์ด ํšŒ์‚ฌ๋Š” ์ƒ์šฉ์ฐจ๋ฅผ ์ƒ์‚ฐํ•œ ์—…์ฒด๋กœ ์ฃผ๋กœ ๋ฒ„์Šค์™€ ํŠธ๋Ÿญ์„ ์ƒ์‚ฐํ–ˆ๋‹ค. 1966๋…„์— ๋ฐ”ํ—ค์ด๋กœ์Šค ๋””์ ค SA()์— ๋งค๊ฐํ–ˆ๋‹ค. 1969๋…„์— ๋‹ค์‹œ ํฌ๋ผ์ด์Šฌ๋Ÿฌ์— ๋งˆ๊ฐ์ด ๋˜์—ˆ์ง€๋งŒ ๋ฒ„์Šค ์ œ์กฐ ๋ถ€๋ฌธ์˜ ๊ฒฝ์šฐ 1971๋…„์— ๋ฒจ๊ธฐ์—์˜ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ธ ๋ฐ˜ํ˜ธ๋ฅด์— ๋งค๊ฐ๋˜์—ˆ๋‹ค. 1983๋…„์— ๋ฐ˜ํ˜ธ๋ฅด๊ฐ€ ์ตœ๋Œ€ ์ฃผ์ฃผ๊ฐ€ ๋˜์—ˆ๊ณ  ์ธ์ˆ˜ ์ตœ๊ธฐ์— ๋ฐ˜ํ˜ธ๋ฅด์˜ ๋ธŒ๋žœ๋“œ๋กœ ์ฐจ๋Ÿ‰ ์ƒ์‚ฐ์„ ํ–ˆ์ง€๋งŒ ์ดํ›„ ์ด์ŠคํŒŒ๋…ธ ์นด๋กœ์„ธ๋ผ SAL()๋กœ ์‚ฌ๋ช…์ด ๋ณ€๊ฒฝ๋˜์—ˆ๋‹ค. 1997๋…„์— ์ดํƒˆ๋ฆฌ์•„์˜ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ธ ํ”ผ๋‹ŒํŒŒ๋ผ๋‚˜()์™€ ์ œํœด๋ฅผ ๋งบ๊ณ  ์‹œ๋‚ด๋ฒ„์Šค ๋ชจ๋ธ์ธ ์•„๋น„ํ† ์™€ ๊ณ ์†๋ฒ„์Šค ๋ชจ๋ธ์ธ ๋””๋ณด๋ฅผ ๊ฐœ๋ฐœํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค. 2000๋…„ 9์›”์— ๋ชจ๋กœ์ฝ”์˜ ์ˆ˜๋„ ์นด์‚ฌ๋ธ”๋ž‘์นด์— ๊ณต์žฅ์„ ์„ค๋ฆฝํ–ˆ๋‹ค. 2005๋…„์— ์ธ๋„์˜ ์ž๋™์ฐจ ์ œ์กฐ ๊ธฐ์—…์ธ ํƒ€ํƒ€์ž๋™์ฐจ๊ฐ€ 21%์˜ ์ง€๋ถ„์„ ํš๋“ํ•œ๋ฐ ์ด์–ด 2009๋…„์— ์ง€๋ถ„ 79%๋ฅผ ์ธ์ˆ˜ํ•˜๋ฉด์„œ ์žํšŒ์‚ฌ๊ฐ€ ๋˜์—ˆ๋‹ค. 2010๋…„์— ํ˜„์žฌ์˜ ์‚ฌ๋ช…์œผ๋กœ ๋ณ€๊ฒฝ์ด ๋˜์—ˆ๋‹ค. 2013๋…„ 9์›”์— ํƒ€ํƒ€์ž๋™์ฐจ๋Š” ์‚ฌ๋ผ๊ณ ์‚ฌ ๊ณต์žฅ ํ์‡„๋ฅผ ๋ฐœํ‘œํ–ˆ๋‹ค. ๋งค์ถœ ํ•˜๋ฝ๊ณผ ๋ฏธ๋ž˜ ์ „๋ง์ด ๋ถˆํˆฌ๋ช…์œผ๋กœ ํ์‡„ ๊ฒฐ์ •์„ ๋‚ด๋ ธ๋‹ค."
}
 8 128660#63       17.71300
{
  "id" : "128660#63",
  "contents" : "ํ›„์ง€์ค‘๊ณต์—…๊ณผ ๋งˆ์ธ ๋น„์‹œ ์ž๋™์ฐจ๋Š” 2005๋…„ 8์›”์— ์ „๊ธฐ์ž๋™์ฐจ์˜ ๊ฐœ๋ฐœ ๊ณ„ํš์„ ๋ฐœํ‘œํ•˜์˜€๋‹ค. ์ด 2๊ฐœ ํšŒ์‚ฌ๊ฐ€ ๊ฑฐ์˜ ์ค‘์ง€ ์ƒํƒœ์˜€๋˜ ์ „๊ธฐ์ž๋™์ฐจ์˜ ๊ฐœ๋ฐœ์„ ์žฌ๊ฐœํ•˜๊ณ  ์žˆ๋‹ค. 2008๋…„์— ๋“ค์–ด ๋‹›์‚ฐ-๋ฅด๋…ธ ์—ฐํ•ฉ์ด ์ „๊ธฐ์ž๋™์ฐจ๋กœ ๋ณธ๊ฒฉ ์ฐธ์—ฌ ๋ฐฉ์นจ์„ ํ‘œ๋ช…ํ•˜์˜€๊ณ , ๋„์š”ํƒ€๋„ 2010๋…„๋Œ€ ์ดˆ๋ฐ˜์— ์ „๊ธฐ์ž๋™์ฐจ๋ฅผ ์ถœ์‹œํ•˜๊ธฐ๋กœ ๋ฐœํ‘œํ•˜๋Š” ๋“ฑ ์ „๊ธฐ ์ž๋™์ฐจ๊ฐ€ ํ™œ์„ฑํ™” ์กฐ์ง์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค."
}
 9 126891#2        17.63640
{
  "id" : "126891#2",
  "contents" : "2007๋…„, ์Šค์›จ๋ด์˜ ๋Œ€ํ‘œ ์ž๋™์ฐจ ๋ฉ”์ด์ปค์ธ ๋ณผ๋ณด๋Š” ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ ์—ํƒ„์˜ฌ ์ž๋™์ฐจ๋ฅผ ์ œ์ž‘ํ•ด์„œ ์ž๋™์ฐจ ๊ฒฝ์ฃผ์— ์ฐธ๊ฐ€ํ–ˆ๋‹ค. ์Šค์›จ๋ด์—์„œ๋Š” ๊ฐ€์†”๋ฆฐ ์ž๋™์ฐจ์˜ ๋„์‹œ๋‚ด ์‚ฌ์šฉ์„ ์ค„์ด๊ณ , ์‹œ๋ฏผ๋“ค์ด ์ž์ „๊ฑฐ๋กœ ์ƒํ™œํ•  ์ˆ˜ ์žˆ๊ฒŒ๋” ์œ ๋„ํ•˜๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ๋ณผ๋ณด์—์„œ ์นœํ™˜๊ฒฝ ์ž๋™์ฐจ๋ฅผ ์ ๊ทน์ ์œผ๋กœ ๊ฐœ๋ฐœํ•˜๊ฒŒ ํ•˜๊ณ , ์‹œ๋ฏผ๋“ค์—๊ฒŒ๋Š” ์นœํ™˜๊ฒฝ ์ž๋™์ฐจ ๊ตฌ์ž…๋น„์— 150๋งŒ ์›์˜ ๋ณด์กฐ๊ธˆ์„ ์ง€๊ธ‰ํ•˜๋ฉฐ, ์—ฐ๋ฃŒ๋น„๋Š” ๊ฐ€์†”๋ฆฐ์˜ 70% ๊ฐ€๊ฒฉ์— ์ฃผ์œ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๋“ฑ ์ ๊ทน์ ์ธ ํƒˆ์„์œ  ์ •์ฑ…์„ ์‹œํ–‰ํ•˜๊ณ  ์žˆ๋‹ค."
}
10 128660#3        17.29680
{
  "id" : "128660#3",
  "contents" : "์ „๊ธฐ์ž๋™์ฐจ๋Š” ๋””์ ค ์—”์ง„, ๊ฐ€์†”๋ฆฐ ์—”์ง„์„ ์‚ฌ์šฉํ•˜๋Š” ์˜คํ† ์‚ฌ์ดํด(์ •์ ์‚ฌ์ดํด)๋ฐฉ์‹์˜ ์ž๋™์ฐจ๋ณด๋‹ค ๋จผ์ € ๊ณ ์•ˆ ๋˜์—ˆ๋‹ค. 1830๋…„๋ถ€ํ„ฐ 1840๋…„ ์‚ฌ์ด์— ์˜๊ตญ ์Šค์ฝ”ํ‹€๋žœ๋“œ์˜ ์‚ฌ์—…๊ฐ€ ์•ค๋”์Šจ์ด ์ „๊ธฐ์ž๋™์ฐจ์˜ ์‹œ์ดˆ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” ์„ธ๊ณ„ ์ตœ์ดˆ์˜ ์›์œ ์ „๊ธฐ๋งˆ์ฐจ๋ฅผ ๋ฐœ๋ช…ํ•œ๋‹ค. 1835๋…„์— ๋„ค๋œ๋ž€๋“œ ํฌ๋ฆฌ์Šคํ† ํผ ๋ฒ ์ปค๋Š” ์ž‘์€ ํฌ๊ธฐ์˜ ์ „๊ธฐ์ž๋™์ฐจ๋ฅผ ๋งŒ๋“ ๋‹ค."
}
  • with dense indexing
python search_online.py --index_type dense --index /path/to/indexing/dense --query "$QUERY" --encoder huggingface_model_name_or_checkpoint_path --device $DEVICE
  • DEVICE: 'cpu' or 'cuda:$GPU_ID'

    • search๋Š” ํ˜„์žฌ๋Š” single gpu๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค. multi gpu๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด individual process๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • with hybrid (first-pass: sparse, reranking: dense) indexing

python search_online.py --index_type hybrid --index /path/to/indexing/sparse,/path/to/indexing/dense --query "$QUERY" --encoder huggingface_model_name_or_checkpoint_path --device $DEVICE --alpha $ALPHA_MULTIPLIED_ON_SPARSE_SCORE --normalization --lang_abbr $LANG_CODE
  • ALPHA_MULTIPLIED_ON_SPARSE_SCORE๋Š” (0,2)์—์„œ line search๋ฅผ ํ•˜๋ฉด์„œ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์œผ๋ฉฐ 0.5๊ฐ€ ๊ธฐ๋ณธ๊ฐ’์ž…๋‹ˆ๋‹ค.

batch

  • with dense indexing
python -m pyserini.search.faiss \
    --encoder huggingface_model_name_or_checkpoint_path \
    --index /path/to/indexing_dense \
    --topics datasets/dataset_name/queries.tsv \
    --output /path/to/runfile --batch $BATCH_SIZE --threads $NUM_THREADS \
    --hits $TOPK --remove-query --device $DEVICE
  • BATCH_SIZE, NUM_THREADS๋Š” ๊ธฐ๋ณธ๊ฐ’์„ 64, 16์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • with sparse indexing

python -m pyserini.search.lucene --bm25 \
    --topics datasets/dataset_name/queries.tsv \
    --index /path/to/indexing_sparse \
    --hits $TOPK \
    --language $LANG_CODE \
    --output /path/to/runfile
  • hybrid model
python -m pyserini.search.hybrid \
dense  --index /path/to/indexing_dense \
        --encoder huggingface_model_name_or_checkpoint_path \
        --device $DEVICE \
sparse --index /path/to/indexing_sprase \
fusion --alpha $ALPHA_MULTIPLIED_ON_SPARSE_SCORE \
run	--topics datasets/dataset_name/queries.jsonl \
    --output /path/to/runfile \
    --threads $NUM_THREADS \
    --batch-size $BATCH_SIZE \
    --hits $TOPK

python -m pyserini.search.hybrid \
dense  --index path/to/indexing/dense \
        --encoder huggingface_model_name_or_checkpoint_path \
        --device $DEVICE \
sparse --index /path/to/indexing/sprase \
fusion --alpha $ALPHA_MULTIPLIED_ON_SPARSE_SCORE \
run --topics datasets/dataset_name/queries.tsv \
    --output runs/hybrid.run \
    --threads $NUM_THREADS \
    --batch-size $BATCH_SIZE \
    --hits 1000
  • ๊ฒฐ๊ณผ๋ฌผ (dir: /path/to/runfile) format: qid q_type pid topK score retrieval_type example:
    46 Q0 271267 1 2.134944 Faiss
    46 Q0 63734 2 2.118700 Faiss
    46 Q0 174045 3 2.110519 Faiss
    ...
    

3. Evaluation (optional)

  • qrels ํŒŒ์ผ์€ ์ •๋Ÿ‰ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ground truth ํŒŒ์ผ๋กœ, qid q_type pid relevance_score ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ.
  • runfile์€ batch๋กœ ๊ฒ€์ƒ‰ํ•œ ๊ฒฐ๊ณผ๋กœ, qid q_type pid topK score retrieval_type ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Œ.
  • ์•„๋ž˜ ์Šคํฌ๋ฆฝํŠธ๋Š” qrels ํŒŒ์ผ๊ณผ runfile์„ ๋น„๊ตํ•˜์—ฌ nDCG@10, MRR@100, Recall@100 ๋“ฑ์˜ ์ง€ํ‘œ๋ฅผ ๊ณ„์‚ฐํ•จ.
python -m pyserini.eval.trec_eval -c -mndcg_cut.10 -mrecip_rank -mrecall.100 /path/to/qrels /path/to/runfile

recip_rank            	all	0.3628
recall_100            	all	0.7158
ndcg_cut_10           	all	0.3805