kr-article-summarizer / article.md
Young Ho Shin
Add article
59e5930

๊ธฐ์ˆ  ์„ค๋ช…

๋ฌธ์„œ์š”์•ฝ(text summarization)์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(natural language processings) ๋ถ„์•ผ์— ์ƒ๋‹นํžˆ ์žฌ๋ฏธ์žˆ๋Š” ๊ณผ์ œ ์ค‘ ํ•œ๊ฐ€์ง€, ๊ทธ๋ฆฌ๊ณ  ์ผ์ƒ์ƒํ™œ์—๋„ ์ƒ๋‹นํžˆ ์œ ์šฉํ•œ๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ์„ ํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์‰ฝ๊ฒŒ ์ฐพ์ง€ ๋ชปํ•ด ์—ฌ๊ธฐ์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‹œ๋„ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ๊ฐœ๋…์€ ๋ฌธ์„œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ N๊ฐœ์˜ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ถ”์ถœ์  ์š”์•ฝ(extractive summarization)์ธ๋ฐ์š” ๋ฌธ์žฅ์„ ์„ ํƒํ•˜๋Š” ๊ธฐ์ค€์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์ด ์žˆ์ฃ . ํ”ํžˆ ์‚ฌ์šฉํ•˜๋Š” TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ์ƒํ˜ธ๊ด€๊ณ„๋ฅผ ํ†ตํ•ด ๊ทธ์ค‘ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์„ ์ฐพ๋Š” graph-based ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ(sentence embedding)์„ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ , ํด๋Ÿฌ์Šคํ„ฐ๋ง(clustering)์œผ๋กœ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ์„ ์ฐพ์•„, ์—ฌ๋Ÿฌ ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์—์„œ ๊ฐ๊ฐ 1๊ฐœ์˜ ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ์ฐพ๋Š”๋ฐ์„œ Sentence-BERT๋ผ๋Š” ๋ชจ๋ธ๋กœ ์‹œ๋„ํ–ˆ์—ˆ๋Š”๋ฐ์š” ๊ธฐ์กด ๋ชจ๋ธ์€ BERT ๊ธฐ๋ฐ˜์ด๋ผ์„œ ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ์— ํŠธ๋ ˆ์ด๋‹์ด๋œ KoBERT๋ฅผ ๋’ท๋ฐ›์นจ์œผ๋กœ Sentence-BERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ๊ตฌํ•˜๊ณ  ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ํŒŒ์ด์ฌ Summarizer ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋ฌธ์žฅ๊ฒฝ๊ณ„์‹๋ณ„(sentence boundary detection) ๋“ฑ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ›„์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด Spacy ๋ผ์ด๋ฒ„๋ฆฌ๋ฅผ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋งํฌ๋ฅผ ์ž…๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ BeautifulSoup๋กœ ๋จผ์ € ํ•ด๋‹น ํŽ˜์ด์ง€์—์„œ ๊ธฐ์‚ฌ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•œ ํ›„ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ์— ์•ž์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผ ํ•  ์ ์ด ์•„์ง ๋งŽ์Šต๋‹ˆ๋‹ค. ๋งŒ์กฑ์Šค๋Ÿฌ์šด ํ•œ๊ธ€ ๋ฌธ์„œ ์š”์•ฝ ๋ฐ์ดํ„ฐ์…‹์ด ์—†์–ด fine-tuning ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ๊ณ  ์ข…์ข… ๊ธ€์— ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์ด ๊ฐœ์š”์—์„œ ๋ˆ„๋ฝ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ  context๊ฐ€ ์—†์–ด ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์žฅ์ด ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์ฃ . ์–ธ์–ด๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ƒ๋‹นํ•ด์„œ CPU ์„ฑ๋Šฅ๋„ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š์•„ ๋” ์ž‘์€ ๋ชจ๋ธ๋กœ ์‹คํ—˜ํ•ด๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋„ค์ด๋ฒ„๋‰ด์Šค ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์‚ฌ์ดํŠธ์—์„œ๋„ ๋ฌธ์„œ ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•˜๋Š” web scraping ์ฝ”๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์‚ฌ์šฉํ•˜๊ธฐ ํŽธํ•˜๊ฒ ์ฃ .