Young Ho Shin commited on
Commit
59e5930
โ€ข
1 Parent(s): 23cd59d

Add article

Browse files
Files changed (3) hide show
  1. app.py +2 -0
  2. article.md +28 -0
  3. description.md +0 -29
app.py CHANGED
@@ -106,6 +106,8 @@ default_text = """
106
  title = "AI ๋ฌธ์„œ ์š”์•ฝ\nKorean text summarization"
107
  with open('description.md',mode='r') as file:
108
  description = file.read()
 
 
109
 
110
 
111
  demo = gr.Interface(
106
  title = "AI ๋ฌธ์„œ ์š”์•ฝ\nKorean text summarization"
107
  with open('description.md',mode='r') as file:
108
  description = file.read()
109
+ with open('article.md',mode='r') as file:
110
+ article = file.read()
111
 
112
 
113
  demo = gr.Interface(
article.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### ๊ธฐ์ˆ  ์„ค๋ช…
2
+
3
+ ๋ฌธ์„œ์š”์•ฝ(text summarization)์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(natural language processings) ๋ถ„์•ผ์— ์ƒ๋‹นํžˆ ์žฌ๋ฏธ์žˆ๋Š” ๊ณผ์ œ ์ค‘ ํ•œ๊ฐ€์ง€,
4
+ ๊ทธ๋ฆฌ๊ณ  ์ผ์ƒ์ƒํ™œ์—๋„ ์ƒ๋‹นํžˆ ์œ ์šฉํ•œ๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ 
5
+ ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ์„ ํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์‰ฝ๊ฒŒ ์ฐพ์ง€ ๋ชปํ•ด ์—ฌ๊ธฐ์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‹œ๋„ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.
6
+
7
+ ๊ธฐ๋ณธ ๊ฐœ๋…์€ ๋ฌธ์„œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ N๊ฐœ์˜ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ถ”์ถœ์  ์š”์•ฝ(extractive summarization)์ธ๋ฐ์š”
8
+ ๋ฌธ์žฅ์„ ์„ ํƒํ•˜๋Š” ๊ธฐ์ค€์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์ด ์žˆ์ฃ .
9
+ ํ”ํžˆ ์‚ฌ์šฉํ•˜๋Š” TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ์ƒํ˜ธ๊ด€๊ณ„๋ฅผ ํ†ตํ•ด ๊ทธ์ค‘ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์„ ์ฐพ๋Š” graph-based ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
10
+
11
+ ์ด ํ”„๋กœ์ ํŠธ๋Š” ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ(sentence embedding)์„ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ ,
12
+ ํด๋Ÿฌ์Šคํ„ฐ๋ง(clustering)์œผ๋กœ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ์„ ์ฐพ์•„,
13
+ ์—ฌ๋Ÿฌ ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์—์„œ ๊ฐ๊ฐ 1๊ฐœ์˜ ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
14
+
15
+ ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ์ฐพ๋Š”๋ฐ์„œ [Sentence-BERT](https://www.sbert.net/)๋ผ๋Š” ๋ชจ๋ธ๋กœ ์‹œ๋„ํ–ˆ์—ˆ๋Š”๋ฐ์š” ๊ธฐ์กด ๋ชจ๋ธ์€ BERT ๊ธฐ๋ฐ˜์ด๋ผ์„œ
16
+ ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
17
+ ๊ทธ๋ž˜์„œ ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ์— ํŠธ๋ ˆ์ด๋‹์ด๋œ [KoBERT](https://github.com/SKTBrain/KoBERT)๋ฅผ ๋’ท๋ฐ›์นจ์œผ๋กœ Sentence-BERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
18
+
19
+ ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ๊ตฌํ•˜๊ณ  ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ํŒŒ์ด์ฌ [Summarizer](https://github.com/dmmiller612/bert-extractive-summarizer) ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ 
20
+ ๋ฌธ์žฅ๊ฒฝ๊ณ„์‹๋ณ„(sentence boundary detection) ๋“ฑ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ›„์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด [Spacy](https://spacy.io/) ๋ผ์ด๋ฒ„๋ฆฌ๋ฅผ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
21
+ ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋งํฌ๋ฅผ ์ž…๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)๋กœ ๋จผ์ € ํ•ด๋‹น ํŽ˜์ด์ง€์—์„œ ๊ธฐ์‚ฌ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•œ ํ›„ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
22
+
23
+ ํ”„๋กœ์ ํŠธ์— ์•ž์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผ ํ•  ์ ์ด ์•„์ง ๋งŽ์Šต๋‹ˆ๋‹ค.
24
+ ๋งŒ์กฑ์Šค๋Ÿฌ์šด ํ•œ๊ธ€ ๋ฌธ์„œ ์š”์•ฝ ๋ฐ์ดํ„ฐ์…‹์ด ์—†์–ด fine-tuning ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ๊ณ 
25
+ ์ข…์ข… ๊ธ€์— ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์ด ๊ฐœ์š”์—์„œ ๋ˆ„๋ฝ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ  context๊ฐ€ ์—†์–ด ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์žฅ์ด ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์ฃ .
26
+ ์–ธ์–ด๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ƒ๋‹นํ•ด์„œ CPU ์„ฑ๋Šฅ๋„ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š์•„
27
+ ๋” ์ž‘์€ ๋ชจ๋ธ๋กœ ์‹คํ—˜ํ•ด๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
28
+ ๋˜ํ•œ ๋„ค์ด๋ฒ„๋‰ด์Šค ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์‚ฌ์ดํŠธ์—์„œ๋„ ๋ฌธ์„œ ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•˜๋Š” web scraping ์ฝ”๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์‚ฌ์šฉํ•˜๊ธฐ ํŽธํ•˜๊ฒ ์ฃ .
description.md CHANGED
@@ -7,32 +7,3 @@
7
  - ํ™”๋ฉด ํ•˜๋‹จ์—์„œ ์›ํ•˜๋Š” "์˜ต์…˜"์„ ์„ ํƒ ํ›„ ์š”์•ฝํ•  ๋‚ด์šฉ์„ ์ง์ ‘ ์ž…๋ ฅํ•˜๊ฑฐ๋‚˜ ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋งํฌ ์ฃผ์†Œ๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
8
 
9
  - 'Submit' ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์šฐ์ธก ํ™”๋ฉด์— ๊ฐœ์š”๊ฐ€ ์ž๋™์œผ๋กœ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
10
-
11
- ### ๊ธฐ์ˆ  ์„ค๋ช…
12
-
13
- ๋ฌธ์„œ์š”์•ฝ(text summarization)์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(natural language processings) ๋ถ„์•ผ์— ์ƒ๋‹นํžˆ ์žฌ๋ฏธ์žˆ๋Š” ๊ณผ์ œ ์ค‘ ํ•œ๊ฐ€์ง€,
14
- ๊ทธ๋ฆฌ๊ณ  ์ผ์ƒ์ƒํ™œ์—๋„ ์ƒ๋‹นํžˆ ์œ ์šฉํ•œ๋ฐ๋„ ๋ถˆ๊ตฌํ•˜๊ณ 
15
- ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ์„ ํ•˜๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์‰ฝ๊ฒŒ ์ฐพ์ง€ ๋ชปํ•ด ์—ฌ๊ธฐ์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์‹œ๋„ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.
16
-
17
- ๊ธฐ๋ณธ ๊ฐœ๋…์€ ๋ฌธ์„œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ N๊ฐœ์˜ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ถ”์ถœ์  ์š”์•ฝ(extractive summarization)์ธ๋ฐ์š”
18
- ๋ฌธ์žฅ์„ ์„ ํƒํ•˜๋Š” ๊ธฐ์ค€์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์ด ์žˆ์ฃ .
19
- ํ”ํžˆ ์‚ฌ์šฉํ•˜๋Š” TextRank ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ์ƒํ˜ธ๊ด€๊ณ„๋ฅผ ํ†ตํ•ด ๊ทธ์ค‘ ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์„ ์ฐพ๋Š” graph-based ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
20
-
21
- ์ด ํ”„๋กœ์ ํŠธ๋Š” ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ(sentence embedding)์„ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ ,
22
- ํด๋Ÿฌ์Šคํ„ฐ๋ง(clustering)์œผ๋กœ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฌธ์žฅ์„ ์ฐพ์•„,
23
- ์—ฌ๋Ÿฌ ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์—์„œ ๊ฐ๊ฐ 1๊ฐœ์˜ ํ•ต์‹ฌ ๋ฌธ์žฅ์„ ์„ ํƒํ•ด์„œ ๊ฐœ์š”๋ฅผ ๋งŒ๋“œ๋Š” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
24
-
25
- ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ์ฐพ๋Š”๋ฐ์„œ [Sentence-BERT](https://www.sbert.net/)๋ผ๋Š” ๋ชจ๋ธ๋กœ ์‹œ๋„ํ–ˆ์—ˆ๋Š”๋ฐ์š” ๊ธฐ์กด ๋ชจ๋ธ์€ BERT ๊ธฐ๋ฐ˜์ด๋ผ์„œ
26
- ํ•œ๊ตญ์–ด ๋ฌธ์„œ์š”์•ฝ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
27
- ๊ทธ๋ž˜์„œ ํ•œ๊ตญ์–ด ์ž„๋ฒ ๋”ฉ์— ํŠธ๋ ˆ์ด๋‹์ด๋œ [KoBERT](https://github.com/SKTBrain/KoBERT)๋ฅผ ๋’ท๋ฐ›์นจ์œผ๋กœ Sentence-BERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
28
-
29
- ๋ฌธ์žฅ์ž„๋ฒ ๋”ฉ์„ ๊ตฌํ•˜๊ณ  ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ํŒŒ์ด์ฌ [Summarizer](https://github.com/dmmiller612/bert-extractive-summarizer) ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ 
30
- ๋ฌธ์žฅ๊ฒฝ๊ณ„์‹๋ณ„(sentence boundary detection) ๋“ฑ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ›„์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด [Spacy](https://spacy.io/) ๋ผ์ด๋ฒ„๋ฆฌ๋ฅผ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
31
- ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋งํฌ๋ฅผ ์ž…๋ ฅํ•˜๋Š” ๊ฒฝ์šฐ [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)๋กœ ๋จผ์ € ํ•ด๋‹น ํŽ˜์ด์ง€์—์„œ ๊ธฐ์‚ฌ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•œ ํ›„ ๊ฐœ์š”๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
32
-
33
- ํ”„๋กœ์ ํŠธ์— ์•ž์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผ ํ•  ์ ์ด ์•„์ง ๋งŽ์Šต๋‹ˆ๋‹ค.
34
- ๋งŒ์กฑ์Šค๋Ÿฌ์šด ํ•œ๊ธ€ ๋ฌธ์„œ ์š”์•ฝ ๋ฐ์ดํ„ฐ์…‹์ด ์—†์–ด fine-tuning ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ๊ณ 
35
- ์ข…์ข… ๊ธ€์— ๊ฐ€์žฅ ํ•ต์‹ฌ์ ์ธ ๋ฌธ์žฅ์ด ๊ฐœ์š”์—์„œ ๋ˆ„๋ฝ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ  context๊ฐ€ ์—†์–ด ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์žฅ์ด ํฌํ•จ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์ฃ .
36
- ์–ธ์–ด๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ƒ๋‹นํ•ด์„œ CPU ์„ฑ๋Šฅ๋„ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š์•„
37
- ๋” ์ž‘์€ ๋ชจ๋ธ๋กœ ์‹คํ—˜ํ•ด๋ณด๋Š” ๊ฒƒ๋„ ์ข‹์„๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
38
- ๋˜ํ•œ ๋„ค์ด๋ฒ„๋‰ด์Šค ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์‚ฌ์ดํŠธ์—์„œ๋„ ๋ฌธ์„œ ๋ณธ๋ฌธ์„ ์ถ”์ถœํ•˜๋Š” web scraping ์ฝ”๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ์‚ฌ์šฉํ•˜๊ธฐ ํŽธํ•˜๊ฒ ์ฃ .
7
  - ํ™”๋ฉด ํ•˜๋‹จ์—์„œ ์›ํ•˜๋Š” "์˜ต์…˜"์„ ์„ ํƒ ํ›„ ์š”์•ฝํ•  ๋‚ด์šฉ์„ ์ง์ ‘ ์ž…๋ ฅํ•˜๊ฑฐ๋‚˜ ๋„ค์ด๋ฒ„ ๋‰ด์Šค ๊ธฐ์‚ฌ ๋งํฌ ์ฃผ์†Œ๋ฅผ ์ž…๋ ฅํ•ฉ๋‹ˆ๋‹ค.
8
 
9
  - 'Submit' ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์šฐ์ธก ํ™”๋ฉด์— ๊ฐœ์š”๊ฐ€ ์ž๋™์œผ๋กœ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.