Spaces:
Paused
Paused
| """ | |
| RecursiveCharacterTextSplitter ๅทฅไฝๅ็่ฏฆ่งฃ | |
| ๅฑ็คบๅฆไฝๅฐ้ฟๆๆกฃๅๅๆๅฐๅ๏ผchunks๏ผ | |
| """ | |
| print("=" * 80) | |
| print("RecursiveCharacterTextSplitter ๅทฅไฝๅ็") | |
| print("=" * 80) | |
| # ============================================================================ | |
| # Part 1: ไธบไปไน้่ฆๆๆฌๅๅฒ๏ผ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("โ Part 1: ไธบไปไน้่ฆๆๆฌๅๅฒ๏ผ") | |
| print("=" * 80) | |
| print(""" | |
| ้ฎ้ข๏ผๅๅงๆๆกฃ้ๅธธๅพ้ฟ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไธ็ฏ็ฝ้กตๆ็ซ ๏ผ5000 ๅญ | |
| ไธไปฝๆๆฏๆๆกฃ๏ผ10000 ๅญ | |
| ไธๆฌไนฆ็ไธ็ซ ๏ผ20000 ๅญ | |
| ๅฆๆ็ดๆฅๅฐๆด็ฏๆๆกฃๅๆๅ้๏ผ | |
| โ ไฟกๆฏๅฏๅบฆๅคชไฝ๏ผๆ ๅ ณไฟกๆฏๅคชๅค๏ผ | |
| โ ๆฃ็ดขไธ็ฒพๅ๏ผๆ ๆณๅฎไฝๅฐๅ ทไฝๆฎต่ฝ๏ผ | |
| โ ่ถ ๅบๆจกๅ้ฟๅบฆ้ๅถ๏ผBERT ๆๅค 512 tokens๏ผ | |
| ่งฃๅณๆนๆก๏ผๆๆฌๅๅฒ๏ผText Splitting๏ผ | |
| โ ๅฐ้ฟๆๆกฃๅๆๅฐๅ๏ผchunks๏ผ | |
| โ ๆฏไธช chunk ็ฌ็ซๅปบ็ซๅ้็ดขๅผ | |
| โ ๆฃ็ดขๆถ่ฟๅๆ็ธๅ ณ็ chunks | |
| """) | |
| # ============================================================================ | |
| # Part 2: ไฝ ็้กน็ฎ้ ็ฝฎ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("โ๏ธ Part 2: ไฝ ็้กน็ฎ้ ็ฝฎ") | |
| print("=" * 80) | |
| print(""" | |
| ๅจ config.py ไธญ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| CHUNK_SIZE = 250 # ๆฏไธชๅๆๅค 250 ไธช tokens | |
| CHUNK_OVERLAP = 0 # ๅไน้ดไธ้ๅ | |
| ๅจ document_processor.py ไธญ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( | |
| chunk_size=250, # ๆฏๅ 250 tokens | |
| chunk_overlap=0 # ๆ ้ๅ | |
| ) | |
| """) | |
| # ============================================================================ | |
| # Part 3: RecursiveCharacterTextSplitter ็ๆ ธๅฟๆบๅถ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 3: RecursiveCharacterTextSplitter ๆ ธๅฟๆบๅถ") | |
| print("=" * 80) | |
| print(""" | |
| "Recursive" ็ๅซไน๏ผ้ๅฝๅผๅๅฒ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ไธๆฏ็ฎๅ็ฒๆดๅฐๆๅญ็ฌฆๆฐๅๅ๏ผ่ๆฏๆ็ งๅ้็ฌฆ็ไผๅ ็บง้ๅฝๅๅ๏ผ | |
| ๅ้็ฌฆไผๅ ็บง๏ผไป้ซๅฐไฝ๏ผ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| 1. "\\n\\n" ๅๆข่ก๏ผๆฎต่ฝๅ้๏ผ โ ๆไผๅ | |
| 2. "\\n" ๅๆข่ก๏ผๅฅๅญๅ้๏ผ | |
| 3. " " ็ฉบๆ ผ๏ผ่ฏ่ฏญๅ้๏ผ | |
| 4. "" ๅญ็ฌฆ็บงๅซๅๅ โ ๆๅๆๆฎต | |
| ๅทฅไฝๆต็จ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Step 1: ๅฐ่ฏ็จ "\\n\\n" ๅๅฒ | |
| โ | |
| ๆฏๅ้ฝ < 250 tokens? | |
| โ No (ๆๅๅคชๅคง) | |
| Step 2: ๅฏนๅคงๅ็จ "\\n" ๅๅฒ | |
| โ | |
| ๆฏๅ้ฝ < 250 tokens? | |
| โ No (ๆๅ่ฟๅคชๅคง) | |
| Step 3: ๅฏนๅคงๅ็จ " " ๅๅฒ | |
| โ | |
| ๆฏๅ้ฝ < 250 tokens? | |
| โ No (ๆๅไปๅคชๅคง) | |
| Step 4: ๅผบๅถๆๅญ็ฌฆๅๅ | |
| โ | |
| ไฟ่ฏๆฏๅ <= 250 tokens โ | |
| """) | |
| # ============================================================================ | |
| # Part 4: ๅฎ้ ็คบไพ - ๆๅจๆจกๆๅๅฒ่ฟ็จ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ก Part 4: ๅฎ้ ็คบไพ - ๆๅจๆจกๆๅๅฒ่ฟ็จ") | |
| print("=" * 80) | |
| # ็คบไพๆๆกฃ | |
| document = """ไบบๅทฅๆบ่ฝ็ฎไป | |
| ไบบๅทฅๆบ่ฝ๏ผAI๏ผๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏใๅฎ่ดๅไบๅๅปบ่ฝๅคๆง่ก้ๅธธ้่ฆไบบ็ฑปๆบ่ฝ็ไปปๅก็็ณป็ปใ | |
| ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็ไธไธชๅญ้ขๅใๅฎไฝฟ่ฎก็ฎๆบ่ฝๅคไปๆฐๆฎไธญๅญฆไน ๅนถๆน่ฟๅ ถๆง่ฝใๆทฑๅบฆๅญฆไน ๆฏๆบๅจๅญฆไน ็ไธ็งๆนๆณ๏ผไฝฟ็จๅคๅฑ็ฅ็ป็ฝ็ปใ | |
| ่ช็ถ่ฏญ่จๅค็๏ผNLP๏ผๆฏๅฆไธไธช้่ฆ็AI้ขๅใๅฎๅค็่ฎก็ฎๆบไธไบบ็ฑป่ฏญ่จไน้ด็ไบคไบใ""" | |
| print(f"\nๅๅงๆๆกฃ๏ผ") | |
| print("โ" * 80) | |
| print(document) | |
| print("โ" * 80) | |
| print(f"ๆๆกฃ้ฟๅบฆ๏ผ{len(document)} ๅญ็ฌฆ") | |
| # ๆจกๆ RecursiveCharacterTextSplitter ็ๅทฅไฝ | |
| def count_tokens(text): | |
| """็ฎๅ็ token ่ฎกๆฐ๏ผๅฎ้ ไฝฟ็จ tiktoken๏ผ""" | |
| # ไธญๆ๏ผๅคง็บฆ 1 ๅญ = 1.5 tokens | |
| # ่ฑๆ๏ผๅคง็บฆ 1 ่ฏ = 1 token | |
| return int(len(text) * 0.7) # ็ฎๅไผฐ็ฎ | |
| print(f"\nไผฐ็ฎ tokens ๆฐ๏ผ{count_tokens(document)} tokens") | |
| # Step 1: ๅฐ่ฏๆๅๆข่กๅๅฒ | |
| print("\n" + "โ" * 80) | |
| print("Step 1: ๆ '\\n\\n' (ๆฎต่ฝ) ๅๅฒ") | |
| print("โ" * 80) | |
| paragraphs = document.split('\n\n') | |
| print(f"\nๅๅฒๆ {len(paragraphs)} ไธชๆฎต่ฝ๏ผ\n") | |
| for i, para in enumerate(paragraphs, 1): | |
| token_count = count_tokens(para) | |
| status = "โ " if token_count <= 250 else "โ ่ถ ๅบ้ๅถ" | |
| print(f"ๆฎต่ฝ {i}: {token_count} tokens {status}") | |
| print(f" ๅ ๅฎน: {para[:60]}...") | |
| print() | |
| # ๅ่ฎพๆไธชๆฎต่ฝ่ถ ๅบ้ๅถ | |
| large_para = """ๆบๅจๅญฆไน ๆฏไบบๅทฅๆบ่ฝ็ไธไธชๅญ้ขๅใๅฎไฝฟ่ฎก็ฎๆบ่ฝๅคไปๆฐๆฎไธญๅญฆไน ๅนถๆน่ฟๅ ถๆง่ฝใๆทฑๅบฆๅญฆไน ๆฏๆบๅจๅญฆไน ็ไธ็งๆนๆณ๏ผไฝฟ็จๅคๅฑ็ฅ็ป็ฝ็ปใ""" | |
| if count_tokens(large_para) > 250: | |
| print("โ" * 80) | |
| print("Step 2: ๆฎต่ฝๅคชๅคง๏ผๆ '\\n' (ๅฅๅญ) ๅๅฒ") | |
| print("โ" * 80) | |
| sentences = large_para.split('ใ') | |
| print(f"\nๅๅฒๆ {len(sentences)} ไธชๅฅๅญ๏ผ\n") | |
| for i, sent in enumerate(sentences, 1): | |
| if sent.strip(): | |
| token_count = count_tokens(sent) | |
| status = "โ " if token_count <= 250 else "โ" | |
| print(f"ๅฅๅญ {i}: {token_count} tokens {status}") | |
| print(f" ๅ ๅฎน: {sent.strip()}") | |
| print() | |
| # ============================================================================ | |
| # Part 5: chunk_overlap ็ไฝ็จ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 5: chunk_overlap๏ผๅ้ๅ ๏ผ็ไฝ็จ") | |
| print("=" * 80) | |
| print(""" | |
| ไฝ ็้กน็ฎ่ฎพ็ฝฎ๏ผCHUNK_OVERLAP = 0๏ผๆ ้ๅ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆ ้ๅ ็ๅๅ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Chunk 1 โโ Chunk 2 โโ Chunk 3 โ | |
| โ 250 tok โโ 250 tok โโ 250 tok โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| ่พน็ๅฏ่ฝๅๆญ่ฏญไน | |
| ๆ้ๅ ็ๅๅ๏ผCHUNK_OVERLAP = 50๏ผ๏ผ | |
| โโโโโโโโโโโโ | |
| โ Chunk 1 โ | |
| โ 250 tok โ | |
| โโโโโโโโโโโโ | |
| โโโโโโโโโโโโ | |
| โ Chunk 2 โ | |
| โ 250 tok โ | |
| โโโโโโโโโโโโ | |
| โโโโโโโโโโโโ | |
| โ Chunk 3 โ | |
| โ 250 tok โ | |
| โโโโโโโโโโโโ | |
| ไผ็น๏ผ | |
| โ ไฟ็ไธไธๆ่ฟ่ดฏๆง | |
| โ ้ฟๅ ๅ ณ้ฎไฟกๆฏ่ขซๅๆญ | |
| โ ๆ้ซๆฃ็ดขๅ็กฎ็ (+5-10%) | |
| ็ผบ็น๏ผ | |
| โ ๅญๅจ็ฉบ้ดๅขๅ 20-30% | |
| โ ๅฏ่ฝ่ฟๅ้ๅคๅ ๅฎน | |
| ไธบไปไนไฝ ็้กน็ฎ่ฎพไธบ 0๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ่็ๅญๅจ็ฉบ้ด | |
| โ ้ฟๅ ้ๅค | |
| โ ้ ๅ CrossEncoder ้ๆๅทฒ็ป่ถณๅคๅ็กฎ | |
| ๆจ่่ฎพ็ฝฎ๏ผ | |
| - CHUNK_OVERLAP = 0: ๅฟซ้ๅๅใๅญๅจๅ้ | |
| - CHUNK_OVERLAP = 50: ็ไบง็ฏๅขใ้ซ็ฒพๅบฆ่ฆๆฑ โญ | |
| - CHUNK_OVERLAP = 100: ๅ ณ้ฎๅบ็จใๅป็ๆณๅพ็ญ | |
| """) | |
| # ============================================================================ | |
| # Part 6: from_tiktoken_encoder ็ไฝ็จ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ฏ Part 6: from_tiktoken_encoder ็็นๆฎไนๅค") | |
| print("=" * 80) | |
| print(""" | |
| ไฝ ็ไปฃ็ ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| RecursiveCharacterTextSplitter.from_tiktoken_encoder( | |
| chunk_size=250, | |
| chunk_overlap=0 | |
| ) | |
| ไธไฝฟ็จ tiktoken๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| RecursiveCharacterTextSplitter( | |
| chunk_size=250, # โ ่ฟ้ๆฏๅญ็ฌฆๆฐ๏ผไธๆฏ tokens๏ผ | |
| chunk_overlap=0 | |
| ) | |
| ๅบๅซ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆฎ้ๆจกๅผ๏ผๆๅญ็ฌฆๆฐๅๅ | |
| โโ chunk_size=250 โ 250 ไธชๅญ็ฌฆ | |
| โโ ไธญๆ๏ผๅคง็บฆ 250 ไธชๅญ = 375 tokens๏ผ่ถ ๆ ๏ผ๏ผ | |
| โโ ่ฑๆ๏ผๅคง็บฆ 50 ไธชๅ่ฏ = 50 tokens๏ผๅคชๅฐ๏ผ๏ผ | |
| tiktoken ๆจกๅผ๏ผๆ tokens ๅๅ โญ | |
| โโ chunk_size=250 โ ็ฒพ็กฎ 250 ไธช tokens | |
| โโ ไธญๆ๏ผๅคง็บฆ 166 ไธชๅญ = 250 tokens โ | |
| โโ ่ฑๆ๏ผๅคง็บฆ 190 ไธชๅ่ฏ = 250 tokens โ | |
| tiktoken ๆฏไปไน๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| OpenAI ็ tokenizer๏ผไธ GPT/BERT ็ๅ่ฏๆนๅผไธ่ด | |
| ไผ็น๏ผ | |
| โ ็ฒพ็กฎๆงๅถ chunk ๅคงๅฐ | |
| โ ไธ Embedding ๆจกๅ็ token ้ๅถไธ่ด | |
| โ ไธญ่ฑๆ้ฝ่ฝๅ็กฎๅค็ | |
| ไฝ ็้กน็ฎไฝฟ็จ tiktoken ๆฏๆญฃ็กฎไธๆจ่็ๅๆณ๏ผ | |
| """) | |
| # ============================================================================ | |
| # Part 7: ๅฎๆด็ๅๅฒๆต็จๅฏ่งๅ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 7: ๅฎๆด็ๅๅฒๆต็จๅฏ่งๅ") | |
| print("=" * 80) | |
| print(""" | |
| ๅๅงๆๆกฃ (5000 tokens) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ็ฌฌไธ็ซ ๏ผไบบๅทฅๆบ่ฝ็ฎไป โ | |
| โ โ | |
| โ ไบบๅทฅๆบ่ฝ๏ผAI๏ผๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏ... โ | |
| โ โ | |
| โ ็ฌฌไบ็ซ ๏ผๆบๅจๅญฆไน โ | |
| โ โ | |
| โ ๆบๅจๅญฆไน ๆฏAI็ไธไธชๅญ้ขๅ... โ | |
| โ โ | |
| โ ็ฌฌไธ็ซ ๏ผๆทฑๅบฆๅญฆไน โ | |
| โ ... โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| RecursiveCharacterTextSplitter | |
| (chunk_size=250, overlap=0) | |
| โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| Chunk 1 (250 tokens) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ็ฌฌไธ็ซ ๏ผไบบๅทฅๆบ่ฝ็ฎไป โ | |
| โ โ | |
| โ ไบบๅทฅๆบ่ฝ๏ผAI๏ผๆฏ่ฎก็ฎๆบ็งๅญฆ็ไธไธชๅๆฏใๅฎ่ดๅ... โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ๅญๅ ฅๅ้ๆฐๆฎๅบ | |
| Chunk 2 (250 tokens) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ไบบๅทฅๆบ่ฝๅ ๆฌๅคไธชๅญ้ขๅ๏ผๅฆๆบๅจๅญฆไน ใ... โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ๅญๅ ฅๅ้ๆฐๆฎๅบ | |
| Chunk 3 (250 tokens) | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ็ฌฌไบ็ซ ๏ผๆบๅจๅญฆไน โ | |
| โ โ | |
| โ ๆบๅจๅญฆไน ๆฏAI็ไธไธชๅญ้ขๅใๅฎไฝฟ่ฎก็ฎๆบ่ฝๅค... โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ ๅญๅ ฅๅ้ๆฐๆฎๅบ | |
| ...็ปง็ปญๅๅฒๆ็บฆ 20 ไธช chunks | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ๆฃ็ดขๆถ๏ผ | |
| ็จๆท้ฎ้ข: "ไปไนๆฏๆบๅจๅญฆไน ๏ผ" | |
| โ | |
| ๅ้ๆฃ็ดข Top 20 chunks | |
| โ | |
| โโ Chunk 3 (็ธๅ ณๅบฆ: 0.92) โ ๆ็ธๅ ณ | |
| โโ Chunk 4 (็ธๅ ณๅบฆ: 0.88) | |
| โโ Chunk 1 (็ธๅ ณๅบฆ: 0.75) | |
| โโ ... | |
| โ | |
| CrossEncoder ้ๆ โ Top 5 | |
| โ | |
| ่ฟๅๆ็ธๅ ณ็็ๆฎต็ป LLM ็ๆ็ญๆก | |
| """) | |
| # ============================================================================ | |
| # Part 8: ๅ ณ้ฎๅๆฐ่ฐไผๅปบ่ฎฎ | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("โ๏ธ Part 8: ๅ ณ้ฎๅๆฐ่ฐไผๅปบ่ฎฎ") | |
| print("=" * 80) | |
| print(""" | |
| ๅๆฐ้ ็ฝฎๅปบ่ฎฎ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| CHUNK_SIZE๏ผๅๅคงๅฐ๏ผ๏ผ | |
| โโ 100-200: ็ญๆๆกฃใ็ฒพ็กฎๆฃ็ดข | |
| โโ 250-500: ้็จๅบๆฏ โญ (ไฝ ็้กน็ฎ) | |
| โโ 500-1000: ้ฟๆๆกฃใ้่ฆๆดๅคไธไธๆ | |
| CHUNK_OVERLAP๏ผ้ๅ ๏ผ๏ผ | |
| โโ 0: ๅฟซ้ๅๅใๅญๅจๅ้ (ไฝ ็้กน็ฎ) | |
| โโ 50: ็ไบง็ฏๅขๆจ่ โญ | |
| โโ 100: ้ซ็ฒพๅบฆ่ฆๆฑ | |
| โโ 150+: ๅ ณ้ฎๅบ็จ | |
| ไฝ ็้กน็ฎ้ ็ฝฎ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| CHUNK_SIZE = 250 โ ้ไธญ๏ผ้ๅๅคงๅคๆฐๅบๆฏ | |
| CHUNK_OVERLAP = 0 โ ๏ธ ๅปบ่ฎฎๆนไธบ 50-100 | |
| ๆจ่ไผๅ๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| CHUNK_SIZE = 400 # ๅขๅ ไธไธๆ | |
| CHUNK_OVERLAP = 100 # ๆทปๅ ้ๅ ไฟ่ฏ่ฟ่ดฏๆง | |
| ็็ฑ๏ผ | |
| โ 400 tokens ่ถณๅคๅ ๅซๅฎๆด็ๆฎต่ฝ | |
| โ 100 tokens ้ๅ ้ฟๅ ๅ ณ้ฎไฟกๆฏ่ขซๅๆญ | |
| โ ้ ๅ CrossEncoder๏ผๅ็กฎ็ๅฏๆๅ 8-12% | |
| """) | |
| # ============================================================================ | |
| # Part 9: ๆป็ป | |
| # ============================================================================ | |
| print("\n" + "=" * 80) | |
| print("๐ Part 9: ๆ ธๅฟ่ฆ็นๆป็ป") | |
| print("=" * 80) | |
| print(""" | |
| RecursiveCharacterTextSplitter ็ๅทฅไฝๅ็๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| 1๏ธโฃ ้ๅฝๅๅฒ๏ผ | |
| โโ ๆไผๅ ็บงๅฐ่ฏๅ้็ฌฆ๏ผ\\n\\n โ \\n โ ็ฉบๆ ผ โ ๅญ็ฌฆ | |
| 2๏ธโฃ ๆบ่ฝๅๅ๏ผ | |
| โโ ไฟๆ่ฏญไนๅฎๆดๆง๏ผไผๅ ๅจๆฎต่ฝ/ๅฅๅญ่พน็ๅๅ | |
| 3๏ธโฃ ็ฒพ็กฎๆงๅถ๏ผ | |
| โโ from_tiktoken_encoder ็กฎไฟๆฏๅๆฐๅฅฝ 250 tokens | |
| 4๏ธโฃ ๅฏ้้ๅ ๏ผ | |
| โโ CHUNK_OVERLAP ไฟ็ไธไธๆ่ฟ่ดฏๆง | |
| 5๏ธโฃ ไฝ ็้กน็ฎๆต็จ๏ผ | |
| ๅๅงๆๆกฃ | |
| โ RecursiveCharacterTextSplitter | |
| 250-token chunks | |
| โ HuggingFace Embeddings | |
| ๅ้ๆฐๆฎๅบ | |
| โ ๅ้ๆฃ็ดข (Top 20) | |
| ๅ้ chunks | |
| โ CrossEncoder ้ๆ | |
| ๆ็ป Top 5 chunks | |
| โ | |
| ๅ็ป LLM ็ๆ็ญๆก | |
| ๅ ณ้ฎไผๅฟ๏ผ | |
| โ ๆบ่ฝๅๅ๏ผไฟๆ่ฏญไนๅฎๆด | |
| โ ็ฒพ็กฎๆงๅถ chunk ๅคงๅฐ | |
| โ ๆฏๆไธญ่ฑๆๆททๅๆๆฌ | |
| โ ไธๅ้ๆฃ็ดข้ ๅๅฎ็พ | |
| ่ฟๅฐฑๆฏไธบไปไนไฝ ็้กน็ฎ่ฝๅคๅ็กฎๆฃ็ดขๅๅ็ญ้ฎ้ข๏ผ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """) | |
| print("\n" + "=" * 80) | |
| print("โ ่งฃๆๅฎๆ๏ผ็ฐๅจไฝ ๅบ่ฏฅ็่งฃไบๆๆฌๅๅฒ็ๅ็") | |
| print("=" * 80) | |
| print() | |