File size: 4,137 Bytes
bd5dd26
 
723e9d1
 
 
 
da28189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: mit
datasets:
- unicamp-dl/mmarco
language:
- de
---


# ColBERTv2-mmarco-de-0.1

This is a German ColBERT implementation based on [colbert-ir/colbertv2.0](https://huggingface.co/colbert-ir/colbertv2.0)

- Base Model: [dbmdz/bert-base-german-cased](https://huggingface.co/dbmdz/bert-base-german-cased)
- Training Data: [unicamp-dl/mmarco](https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2) --> 10Mio random sample
- Framework used for training [RAGatouille](https://github.com/bclavie/RAGatouille) Thanks a ton [@bclavie](https://huggingface.co/bclavie) !


As I'm limited on GPU Training did not go through all the way. "Only" 10 checkpoints were trained.

# Code
My code is probably a mess, but YOLO!


## data prep
```python
from datasets import load_dataset
from ragatouille import RAGTrainer
from tqdm import tqdm
import pickle
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm
import concurrent

SAMPLE_SIZE = -1



def int_to_string(number):
    if number < 0:
        return "full"
    elif number < 1000:
        return str(number)
    elif number < 1000000:
        return f"{number // 1000}K"
    elif number >= 1000000:
        return f"{number // 1000000}M"

def process_chunk(chunk):
    return [list(item) for item in zip(chunk["query"], chunk["positive"], chunk["negative"])]

def chunked_iterable(iterable, chunk_size):
    """Yield successive chunks from iterable."""
    for i in range(0, len(iterable), chunk_size):
        yield iterable[i:i + chunk_size]

def process_dataset_concurrently(dataset, chunksize=1000):
    with ThreadPoolExecutor() as executor:
        # Wrap the dataset with tqdm for real-time updates
        wrapped_dataset = tqdm(chunked_iterable(dataset, chunksize), total=(len(dataset) + chunksize - 1) // chunksize)
        # Submit each chunk to the executor
        futures = [executor.submit(process_chunk, chunk) for chunk in wrapped_dataset]
        results = []
        for future in concurrent.futures.as_completed(futures):
            results.extend(future.result())
        return results

dataset = load_dataset('unicamp-dl/mmarco', 'german', trust_remote_code=True)


# Shuffle the dataset and seed for reproducibility if needed
shuffled_dataset = dataset['train'].shuffle(seed=42)


if SAMPLE_SIZE > 0:
    sampled_dataset = shuffled_dataset.select(range(SAMPLE_SIZE))
else:
    sampled_dataset = shuffled_dataset


triplets = process_dataset_concurrently(sampled_dataset, chunksize=10000)
trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",)
trainer.prepare_training_data(raw_data=triplets, mine_hard_negatives=False)

```


## Training

```python
from datasets import load_dataset
import os
from ragatouille import RAGTrainer
from tqdm import tqdm
import pickle
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm
import concurrent
from pathlib import Path


def int_to_string(number):
    if number < 1000:
        return str(number)
    elif number < 1000000:
        return f"{number // 1000}K"
    elif number >= 1000000:
        return f"{number // 1000000}M"



SAMPLE_SIZE = 1000000


trainer = RAGTrainer(model_name=f"ColBERT-mmacro-de-{int_to_string(SAMPLE_SIZE)}", pretrained_model_name="dbmdz/bert-base-german-cased", language_code="de",)

trainer.data_dir = Path("/kaggle/input/mmarco-de-10m")

trainer.train(batch_size=32,
    nbits=4, # How many bits will the trained model use when compressing indexes
    maxsteps=500000, # Maximum steps hard stop
    use_ib_negatives=True, # Use in-batch negative to calculate loss
    dim=128, # How many dimensions per embedding. 128 is the default and works well.
    learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
    doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
    use_relu=False, # Disable ReLU -- doesn't improve performance
    warmup_steps="auto", # Defaults to 10%
    )
```