fschlatt commited on
Commit
22f52a4
1 Parent(s): 650ead0

update readme

Browse files
Files changed (4) hide show
  1. README.md +28 -0
  2. configs/fine-tune.yaml +66 -0
  3. configs/index.yaml +21 -0
  4. configs/search.yaml +24 -0
README.md CHANGED
@@ -1,3 +1,31 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Lightning IR SPLADE
6
+
7
+ This model is a SPLADE[^1] model fine-tuned using [Lightning IR](https://github.com/webis-de/lightning-ir).
8
+
9
+ See the [Lightning IR Model Zoo](https://webis-de.github.io/lightning-ir/models.html) for a comparison with other models.
10
+
11
+ ## Reproduction
12
+
13
+ To reproduce the model training, install Lightning IR and run the following command using the [fine-tune.yaml](./configs/fine-tune.yaml) configuration file:
14
+
15
+ ```bash
16
+ lightning-ir fit --config fine-tune.yaml
17
+ ```
18
+
19
+ To index MS~MARCO passages, use the following command and the [index.yaml](./configs/index.yaml) configuration file:
20
+
21
+ ```bash
22
+ lightning-ir index --config index.yaml
23
+ ```
24
+
25
+ After indexing, to evaluate the model on TREC Deep Learning 2019 and 2020, use the following command and the [search.yaml](./configs/search.yaml) configuration file:
26
+
27
+ ```bash
28
+ lightning-ir search --config search.yaml
29
+ ```
30
+
31
+ [^1]: Formal et al., [SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://dl.acm.org/doi/abs/10.1145/3404835.3463098)
configs/fine-tune.yaml ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # lightning.pytorch==2.3.3
2
+ seed_everything: 0
3
+ trainer:
4
+ precision: bf16-mixed
5
+ max_steps: 50000
6
+ data:
7
+ class_path: lightning_ir.LightningIRDataModule
8
+ init_args:
9
+ num_workers: 1
10
+ train_batch_size: 64
11
+ shuffle_train: true
12
+ train_dataset:
13
+ class_path: lightning_ir.RunDataset
14
+ init_args:
15
+ run_path_or_id: msmarco-passage/train/rank-distillm/set-encoder
16
+ depth: 100
17
+ sample_size: 8
18
+ sampling_strategy: log_random
19
+ targets: score
20
+ normalize_targets: false
21
+ model:
22
+ class_path: lightning_ir.BiEncoderModule
23
+ init_args:
24
+ model_name_or_path: bert-base-uncased
25
+ config:
26
+ class_path: lightning_ir.SpladeConfig
27
+ init_args:
28
+ query_pooling_strategy: max
29
+ doc_pooling_strategy: max
30
+ projection: mlm
31
+ sparsification: relu_log
32
+ embedding_dim: 30522
33
+ similarity_function: dot
34
+ query_expansion: false
35
+ attend_to_query_expanded_tokens: false
36
+ query_mask_scoring_tokens: null
37
+ query_aggregation_function: sum
38
+ doc_expansion: false
39
+ attend_to_doc_expanded_tokens: false
40
+ doc_mask_scoring_tokens: null
41
+ normalize: false
42
+ add_marker_tokens: false
43
+ query_length: 32
44
+ doc_length: 256
45
+ loss_functions:
46
+ - - class_path: lightning_ir.SupervisedMarginMSE
47
+ - 0.05
48
+ - class_path: lightning_ir.KLDivergence
49
+ - class_path: lightning_ir.FLOPSRegularization
50
+ init_args:
51
+ query_weight: 0.01
52
+ doc_weight: 0.02
53
+ - class_path: lightning_ir.InBatchCrossEntropy
54
+ init_args:
55
+ pos_sampling_technique: first
56
+ neg_sampling_technique: first
57
+ max_num_neg_samples: 8
58
+ optimizer:
59
+ class_path: torch.optim.AdamW
60
+ init_args:
61
+ lr: 2.0e-05
62
+ lr_scheduler:
63
+ class_path: lightning_ir.ConstantLRSchedulerWithLinearWarmup
64
+ init_args:
65
+ num_warmup_steps: 5000
66
+ num_delay_steps: 0
configs/index.yaml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ trainer:
2
+ logger: false
3
+ callbacks:
4
+ - class_path: lightning_ir.IndexCallback
5
+ init_args:
6
+ index_dir: ./index
7
+ index_config:
8
+ class_path: lightning_ir.SparseIndexConfig
9
+ model:
10
+ class_path: lightning_ir.BiEncoderModule
11
+ init_args:
12
+ model_name_or_path: webis/bert-bi-encoder
13
+ data:
14
+ class_path: lightning_ir.LightningIRDataModule
15
+ init_args:
16
+ num_workers: 1
17
+ inference_batch_size: 256
18
+ inference_datasets:
19
+ - class_path: DocDataset
20
+ init_args:
21
+ doc_dataset: msmarco-passage
configs/search.yaml ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ trainer:
3
+ logger: false
4
+ callbacks:
5
+ - class_path: SearchCallback
6
+ init_args:
7
+ index_dir: ./index
8
+ search_config:
9
+ class_path: SparseSearchConfig
10
+ init_args:
11
+ k: 10
12
+ model:
13
+ class_path: lightning_ir.BiEncoderModule
14
+ init_args:
15
+ model_name_or_path: webis/bert-bi-encoder
16
+ evaluation_metrics:
17
+ - nDCG@10
18
+ data:
19
+ class_path: lightning_ir.LightningIRDataModule
20
+ init_args:
21
+ inference_datasets:
22
+ - class_path: QueryDataset
23
+ init_args:
24
+ doc_dataset: msmarco-passage/trec-dl-2019/judged