Files changed (1) hide show
  1. project.yml +63 -94
project.yml CHANGED
@@ -7,7 +7,7 @@ tags:
7
  - machine learning
8
  - natural language processing
9
  - huggingface
10
- ---
11
  vars:
12
  lang: "en"
13
  train: corpus/train.spacy
@@ -21,18 +21,21 @@ vars:
21
  ner_manual_labels: ecfr_manual_ner
22
  senter_labels: ecfr_labeled_sents
23
  ner_labeled_dataset: ecfr_labeled_ner
24
- assets:
25
- ner_labels: assets/ecfr_ner_labels.jsonl
26
- senter_labels: assets/ecfr_senter_labels.jsonl
27
- ner_patterns: assets/patterns.jsonl
28
- corpus_labels: corpus/labels
29
- data_files: data
30
- trained_model: my_trained_model
31
- trained_model_textcat: my_trained_model/textcat_multilabel
32
- output_models: output
33
- python_code: python_Code
34
 
35
- directories: ["corpus/labels", "data", "my_trained_model/textcat_multilabel", "my_trained_model/vocab", "output/experiment1/model-best/textcat_multilabel", "output/experiment1/model-best/vocab", "output/experiment1/model-last/textcat_multilabel", "output/experiment1/model-last/vocab", "output/experiment3/model-best/textcat_multilabel", "output/experiment3/model-best/vocab", "output/experiment3/model-last/textcat_multilabel", "output/experiment3/model-last/vocab", "python_Code"]
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  assets:
38
  - dest: "corpus/labels/ner.json"
@@ -207,15 +210,11 @@ commands:
207
  Explanation:
208
  - The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
209
  - It extracts text and labels from each JSON object in the dataset file.
210
- - If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
211
- - If either text or label is missing in a JSON object, a warning message is printed.
212
- - Upon completion, the script prints a message confirming the processing and the path to the output file.
213
- script:
214
- - "python3 python_Code/firstStep-format.py"
215
 
216
  - name: "train-text-classification-model"
217
  help: |
218
- Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
219
 
220
  Usage:
221
  ```
@@ -223,18 +222,13 @@ commands:
223
  ```
224
 
225
  Explanation:
226
- - The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
227
- - It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
228
- - The processed data is converted to spaCy format for training the model.
229
- - The model is trained using the converted data for a specified number of iterations (`n_iter`).
230
- - Losses are printed for each iteration during training.
231
- - Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
232
- script:
233
- - "python3 python_Code/secondStep-score.py"
234
 
235
  - name: "classify-unlabeled-data"
236
  help: |
237
- Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
238
 
239
  Usage:
240
  ```
@@ -242,17 +236,13 @@ commands:
242
  ```
243
 
244
  Explanation:
245
- - The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
246
- - It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
247
- - Each record in the unlabeled data is classified using the loaded model.
248
- - The predicted labels for each record are extracted and stored along with the text.
249
- - The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
250
- script:
251
- - "python3 python_Code/thirdStep-label.py"
252
 
253
  - name: "format-labeled-data"
254
  help: |
255
- Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
256
 
257
  Usage:
258
  ```
@@ -260,23 +250,25 @@ commands:
260
  ```
261
 
262
  Explanation:
263
- - The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
264
- - For each record, it determines accepted categories based on a specified threshold.
265
- - It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
266
- - The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
267
- script:
268
- - "python3 python_Code/finalStep-formatLabel.py"
269
-
270
  - name: "setup-environment"
271
  help: |
272
- Set up the Python virtual environment.
273
- script:
274
- - "python3 -m virtualenv venv"
275
- - "source venv/bin/activate"
276
 
 
 
 
 
 
 
 
 
277
  - name: "review-evaluation-data"
278
  help: |
279
- Review the evaluation data in Prodigy and automatically accept annotations.
280
 
281
  Usage:
282
  ```
@@ -284,15 +276,13 @@ commands:
284
  ```
285
 
286
  Explanation:
287
- - The command reviews the evaluation data in Prodigy.
288
- - It automatically accepts annotations made during the review process.
289
- - Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
290
- script:
291
- - "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
292
 
293
  - name: "export-reviewed-evaluation-data"
294
  help: |
295
- Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
296
 
297
  Usage:
298
  ```
@@ -300,16 +290,12 @@ commands:
300
  ```
301
 
302
  Explanation:
303
- - The command exports the reviewed evaluation data from Prodigy to a JSONL file.
304
- - The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
305
- - The exported data is saved to the file 'goldenEval.jsonl'.
306
- - This command helps in preserving the reviewed annotations for further analysis or processing.
307
- script:
308
- - "prodigy db-out project3eval-review > goldenEval.jsonl"
309
 
310
  - name: "import-training-data"
311
  help: |
312
- Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
313
 
314
  Usage:
315
  ```
@@ -317,15 +303,11 @@ commands:
317
  ```
318
 
319
  Explanation:
320
- - The command imports the training data into Prodigy from the specified JSONL file.
321
- - The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
322
- - This command prepares the training data for annotation and model training in Prodigy.
323
- script:
324
- - "prodigy db-in prodigy3train train200.jsonl"
325
 
326
  - name: "import-golden-evaluation-data"
327
  help: |
328
- Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
329
 
330
  Usage:
331
  ```
@@ -333,15 +315,11 @@ commands:
333
  ```
334
 
335
  Explanation:
336
- - The command imports the golden evaluation data into Prodigy from the specified JSONL file.
337
- - The data is imported into the Prodigy database associated with the project named 'golden3'.
338
- - This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
339
- script:
340
- - "prodigy db-in golden3 goldeneval.jsonl"
341
 
342
  - name: "train-model-experiment1"
343
  help: |
344
- Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
345
 
346
  Usage:
347
  ```
@@ -349,15 +327,13 @@ commands:
349
  ```
350
 
351
  Explanation:
352
- - The command trains a text classification model using Prodigy.
353
- - It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
354
- - The trained model is saved to the './output/experiment1' directory.
355
- script:
356
- - "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
357
 
358
  - name: "download-model"
359
  help: |
360
- Download the English language model 'en_core_web_lg' from spaCy.
361
 
362
  Usage:
363
  ```
@@ -365,14 +341,12 @@ commands:
365
  ```
366
 
367
  Explanation:
368
- - The command downloads the English language model 'en_core_web_lg' from spaCy.
369
- - This model is used as the base model for further data processing and training in the project.
370
- script:
371
- - "python3 -m spacy download en_core_web_lg"
372
 
373
  - name: "convert-data-to-spacy-format"
374
  help: |
375
- Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
376
 
377
  Usage:
378
  ```
@@ -380,15 +354,12 @@ commands:
380
  ```
381
 
382
  Explanation:
383
- - The command converts the annotated data from Prodigy to spaCy format.
384
- - It uses the 'prodigy3train' and 'golden3' datasets for conversion.
385
- - The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
386
- script:
387
- - "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
388
 
389
  - name: "train-custom-model"
390
  help: |
391
- Train a custom text classification model using spaCy with the converted data in spaCy format.
392
 
393
  Usage:
394
  ```
@@ -396,8 +367,6 @@ commands:
396
  ```
397
 
398
  Explanation:
399
- - The command trains a custom text classification model using spaCy.
400
- - It uses the converted data in spaCy format located in the './corpus' directory.
401
- - The model is trained using the configuration defined in 'corpus/config.cfg'.
402
- script:
403
- - "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
 
7
  - machine learning
8
  - natural language processing
9
  - huggingface
10
+
11
  vars:
12
  lang: "en"
13
  train: corpus/train.spacy
 
21
  ner_manual_labels: ecfr_manual_ner
22
  senter_labels: ecfr_labeled_sents
23
  ner_labeled_dataset: ecfr_labeled_ner
 
 
 
 
 
 
 
 
 
 
24
 
25
+ directories:
26
+ - corpus/labels
27
+ - data
28
+ - my_trained_model/textcat_multilabel
29
+ - my_trained_model/vocab
30
+ - output/experiment1/model-best/textcat_multilabel
31
+ - output/experiment1/model-best/vocab
32
+ - output/experiment1/model-last/textcat_multilabel
33
+ - output/experiment1/model-last/vocab
34
+ - output/experiment3/model-best/textcat_multilabel
35
+ - output/experiment3/model-best/vocab
36
+ - output/experiment3/model-last/textcat_multilabel
37
+ - output/experiment3/model-last/vocab
38
+ - python_Code
39
 
40
  assets:
41
  - dest: "corpus/labels/ner.json"
 
210
  Explanation:
211
  - The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
212
  - It extracts text and labels from each JSON object in the dataset file.
213
+ - If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and labels.
 
 
 
 
214
 
215
  - name: "train-text-classification-model"
216
  help: |
217
+ Train a text classification model using spaCy.
218
 
219
  Usage:
220
  ```
 
222
  ```
223
 
224
  Explanation:
225
+ - This command trains a text classification model using the spaCy library based on the configuration provided in the `textcat_multilabel.cfg` file.
226
+ - The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
227
+ - The trained model is saved to the directory specified in the `output_model_dir` variable (`my_trained_model/textcat_multilabel/model` by default).
 
 
 
 
 
228
 
229
  - name: "classify-unlabeled-data"
230
  help: |
231
+ Classify unlabeled data using a trained text classification model.
232
 
233
  Usage:
234
  ```
 
236
  ```
237
 
238
  Explanation:
239
+ - This command loads the trained text classification model from the directory specified in the `model_dir` variable (`my_trained_model/textcat_multilabel/model` by default).
240
+ - It classifies unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/thirdStep_file.jsonl` by default).
241
+ - The classified data is saved to the file specified in the `classified_data_file` variable (`data/classified_data.jsonl` by default).
 
 
 
 
242
 
243
  - name: "format-labeled-data"
244
  help: |
245
+ Execute the Python script `finalStep-formatLabel.py`, which performs the final formatting of labeled data for the last step of the project. This script converts labeled data from the JSONL format used by Prodigy to the JSONL format used by spaCy.
246
 
247
  Usage:
248
  ```
 
250
  ```
251
 
252
  Explanation:
253
+ - The script `finalStep-formatLabel.py` reads labeled data from the file specified in the `labeled_data_file` variable (`data/thirdStep_file.jsonl` by default).
254
+ - It converts the labeled data from Prodigy's JSONL format to spaCy's JSONL format.
255
+ - The converted data is saved to the file specified in the `formatted_data_file` variable (`data/fourthStep_file.jsonl` by default).
256
+
 
 
 
257
  - name: "setup-environment"
258
  help: |
259
+ Set up the Python environment for the project using pip and the provided requirements.txt file.
 
 
 
260
 
261
+ Usage:
262
+ ```
263
+ spacy project run setup-environment
264
+ ```
265
+
266
+ Explanation:
267
+ - This command installs the required Python packages listed in the `requirements.txt` file using pip.
268
+
269
  - name: "review-evaluation-data"
270
  help: |
271
+ Review the evaluation data using Prodigy.
272
 
273
  Usage:
274
  ```
 
276
  ```
277
 
278
  Explanation:
279
+ - This command launches Prodigy to review the evaluation data.
280
+ - Prodigy loads the evaluation data from the file specified in the `eval_data_file` variable (`data/eval.jsonl` by default).
281
+ - You can review the data and annotate it as needed using Prodigy's user interface.
 
 
282
 
283
  - name: "export-reviewed-evaluation-data"
284
  help: |
285
+ Export the reviewed evaluation data from Prodigy.
286
 
287
  Usage:
288
  ```
 
290
  ```
291
 
292
  Explanation:
293
+ - This command exports the reviewed evaluation data from Prodigy to a JSONL file.
294
+ - Prodigy exports the reviewed data to the file specified in the `exported_eval_data_file` variable (`data/goldenEval.jsonl` by default).
 
 
 
 
295
 
296
  - name: "import-training-data"
297
  help: |
298
+ Import training data into Prodigy.
299
 
300
  Usage:
301
  ```
 
303
  ```
304
 
305
  Explanation:
306
+ - This command imports training data into Prodigy from the file specified in the `training_data_file` variable (`data/fourthStep_file.jsonl` by default).
 
 
 
 
307
 
308
  - name: "import-golden-evaluation-data"
309
  help: |
310
+ Import golden evaluation data into Prodigy.
311
 
312
  Usage:
313
  ```
 
315
  ```
316
 
317
  Explanation:
318
+ - This command imports golden evaluation data into Prodigy from the file specified in the `golden_evaluation_data_file` variable (`data/goldenEval.jsonl` by default).
 
 
 
 
319
 
320
  - name: "train-model-experiment1"
321
  help: |
322
+ Train a text classification model with different configurations for experiment 1.
323
 
324
  Usage:
325
  ```
 
327
  ```
328
 
329
  Explanation:
330
+ - This command trains a text classification model using different configurations specified in the `experiment1_configs` list in the `config.cfg` file.
331
+ - The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
332
+ - The trained models are saved to the directories specified in the `output_model_dir` variable (`output/experiment1/model-last/textcat_multilabel/model` and `output/experiment1/model-best/textcat_multilabel/model` by default).
 
 
333
 
334
  - name: "download-model"
335
  help: |
336
+ Download a trained text classification model.
337
 
338
  Usage:
339
  ```
 
341
  ```
342
 
343
  Explanation:
344
+ - This command downloads a trained text classification model from the URL specified in the `model_url` variable (`https://example.com/model.tar.gz` by default).
345
+ - The downloaded model is saved to the directory specified in the `output_model_dir` variable (`models` by default).
 
 
346
 
347
  - name: "convert-data-to-spacy-format"
348
  help: |
349
+ Convert data to spaCy's JSONL format.
350
 
351
  Usage:
352
  ```
 
354
  ```
355
 
356
  Explanation:
357
+ - This command converts data from Prodigy's JSONL format to spaCy's JSONL format.
358
+ - It reads data from the file specified in the `prodigy_data_file` variable (`data/ner_dataset.jsonl` by default) and writes the converted data to the file specified in the `spacy_data_file` variable (`data/ner_dataset_spacy.jsonl` by default).
 
 
 
359
 
360
  - name: "train-custom-model"
361
  help: |
362
+ Train a custom NER model using spaCy.
363
 
364
  Usage:
365
  ```
 
367
  ```
368
 
369
  Explanation:
370
+ - This command trains a custom NER model using spaCy based on the configuration provided in the `config.cfg` file.
371
+ - The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
372
+ - The trained model is saved to the directory specified in the `output_model_dir` variable (`my_trained_model` by default).