Elron commited on
Commit
100c2eb
1 Parent(s): 0a1b314

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,50 +1,75 @@
1
- ---
2
- title: Metric
3
- datasets:
4
- - none
5
- tags:
6
- - evaluate
7
- - metric
8
- description: "TODO: add a description here"
9
- sdk: gradio
10
- sdk_version: 3.19.1
11
- app_file: app.py
12
- pinned: false
13
- ---
14
 
15
- # Metric Card for Metric
 
 
 
 
 
 
 
16
 
17
- ***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
18
 
19
- ## Metric Description
20
- *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
21
 
22
- ## How to Use
23
- *Give general statement of how to use the metric*
24
 
25
- *Provide simplest possible example for using the metric*
 
 
 
 
 
 
 
26
 
27
- ### Inputs
28
- *List all input arguments in the format below*
29
- - **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
30
 
31
- ### Output Values
32
 
33
- *Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
34
 
35
- *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
 
 
 
 
36
 
37
- #### Values from Popular Papers
38
- *Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
39
 
40
- ### Examples
41
- *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
 
 
 
 
 
 
42
 
43
- ## Limitations and Bias
44
- *Note any known limitations or biases that the metric has, with links and references if possible.*
45
 
46
- ## Citation
47
- *Cite the source where this metric was introduced.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- ## Further References
50
- *Add any useful further references.*
 
1
+ <div align="center">
2
+ <img src="./assets/banner.png" alt="Image Description" width="100%" />
3
+ </div>
 
 
 
 
 
 
 
 
 
 
4
 
5
+ [![Button](https://img.shields.io/badge/Video-pink?style=for-the-badge)](https://unitxt.readthedocs.io/)
6
+ [![Button](https://img.shields.io/badge/Demo-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/demo.html)
7
+ [![Button](https://img.shields.io/badge/Tutorial-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html)
8
+ [![Button](https://img.shields.io/badge/Paper-pink?style=for-the-badge)](https://arxiv.org/abs/2401.14019)
9
+ [![Button](https://img.shields.io/badge/Documentation-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/modules.html)
10
+ [![Button](https://img.shields.io/badge/Catalog-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/catalog/catalog.__dir__.html)
11
+ [![Button](https://img.shields.io/badge/Contributors-pink?style=for-the-badge)](https://github.com/IBM/unitxt/blob/main/CONTRIBUTING.md)
12
+ [![Button](https://img.shields.io/badge/PyPi-pink?style=for-the-badge)](https://pypi.org/project/unitxt/)
13
 
 
14
 
15
+ In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.
 
16
 
17
+ Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.
 
18
 
19
+ #
20
+ [![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
21
+ ![license](https://img.shields.io/github/license/ibm/unitxt)
22
+ ![python](https://img.shields.io/badge/python-3.8%20|%203.9-blue)
23
+ ![tests](https://img.shields.io/github/actions/workflow/status/ibm/unitxt/library_tests.yml?branch=main&label=tests)
24
+ [![codecov](https://codecov.io/gh/IBM/unitxt/branch/main/graph/badge.svg?token=mlrWq9cwz3)](https://codecov.io/gh/IBM/unitxt)
25
+ ![Read the Docs](https://img.shields.io/readthedocs/unitxt)
26
+ [![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
27
 
28
+ #
 
 
29
 
30
+ https://github.com/IBM/unitxt/assets/23455264/baef9131-39d4-4164-90b2-05da52919fdf
31
 
32
+ ### 🦄 Currently on Unitxt Catalog
33
 
34
+ ![NLP Tasks](https://img.shields.io/badge/NLP_tasks-40-blue)
35
+ ![Dataset Cards](https://img.shields.io/badge/Dataset_Cards-457-blue)
36
+ ![Templates](https://img.shields.io/badge/Templates-229-blue)
37
+ ![Formats](https://img.shields.io/badge/Formats-18-blue)
38
+ ![Metrics](https://img.shields.io/badge/Metrics-98-blue)
39
 
40
+ ### 🦄 Run Unitxt Exploration Dashboard
 
41
 
42
+ To launch unitxt graphical user interface first install unitxt with ui requirements:
43
+ ```
44
+ pip install unitxt[ui]
45
+ ```
46
+ Then launch the ui by running:
47
+ ```
48
+ unitxt-explore
49
+ ```
50
 
51
+ # 🦄 Contributors
 
52
 
53
+ Please install Unitxt from source by:
54
+ ```
55
+ git clone git@github.com:IBM/unitxt.git
56
+ cd unitxt
57
+ pip install -e ".[dev]"
58
+ pre-commit install
59
+ ```
60
+
61
+ # 🦄 Citation
62
+
63
+ If you use Unitxt in your research, please cite our paper:
64
+
65
+ ```
66
+ @misc{unitxt,
67
+ title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
68
+ author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
69
+ year={2024},
70
+ eprint={2401.14019},
71
+ archivePrefix={arXiv},
72
+ primaryClass={cs.CL}
73
+ }
74
+ ```
75
 
 
 
blocks.py CHANGED
@@ -22,8 +22,11 @@ from .splitters import RandomSampler, SliceSplit, SplitRandomMix, SpreadSplit
22
  from .stream import MultiStream
23
  from .struct_data_operators import (
24
  ListToKeyValPairs,
 
25
  SerializeKeyValPairs,
 
26
  SerializeTableAsIndexedRowMajor,
 
27
  SerializeTableAsMarkdown,
28
  SerializeTableRowAsList,
29
  SerializeTableRowAsText,
 
22
  from .stream import MultiStream
23
  from .struct_data_operators import (
24
  ListToKeyValPairs,
25
+ MapHTMLTableToJSON,
26
  SerializeKeyValPairs,
27
+ SerializeTableAsDFLoader,
28
  SerializeTableAsIndexedRowMajor,
29
+ SerializeTableAsJson,
30
  SerializeTableAsMarkdown,
31
  SerializeTableRowAsList,
32
  SerializeTableRowAsText,
dataset.py CHANGED
@@ -10,7 +10,6 @@ from .catalog import __file__ as _
10
  from .collections import __file__ as _
11
  from .collections_operators import __file__ as _
12
  from .dataclass import __file__ as _
13
- from .dataset_utils import __file__ as _
14
  from .dataset_utils import get_dataset_artifact
15
  from .deprecation_utils import __file__ as _
16
  from .dialog_operators import __file__ as _
@@ -20,13 +19,11 @@ from .file_utils import __file__ as _
20
  from .formats import __file__ as _
21
  from .fusion import __file__ as _
22
  from .generator_utils import __file__ as _
23
- from .hf_utils import __file__ as _
24
  from .hf_utils import verify_versions_compatibility
25
  from .inference import __file__ as _
26
  from .instructions import __file__ as _
27
  from .llm_as_judge import __file__ as _
28
  from .loaders import __file__ as _
29
- from .logging_utils import __file__ as _
30
  from .logging_utils import get_logger
31
  from .metric import __file__ as _
32
  from .metric_utils import __file__ as _
@@ -40,13 +37,13 @@ from .random_utils import __file__ as _
40
  from .recipe import __file__ as _
41
  from .register import __file__ as _
42
  from .schema import __file__ as _
43
- from .settings_utils import __file__ as _
44
  from .settings_utils import get_constants
45
  from .span_lableing_operators import __file__ as _
46
  from .split_utils import __file__ as _
47
  from .splitters import __file__ as _
48
  from .standard import __file__ as _
49
  from .stream import __file__ as _
 
50
  from .string_operators import __file__ as _
51
  from .struct_data_operators import __file__ as _
52
  from .system_prompts import __file__ as _
@@ -54,7 +51,6 @@ from .task import __file__ as _
54
  from .templates import __file__ as _
55
  from .text_utils import __file__ as _
56
  from .type_utils import __file__ as _
57
- from .utils import __file__ as _
58
  from .utils import is_package_installed
59
  from .validate import __file__ as _
60
  from .version import __file__ as _
@@ -75,8 +71,9 @@ class Dataset(datasets.GeneratorBasedBuilder):
75
  if is_package_installed("unitxt"):
76
  verify_versions_compatibility("dataset", self.VERSION)
77
 
78
- from unitxt.dataset_utils import \
79
- get_dataset_artifact as get_dataset_artifact_installed
 
80
 
81
  logger.info("Loading with installed unitxt library...")
82
  dataset = get_dataset_artifact_installed(self.config.name)
 
10
  from .collections import __file__ as _
11
  from .collections_operators import __file__ as _
12
  from .dataclass import __file__ as _
 
13
  from .dataset_utils import get_dataset_artifact
14
  from .deprecation_utils import __file__ as _
15
  from .dialog_operators import __file__ as _
 
19
  from .formats import __file__ as _
20
  from .fusion import __file__ as _
21
  from .generator_utils import __file__ as _
 
22
  from .hf_utils import verify_versions_compatibility
23
  from .inference import __file__ as _
24
  from .instructions import __file__ as _
25
  from .llm_as_judge import __file__ as _
26
  from .loaders import __file__ as _
 
27
  from .logging_utils import get_logger
28
  from .metric import __file__ as _
29
  from .metric_utils import __file__ as _
 
37
  from .recipe import __file__ as _
38
  from .register import __file__ as _
39
  from .schema import __file__ as _
 
40
  from .settings_utils import get_constants
41
  from .span_lableing_operators import __file__ as _
42
  from .split_utils import __file__ as _
43
  from .splitters import __file__ as _
44
  from .standard import __file__ as _
45
  from .stream import __file__ as _
46
+ from .stream_operators import __file__ as _
47
  from .string_operators import __file__ as _
48
  from .struct_data_operators import __file__ as _
49
  from .system_prompts import __file__ as _
 
51
  from .templates import __file__ as _
52
  from .text_utils import __file__ as _
53
  from .type_utils import __file__ as _
 
54
  from .utils import is_package_installed
55
  from .validate import __file__ as _
56
  from .version import __file__ as _
 
71
  if is_package_installed("unitxt"):
72
  verify_versions_compatibility("dataset", self.VERSION)
73
 
74
+ from unitxt.dataset_utils import (
75
+ get_dataset_artifact as get_dataset_artifact_installed,
76
+ )
77
 
78
  logger.info("Loading with installed unitxt library...")
79
  dataset = get_dataset_artifact_installed(self.config.name)
fusion.py CHANGED
@@ -4,7 +4,7 @@ from typing import Dict, Generator, List, Optional, Union
4
  from .dataclass import NonPositionalField
5
  from .operator import SourceOperator
6
  from .random_utils import new_random_generator
7
- from .stream import GeneratorStream, MultiStream
8
  from .type_utils import isoftype
9
 
10
 
@@ -49,7 +49,7 @@ class BaseFusion(SourceOperator):
49
  ) -> MultiStream:
50
  result = {}
51
  for split in self.splits():
52
- result[split] = GeneratorStream(
53
  self.fusion_generator, gen_kwargs={"split": split}
54
  )
55
  return MultiStream(result)
 
4
  from .dataclass import NonPositionalField
5
  from .operator import SourceOperator
6
  from .random_utils import new_random_generator
7
+ from .stream import DynamicStream, MultiStream
8
  from .type_utils import isoftype
9
 
10
 
 
49
  ) -> MultiStream:
50
  result = {}
51
  for split in self.splits():
52
+ result[split] = DynamicStream(
53
  self.fusion_generator, gen_kwargs={"split": split}
54
  )
55
  return MultiStream(result)
inference.py CHANGED
@@ -121,8 +121,7 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
121
  f"Error while trying to run IbmGenAiInferenceEngine."
122
  f" Please set the environment param '{api_key_env_var_name}'."
123
  )
124
- api_endpoint = os.environ.get("GENAI_KEY")
125
- credentials = Credentials(api_key=api_key, api_endpoint=api_endpoint)
126
  self.client = Client(credentials=credentials)
127
 
128
  def _infer(self, dataset):
@@ -141,13 +140,14 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
141
  decoding_method=self.parameters.decoding_method,
142
  )
143
 
144
- return list(
145
- self.client.text.generation.create(
 
146
  model_id=self.model_name,
147
  inputs=[instance["source"] for instance in dataset],
148
  parameters=genai_params,
149
  )
150
- )
151
 
152
 
153
  class OpenAiInferenceEngineParams(Artifact):
 
121
  f"Error while trying to run IbmGenAiInferenceEngine."
122
  f" Please set the environment param '{api_key_env_var_name}'."
123
  )
124
+ credentials = Credentials(api_key=api_key)
 
125
  self.client = Client(credentials=credentials)
126
 
127
  def _infer(self, dataset):
 
140
  decoding_method=self.parameters.decoding_method,
141
  )
142
 
143
+ return [
144
+ response.results[0].generated_text
145
+ for response in self.client.text.generation.create(
146
  model_id=self.model_name,
147
  inputs=[instance["source"] for instance in dataset],
148
  parameters=genai_params,
149
  )
150
+ ]
151
 
152
 
153
  class OpenAiInferenceEngineParams(Artifact):
llm_as_judge.py CHANGED
@@ -135,4 +135,8 @@ class LLMAsJudge(BulkInstanceMetric):
135
  dataset = produce(instances, recipe)
136
  verdicts = self.inference_model.infer(dataset)
137
  meta_scores = evaluate(predictions=verdicts, data=dataset)
138
- return [{self.main_score: instance["prediction"]} for instance in meta_scores]
 
 
 
 
 
135
  dataset = produce(instances, recipe)
136
  verdicts = self.inference_model.infer(dataset)
137
  meta_scores = evaluate(predictions=verdicts, data=dataset)
138
+ return [
139
+ {self.main_score: instance["prediction"], "judge_raw_output": verdict}
140
+ for instance in meta_scores
141
+ for verdict in verdicts
142
+ ]
loaders.py CHANGED
@@ -30,6 +30,7 @@ Available Loaders Overview:
30
 
31
  ------------------------
32
  """
 
33
  import itertools
34
  import os
35
  import tempfile
@@ -41,6 +42,7 @@ from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
41
 
42
  import pandas as pd
43
  from datasets import load_dataset as hf_load_dataset
 
44
  from tqdm import tqdm
45
 
46
  from .dataclass import InternalField, OptionalField
@@ -49,7 +51,7 @@ from .logging_utils import get_logger
49
  from .operator import SourceOperator
50
  from .operators import AddFields
51
  from .settings_utils import get_settings
52
- from .stream import GeneratorStream, MultiStream
53
 
54
  logger = get_logger()
55
  settings = get_settings()
@@ -259,7 +261,7 @@ class LoadHF(Loader):
259
  self.log_limited_loading()
260
  return MultiStream(
261
  {
262
- name: GeneratorStream(
263
  generator=self.split_limited_load, gen_kwargs={"split_name": name}
264
  )
265
  for name in self._cache.keys()
@@ -349,7 +351,7 @@ class LoadCSV(Loader):
349
  if self.streaming:
350
  return MultiStream(
351
  {
352
- name: GeneratorStream(
353
  generator=self.stream_csv, gen_kwargs={"file": file}
354
  )
355
  for name, file in self.files.items()
@@ -358,9 +360,7 @@ class LoadCSV(Loader):
358
 
359
  return MultiStream(
360
  {
361
- name: GeneratorStream(
362
- generator=self.load_csv, gen_kwargs={"file": file}
363
- )
364
  for name, file in self.files.items()
365
  }
366
  )
@@ -385,7 +385,6 @@ class LoadFromSklearn(Loader):
385
 
386
  dataset_name: str
387
  splits: List[str] = ["train", "test"]
388
- data_classification_policy = ["public"]
389
 
390
  _requirements_list: List[str] = ["sklearn", "pandas"]
391
 
@@ -683,8 +682,10 @@ class LoadFromDictionary(Loader):
683
  .. code-block:: python
684
 
685
  data = {
686
- "train": {"input": "SomeInput1", "output": "SomeResult1"},
687
- "test": {"input": "SomeInput2", "output": "SomeResult2"},
 
 
688
  }
689
  loader = LoadFromDictionary(data=data)
690
  """
@@ -794,18 +795,79 @@ class LoadFromHFSpace(LoadHF):
794
  else:
795
  data_files = self.data_files
796
 
 
797
  for files in data_files:
798
  if isinstance(files, str):
799
  files = [files]
800
- # All files - within the same space - are downloaded into the same base directory:
801
  paths = [self._download_file_from_space(file) for file in files]
802
- dir_path = paths[0].replace(files[0], "")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
803
 
804
- return dir_path
 
 
 
 
 
 
 
 
 
 
 
805
 
806
  def load_data(self):
807
  self.sef_default_data_classification(
808
  ["public"], "when loading from Huggingface spaces"
809
  )
 
810
  self.path = self._download_data()
811
  return super().load_data()
 
30
 
31
  ------------------------
32
  """
33
+ import fnmatch
34
  import itertools
35
  import os
36
  import tempfile
 
42
 
43
  import pandas as pd
44
  from datasets import load_dataset as hf_load_dataset
45
+ from huggingface_hub import HfApi
46
  from tqdm import tqdm
47
 
48
  from .dataclass import InternalField, OptionalField
 
51
  from .operator import SourceOperator
52
  from .operators import AddFields
53
  from .settings_utils import get_settings
54
+ from .stream import DynamicStream, MultiStream
55
 
56
  logger = get_logger()
57
  settings = get_settings()
 
261
  self.log_limited_loading()
262
  return MultiStream(
263
  {
264
+ name: DynamicStream(
265
  generator=self.split_limited_load, gen_kwargs={"split_name": name}
266
  )
267
  for name in self._cache.keys()
 
351
  if self.streaming:
352
  return MultiStream(
353
  {
354
+ name: DynamicStream(
355
  generator=self.stream_csv, gen_kwargs={"file": file}
356
  )
357
  for name, file in self.files.items()
 
360
 
361
  return MultiStream(
362
  {
363
+ name: DynamicStream(generator=self.load_csv, gen_kwargs={"file": file})
 
 
364
  for name, file in self.files.items()
365
  }
366
  )
 
385
 
386
  dataset_name: str
387
  splits: List[str] = ["train", "test"]
 
388
 
389
  _requirements_list: List[str] = ["sklearn", "pandas"]
390
 
 
682
  .. code-block:: python
683
 
684
  data = {
685
+ "train": [{"input": "SomeInput1", "output": "SomeResult1"},
686
+ {"input": "SomeInput2", "output": "SomeResult2"}],
687
+ "test": [{"input": "SomeInput3", "output": "SomeResult3"},
688
+ {"input": "SomeInput4", "output": "SomeResult4"}]
689
  }
690
  loader = LoadFromDictionary(data=data)
691
  """
 
795
  else:
796
  data_files = self.data_files
797
 
798
+ dir_paths_list = []
799
  for files in data_files:
800
  if isinstance(files, str):
801
  files = [files]
802
+
803
  paths = [self._download_file_from_space(file) for file in files]
804
+ dir_paths = [
805
+ path.replace(file_url, "") for path, file_url in zip(paths, files)
806
+ ]
807
+ dir_paths_list.extend(dir_paths)
808
+
809
+ # All files - within the same space - are downloaded into the same base directory:
810
+ assert len(set(dir_paths_list)) == 1
811
+
812
+ return f"{dir_paths_list.pop()}"
813
+
814
+ @staticmethod
815
+ def _is_wildcard(path: str) -> bool:
816
+ wildcard_characters = ["*", "?", "[", "]"]
817
+ return any(char in path for char in wildcard_characters)
818
+
819
+ def _get_file_list_from_wildcard_path(
820
+ self, pattern: str, repo_files: List
821
+ ) -> List[str]:
822
+ if self._is_wildcard(pattern):
823
+ return fnmatch.filter(repo_files, pattern)
824
+ return [pattern]
825
+
826
+ def _map_wildcard_path_to_full_paths(self):
827
+ api = HfApi()
828
+ repo_files = api.list_repo_files(self.space_name, repo_type="space")
829
+ if isinstance(self.data_files, str):
830
+ self.data_files = self._get_file_list_from_wildcard_path(
831
+ self.data_files, repo_files
832
+ )
833
+ elif isinstance(self.data_files, Mapping):
834
+ new_mapping = {}
835
+ for k, v in self.data_files.items():
836
+ if isinstance(v, list):
837
+ assert all(isinstance(s, str) for s in v)
838
+ new_mapping[k] = [
839
+ file
840
+ for p in v
841
+ for file in self._get_file_list_from_wildcard_path(
842
+ p, repo_files
843
+ )
844
+ ]
845
+ elif isinstance(v, str):
846
+ new_mapping[k] = self._get_file_list_from_wildcard_path(
847
+ v, repo_files
848
+ )
849
+ else:
850
+ raise NotImplementedError(
851
+ f"Loader does not support input 'data_files' of type Mapping[{type(v)}]"
852
+ )
853
 
854
+ self.data_files = new_mapping
855
+ elif isinstance(self.data_files, list):
856
+ assert all(isinstance(s, str) for s in self.data_files)
857
+ self.data_files = [
858
+ file
859
+ for p in self.data_files
860
+ for file in self._get_file_list_from_wildcard_path(p, repo_files)
861
+ ]
862
+ else:
863
+ raise NotImplementedError(
864
+ f"Loader does not support input 'data_files' of type {type(self.data_files)}"
865
+ )
866
 
867
  def load_data(self):
868
  self.sef_default_data_classification(
869
  ["public"], "when loading from Huggingface spaces"
870
  )
871
+ self._map_wildcard_path_to_full_paths()
872
  self.path = self._download_data()
873
  return super().load_data()
metric.py CHANGED
@@ -19,16 +19,13 @@ from .file_utils import __file__ as _
19
  from .formats import __file__ as _
20
  from .fusion import __file__ as _
21
  from .generator_utils import __file__ as _
22
- from .hf_utils import __file__ as _
23
  from .hf_utils import verify_versions_compatibility
24
  from .inference import __file__ as _
25
  from .instructions import __file__ as _
26
  from .llm_as_judge import __file__ as _
27
  from .loaders import __file__ as _
28
  from .logging_utils import __file__ as _
29
- from .metric_utils import UNITXT_METRIC_SCHEMA
30
- from .metric_utils import __file__ as _
31
- from .metric_utils import _compute
32
  from .metrics import __file__ as _
33
  from .normalizers import __file__ as _
34
  from .operator import __file__ as _
@@ -39,13 +36,13 @@ from .random_utils import __file__ as _
39
  from .recipe import __file__ as _
40
  from .register import __file__ as _
41
  from .schema import __file__ as _
42
- from .settings_utils import __file__ as _
43
  from .settings_utils import get_constants
44
  from .span_lableing_operators import __file__ as _
45
  from .split_utils import __file__ as _
46
  from .splitters import __file__ as _
47
  from .standard import __file__ as _
48
  from .stream import __file__ as _
 
49
  from .string_operators import __file__ as _
50
  from .struct_data_operators import __file__ as _
51
  from .system_prompts import __file__ as _
@@ -53,7 +50,6 @@ from .task import __file__ as _
53
  from .templates import __file__ as _
54
  from .text_utils import __file__ as _
55
  from .type_utils import __file__ as _
56
- from .utils import __file__ as _
57
  from .utils import is_package_installed
58
  from .validate import __file__ as _
59
  from .version import __file__ as _
 
19
  from .formats import __file__ as _
20
  from .fusion import __file__ as _
21
  from .generator_utils import __file__ as _
 
22
  from .hf_utils import verify_versions_compatibility
23
  from .inference import __file__ as _
24
  from .instructions import __file__ as _
25
  from .llm_as_judge import __file__ as _
26
  from .loaders import __file__ as _
27
  from .logging_utils import __file__ as _
28
+ from .metric_utils import UNITXT_METRIC_SCHEMA, _compute
 
 
29
  from .metrics import __file__ as _
30
  from .normalizers import __file__ as _
31
  from .operator import __file__ as _
 
36
  from .recipe import __file__ as _
37
  from .register import __file__ as _
38
  from .schema import __file__ as _
 
39
  from .settings_utils import get_constants
40
  from .span_lableing_operators import __file__ as _
41
  from .split_utils import __file__ as _
42
  from .splitters import __file__ as _
43
  from .standard import __file__ as _
44
  from .stream import __file__ as _
45
+ from .stream_operators import __file__ as _
46
  from .string_operators import __file__ as _
47
  from .struct_data_operators import __file__ as _
48
  from .system_prompts import __file__ as _
 
50
  from .templates import __file__ as _
51
  from .text_utils import __file__ as _
52
  from .type_utils import __file__ as _
 
53
  from .utils import is_package_installed
54
  from .validate import __file__ as _
55
  from .version import __file__ as _
metric_utils.py CHANGED
@@ -15,7 +15,7 @@ from .operator import (
15
  from .operators import (
16
  ApplyMetric,
17
  ApplyOperatorsField,
18
- CopyFields,
19
  FlattenInstances,
20
  MergeStreams,
21
  SplitByNestedGroup,
@@ -23,7 +23,7 @@ from .operators import (
23
  from .register import _reset_env_local_catalogs, register_all_artifacts
24
  from .schema import UNITXT_DATASET_SCHEMA
25
  from .settings_utils import get_settings
26
- from .stream import GeneratorStream, MultiStream
27
  from .struct_data_operators import LoadJson
28
 
29
 
@@ -109,7 +109,7 @@ class MultiStreamScoreMean(MultiStreamOperator):
109
 
110
  return MultiStream(
111
  {
112
- stream_name: GeneratorStream(
113
  never_peek_twice_generator,
114
  gen_kwargs={
115
  "stream_name": stream_name,
@@ -132,7 +132,7 @@ class FromPredictionsAndOriginalData(StreamInitializerOperator):
132
  ) -> MultiStream:
133
  return MultiStream(
134
  {
135
- split_name: GeneratorStream(
136
  self.zip,
137
  gen_kwargs={"predictions": predictions, "references": references},
138
  )
@@ -155,10 +155,9 @@ class MetricRecipe(SequentialOperatorInitializer):
155
  self.steps = [
156
  FromPredictionsAndOriginalData(),
157
  LoadJson(field="task_data"),
158
- CopyFields(
159
- field_to_field={
160
- "source": "task_data/source",
161
- }
162
  ),
163
  ApplyOperatorsField(
164
  operators_field="postprocessors",
 
15
  from .operators import (
16
  ApplyMetric,
17
  ApplyOperatorsField,
18
+ Copy,
19
  FlattenInstances,
20
  MergeStreams,
21
  SplitByNestedGroup,
 
23
  from .register import _reset_env_local_catalogs, register_all_artifacts
24
  from .schema import UNITXT_DATASET_SCHEMA
25
  from .settings_utils import get_settings
26
+ from .stream import DynamicStream, MultiStream
27
  from .struct_data_operators import LoadJson
28
 
29
 
 
109
 
110
  return MultiStream(
111
  {
112
+ stream_name: DynamicStream(
113
  never_peek_twice_generator,
114
  gen_kwargs={
115
  "stream_name": stream_name,
 
132
  ) -> MultiStream:
133
  return MultiStream(
134
  {
135
+ split_name: DynamicStream(
136
  self.zip,
137
  gen_kwargs={"predictions": predictions, "references": references},
138
  )
 
155
  self.steps = [
156
  FromPredictionsAndOriginalData(),
157
  LoadJson(field="task_data"),
158
+ Copy(
159
+ field="source",
160
+ to_field="task_data/source",
 
161
  ),
162
  ApplyOperatorsField(
163
  operators_field="postprocessors",
operator.py CHANGED
@@ -4,7 +4,7 @@ from typing import Any, Dict, Generator, List, Optional, Union
4
 
5
  from .artifact import Artifact
6
  from .dataclass import InternalField, NonPositionalField
7
- from .stream import GeneratorStream, MultiStream, Stream
8
  from .utils import is_module_available
9
 
10
 
@@ -170,7 +170,7 @@ def instance_generator(instance):
170
 
171
 
172
  def stream_single(instance: Dict[str, Any]) -> Stream:
173
- return GeneratorStream(
174
  generator=instance_generator, gen_kwargs={"instance": instance}
175
  )
176
 
@@ -244,7 +244,7 @@ class StreamOperator(MultiStreamOperator):
244
  def _process_single_stream(
245
  self, stream: Stream, stream_name: Optional[str] = None
246
  ) -> Stream:
247
- return GeneratorStream(
248
  self._process_stream,
249
  gen_kwargs={"stream": stream, "stream_name": stream_name},
250
  )
@@ -401,7 +401,7 @@ class InstanceOperatorValidator(InstanceOperator):
401
  try:
402
  first_instance = next(iterator)
403
  except StopIteration as e:
404
- raise StopIteration(f"Stream '{stream_name}' is empty") from e
405
  result = self._process_instance(first_instance, stream_name)
406
  self.validate(result)
407
  yield result
@@ -439,7 +439,7 @@ class InstanceOperatorWithMultiStreamAccess(StreamingOperator):
439
  result = {}
440
 
441
  for stream_name, stream in multi_stream.items():
442
- stream = GeneratorStream(
443
  self.generator,
444
  gen_kwargs={"stream": stream, "multi_stream": multi_stream},
445
  )
 
4
 
5
  from .artifact import Artifact
6
  from .dataclass import InternalField, NonPositionalField
7
+ from .stream import DynamicStream, EmptyStreamError, MultiStream, Stream
8
  from .utils import is_module_available
9
 
10
 
 
170
 
171
 
172
  def stream_single(instance: Dict[str, Any]) -> Stream:
173
+ return DynamicStream(
174
  generator=instance_generator, gen_kwargs={"instance": instance}
175
  )
176
 
 
244
  def _process_single_stream(
245
  self, stream: Stream, stream_name: Optional[str] = None
246
  ) -> Stream:
247
+ return DynamicStream(
248
  self._process_stream,
249
  gen_kwargs={"stream": stream, "stream_name": stream_name},
250
  )
 
401
  try:
402
  first_instance = next(iterator)
403
  except StopIteration as e:
404
+ raise EmptyStreamError(f"Stream '{stream_name}' is empty") from e
405
  result = self._process_instance(first_instance, stream_name)
406
  self.validate(result)
407
  yield result
 
439
  result = {}
440
 
441
  for stream_name, stream in multi_stream.items():
442
+ stream = DynamicStream(
443
  self.generator,
444
  gen_kwargs={"stream": stream, "multi_stream": multi_stream},
445
  )
operators.py CHANGED
@@ -76,7 +76,7 @@ from .operator import (
76
  )
77
  from .random_utils import new_random_generator
78
  from .settings_utils import get_settings
79
- from .stream import GeneratorStream, Stream
80
  from .text_utils import nested_tuple_to_string
81
  from .type_utils import isoftype
82
  from .utils import flatten_dict
@@ -282,6 +282,24 @@ class RemoveFields(InstanceOperator):
282
  return instance
283
 
284
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
285
  class InstanceFieldOperator(InstanceOperator):
286
  """A general stream instance operator that processes the values of a field (or multiple ones).
287
 
@@ -1007,7 +1025,7 @@ class Perturb(FieldOperator):
1007
  return value
1008
 
1009
 
1010
- class CopyFields(FieldOperator):
1011
  """Copies values from specified fields to specified fields.
1012
 
1013
  Args (of parent class):
@@ -1015,13 +1033,13 @@ class CopyFields(FieldOperator):
1015
 
1016
  Examples:
1017
  An input instance {"a": 2, "b": 3}, when processed by
1018
- CopyField(field_to_field={"a": "b"}
1019
  would yield {"a": 2, "b": 2}, and when processed by
1020
- CopyField(field_to_field={"a": "c"} would yield
1021
  {"a": 2, "b": 3, "c": 2}
1022
 
1023
  with field names containing / , we can also copy inside the field:
1024
- CopyFields(field_to_field={"a/0": "a"})
1025
  would process instance {"a": [1, 3]} into {"a": 1}
1026
 
1027
 
@@ -1031,6 +1049,10 @@ class CopyFields(FieldOperator):
1031
  return copy.deepcopy(value)
1032
 
1033
 
 
 
 
 
1034
  class GetItemByIndex(FieldOperator):
1035
  """Get from the item list by the index in the field."""
1036
 
@@ -1299,6 +1321,52 @@ class FilterByCondition(StreamOperator):
1299
  return True
1300
 
1301
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1302
  class ComputeExpressionMixin(Artifact):
1303
  """Computes an expression expressed over fields of an instance.
1304
 
@@ -1774,7 +1842,7 @@ class MergeStreams(MultiStreamOperator):
1774
  def process(self, multi_stream: MultiStream) -> MultiStream:
1775
  return MultiStream(
1776
  {
1777
- self.new_stream_name: GeneratorStream(
1778
  self.merge, gen_kwargs={"multi_stream": multi_stream}
1779
  )
1780
  }
 
76
  )
77
  from .random_utils import new_random_generator
78
  from .settings_utils import get_settings
79
+ from .stream import DynamicStream, Stream
80
  from .text_utils import nested_tuple_to_string
81
  from .type_utils import isoftype
82
  from .utils import flatten_dict
 
282
  return instance
283
 
284
 
285
+ class SelectFields(InstanceOperator):
286
+ """Keep only specified fields from each instance in a stream.
287
+
288
+ Args:
289
+ fields (List[str]): The fields to keep from each instance.
290
+ """
291
+
292
+ fields: List[str]
293
+
294
+ def process(
295
+ self, instance: Dict[str, Any], stream_name: Optional[str] = None
296
+ ) -> Dict[str, Any]:
297
+ new_instance = {}
298
+ for selected_field in self.fields:
299
+ new_instance[selected_field] = instance[selected_field]
300
+ return new_instance
301
+
302
+
303
  class InstanceFieldOperator(InstanceOperator):
304
  """A general stream instance operator that processes the values of a field (or multiple ones).
305
 
 
1025
  return value
1026
 
1027
 
1028
+ class Copy(FieldOperator):
1029
  """Copies values from specified fields to specified fields.
1030
 
1031
  Args (of parent class):
 
1033
 
1034
  Examples:
1035
  An input instance {"a": 2, "b": 3}, when processed by
1036
+ Copy(field_to_field={"a": "b"}
1037
  would yield {"a": 2, "b": 2}, and when processed by
1038
+ Copy(field_to_field={"a": "c"} would yield
1039
  {"a": 2, "b": 3, "c": 2}
1040
 
1041
  with field names containing / , we can also copy inside the field:
1042
+ Copy(field="a/0",to_field="a")
1043
  would process instance {"a": [1, 3]} into {"a": 1}
1044
 
1045
 
 
1049
  return copy.deepcopy(value)
1050
 
1051
 
1052
+ class CopyFields(Copy):
1053
+ pass
1054
+
1055
+
1056
  class GetItemByIndex(FieldOperator):
1057
  """Get from the item list by the index in the field."""
1058
 
 
1321
  return True
1322
 
1323
 
1324
+ class FilterByConditionBasedOnFields(FilterByCondition):
1325
+ """Filters a stream based on a condition between 2 fields values.
1326
+
1327
+ Raises an error if either of the required fields names is missing from the input instance.
1328
+
1329
+ Args:
1330
+ values (Dict[str, str]): The fields names that the filter operation is based on.
1331
+ condition: the name of the desired condition operator between the specified field's values. Supported conditions are ("gt", "ge", "lt", "le", "ne", "eq", "in","not in")
1332
+ error_on_filtered_all (bool, optional): If True, raises an error if all instances are filtered out. Defaults to True.
1333
+
1334
+ Examples:
1335
+ FilterByCondition(values = {"a":"b}, condition = "gt") will yield only instances where field "a" contains a value greater then the value in field "b".
1336
+ FilterByCondition(values = {"a":"b}, condition = "le") will yield only instances where "a"<="b"
1337
+ """
1338
+
1339
+ def _is_required(self, instance: dict) -> bool:
1340
+ for key, value in self.values.items():
1341
+ try:
1342
+ instance_key = dict_get(instance, key)
1343
+ except ValueError as ve:
1344
+ raise ValueError(
1345
+ f"Required filter field ('{key}') in FilterByCondition is not found in {instance}"
1346
+ ) from ve
1347
+ try:
1348
+ instance_value = dict_get(instance, value)
1349
+ except ValueError as ve:
1350
+ raise ValueError(
1351
+ f"Required filter field ('{value}') in FilterByCondition is not found in {instance}"
1352
+ ) from ve
1353
+ if self.condition == "in":
1354
+ if instance_key not in instance_value:
1355
+ return False
1356
+ elif self.condition == "not in":
1357
+ if instance_key in instance_value:
1358
+ return False
1359
+ else:
1360
+ func = self.condition_to_func[self.condition]
1361
+ if func is None:
1362
+ raise ValueError(
1363
+ f"Function not defined for condition '{self.condition}'"
1364
+ )
1365
+ if not func(instance_key, instance_value):
1366
+ return False
1367
+ return True
1368
+
1369
+
1370
  class ComputeExpressionMixin(Artifact):
1371
  """Computes an expression expressed over fields of an instance.
1372
 
 
1842
  def process(self, multi_stream: MultiStream) -> MultiStream:
1843
  return MultiStream(
1844
  {
1845
+ self.new_stream_name: DynamicStream(
1846
  self.merge, gen_kwargs={"multi_stream": multi_stream}
1847
  )
1848
  }
processors.py CHANGED
@@ -245,3 +245,11 @@ class LiteralEval(FieldOperator):
245
  if text is None or text == "":
246
  return text
247
  return ast.literal_eval(text.strip())
 
 
 
 
 
 
 
 
 
245
  if text is None or text == "":
246
  return text
247
  return ast.literal_eval(text.strip())
248
+
249
+
250
+ class ExtractSafeUnsafeJudgment(FieldOperator):
251
+ def process_value(self, text: Any) -> Any:
252
+ first_line = str(text).strip().split("\n")[0].lower()
253
+ if first_line == "safe":
254
+ return 1.0
255
+ return 0.0
settings_utils.py CHANGED
@@ -127,6 +127,7 @@ if Settings.is_uninitilized():
127
  settings.artifactories = None
128
  settings.default_recipe = "standard_recipe"
129
  settings.default_verbosity = "info"
 
130
  settings.remote_metrics = []
131
  settings.test_card_disable = (bool, False)
132
  settings.test_metric_disable = (bool, False)
 
127
  settings.artifactories = None
128
  settings.default_recipe = "standard_recipe"
129
  settings.default_verbosity = "info"
130
+ settings.use_eager_execution = False
131
  settings.remote_metrics = []
132
  settings.test_card_disable = (bool, False)
133
  settings.test_metric_disable = (bool, False)
split_utils.py CHANGED
@@ -5,7 +5,7 @@ from typing import Dict
5
  from .generator_utils import ReusableGenerator
6
  from .logging_utils import get_logger
7
  from .random_utils import new_random_generator
8
- from .stream import Stream
9
 
10
  logger = get_logger()
11
 
@@ -140,7 +140,9 @@ def slice_streams(input_streams, mapping):
140
  def generator(new_stream, sources):
141
  for old_stream, slices in sources.items():
142
  if old_stream not in input_streams:
143
- raise ValueError(f"'{old_stream}' is not available in input stream")
 
 
144
  old_stream_content = input_streams[old_stream]
145
  for start, end in slices:
146
  yield from slice_stream(old_stream_content, start, end)
 
5
  from .generator_utils import ReusableGenerator
6
  from .logging_utils import get_logger
7
  from .random_utils import new_random_generator
8
+ from .stream import MissingStreamError, Stream
9
 
10
  logger = get_logger()
11
 
 
140
  def generator(new_stream, sources):
141
  for old_stream, slices in sources.items():
142
  if old_stream not in input_streams:
143
+ raise MissingStreamError(
144
+ f"'{old_stream}' is not available in input streams, but need to slice there from"
145
+ )
146
  old_stream_content = input_streams[old_stream]
147
  for start, end in slices:
148
  yield from slice_stream(old_stream_content, start, end)
splitters.py CHANGED
@@ -1,5 +1,6 @@
1
  import itertools
2
  from abc import abstractmethod
 
3
  from random import Random
4
  from typing import Dict, List
5
 
@@ -13,7 +14,7 @@ from .split_utils import (
13
  rename_split,
14
  slice_streams,
15
  )
16
- from .stream import MultiStream
17
 
18
 
19
  class Splitter(MultiStreamOperator):
@@ -138,8 +139,13 @@ class Sampler(Artifact):
138
  ) -> List[Dict[str, object]]:
139
  if "inputs" not in instance:
140
  raise ValueError(f"'inputs' field is missing from '{instance}'.")
141
-
142
- return list(filter(lambda x: x["inputs"] != instance["inputs"], instances_pool))
 
 
 
 
 
143
 
144
 
145
  class RandomSampler(Sampler):
@@ -282,16 +288,20 @@ class SpreadSplit(InstanceOperatorWithMultiStreamAccess):
282
  ) -> Dict[str, object]:
283
  try:
284
  if self.local_cache is None:
285
- self.local_cache = list(multi_stream[self.source_stream])
286
 
287
  source_stream = self.local_cache
288
  source_stream = self.sampler.filter_source_by_instance(
289
  source_stream, instance
290
  )
 
 
 
 
291
  sampled_instances = self.sampler.sample(source_stream)
292
  instance[self.target_field] = sampled_instances
293
  return instance
294
- except Exception as e:
295
- raise Exception(
296
  f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
297
  ) from e
 
1
  import itertools
2
  from abc import abstractmethod
3
+ from copy import deepcopy
4
  from random import Random
5
  from typing import Dict, List
6
 
 
14
  rename_split,
15
  slice_streams,
16
  )
17
+ from .stream import EmptyStreamError, FaultyStreamError, MultiStream
18
 
19
 
20
  class Splitter(MultiStreamOperator):
 
139
  ) -> List[Dict[str, object]]:
140
  if "inputs" not in instance:
141
  raise ValueError(f"'inputs' field is missing from '{instance}'.")
142
+ # l = list(filter(lambda x: x["inputs"] != instance["inputs"], instances_pool))
143
+ try:
144
+ return [
145
+ item for item in instances_pool if item["inputs"] != instance["inputs"]
146
+ ]
147
+ except Exception as e:
148
+ raise e
149
 
150
 
151
  class RandomSampler(Sampler):
 
288
  ) -> Dict[str, object]:
289
  try:
290
  if self.local_cache is None:
291
+ self.local_cache = deepcopy(list(multi_stream[self.source_stream]))
292
 
293
  source_stream = self.local_cache
294
  source_stream = self.sampler.filter_source_by_instance(
295
  source_stream, instance
296
  )
297
+ if len(source_stream) < self.sampler.sample_size:
298
+ raise ValueError(
299
+ f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
300
+ )
301
  sampled_instances = self.sampler.sample(source_stream)
302
  instance[self.target_field] = sampled_instances
303
  return instance
304
+ except FaultyStreamError as e:
305
+ raise EmptyStreamError(
306
  f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
307
  ) from e
stream.py CHANGED
@@ -1,11 +1,19 @@
1
  import tempfile
 
 
2
  from abc import abstractmethod
3
- from typing import Any, Callable, Dict, Iterable, List
 
4
 
5
  from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
6
 
7
  from .dataclass import Dataclass, OptionalField
8
  from .generator_utils import CopyingReusableGenerator, ReusableGenerator
 
 
 
 
 
9
 
10
 
11
  class Stream(Dataclass):
@@ -21,22 +29,32 @@ class Stream(Dataclass):
21
  def take(self, n):
22
  pass
23
 
 
 
 
 
24
 
25
  class ListStream(Stream):
26
  instances_list: List[Dict[str, Any]]
 
27
 
28
  def __iter__(self):
 
 
29
  return iter(self.instances_list)
30
 
31
  def peek(self):
32
- return next(iter(self.instances_list))
33
 
34
- def take(self, n):
35
  for i, instance in enumerate(self.instances_list):
36
  if i >= n:
37
  break
38
  yield instance
39
 
 
 
 
40
 
41
  class GeneratorStream(Stream):
42
  """A class for handling streaming data in a customizable way.
@@ -88,6 +106,79 @@ class GeneratorStream(Stream):
88
  break
89
  yield instance
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  class MultiStream(dict):
93
  """A class for handling multiple streams of data in a dictionary-like format.
@@ -112,7 +203,7 @@ class MultiStream(dict):
112
  isinstance(key, str), "MultiStream keys must be strings"
113
  super().__init__(data)
114
 
115
- def get_generator(self, key):
116
  """Gets a generator for a specified key.
117
 
118
  Args:
@@ -129,7 +220,7 @@ class MultiStream(dict):
129
 
130
  def set_copying(self, copying: bool):
131
  for stream in self.values():
132
- stream.copying = copying
133
 
134
  def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
135
  with tempfile.TemporaryDirectory() as dir_to_be_deleted:
@@ -178,7 +269,7 @@ class MultiStream(dict):
178
  assert all(isinstance(v, ReusableGenerator) for v in generators.values())
179
  return cls(
180
  {
181
- key: GeneratorStream(
182
  generator.generator,
183
  gen_kwargs=generator.gen_kwargs,
184
  caching=caching,
@@ -204,7 +295,7 @@ class MultiStream(dict):
204
  """
205
  return cls(
206
  {
207
- key: GeneratorStream(
208
  iterable.__iter__,
209
  caching=caching,
210
  copying=copying,
 
1
  import tempfile
2
+ import traceback
3
+ import warnings
4
  from abc import abstractmethod
5
+ from copy import deepcopy
6
+ from typing import Any, Callable, Dict, Generator, Iterable, List
7
 
8
  from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
9
 
10
  from .dataclass import Dataclass, OptionalField
11
  from .generator_utils import CopyingReusableGenerator, ReusableGenerator
12
+ from .logging_utils import get_logger
13
+ from .settings_utils import get_settings
14
+
15
+ settings = get_settings()
16
+ logger = get_logger()
17
 
18
 
19
  class Stream(Dataclass):
 
29
  def take(self, n):
30
  pass
31
 
32
+ @abstractmethod
33
+ def set_copying(self, copying: bool):
34
+ pass
35
+
36
 
37
  class ListStream(Stream):
38
  instances_list: List[Dict[str, Any]]
39
+ copying: bool = False
40
 
41
  def __iter__(self):
42
+ if self.copying:
43
+ return iter(deepcopy(self.instances_list))
44
  return iter(self.instances_list)
45
 
46
  def peek(self):
47
+ return next(iter(self))
48
 
49
+ def take(self, n) -> Generator:
50
  for i, instance in enumerate(self.instances_list):
51
  if i >= n:
52
  break
53
  yield instance
54
 
55
+ def set_copying(self, copying: bool):
56
+ self.copying = copying
57
+
58
 
59
  class GeneratorStream(Stream):
60
  """A class for handling streaming data in a customizable way.
 
106
  break
107
  yield instance
108
 
109
+ def set_copying(self, copying: bool):
110
+ self.copying = copying
111
+
112
+
113
+ class FaultyStreamError(Exception):
114
+ """Base class for all stream-related exceptions."""
115
+
116
+ pass
117
+
118
+
119
+ class MissingStreamError(FaultyStreamError):
120
+ """Raised when a required stream is missing."""
121
+
122
+ pass
123
+
124
+
125
+ class EmptyStreamError(FaultyStreamError):
126
+ """Raised when a stream is unexpectedly empty."""
127
+
128
+ pass
129
+
130
+
131
+ def eager_failed():
132
+ traceback.print_exc()
133
+ warnings.warn(
134
+ "The eager execution has failed due to the error above.", stacklevel=2
135
+ )
136
+
137
+
138
+ class DynamicStream(Stream):
139
+ generator: Callable
140
+ gen_kwargs: Dict[str, Any] = OptionalField(default_factory=dict)
141
+ caching: bool = False
142
+ copying: bool = False
143
+
144
+ def __post_init__(self):
145
+ self.stream = None
146
+ if settings.use_eager_execution:
147
+ try:
148
+ instances_list = []
149
+ for instance in self.generator(**self.gen_kwargs):
150
+ instances_list.append(instance)
151
+ self.stream = ListStream(
152
+ instances_list=instances_list, copying=self.copying
153
+ )
154
+ except FaultyStreamError:
155
+ eager_failed()
156
+ except RuntimeError as e:
157
+ if isinstance(e.__cause__, FaultyStreamError):
158
+ eager_failed()
159
+ else:
160
+ raise e
161
+
162
+ if self.stream is None:
163
+ self.stream = GeneratorStream(
164
+ generator=self.generator,
165
+ gen_kwargs=self.gen_kwargs,
166
+ caching=self.caching,
167
+ copying=self.copying,
168
+ )
169
+
170
+ def __iter__(self):
171
+ return self.stream.__iter__()
172
+
173
+ def peek(self):
174
+ return self.stream.peek()
175
+
176
+ def take(self, n):
177
+ return self.stream.take(n)
178
+
179
+ def set_copying(self, copying: bool):
180
+ self.stream.set_copying(copying)
181
+
182
 
183
  class MultiStream(dict):
184
  """A class for handling multiple streams of data in a dictionary-like format.
 
203
  isinstance(key, str), "MultiStream keys must be strings"
204
  super().__init__(data)
205
 
206
+ def get_generator(self, key) -> Generator:
207
  """Gets a generator for a specified key.
208
 
209
  Args:
 
220
 
221
  def set_copying(self, copying: bool):
222
  for stream in self.values():
223
+ stream.set_copying(copying)
224
 
225
  def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
226
  with tempfile.TemporaryDirectory() as dir_to_be_deleted:
 
269
  assert all(isinstance(v, ReusableGenerator) for v in generators.values())
270
  return cls(
271
  {
272
+ key: DynamicStream(
273
  generator.generator,
274
  gen_kwargs=generator.gen_kwargs,
275
  caching=caching,
 
295
  """
296
  return cls(
297
  {
298
+ key: DynamicStream(
299
  iterable.__iter__,
300
  caching=caching,
301
  copying=copying,
stream_operators.py ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """This section describes unitxt operators.
2
+
3
+ Operators: Building Blocks of Unitxt Processing Pipelines
4
+ ==============================================================
5
+
6
+ Within the Unitxt framework, operators serve as the foundational elements used to assemble processing pipelines.
7
+ Each operator is designed to perform specific manipulations on dictionary structures within a stream.
8
+ These operators are callable entities that receive a MultiStream as input.
9
+ The output is a MultiStream, augmented with the operator's manipulations, which are then systematically applied to each instance in the stream when pulled.
10
+
11
+ Creating Custom Operators
12
+ -------------------------------
13
+ To enhance the functionality of Unitxt, users are encouraged to develop custom operators.
14
+ This can be achieved by inheriting from any of the existing operators listed below or from one of the fundamental :class:`base operators<unitxt.operator>`.
15
+ The primary task in any operator development is to implement the `process` function, which defines the unique manipulations the operator will perform.
16
+
17
+ General or Specelized Operators
18
+ --------------------------------
19
+ Some operators are specielized in specific task such as:
20
+
21
+ - :class:`loaders<unitxt.loaders>` for loading data.
22
+ - :class:`splitters<unitxt.splitters>` for fixing data splits.
23
+ - :class:`struct_data_operators<unitxt.struct_data_operators>` for structured data operators.
24
+
25
+ Other specelized operators are used by unitxt internally:
26
+
27
+ - :class:`templates<unitxt.templates>` for verbalizing data examples.
28
+ - :class:`formats<unitxt.formats>` for preparing data for models.
29
+
30
+ The rest of this section is dedicated for operators that operates on streams.
31
+
32
+ """
33
+
34
+ from typing import (
35
+ List,
36
+ Literal,
37
+ Optional,
38
+ )
39
+
40
+ import pandas as pd
41
+
42
+ from .operator import (
43
+ MultiStream,
44
+ MultiStreamOperator,
45
+ )
46
+ from .settings_utils import get_settings
47
+ from .stream import ListStream
48
+
49
+ settings = get_settings()
50
+
51
+
52
+ class JoinStreams(MultiStreamOperator):
53
+ """Join multiple streams into a single stream.
54
+
55
+ Args:
56
+ left_stream (str): The stream that will be considered the "left" in the join operations.
57
+ right_stream (str): The stream that will be considered the "right" in the join operations.
58
+ how (Literal["left", "right", "inner", "outer", "cross"]): The type of join to be performed.
59
+ on (Optional[List[str]]): Column names to join on. These must be found in both streams.
60
+ left_on (Optional[List[str]]): Column names to join on in the left stream.
61
+ right_on (Optional[List[str]]): Column names to join on in the right streasm.
62
+ new_stream_name (str): The name of the new stream resulting from the merge.
63
+
64
+ Examples:
65
+ JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on="question_id", new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field using inner join, resulting with a new stream named "question_with_answers".
66
+ JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on_left="question_id", on_right="question" new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field in the left stream and the 'question' field in the right stream, using inner join, resulting with a new stream named "question_with_answers". This is suitable when the fields have different labels across the streams.
67
+ """
68
+
69
+ left_stream: str
70
+ right_stream: str
71
+ how: Literal["left", "right", "inner", "outer", "cross"]
72
+ on: Optional[List[str]] = None
73
+ left_on: Optional[List[str]] = None
74
+ right_on: Optional[List[str]] = None
75
+ new_stream_name: str
76
+
77
+ def merge(self, multi_stream) -> List:
78
+ assert self.right_stream in multi_stream and self.left_stream in multi_stream
79
+ stream_dict = dict(multi_stream.items())
80
+ left_stream = list(stream_dict[self.left_stream])
81
+ right_stream = list(stream_dict[self.right_stream])
82
+ left_stream_df = pd.DataFrame(left_stream)
83
+ right_stream_df = pd.DataFrame(right_stream)
84
+
85
+ # Remove common col we don't join on, so we don't have unexpected column (standard behavior is to add a suffix)
86
+ common_cols = set(left_stream_df.columns).intersection(
87
+ set(right_stream_df.columns)
88
+ )
89
+ on = self.on if self.on is not None else []
90
+ left_on = self.left_on if self.left_on is not None else []
91
+ right_on = self.right_on if self.right_on is not None else []
92
+ on_cols = set(on + left_on + right_on)
93
+ col_to_remove = list(common_cols - on_cols)
94
+ left_stream_df = left_stream_df.drop(columns=col_to_remove, errors="ignore")
95
+ right_stream_df = right_stream_df.drop(columns=col_to_remove, errors="ignore")
96
+
97
+ merged_df = pd.merge(
98
+ left_stream_df,
99
+ right_stream_df,
100
+ how=self.how,
101
+ on=self.on,
102
+ left_on=self.left_on,
103
+ right_on=self.right_on,
104
+ )
105
+ return merged_df.to_dict(orient="records")
106
+
107
+ def process(self, multi_stream: MultiStream) -> MultiStream:
108
+ merged_records = self.merge(multi_stream)
109
+ multi_stream[self.new_stream_name] = ListStream(instances_list=merged_records)
110
+ return multi_stream
111
+
112
+
113
+ class DeleteSplits(MultiStreamOperator):
114
+ """Operator which delete splits in stream.
115
+
116
+ Attributes:
117
+ splits (List[str]): The splits to delete from the stream.
118
+ """
119
+
120
+ splits: List[str]
121
+
122
+ def process(self, multi_stream: MultiStream) -> MultiStream:
123
+ generators = {
124
+ key: val for key, val in multi_stream.items() if key not in self.splits
125
+ }
126
+ return MultiStream(generators)
struct_data_operators.py CHANGED
@@ -566,3 +566,42 @@ class LoadJson(FieldOperator):
566
  class DumpJson(FieldOperator):
567
  def process_value(self, value: str) -> str:
568
  return json.dumps(value)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
566
  class DumpJson(FieldOperator):
567
  def process_value(self, value: str) -> str:
568
  return json.dumps(value)
569
+
570
+
571
+ class MapHTMLTableToJSON(FieldOperator):
572
+ """Converts HTML table format to the basic one (JSON).
573
+
574
+ JSON format
575
+ {
576
+ "header": ["col1", "col2"],
577
+ "rows": [["row11", "row12"], ["row21", "row22"], ["row31", "row32"]]
578
+ }
579
+ """
580
+
581
+ _requirements_list = ["bs4"]
582
+
583
+ def process_value(self, table: Any) -> Any:
584
+ return self.truncate_table_rows(table_content=table)
585
+
586
+ def truncate_table_rows(self, table_content: str) -> Dict:
587
+ from bs4 import BeautifulSoup
588
+
589
+ soup = BeautifulSoup(table_content, "html.parser")
590
+
591
+ # Extract header
592
+ header = []
593
+ header_cells = soup.find("thead").find_all("th")
594
+ for cell in header_cells:
595
+ header.append(cell.get_text())
596
+
597
+ # Extract rows
598
+ rows = []
599
+ for row in soup.find("tbody").find_all("tr"):
600
+ row_data = []
601
+ for cell in row.find_all("td"):
602
+ row_data.append(cell.get_text())
603
+ rows.append(row_data)
604
+
605
+ # return dictionary
606
+
607
+ return {"header": header, "rows": rows}
text_utils.py CHANGED
@@ -135,8 +135,8 @@ def is_made_of_sub_strings(string, sub_strings):
135
  return bool(re.match(pattern, string))
136
 
137
 
138
- # Giveמ all the lines of a file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
139
- # and an object name, e.g. TaskCard,
140
  # return the ordinal number of the line that starts that object, in our example: the
141
  # line number of the following line (notice that the line where TaskCard is imported
142
  # is not supposed to return):
@@ -145,10 +145,12 @@ def is_made_of_sub_strings(string, sub_strings):
145
  # the matching close:
146
  # )
147
  # This util depends on ruff to ensure this setting of the card file: that a close of one
148
- # tag and the open of the next tag, do not sit in same line, both tags being
149
- # major level within TaskCard
 
 
150
  # flake8: noqa: B007
151
- def lines_defining_obj(
152
  all_lines: List[str], obj_name: str, start_search_at_line: int = 0
153
  ) -> Tuple[int, int]:
154
  for starting_line in range(start_search_at_line, len(all_lines)):
@@ -160,11 +162,28 @@ def lines_defining_obj(
160
  return (-1, -1)
161
  num_of_opens = 0
162
  num_of_closes = 0
163
- for ending_line in range(starting_line, len(all_lines)):
 
 
164
  num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
165
  num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
166
  if num_of_closes == num_of_opens:
167
  break
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  if num_of_closes != num_of_opens:
170
  raise ValueError(
 
135
  return bool(re.match(pattern, string))
136
 
137
 
138
+ # Giveמ all the lines of a card preparer file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
139
+ # and an object name, e.g. TaskCard(,
140
  # return the ordinal number of the line that starts that object, in our example: the
141
  # line number of the following line (notice that the line where TaskCard is imported
142
  # is not supposed to return):
 
145
  # the matching close:
146
  # )
147
  # This util depends on ruff to ensure this setting of the card file: that a close of one
148
+ # tag and the open of the next tag, do not sit in same line, when both tags being
149
+ # major level within TaskCard.
150
+ # It also prepares for the case that __description__ tag does not contain balanced
151
+ # parentheses, since it is often cut in the middle, (with "... see more at")
152
  # flake8: noqa: B007
153
+ def lines_defining_obj_in_card(
154
  all_lines: List[str], obj_name: str, start_search_at_line: int = 0
155
  ) -> Tuple[int, int]:
156
  for starting_line in range(start_search_at_line, len(all_lines)):
 
162
  return (-1, -1)
163
  num_of_opens = 0
164
  num_of_closes = 0
165
+ ending_line = starting_line - 1
166
+ while ending_line < len(all_lines):
167
+ ending_line += 1
168
  num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
169
  num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
170
  if num_of_closes == num_of_opens:
171
  break
172
+ if "__description__" in all_lines[ending_line]:
173
+ # can not trust parentheses inside description.
174
+ # trust the indentation enforced by ruff, and the way we build __description__:
175
+ # a line consisting of only __description__=(
176
+ # followed by one or more lines of text, can not trust opens and closes
177
+ # in them, followed by a line consisting of only: ),
178
+ # where the ) is indented with the beginning of __description__
179
+ tag_indentation = all_lines[ending_line].index("__description__")
180
+ last_line_to_start_with = (" " * tag_indentation) + ")"
181
+ while not all_lines[ending_line].startswith(last_line_to_start_with):
182
+ ending_line += 1
183
+ if "__description__" in obj_name:
184
+ return (starting_line, ending_line)
185
+ num_of_closes += 1 # for this last line of desc
186
+ # continue to the line following the end of description
187
 
188
  if num_of_closes != num_of_opens:
189
  raise ValueError(