Spaces:

unitxt
/

metric

Running

App Files Files Community

Elron commited on Jun 9, 2024

Commit

100c2eb

verified ·

1 Parent(s): 0a1b314

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

README.md +61 -36
blocks.py +3 -0
dataset.py +4 -7
fusion.py +2 -2
inference.py +5 -5
llm_as_judge.py +5 -1
loaders.py +74 -12
metric.py +2 -6
metric_utils.py +7 -8
operator.py +5 -5
operators.py +74 -6
processors.py +8 -0
settings_utils.py +1 -0
split_utils.py +4 -2
splitters.py +16 -6
stream.py +98 -7
stream_operators.py +126 -0
struct_data_operators.py +39 -0
text_utils.py +25 -6

README.md CHANGED Viewed

@@ -1,50 +1,75 @@
----
-title: Metric
-datasets:
-- none
-tags:
-- evaluate
-- metric
-description: "TODO: add a description here"
-sdk: gradio
-sdk_version: 3.19.1
-app_file: app.py
-pinned: false
----
-# Metric Card for Metric
-***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
-## Metric Description
-*Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
-## How to Use
-*Give general statement of how to use the metric*
-*Provide simplest possible example for using the metric*
-### Inputs
-*List all input arguments in the format below*
-- **input_field** *(type): Definition of input, with explanation if necessary. State any default value(s).*
-### Output Values
-*Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}*
-*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
-#### Values from Popular Papers
-*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
-### Examples
-*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
-## Limitations and Bias
-*Note any known limitations or biases that the metric has, with links and references if possible.*
-## Citation
-*Cite the source where this metric was introduced.*
-## Further References
-*Add any useful further references.*

+<div align="center">
+    <img src="./assets/banner.png" alt="Image Description" width="100%" />
+</div>
+[![Button](https://img.shields.io/badge/Video-pink?style=for-the-badge)](https://unitxt.readthedocs.io/)
+[![Button](https://img.shields.io/badge/Demo-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/demo.html)
+[![Button](https://img.shields.io/badge/Tutorial-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html)
+[![Button](https://img.shields.io/badge/Paper-pink?style=for-the-badge)](https://arxiv.org/abs/2401.14019)
+[![Button](https://img.shields.io/badge/Documentation-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/modules.html)
+[![Button](https://img.shields.io/badge/Catalog-pink?style=for-the-badge)](https://unitxt.readthedocs.io/en/latest/catalog/catalog.__dir__.html)
+[![Button](https://img.shields.io/badge/Contributors-pink?style=for-the-badge)](https://github.com/IBM/unitxt/blob/main/CONTRIBUTING.md)
+[![Button](https://img.shields.io/badge/PyPi-pink?style=for-the-badge)](https://pypi.org/project/unitxt/)
+In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.
+ Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.
+#
+[![version](https://img.shields.io/pypi/v/unitxt)](https://pypi.org/project/unitxt/)
+![license](https://img.shields.io/github/license/ibm/unitxt)
+![python](https://img.shields.io/badge/python-3.8%20|%203.9-blue)
+![tests](https://img.shields.io/github/actions/workflow/status/ibm/unitxt/library_tests.yml?branch=main&label=tests)
+[![codecov](https://codecov.io/gh/IBM/unitxt/branch/main/graph/badge.svg?token=mlrWq9cwz3)](https://codecov.io/gh/IBM/unitxt)
+![Read the Docs](https://img.shields.io/readthedocs/unitxt)
+[![downloads](https://static.pepy.tech/personalized-badge/unitxt?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/unitxt)
+#
+https://github.com/IBM/unitxt/assets/23455264/baef9131-39d4-4164-90b2-05da52919fdf
+### 🦄 Currently on Unitxt Catalog
+![NLP Tasks](https://img.shields.io/badge/NLP_tasks-40-blue)
+![Dataset Cards](https://img.shields.io/badge/Dataset_Cards-457-blue)
+![Templates](https://img.shields.io/badge/Templates-229-blue)
+![Formats](https://img.shields.io/badge/Formats-18-blue)
+![Metrics](https://img.shields.io/badge/Metrics-98-blue)
+### 🦄 Run Unitxt Exploration Dashboard
+To launch unitxt graphical user interface first install unitxt with ui requirements:
+```
+pip install unitxt[ui]
+```
+Then launch the ui by running:
+```
+unitxt-explore
+```
+# 🦄 Contributors
+Please install Unitxt from source by:
+```
+git clone git@github.com:IBM/unitxt.git
+cd unitxt
+pip install -e ".[dev]"
+pre-commit install
+```
+# 🦄 Citation
+If you use Unitxt in your research, please cite our paper:
+```
+@misc{unitxt,
+      title={Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI},
+      author={Elron Bandel and Yotam Perlitz and Elad Venezian and Roni Friedman-Melamed and Ofir Arviv and Matan Orbach and Shachar Don-Yehyia and Dafna Sheinwald and Ariel Gera and Leshem Choshen and Michal Shmueli-Scheuer and Yoav Katz},
+      year={2024},
+      eprint={2401.14019},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

blocks.py CHANGED Viewed

@@ -22,8 +22,11 @@ from .splitters import RandomSampler, SliceSplit, SplitRandomMix, SpreadSplit
 from .stream import MultiStream
 from .struct_data_operators import (
     ListToKeyValPairs,
     SerializeKeyValPairs,
     SerializeTableAsIndexedRowMajor,
     SerializeTableAsMarkdown,
     SerializeTableRowAsList,
     SerializeTableRowAsText,

 from .stream import MultiStream
 from .struct_data_operators import (
     ListToKeyValPairs,
+    MapHTMLTableToJSON,
     SerializeKeyValPairs,
+    SerializeTableAsDFLoader,
     SerializeTableAsIndexedRowMajor,
+    SerializeTableAsJson,
     SerializeTableAsMarkdown,
     SerializeTableRowAsList,
     SerializeTableRowAsText,

dataset.py CHANGED Viewed

@@ -10,7 +10,6 @@ from .catalog import __file__ as _
 from .collections import __file__ as _
 from .collections_operators import __file__ as _
 from .dataclass import __file__ as _
-from .dataset_utils import __file__ as _
 from .dataset_utils import get_dataset_artifact
 from .deprecation_utils import __file__ as _
 from .dialog_operators import __file__ as _
@@ -20,13 +19,11 @@ from .file_utils import __file__ as _
 from .formats import __file__ as _
 from .fusion import __file__ as _
 from .generator_utils import __file__ as _
-from .hf_utils import __file__ as _
 from .hf_utils import verify_versions_compatibility
 from .inference import __file__ as _
 from .instructions import __file__ as _
 from .llm_as_judge import __file__ as _
 from .loaders import __file__ as _
-from .logging_utils import __file__ as _
 from .logging_utils import get_logger
 from .metric import __file__ as _
 from .metric_utils import __file__ as _
@@ -40,13 +37,13 @@ from .random_utils import __file__ as _
 from .recipe import __file__ as _
 from .register import __file__ as _
 from .schema import __file__ as _
-from .settings_utils import __file__ as _
 from .settings_utils import get_constants
 from .span_lableing_operators import __file__ as _
 from .split_utils import __file__ as _
 from .splitters import __file__ as _
 from .standard import __file__ as _
 from .stream import __file__ as _
 from .string_operators import __file__ as _
 from .struct_data_operators import __file__ as _
 from .system_prompts import __file__ as _
@@ -54,7 +51,6 @@ from .task import __file__ as _
 from .templates import __file__ as _
 from .text_utils import __file__ as _
 from .type_utils import __file__ as _
-from .utils import __file__ as _
 from .utils import is_package_installed
 from .validate import __file__ as _
 from .version import __file__ as _
@@ -75,8 +71,9 @@ class Dataset(datasets.GeneratorBasedBuilder):
             if is_package_installed("unitxt"):
                 verify_versions_compatibility("dataset", self.VERSION)
-                from unitxt.dataset_utils import \
-                    get_dataset_artifact as get_dataset_artifact_installed
                 logger.info("Loading with installed unitxt library...")
                 dataset = get_dataset_artifact_installed(self.config.name)

 from .collections import __file__ as _
 from .collections_operators import __file__ as _
 from .dataclass import __file__ as _
 from .dataset_utils import get_dataset_artifact
 from .deprecation_utils import __file__ as _
 from .dialog_operators import __file__ as _
 from .formats import __file__ as _
 from .fusion import __file__ as _
 from .generator_utils import __file__ as _
 from .hf_utils import verify_versions_compatibility
 from .inference import __file__ as _
 from .instructions import __file__ as _
 from .llm_as_judge import __file__ as _
 from .loaders import __file__ as _
 from .logging_utils import get_logger
 from .metric import __file__ as _
 from .metric_utils import __file__ as _
 from .recipe import __file__ as _
 from .register import __file__ as _
 from .schema import __file__ as _
 from .settings_utils import get_constants
 from .span_lableing_operators import __file__ as _
 from .split_utils import __file__ as _
 from .splitters import __file__ as _
 from .standard import __file__ as _
 from .stream import __file__ as _
+from .stream_operators import __file__ as _
 from .string_operators import __file__ as _
 from .struct_data_operators import __file__ as _
 from .system_prompts import __file__ as _
 from .templates import __file__ as _
 from .text_utils import __file__ as _
 from .type_utils import __file__ as _
 from .utils import is_package_installed
 from .validate import __file__ as _
 from .version import __file__ as _
             if is_package_installed("unitxt"):
                 verify_versions_compatibility("dataset", self.VERSION)
+                from unitxt.dataset_utils import (
+                    get_dataset_artifact as get_dataset_artifact_installed,
+                )
                 logger.info("Loading with installed unitxt library...")
                 dataset = get_dataset_artifact_installed(self.config.name)

fusion.py CHANGED Viewed

@@ -4,7 +4,7 @@ from typing import Dict, Generator, List, Optional, Union
 from .dataclass import NonPositionalField
 from .operator import SourceOperator
 from .random_utils import new_random_generator
-from .stream import GeneratorStream, MultiStream
 from .type_utils import isoftype
@@ -49,7 +49,7 @@ class BaseFusion(SourceOperator):
     ) -> MultiStream:
         result = {}
         for split in self.splits():
-            result[split] = GeneratorStream(
                 self.fusion_generator, gen_kwargs={"split": split}
             )
         return MultiStream(result)

 from .dataclass import NonPositionalField
 from .operator import SourceOperator
 from .random_utils import new_random_generator
+from .stream import DynamicStream, MultiStream
 from .type_utils import isoftype
     ) -> MultiStream:
         result = {}
         for split in self.splits():
+            result[split] = DynamicStream(
                 self.fusion_generator, gen_kwargs={"split": split}
             )
         return MultiStream(result)

inference.py CHANGED Viewed

@@ -121,8 +121,7 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
             f"Error while trying to run IbmGenAiInferenceEngine."
             f" Please set the environment param '{api_key_env_var_name}'."
         )
-        api_endpoint = os.environ.get("GENAI_KEY")
-        credentials = Credentials(api_key=api_key, api_endpoint=api_endpoint)
         self.client = Client(credentials=credentials)
     def _infer(self, dataset):
@@ -141,13 +140,14 @@ class IbmGenAiInferenceEngine(InferenceEngine, PackageRequirementsMixin):
             decoding_method=self.parameters.decoding_method,
         )
-        return list(
-            self.client.text.generation.create(
                 model_id=self.model_name,
                 inputs=[instance["source"] for instance in dataset],
                 parameters=genai_params,
             )
-        )
 class OpenAiInferenceEngineParams(Artifact):

             f"Error while trying to run IbmGenAiInferenceEngine."
             f" Please set the environment param '{api_key_env_var_name}'."
         )
+        credentials = Credentials(api_key=api_key)
         self.client = Client(credentials=credentials)
     def _infer(self, dataset):
             decoding_method=self.parameters.decoding_method,
         )
+        return [
+            response.results[0].generated_text
+            for response in self.client.text.generation.create(
                 model_id=self.model_name,
                 inputs=[instance["source"] for instance in dataset],
                 parameters=genai_params,
             )
+        ]
 class OpenAiInferenceEngineParams(Artifact):

llm_as_judge.py CHANGED Viewed

@@ -135,4 +135,8 @@ class LLMAsJudge(BulkInstanceMetric):
         dataset = produce(instances, recipe)
         verdicts = self.inference_model.infer(dataset)
         meta_scores = evaluate(predictions=verdicts, data=dataset)
-        return [{self.main_score: instance["prediction"]} for instance in meta_scores]

         dataset = produce(instances, recipe)
         verdicts = self.inference_model.infer(dataset)
         meta_scores = evaluate(predictions=verdicts, data=dataset)
+        return [
+            {self.main_score: instance["prediction"], "judge_raw_output": verdict}
+            for instance in meta_scores
+            for verdict in verdicts
+        ]

loaders.py CHANGED Viewed

@@ -30,6 +30,7 @@ Available Loaders Overview:
 ------------------------
 """
 import itertools
 import os
 import tempfile
@@ -41,6 +42,7 @@ from typing import Any, Dict, List, Mapping, Optional, Sequence, Union
 import pandas as pd
 from datasets import load_dataset as hf_load_dataset
 from tqdm import tqdm
 from .dataclass import InternalField, OptionalField
@@ -49,7 +51,7 @@ from .logging_utils import get_logger
 from .operator import SourceOperator
 from .operators import AddFields
 from .settings_utils import get_settings
-from .stream import GeneratorStream, MultiStream
 logger = get_logger()
 settings = get_settings()
@@ -259,7 +261,7 @@ class LoadHF(Loader):
         self.log_limited_loading()
         return MultiStream(
             {
-                name: GeneratorStream(
                     generator=self.split_limited_load, gen_kwargs={"split_name": name}
                 )
                 for name in self._cache.keys()
@@ -349,7 +351,7 @@ class LoadCSV(Loader):
         if self.streaming:
             return MultiStream(
                 {
-                    name: GeneratorStream(
                         generator=self.stream_csv, gen_kwargs={"file": file}
                     )
                     for name, file in self.files.items()
@@ -358,9 +360,7 @@ class LoadCSV(Loader):
         return MultiStream(
             {
-                name: GeneratorStream(
-                    generator=self.load_csv, gen_kwargs={"file": file}
-                )
                 for name, file in self.files.items()
             }
         )
@@ -385,7 +385,6 @@ class LoadFromSklearn(Loader):
     dataset_name: str
     splits: List[str] = ["train", "test"]
-    data_classification_policy = ["public"]
     _requirements_list: List[str] = ["sklearn", "pandas"]
@@ -683,8 +682,10 @@ class LoadFromDictionary(Loader):
         .. code-block:: python
             data = {
-                "train": {"input": "SomeInput1", "output": "SomeResult1"},
-                "test": {"input": "SomeInput2", "output": "SomeResult2"},
             }
             loader = LoadFromDictionary(data=data)
     """
@@ -794,18 +795,79 @@ class LoadFromHFSpace(LoadHF):
         else:
             data_files = self.data_files
         for files in data_files:
             if isinstance(files, str):
                 files = [files]
-            # All files - within the same space - are downloaded into the same base directory:
             paths = [self._download_file_from_space(file) for file in files]
-            dir_path = paths[0].replace(files[0], "")
-        return dir_path
     def load_data(self):
         self.sef_default_data_classification(
             ["public"], "when loading from Huggingface spaces"
         )
         self.path = self._download_data()
         return super().load_data()

 ------------------------
 """
+import fnmatch
 import itertools
 import os
 import tempfile
 import pandas as pd
 from datasets import load_dataset as hf_load_dataset
+from huggingface_hub import HfApi
 from tqdm import tqdm
 from .dataclass import InternalField, OptionalField
 from .operator import SourceOperator
 from .operators import AddFields
 from .settings_utils import get_settings
+from .stream import DynamicStream, MultiStream
 logger = get_logger()
 settings = get_settings()
         self.log_limited_loading()
         return MultiStream(
             {
+                name: DynamicStream(
                     generator=self.split_limited_load, gen_kwargs={"split_name": name}
                 )
                 for name in self._cache.keys()
         if self.streaming:
             return MultiStream(
                 {
+                    name: DynamicStream(
                         generator=self.stream_csv, gen_kwargs={"file": file}
                     )
                     for name, file in self.files.items()
         return MultiStream(
             {
+                name: DynamicStream(generator=self.load_csv, gen_kwargs={"file": file})
                 for name, file in self.files.items()
             }
         )
     dataset_name: str
     splits: List[str] = ["train", "test"]
     _requirements_list: List[str] = ["sklearn", "pandas"]
         .. code-block:: python
             data = {
+                "train": [{"input": "SomeInput1", "output": "SomeResult1"},
+                          {"input": "SomeInput2", "output": "SomeResult2"}],
+                "test":  [{"input": "SomeInput3", "output": "SomeResult3"},
+                          {"input": "SomeInput4", "output": "SomeResult4"}]
             }
             loader = LoadFromDictionary(data=data)
     """
         else:
             data_files = self.data_files
+        dir_paths_list = []
         for files in data_files:
             if isinstance(files, str):
                 files = [files]
             paths = [self._download_file_from_space(file) for file in files]
+            dir_paths = [
+                path.replace(file_url, "") for path, file_url in zip(paths, files)
+            ]
+            dir_paths_list.extend(dir_paths)
+        # All files - within the same space - are downloaded into the same base directory:
+        assert len(set(dir_paths_list)) == 1
+        return f"{dir_paths_list.pop()}"
+    @staticmethod
+    def _is_wildcard(path: str) -> bool:
+        wildcard_characters = ["*", "?", "[", "]"]
+        return any(char in path for char in wildcard_characters)
+    def _get_file_list_from_wildcard_path(
+        self, pattern: str, repo_files: List
+    ) -> List[str]:
+        if self._is_wildcard(pattern):
+            return fnmatch.filter(repo_files, pattern)
+        return [pattern]
+    def _map_wildcard_path_to_full_paths(self):
+        api = HfApi()
+        repo_files = api.list_repo_files(self.space_name, repo_type="space")
+        if isinstance(self.data_files, str):
+            self.data_files = self._get_file_list_from_wildcard_path(
+                self.data_files, repo_files
+            )
+        elif isinstance(self.data_files, Mapping):
+            new_mapping = {}
+            for k, v in self.data_files.items():
+                if isinstance(v, list):
+                    assert all(isinstance(s, str) for s in v)
+                    new_mapping[k] = [
+                        file
+                        for p in v
+                        for file in self._get_file_list_from_wildcard_path(
+                            p, repo_files
+                        )
+                    ]
+                elif isinstance(v, str):
+                    new_mapping[k] = self._get_file_list_from_wildcard_path(
+                        v, repo_files
+                    )
+                else:
+                    raise NotImplementedError(
+                        f"Loader does not support input 'data_files' of type Mapping[{type(v)}]"
+                    )
+            self.data_files = new_mapping
+        elif isinstance(self.data_files, list):
+            assert all(isinstance(s, str) for s in self.data_files)
+            self.data_files = [
+                file
+                for p in self.data_files
+                for file in self._get_file_list_from_wildcard_path(p, repo_files)
+            ]
+        else:
+            raise NotImplementedError(
+                f"Loader does not support input 'data_files' of type {type(self.data_files)}"
+            )
     def load_data(self):
         self.sef_default_data_classification(
             ["public"], "when loading from Huggingface spaces"
         )
+        self._map_wildcard_path_to_full_paths()
         self.path = self._download_data()
         return super().load_data()

metric.py CHANGED Viewed

@@ -19,16 +19,13 @@ from .file_utils import __file__ as _
 from .formats import __file__ as _
 from .fusion import __file__ as _
 from .generator_utils import __file__ as _
-from .hf_utils import __file__ as _
 from .hf_utils import verify_versions_compatibility
 from .inference import __file__ as _
 from .instructions import __file__ as _
 from .llm_as_judge import __file__ as _
 from .loaders import __file__ as _
 from .logging_utils import __file__ as _
-from .metric_utils import UNITXT_METRIC_SCHEMA
-from .metric_utils import __file__ as _
-from .metric_utils import _compute
 from .metrics import __file__ as _
 from .normalizers import __file__ as _
 from .operator import __file__ as _
@@ -39,13 +36,13 @@ from .random_utils import __file__ as _
 from .recipe import __file__ as _
 from .register import __file__ as _
 from .schema import __file__ as _
-from .settings_utils import __file__ as _
 from .settings_utils import get_constants
 from .span_lableing_operators import __file__ as _
 from .split_utils import __file__ as _
 from .splitters import __file__ as _
 from .standard import __file__ as _
 from .stream import __file__ as _
 from .string_operators import __file__ as _
 from .struct_data_operators import __file__ as _
 from .system_prompts import __file__ as _
@@ -53,7 +50,6 @@ from .task import __file__ as _
 from .templates import __file__ as _
 from .text_utils import __file__ as _
 from .type_utils import __file__ as _
-from .utils import __file__ as _
 from .utils import is_package_installed
 from .validate import __file__ as _
 from .version import __file__ as _

 from .formats import __file__ as _
 from .fusion import __file__ as _
 from .generator_utils import __file__ as _
 from .hf_utils import verify_versions_compatibility
 from .inference import __file__ as _
 from .instructions import __file__ as _
 from .llm_as_judge import __file__ as _
 from .loaders import __file__ as _
 from .logging_utils import __file__ as _
+from .metric_utils import UNITXT_METRIC_SCHEMA, _compute
 from .metrics import __file__ as _
 from .normalizers import __file__ as _
 from .operator import __file__ as _
 from .recipe import __file__ as _
 from .register import __file__ as _
 from .schema import __file__ as _
 from .settings_utils import get_constants
 from .span_lableing_operators import __file__ as _
 from .split_utils import __file__ as _
 from .splitters import __file__ as _
 from .standard import __file__ as _
 from .stream import __file__ as _
+from .stream_operators import __file__ as _
 from .string_operators import __file__ as _
 from .struct_data_operators import __file__ as _
 from .system_prompts import __file__ as _
 from .templates import __file__ as _
 from .text_utils import __file__ as _
 from .type_utils import __file__ as _
 from .utils import is_package_installed
 from .validate import __file__ as _
 from .version import __file__ as _

metric_utils.py CHANGED Viewed

@@ -15,7 +15,7 @@ from .operator import (
 from .operators import (
     ApplyMetric,
     ApplyOperatorsField,
-    CopyFields,
     FlattenInstances,
     MergeStreams,
     SplitByNestedGroup,
@@ -23,7 +23,7 @@ from .operators import (
 from .register import _reset_env_local_catalogs, register_all_artifacts
 from .schema import UNITXT_DATASET_SCHEMA
 from .settings_utils import get_settings
-from .stream import GeneratorStream, MultiStream
 from .struct_data_operators import LoadJson
@@ -109,7 +109,7 @@ class MultiStreamScoreMean(MultiStreamOperator):
         return MultiStream(
             {
-                stream_name: GeneratorStream(
                     never_peek_twice_generator,
                     gen_kwargs={
                         "stream_name": stream_name,
@@ -132,7 +132,7 @@ class FromPredictionsAndOriginalData(StreamInitializerOperator):
     ) -> MultiStream:
         return MultiStream(
             {
-                split_name: GeneratorStream(
                     self.zip,
                     gen_kwargs={"predictions": predictions, "references": references},
                 )
@@ -155,10 +155,9 @@ class MetricRecipe(SequentialOperatorInitializer):
         self.steps = [
             FromPredictionsAndOriginalData(),
             LoadJson(field="task_data"),
-            CopyFields(
-                field_to_field={
-                    "source": "task_data/source",
-                }
             ),
             ApplyOperatorsField(
                 operators_field="postprocessors",

 from .operators import (
     ApplyMetric,
     ApplyOperatorsField,
+    Copy,
     FlattenInstances,
     MergeStreams,
     SplitByNestedGroup,
 from .register import _reset_env_local_catalogs, register_all_artifacts
 from .schema import UNITXT_DATASET_SCHEMA
 from .settings_utils import get_settings
+from .stream import DynamicStream, MultiStream
 from .struct_data_operators import LoadJson
         return MultiStream(
             {
+                stream_name: DynamicStream(
                     never_peek_twice_generator,
                     gen_kwargs={
                         "stream_name": stream_name,
     ) -> MultiStream:
         return MultiStream(
             {
+                split_name: DynamicStream(
                     self.zip,
                     gen_kwargs={"predictions": predictions, "references": references},
                 )
         self.steps = [
             FromPredictionsAndOriginalData(),
             LoadJson(field="task_data"),
+            Copy(
+                field="source",
+                to_field="task_data/source",
             ),
             ApplyOperatorsField(
                 operators_field="postprocessors",

operator.py CHANGED Viewed

@@ -4,7 +4,7 @@ from typing import Any, Dict, Generator, List, Optional, Union
 from .artifact import Artifact
 from .dataclass import InternalField, NonPositionalField
-from .stream import GeneratorStream, MultiStream, Stream
 from .utils import is_module_available
@@ -170,7 +170,7 @@ def instance_generator(instance):
 def stream_single(instance: Dict[str, Any]) -> Stream:
-    return GeneratorStream(
         generator=instance_generator, gen_kwargs={"instance": instance}
     )
@@ -244,7 +244,7 @@ class StreamOperator(MultiStreamOperator):
     def _process_single_stream(
         self, stream: Stream, stream_name: Optional[str] = None
     ) -> Stream:
-        return GeneratorStream(
             self._process_stream,
             gen_kwargs={"stream": stream, "stream_name": stream_name},
         )
@@ -401,7 +401,7 @@ class InstanceOperatorValidator(InstanceOperator):
         try:
             first_instance = next(iterator)
         except StopIteration as e:
-            raise StopIteration(f"Stream '{stream_name}' is empty") from e
         result = self._process_instance(first_instance, stream_name)
         self.validate(result)
         yield result
@@ -439,7 +439,7 @@ class InstanceOperatorWithMultiStreamAccess(StreamingOperator):
         result = {}
         for stream_name, stream in multi_stream.items():
-            stream = GeneratorStream(
                 self.generator,
                 gen_kwargs={"stream": stream, "multi_stream": multi_stream},
             )

 from .artifact import Artifact
 from .dataclass import InternalField, NonPositionalField
+from .stream import DynamicStream, EmptyStreamError, MultiStream, Stream
 from .utils import is_module_available
 def stream_single(instance: Dict[str, Any]) -> Stream:
+    return DynamicStream(
         generator=instance_generator, gen_kwargs={"instance": instance}
     )
     def _process_single_stream(
         self, stream: Stream, stream_name: Optional[str] = None
     ) -> Stream:
+        return DynamicStream(
             self._process_stream,
             gen_kwargs={"stream": stream, "stream_name": stream_name},
         )
         try:
             first_instance = next(iterator)
         except StopIteration as e:
+            raise EmptyStreamError(f"Stream '{stream_name}' is empty") from e
         result = self._process_instance(first_instance, stream_name)
         self.validate(result)
         yield result
         result = {}
         for stream_name, stream in multi_stream.items():
+            stream = DynamicStream(
                 self.generator,
                 gen_kwargs={"stream": stream, "multi_stream": multi_stream},
             )

operators.py CHANGED Viewed

@@ -76,7 +76,7 @@ from .operator import (
 )
 from .random_utils import new_random_generator
 from .settings_utils import get_settings
-from .stream import GeneratorStream, Stream
 from .text_utils import nested_tuple_to_string
 from .type_utils import isoftype
 from .utils import flatten_dict
@@ -282,6 +282,24 @@ class RemoveFields(InstanceOperator):
         return instance
 class InstanceFieldOperator(InstanceOperator):
     """A general stream instance operator that processes the values of a field (or multiple ones).
@@ -1007,7 +1025,7 @@ class Perturb(FieldOperator):
         return value
-class CopyFields(FieldOperator):
     """Copies values from specified fields to specified fields.
     Args (of parent class):
@@ -1015,13 +1033,13 @@ class CopyFields(FieldOperator):
     Examples:
         An input instance {"a": 2, "b": 3}, when processed by
-        CopyField(field_to_field={"a": "b"}
         would yield {"a": 2, "b": 2}, and when processed by
-        CopyField(field_to_field={"a": "c"} would yield
         {"a": 2, "b": 3, "c": 2}
         with field names containing / , we can also copy inside the field:
-        CopyFields(field_to_field={"a/0": "a"})
         would process instance {"a": [1, 3]} into {"a": 1}
@@ -1031,6 +1049,10 @@ class CopyFields(FieldOperator):
         return copy.deepcopy(value)
 class GetItemByIndex(FieldOperator):
     """Get from the item list by the index in the field."""
@@ -1299,6 +1321,52 @@ class FilterByCondition(StreamOperator):
         return True
 class ComputeExpressionMixin(Artifact):
     """Computes an expression expressed over fields of an instance.
@@ -1774,7 +1842,7 @@ class MergeStreams(MultiStreamOperator):
     def process(self, multi_stream: MultiStream) -> MultiStream:
         return MultiStream(
             {
-                self.new_stream_name: GeneratorStream(
                     self.merge, gen_kwargs={"multi_stream": multi_stream}
                 )
             }

 )
 from .random_utils import new_random_generator
 from .settings_utils import get_settings
+from .stream import DynamicStream, Stream
 from .text_utils import nested_tuple_to_string
 from .type_utils import isoftype
 from .utils import flatten_dict
         return instance
+class SelectFields(InstanceOperator):
+    """Keep only specified fields from each instance in a stream.
+    Args:
+        fields (List[str]): The fields to keep from each instance.
+    """
+    fields: List[str]
+    def process(
+        self, instance: Dict[str, Any], stream_name: Optional[str] = None
+    ) -> Dict[str, Any]:
+        new_instance = {}
+        for selected_field in self.fields:
+            new_instance[selected_field] = instance[selected_field]
+        return new_instance
 class InstanceFieldOperator(InstanceOperator):
     """A general stream instance operator that processes the values of a field (or multiple ones).
         return value
+class Copy(FieldOperator):
     """Copies values from specified fields to specified fields.
     Args (of parent class):
     Examples:
         An input instance {"a": 2, "b": 3}, when processed by
+        Copy(field_to_field={"a": "b"}
         would yield {"a": 2, "b": 2}, and when processed by
+        Copy(field_to_field={"a": "c"} would yield
         {"a": 2, "b": 3, "c": 2}
         with field names containing / , we can also copy inside the field:
+        Copy(field="a/0",to_field="a")
         would process instance {"a": [1, 3]} into {"a": 1}
         return copy.deepcopy(value)
+class CopyFields(Copy):
+    pass
 class GetItemByIndex(FieldOperator):
     """Get from the item list by the index in the field."""
         return True
+class FilterByConditionBasedOnFields(FilterByCondition):
+    """Filters a stream based on a condition between 2 fields values.
+    Raises an error if either of the required fields names is missing from the input instance.
+    Args:
+       values (Dict[str, str]): The fields names that the filter operation is based on.
+       condition: the name of the desired condition operator between the specified field's values.  Supported conditions are  ("gt", "ge", "lt", "le", "ne", "eq", "in","not in")
+       error_on_filtered_all (bool, optional): If True, raises an error if all instances are filtered out. Defaults to True.
+    Examples:
+       FilterByCondition(values = {"a":"b}, condition = "gt") will yield only instances where field "a" contains a value greater then the value in field "b".
+       FilterByCondition(values = {"a":"b}, condition = "le") will yield only instances where "a"<="b"
+    """
+    def _is_required(self, instance: dict) -> bool:
+        for key, value in self.values.items():
+            try:
+                instance_key = dict_get(instance, key)
+            except ValueError as ve:
+                raise ValueError(
+                    f"Required filter field ('{key}') in FilterByCondition is not found in {instance}"
+                ) from ve
+            try:
+                instance_value = dict_get(instance, value)
+            except ValueError as ve:
+                raise ValueError(
+                    f"Required filter field ('{value}') in FilterByCondition is not found in {instance}"
+                ) from ve
+            if self.condition == "in":
+                if instance_key not in instance_value:
+                    return False
+            elif self.condition == "not in":
+                if instance_key in instance_value:
+                    return False
+            else:
+                func = self.condition_to_func[self.condition]
+                if func is None:
+                    raise ValueError(
+                        f"Function not defined for condition '{self.condition}'"
+                    )
+                if not func(instance_key, instance_value):
+                    return False
+        return True
 class ComputeExpressionMixin(Artifact):
     """Computes an expression expressed over fields of an instance.
     def process(self, multi_stream: MultiStream) -> MultiStream:
         return MultiStream(
             {
+                self.new_stream_name: DynamicStream(
                     self.merge, gen_kwargs={"multi_stream": multi_stream}
                 )
             }

processors.py CHANGED Viewed

@@ -245,3 +245,11 @@ class LiteralEval(FieldOperator):
         if text is None or text == "":
             return text
         return ast.literal_eval(text.strip())

         if text is None or text == "":
             return text
         return ast.literal_eval(text.strip())
+class ExtractSafeUnsafeJudgment(FieldOperator):
+    def process_value(self, text: Any) -> Any:
+        first_line = str(text).strip().split("\n")[0].lower()
+        if first_line == "safe":
+            return 1.0
+        return 0.0

settings_utils.py CHANGED Viewed

@@ -127,6 +127,7 @@ if Settings.is_uninitilized():
     settings.artifactories = None
     settings.default_recipe = "standard_recipe"
     settings.default_verbosity = "info"
     settings.remote_metrics = []
     settings.test_card_disable = (bool, False)
     settings.test_metric_disable = (bool, False)

     settings.artifactories = None
     settings.default_recipe = "standard_recipe"
     settings.default_verbosity = "info"
+    settings.use_eager_execution = False
     settings.remote_metrics = []
     settings.test_card_disable = (bool, False)
     settings.test_metric_disable = (bool, False)

split_utils.py CHANGED Viewed

@@ -5,7 +5,7 @@ from typing import Dict
 from .generator_utils import ReusableGenerator
 from .logging_utils import get_logger
 from .random_utils import new_random_generator
-from .stream import Stream
 logger = get_logger()
@@ -140,7 +140,9 @@ def slice_streams(input_streams, mapping):
         def generator(new_stream, sources):
             for old_stream, slices in sources.items():
                 if old_stream not in input_streams:
-                    raise ValueError(f"'{old_stream}' is not available in input stream")
                 old_stream_content = input_streams[old_stream]
                 for start, end in slices:
                     yield from slice_stream(old_stream_content, start, end)

 from .generator_utils import ReusableGenerator
 from .logging_utils import get_logger
 from .random_utils import new_random_generator
+from .stream import MissingStreamError, Stream
 logger = get_logger()
         def generator(new_stream, sources):
             for old_stream, slices in sources.items():
                 if old_stream not in input_streams:
+                    raise MissingStreamError(
+                        f"'{old_stream}' is not available in input streams, but need to slice there from"
+                    )
                 old_stream_content = input_streams[old_stream]
                 for start, end in slices:
                     yield from slice_stream(old_stream_content, start, end)

splitters.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import itertools
 from abc import abstractmethod
 from random import Random
 from typing import Dict, List
@@ -13,7 +14,7 @@ from .split_utils import (
     rename_split,
     slice_streams,
 )
-from .stream import MultiStream
 class Splitter(MultiStreamOperator):
@@ -138,8 +139,13 @@ class Sampler(Artifact):
     ) -> List[Dict[str, object]]:
         if "inputs" not in instance:
             raise ValueError(f"'inputs' field is missing from '{instance}'.")
-        return list(filter(lambda x: x["inputs"] != instance["inputs"], instances_pool))
 class RandomSampler(Sampler):
@@ -282,16 +288,20 @@ class SpreadSplit(InstanceOperatorWithMultiStreamAccess):
     ) -> Dict[str, object]:
         try:
             if self.local_cache is None:
-                self.local_cache = list(multi_stream[self.source_stream])
             source_stream = self.local_cache
             source_stream = self.sampler.filter_source_by_instance(
                 source_stream, instance
             )
             sampled_instances = self.sampler.sample(source_stream)
             instance[self.target_field] = sampled_instances
             return instance
-        except Exception as e:
-            raise Exception(
                 f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
             ) from e

 import itertools
 from abc import abstractmethod
+from copy import deepcopy
 from random import Random
 from typing import Dict, List
     rename_split,
     slice_streams,
 )
+from .stream import EmptyStreamError, FaultyStreamError, MultiStream
 class Splitter(MultiStreamOperator):
     ) -> List[Dict[str, object]]:
         if "inputs" not in instance:
             raise ValueError(f"'inputs' field is missing from '{instance}'.")
+        # l = list(filter(lambda x: x["inputs"] != instance["inputs"], instances_pool))
+        try:
+            return [
+                item for item in instances_pool if item["inputs"] != instance["inputs"]
+            ]
+        except Exception as e:
+            raise e
 class RandomSampler(Sampler):
     ) -> Dict[str, object]:
         try:
             if self.local_cache is None:
+                self.local_cache = deepcopy(list(multi_stream[self.source_stream]))
             source_stream = self.local_cache
             source_stream = self.sampler.filter_source_by_instance(
                 source_stream, instance
             )
+            if len(source_stream) < self.sampler.sample_size:
+                raise ValueError(
+                    f"Size of population to sample from: {len(source_stream)} is smaller than the needed sample_size: {self.sampler.sample_size}."
+                )
             sampled_instances = self.sampler.sample(source_stream)
             instance[self.target_field] = sampled_instances
             return instance
+        except FaultyStreamError as e:
+            raise EmptyStreamError(
                 f"Unable to fetch instances from '{self.source_stream}' to '{self.target_field}', due to {e.__class__.__name__}: {e}"
             ) from e

stream.py CHANGED Viewed

@@ -1,11 +1,19 @@
 import tempfile
 from abc import abstractmethod
-from typing import Any, Callable, Dict, Iterable, List
 from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
 from .dataclass import Dataclass, OptionalField
 from .generator_utils import CopyingReusableGenerator, ReusableGenerator
 class Stream(Dataclass):
@@ -21,22 +29,32 @@ class Stream(Dataclass):
     def take(self, n):
         pass
 class ListStream(Stream):
     instances_list: List[Dict[str, Any]]
     def __iter__(self):
         return iter(self.instances_list)
     def peek(self):
-        return next(iter(self.instances_list))
-    def take(self, n):
         for i, instance in enumerate(self.instances_list):
             if i >= n:
                 break
             yield instance
 class GeneratorStream(Stream):
     """A class for handling streaming data in a customizable way.
@@ -88,6 +106,79 @@ class GeneratorStream(Stream):
                 break
             yield instance
 class MultiStream(dict):
     """A class for handling multiple streams of data in a dictionary-like format.
@@ -112,7 +203,7 @@ class MultiStream(dict):
             isinstance(key, str), "MultiStream keys must be strings"
         super().__init__(data)
-    def get_generator(self, key):
         """Gets a generator for a specified key.
         Args:
@@ -129,7 +220,7 @@ class MultiStream(dict):
     def set_copying(self, copying: bool):
         for stream in self.values():
-            stream.copying = copying
     def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
         with tempfile.TemporaryDirectory() as dir_to_be_deleted:
@@ -178,7 +269,7 @@ class MultiStream(dict):
         assert all(isinstance(v, ReusableGenerator) for v in generators.values())
         return cls(
             {
-                key: GeneratorStream(
                     generator.generator,
                     gen_kwargs=generator.gen_kwargs,
                     caching=caching,
@@ -204,7 +295,7 @@ class MultiStream(dict):
         """
         return cls(
             {
-                key: GeneratorStream(
                     iterable.__iter__,
                     caching=caching,
                     copying=copying,

 import tempfile
+import traceback
+import warnings
 from abc import abstractmethod
+from copy import deepcopy
+from typing import Any, Callable, Dict, Generator, Iterable, List
 from datasets import Dataset, DatasetDict, IterableDataset, IterableDatasetDict
 from .dataclass import Dataclass, OptionalField
 from .generator_utils import CopyingReusableGenerator, ReusableGenerator
+from .logging_utils import get_logger
+from .settings_utils import get_settings
+settings = get_settings()
+logger = get_logger()
 class Stream(Dataclass):
     def take(self, n):
         pass
+    @abstractmethod
+    def set_copying(self, copying: bool):
+        pass
 class ListStream(Stream):
     instances_list: List[Dict[str, Any]]
+    copying: bool = False
     def __iter__(self):
+        if self.copying:
+            return iter(deepcopy(self.instances_list))
         return iter(self.instances_list)
     def peek(self):
+        return next(iter(self))
+    def take(self, n) -> Generator:
         for i, instance in enumerate(self.instances_list):
             if i >= n:
                 break
             yield instance
+    def set_copying(self, copying: bool):
+        self.copying = copying
 class GeneratorStream(Stream):
     """A class for handling streaming data in a customizable way.
                 break
             yield instance
+    def set_copying(self, copying: bool):
+        self.copying = copying
+class FaultyStreamError(Exception):
+    """Base class for all stream-related exceptions."""
+    pass
+class MissingStreamError(FaultyStreamError):
+    """Raised when a required stream is missing."""
+    pass
+class EmptyStreamError(FaultyStreamError):
+    """Raised when a stream is unexpectedly empty."""
+    pass
+def eager_failed():
+    traceback.print_exc()
+    warnings.warn(
+        "The eager execution has failed due to the error above.", stacklevel=2
+    )
+class DynamicStream(Stream):
+    generator: Callable
+    gen_kwargs: Dict[str, Any] = OptionalField(default_factory=dict)
+    caching: bool = False
+    copying: bool = False
+    def __post_init__(self):
+        self.stream = None
+        if settings.use_eager_execution:
+            try:
+                instances_list = []
+                for instance in self.generator(**self.gen_kwargs):
+                    instances_list.append(instance)
+                self.stream = ListStream(
+                    instances_list=instances_list, copying=self.copying
+                )
+            except FaultyStreamError:
+                eager_failed()
+            except RuntimeError as e:
+                if isinstance(e.__cause__, FaultyStreamError):
+                    eager_failed()
+                else:
+                    raise e
+        if self.stream is None:
+            self.stream = GeneratorStream(
+                generator=self.generator,
+                gen_kwargs=self.gen_kwargs,
+                caching=self.caching,
+                copying=self.copying,
+            )
+    def __iter__(self):
+        return self.stream.__iter__()
+    def peek(self):
+        return self.stream.peek()
+    def take(self, n):
+        return self.stream.take(n)
+    def set_copying(self, copying: bool):
+        self.stream.set_copying(copying)
 class MultiStream(dict):
     """A class for handling multiple streams of data in a dictionary-like format.
             isinstance(key, str), "MultiStream keys must be strings"
         super().__init__(data)
+    def get_generator(self, key) -> Generator:
         """Gets a generator for a specified key.
         Args:
     def set_copying(self, copying: bool):
         for stream in self.values():
+            stream.set_copying(copying)
     def to_dataset(self, disable_cache=True, cache_dir=None) -> DatasetDict:
         with tempfile.TemporaryDirectory() as dir_to_be_deleted:
         assert all(isinstance(v, ReusableGenerator) for v in generators.values())
         return cls(
             {
+                key: DynamicStream(
                     generator.generator,
                     gen_kwargs=generator.gen_kwargs,
                     caching=caching,
         """
         return cls(
             {
+                key: DynamicStream(
                     iterable.__iter__,
                     caching=caching,
                     copying=copying,

stream_operators.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""This section describes unitxt operators.
+Operators: Building Blocks of Unitxt Processing Pipelines
+==============================================================
+Within the Unitxt framework, operators serve as the foundational elements used to assemble processing pipelines.
+Each operator is designed to perform specific manipulations on dictionary structures within a stream.
+These operators are callable entities that receive a MultiStream as input.
+The output is a MultiStream, augmented with the operator's manipulations, which are then systematically applied to each instance in the stream when pulled.
+Creating Custom Operators
+-------------------------------
+To enhance the functionality of Unitxt, users are encouraged to develop custom operators.
+This can be achieved by inheriting from any of the existing operators listed below or from one of the fundamental :class:`base operators<unitxt.operator>`.
+The primary task in any operator development is to implement the `process` function, which defines the unique manipulations the operator will perform.
+General or Specelized Operators
+--------------------------------
+Some operators are specielized in specific task such as:
+- :class:`loaders<unitxt.loaders>` for loading data.
+- :class:`splitters<unitxt.splitters>` for fixing data splits.
+- :class:`struct_data_operators<unitxt.struct_data_operators>` for structured data operators.
+Other specelized operators are used by unitxt internally:
+- :class:`templates<unitxt.templates>` for verbalizing data examples.
+- :class:`formats<unitxt.formats>` for preparing data for models.
+The rest of this section is dedicated for operators that operates on streams.
+"""
+from typing import (
+    List,
+    Literal,
+    Optional,
+)
+import pandas as pd
+from .operator import (
+    MultiStream,
+    MultiStreamOperator,
+)
+from .settings_utils import get_settings
+from .stream import ListStream
+settings = get_settings()
+class JoinStreams(MultiStreamOperator):
+    """Join multiple streams into a single stream.
+    Args:
+        left_stream (str): The stream that will be considered the "left" in the join operations.
+        right_stream (str): The stream that will be considered the "right" in the join operations.
+        how (Literal["left", "right", "inner", "outer", "cross"]): The type of join to be performed.
+        on (Optional[List[str]]): Column names to join on. These must be found in both streams.
+        left_on (Optional[List[str]]): Column  names to join on in the left stream.
+        right_on (Optional[List[str]]): Column  names to join on in the right streasm.
+        new_stream_name (str): The name of the new stream resulting from the merge.
+    Examples:
+       JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on="question_id", new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field using inner join, resulting with a new stream named "question_with_answers".
+       JoinStreams(left_stream = "questions", right_stream = "answers", how="inner", on_left="question_id", on_right="question" new_stream_name="question_with_answers" ) Join the 'question' and 'answer' stream based on the 'question_id' field in the left stream and the 'question' field in the right stream, using inner join, resulting with a new stream named "question_with_answers". This is suitable when the fields have different labels across the streams.
+    """
+    left_stream: str
+    right_stream: str
+    how: Literal["left", "right", "inner", "outer", "cross"]
+    on: Optional[List[str]] = None
+    left_on: Optional[List[str]] = None
+    right_on: Optional[List[str]] = None
+    new_stream_name: str
+    def merge(self, multi_stream) -> List:
+        assert self.right_stream in multi_stream and self.left_stream in multi_stream
+        stream_dict = dict(multi_stream.items())
+        left_stream = list(stream_dict[self.left_stream])
+        right_stream = list(stream_dict[self.right_stream])
+        left_stream_df = pd.DataFrame(left_stream)
+        right_stream_df = pd.DataFrame(right_stream)
+        # Remove common col we don't join on, so we don't have unexpected column (standard behavior is to add a suffix)
+        common_cols = set(left_stream_df.columns).intersection(
+            set(right_stream_df.columns)
+        )
+        on = self.on if self.on is not None else []
+        left_on = self.left_on if self.left_on is not None else []
+        right_on = self.right_on if self.right_on is not None else []
+        on_cols = set(on + left_on + right_on)
+        col_to_remove = list(common_cols - on_cols)
+        left_stream_df = left_stream_df.drop(columns=col_to_remove, errors="ignore")
+        right_stream_df = right_stream_df.drop(columns=col_to_remove, errors="ignore")
+        merged_df = pd.merge(
+            left_stream_df,
+            right_stream_df,
+            how=self.how,
+            on=self.on,
+            left_on=self.left_on,
+            right_on=self.right_on,
+        )
+        return merged_df.to_dict(orient="records")
+    def process(self, multi_stream: MultiStream) -> MultiStream:
+        merged_records = self.merge(multi_stream)
+        multi_stream[self.new_stream_name] = ListStream(instances_list=merged_records)
+        return multi_stream
+class DeleteSplits(MultiStreamOperator):
+    """Operator which delete splits in stream.
+    Attributes:
+        splits (List[str]): The splits to delete from the stream.
+    """
+    splits: List[str]
+    def process(self, multi_stream: MultiStream) -> MultiStream:
+        generators = {
+            key: val for key, val in multi_stream.items() if key not in self.splits
+        }
+        return MultiStream(generators)

struct_data_operators.py CHANGED Viewed

@@ -566,3 +566,42 @@ class LoadJson(FieldOperator):
 class DumpJson(FieldOperator):
     def process_value(self, value: str) -> str:
         return json.dumps(value)

 class DumpJson(FieldOperator):
     def process_value(self, value: str) -> str:
         return json.dumps(value)
+class MapHTMLTableToJSON(FieldOperator):
+    """Converts HTML table format to the basic one (JSON).
+    JSON format
+    {
+        "header": ["col1", "col2"],
+        "rows": [["row11", "row12"], ["row21", "row22"], ["row31", "row32"]]
+    }
+    """
+    _requirements_list = ["bs4"]
+    def process_value(self, table: Any) -> Any:
+        return self.truncate_table_rows(table_content=table)
+    def truncate_table_rows(self, table_content: str) -> Dict:
+        from bs4 import BeautifulSoup
+        soup = BeautifulSoup(table_content, "html.parser")
+        # Extract header
+        header = []
+        header_cells = soup.find("thead").find_all("th")
+        for cell in header_cells:
+            header.append(cell.get_text())
+        # Extract rows
+        rows = []
+        for row in soup.find("tbody").find_all("tr"):
+            row_data = []
+            for cell in row.find_all("td"):
+                row_data.append(cell.get_text())
+            rows.append(row_data)
+        # return dictionary
+        return {"header": header, "rows": rows}

text_utils.py CHANGED Viewed

@@ -135,8 +135,8 @@ def is_made_of_sub_strings(string, sub_strings):
     return bool(re.match(pattern, string))
-# Giveמ all the lines of a file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
-# and an object name, e.g. TaskCard,
 # return the ordinal number of the line that starts that object, in our example: the
 # line number of the following line (notice that the line where TaskCard is imported
 # is not supposed to return):
@@ -145,10 +145,12 @@ def is_made_of_sub_strings(string, sub_strings):
 # the matching close:
 #         )
 # This util depends on ruff to ensure this setting of the card file: that a close of one
-# tag and the open of the next tag, do not sit in same line, both tags being
-# major level within TaskCard
 # flake8: noqa: B007
-def lines_defining_obj(
     all_lines: List[str], obj_name: str, start_search_at_line: int = 0
 ) -> Tuple[int, int]:
     for starting_line in range(start_search_at_line, len(all_lines)):
@@ -160,11 +162,28 @@ def lines_defining_obj(
         return (-1, -1)
     num_of_opens = 0
     num_of_closes = 0
-    for ending_line in range(starting_line, len(all_lines)):
         num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
         num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
         if num_of_closes == num_of_opens:
             break
     if num_of_closes != num_of_opens:
         raise ValueError(

     return bool(re.match(pattern, string))
+# Giveמ all the lines of a card preparer file, e.g. all the lines of prepare/cards/cohere_for_ai.py,
+# and an object name, e.g. TaskCard(,
 # return the ordinal number of the line that starts that object, in our example: the
 # line number of the following line (notice that the line where TaskCard is imported
 # is not supposed to return):
 # the matching close:
 #         )
 # This util depends on ruff to ensure this setting of the card file: that a close of one
+# tag and the open of the next tag, do not sit in same line, when both tags being
+# major level within TaskCard.
+# It also prepares for the case that  __description__ tag does not contain balanced
+# parentheses, since it is often cut in the middle, (with  "... see more at")
 # flake8: noqa: B007
+def lines_defining_obj_in_card(
     all_lines: List[str], obj_name: str, start_search_at_line: int = 0
 ) -> Tuple[int, int]:
     for starting_line in range(start_search_at_line, len(all_lines)):
         return (-1, -1)
     num_of_opens = 0
     num_of_closes = 0
+    ending_line = starting_line - 1
+    while ending_line < len(all_lines):
+        ending_line += 1
         num_of_opens += len(re.findall(r"[({[]", all_lines[ending_line]))
         num_of_closes += len(re.findall(r"[)}\]]", all_lines[ending_line]))
         if num_of_closes == num_of_opens:
             break
+        if "__description__" in all_lines[ending_line]:
+            # can not trust parentheses inside description.
+            # trust the indentation enforced by ruff, and the way we build __description__:
+            # a line consisting of only __description__=(
+            # followed by one or more lines of text, can not trust opens and closes
+            # in them, followed by a line consisting of only:  ),
+            # where the ) is indented with the beginning of __description__
+            tag_indentation = all_lines[ending_line].index("__description__")
+            last_line_to_start_with = (" " * tag_indentation) + ")"
+            while not all_lines[ending_line].startswith(last_line_to_start_with):
+                ending_line += 1
+            if "__description__" in obj_name:
+                return (starting_line, ending_line)
+            num_of_closes += 1  # for this last line of desc
+            # continue to the line following the end of description
     if num_of_closes != num_of_opens:
         raise ValueError(