Spaces:

edugp
/

perplexity-lenses

Runtime error

edugp commited on Nov 9, 2021

Commit

d131aa3

•

1 Parent(s): 6d1a001

Add tests and fix issue when splitting into sentences, to grab the minimum number between total sentences and sample size, rather than total original documents and sample size

Files changed (4) hide show

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ The app is hosted [here](https://huggingface.co/spaces/edugp/perplexity-lenses).
 python -m streamlit run app.py
 ```
-# CLI
 The CLI with no arguments defaults to running mc4 in Spanish.
 For full usage:
 ```
@@ -40,3 +40,7 @@ python cli.py \
     --model-name distiluse-base-multilingual-cased-v1 \
     --output-file perplexity.html
 ```

 python -m streamlit run app.py
 ```
+# CLI:
 The CLI with no arguments defaults to running mc4 in Spanish.
 For full usage:
 ```
     --model-name distiluse-base-multilingual-cased-v1 \
     --output-file perplexity.html
 ```
+# Tests:
+```
+python -m unittest discover -s ./tests/ -p "test_*.py"
+```

perplexity_lenses/data.py CHANGED Viewed

@@ -40,4 +40,4 @@ def hub_dataset_to_dataframe(
 def documents_df_to_sentences_df(df: pd.DataFrame, text_column: str, sample: int, seed: int = 0):
     df_sentences = pd.DataFrame({text_column: np.array(df[text_column].map(lambda x: x.split("\n")).values.tolist()).flatten()})
-    return df_sentences.sample(min(sample, df.shape[0]), random_state=seed)

 def documents_df_to_sentences_df(df: pd.DataFrame, text_column: str, sample: int, seed: int = 0):
     df_sentences = pd.DataFrame({text_column: np.array(df[text_column].map(lambda x: x.split("\n")).values.tolist()).flatten()})
+    return df_sentences.sample(min(sample, df_sentences.shape[0]), random_state=seed)

tests/__init__.py ADDED Viewed

File without changes

tests/test_data.py ADDED Viewed

+import unittest
+import pandas as pd
+from perplexity_lenses.data import documents_df_to_sentences_df
+class TestData(unittest.TestCase):
+    def test_documents_df_to_sentences_df(self):
+        input_df = pd.DataFrame({"text": ["foo\nbar"]})
+        expected_output_df = pd.DataFrame({"text": ["foo", "bar"]})
+        output_df = documents_df_to_sentences_df(input_df, "text", 100)
+        pd.testing.assert_frame_equal(output_df, expected_output_df, check_like=True, check_exact=True)