Spaces:

evaluate-measurement
/

word_length

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

0b617d0

1 Parent(s): e5754d4

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +57 -12
app.py +6 -0
requirements.txt +3 -0
word_length.py +78 -0

README.md CHANGED Viewed

@@ -1,12 +1,57 @@
----
-title: Word_length
-emoji: 🐨
-colorFrom: pink
-colorTo: yellow
-sdk: gradio
-sdk_version: 3.0.2
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

+# Measurement Card for Word Length
+## Metric Description
+The `word_length` measurement returns the word count of the input string, based on tokenization using [NLTK word_tokenize](https://www.nltk.org/api/nltk.tokenize.html).
+## How to Use
+This measurement requires a list of strings as input:
+```python
+>>> data = ["hello world"]
+>>> wordlength = evaluate.load("word_length", type="measurement")
+>>> results = wordlength.compute(data=data)
+```
+### Inputs
+- **data** (list of `str`): The input list of strings for which the word length is calculated.
+- **tokenizer** (`Callable`) : approach used for tokenizing `data` (optional). The default tokenizer is [NLTK's `word_tokenize`](https://www.nltk.org/api/nltk.tokenize.html). This can be replaced by any function that takes a string as input and returns a list of tokens as output.
+### Output Values
+- **average_word_length**(`float`): the average number of words in the input string(s).
+Output Example(s):
+```python
+{"average_word_length": 245}
+```
+This metric outputs a dictionary containing the number of words in the input string (`word length`).
+### Examples
+Example for a single string
+```python
+>>> data = ["hello sun and goodbye moon"]
+>>> wordlength = evaluate.load("word_length", type="measurement")
+>>> results = wordlength.compute(data=data)
+>>> print(results)
+{'average_length': 5}
+```
+Example for a multiple strings
+```python
+>>> data = ["hello sun and goodbye moon", "foo bar foo bar"]
+>>> wordlength = evaluate.load("word_length", type="measurement")
+>>> results = wordlength.compute(data=text)
+{'average_length': 4.5}
+```
+## Citation(s)
+## Further References
+- [NLTK's `word_tokenize`](https://www.nltk.org/api/nltk.tokenize.html)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("word_length", type="measurement")
+launch_gradio_widget(module)

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+git+https://github.com/huggingface/evaluate.git@main
+datasets~=2.0
+nltk~=3.7

word_length.py ADDED Viewed

	@@ -0,0 +1,78 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from nltk import word_tokenize
+import evaluate
+import datasets
+from statistics import mean
+_DESCRIPTION = """
+Returns the average length (in terms of the number of words) of the input data.
+"""
+_KWARGS_DESCRIPTION = """
+Args:
+    `data`: a list of `str` for which the word length is calculated.
+    `tokenizer` (`Callable`) : the approach used for tokenizing `data` (optional).
+        The default tokenizer is `word_tokenize` from NLTK: https://www.nltk.org/api/nltk.tokenize.html
+        This can be replaced by any function that takes a string as input and returns a list of tokens as output.
+Returns:
+    `average_word_length` (`float`) : the average number of words in the input list of strings.
+Examples:
+    >>> data = ["hello world"]
+    >>> wordlength = evaluate.load("word_length", type="measurement")
+    >>> results = wordlength.compute(data=data)
+    >>> print(results)
+    {'average_word_length': 2}
+"""
+# TODO: Add BibTeX citation
+_CITATION = """\
+@InProceedings{huggingface:module,
+title = {A great new module},
+authors={huggingface, Inc.},
+year={2020}
+}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class WordLength(evaluate.EvaluationModule):
+    """This measurement returns the average number of words in the input string(s)."""
+    def _info(self):
+        # TODO: Specifies the evaluate.EvaluationModuleInfo object
+        return evaluate.EvaluationModuleInfo(
+            # This is the description that will appear on the modules page.
+            type="measurement",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            # This defines the format of each prediction and reference
+            features=datasets.Features({
+                'data': datasets.Value('string'),
+            })
+        )
+    def _download_and_prepare(self, dl_manager):
+        import nltk
+        nltk.download("punkt")
+    def _compute(self, data, tokenizer=word_tokenize):
+        """Returns the average word length of the input data"""
+        lengths = [len(tokenizer(d)) for d in data]
+        average_length = mean(lengths)
+        return {"average_word_length": average_length}