metadata

title: Word Count
emoji: 🤗
colorFrom: green
colorTo: purple
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
  - evaluate
  - measurement
description: >-
  Returns the total number of words, and the number of unique words in the input
  data.

Measurement Card for Word Count

Measurement Description

The word_count measurement returns the total number of word count of the input string, using the sklearn's CountVectorizer

How to Use

This measurement requires a list of strings as input:

>>> data = ["hello world and hello moon"]
>>> wordcount= evaluate.load("word_count")
>>> results = wordcount.compute(data=data)

Inputs

data (list of str): The input list of strings for which the word length is calculated.
max_vocab (int): (optional) the top number of words to consider (can be specified if dataset is too large)

Output Values

total_word_count (int): the total number of words in the input string(s).
unique_words (int): the number of unique words in the input string(s).

Output Example(s):

{'total_word_count': 5, 'unique_words': 4}


### Examples

Example for a single string

```python
>>> data = ["hello sun and goodbye moon"]
>>> wordcount = evaluate.load("word_count")
>>> results = wordcount.compute(data=data)
>>> print(results)
{'total_word_count': 5, 'unique_words': 5}

Example for a multiple strings

>>> data = ["hello sun and goodbye moon", "foo bar foo bar"]
>>> wordcount = evaluate.load("word_count")
>>> results = wordcount.compute(data=data)
>>> print(results)
{'total_word_count': 9, 'unique_words': 7}

Example for a dataset from 🤗 Datasets:

>>> imdb = datasets.load_dataset('imdb', split = 'train')
>>> wordcount = evaluate.load("word_count")
>>> results = wordcount.compute(data=imdb['text'])
>>> print(results)
{'total_word_count': 5678573, 'unique_words': 74849}

Citation(s)

Further References

Sklearn CountVectorizer