--- title: Word Count emoji: 🤗 colorFrom: green colorTo: purple sdk: gradio sdk_version: 3.0.2 app_file: app.py pinned: false tags: - evaluate - measurement description: >- Returns the total number of words, and the number of unique words in the input data. --- # Measurement Card for Word Count ## Measurement Description The `word_count` measurement returns the total number of word count of the input string, using the sklearn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) ## How to Use This measurement requires a list of strings as input: ```python >>> data = ["hello world and hello moon"] >>> wordcount= evaluate.load("word_count") >>> results = wordcount.compute(data=data) ``` ### Inputs - **data** (list of `str`): The input list of strings for which the word length is calculated. - **max_vocab** (`int`): (optional) the top number of words to consider (can be specified if dataset is too large) ### Output Values - **total_word_count** (`int`): the total number of words in the input string(s). - **unique_words** (`int`): the number of unique words in the input string(s). Output Example(s): ```python {'total_word_count': 5, 'unique_words': 4} ### Examples Example for a single string ```python >>> data = ["hello sun and goodbye moon"] >>> wordcount = evaluate.load("word_count") >>> results = wordcount.compute(data=data) >>> print(results) {'total_word_count': 5, 'unique_words': 5} ``` Example for a multiple strings ```python >>> data = ["hello sun and goodbye moon", "foo bar foo bar"] >>> wordcount = evaluate.load("word_count") >>> results = wordcount.compute(data=data) >>> print(results) {'total_word_count': 9, 'unique_words': 7} ``` Example for a dataset from 🤗 Datasets: ```python >>> imdb = datasets.load_dataset('imdb', split = 'train') >>> wordcount = evaluate.load("word_count") >>> results = wordcount.compute(data=imdb['text']) >>> print(results) {'total_word_count': 5678573, 'unique_words': 74849} ``` ## Citation(s) ## Further References - [Sklearn `CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)