dominguesm
commited on
Commit
•
30ae784
1
Parent(s):
d54fea0
Descriptions
Browse files- descriptions/description_category_selection.md +1 -0
- descriptions/description_parameter_grid.md +1 -0
- descriptions/description_part1.md +1 -0
- descriptions/description_part2.md +1 -0
- descriptions/parameter_grid/alpha.md +1 -0
- descriptions/parameter_grid/max_df.md +1 -0
- descriptions/parameter_grid/min_df.md +1 -0
- descriptions/parameter_grid/ngram_range.md +1 -0
- descriptions/parameter_grid/norm.md +1 -0
descriptions/description_category_selection.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The task of text classification is easier when there is little overlap between the characteristic terms of different topics. This is because the presence of common terms can make it difficult to distinguish between different topics. On the other hand, when there is little overlap between the characteristic terms of different topics, the task of text classification becomes easier, as the unique terms of each topic provide a solid basis for accurately classifying the document into its respective category. Therefore, careful selection of characteristic terms for each topic is crucial to ensure accuracy in text classification.
|
descriptions/description_parameter_grid.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
We define a grid of hyperparameters to be explored by the [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html). Using a [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) instead would explore all the possible combinations on the grid, which can be costly to compute, whereas the parameter n_iter of the RandomizedSearchCV controls the number of different random combination that are evaluated. Notice that setting n_iter larger than the number of possible combinations in a grid would lead to repeating already-explored combinations. We search for the best parameter combination for both the feature extraction (vect__) and the classifier (clf__).
|
descriptions/description_part1.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The dataset used in this example is, [The 20 newsgroups text dataset](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) which will be automatically downloaded, cached and reused for the document classification example.
|
descriptions/description_part2.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
In this example, we tune the hyperparameters of a particular classifier using a [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV). For a demo on the performance of some other classifiers, see the [Classification of text documents using sparse features](https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py) notebook.
|
descriptions/parameter_grid/alpha.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The value of "alpha" adds a constant amount to the occurrence counters of features, ensuring that even unobserved feature values have a non-zero probability. Smaller values of "alpha" result in weaker smoothing, while larger values increase the level of smoothing. The default value is 1.0, which applies Laplace smoothing, but it can be adjusted based on the model's requirements.
|
descriptions/parameter_grid/max_df.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The "max_df" parameter of TfidfVectorizer in scikit-learn is used to set an upper limit on the term frequency within a document, where terms that occur more frequently than the specified value are ignored during the vectorization process.
|
descriptions/parameter_grid/min_df.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The "min_df" parameter of TfidfVectorizer in scikit-learn is used to set a lower limit on the term frequency within a document, where terms that occur less frequently than the specified value are ignored during the vectorization process.
|
descriptions/parameter_grid/ngram_range.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The "ngram_range" parameter of TfidfVectorizer in scikit-learn is used to specify the range of n-grams (contiguous sequences of n words) to consider during the vectorization process. It defines the lower and upper bounds for the n-gram sizes that will be included in the feature representation.
|
descriptions/parameter_grid/norm.md
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
The "norm" parameter of TfidfVectorizer in scikit-learn is used to specify the normalization method applied to the resulting TF-IDF vectors. It controls whether the vectors should be normalized to have unit norm (L2 normalization) or left unnormalized (None).
|