Update README.md
Browse files
README.md
CHANGED
|
@@ -25,7 +25,7 @@ base_model:
|
|
| 25 |
|
| 26 |
**Council Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes subjects. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subjects, making it particularly effective for categorizing complex governmental content.
|
| 27 |
|
| 28 |
-
🚀 **Try out the model:** [
|
| 29 |
|
| 30 |
## Key Features
|
| 31 |
|
|
@@ -100,18 +100,18 @@ bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").
|
|
| 100 |
|
| 101 |
# Preprocess text
|
| 102 |
text = "A Câmara Municipal aprovou o orçamento de 2024..."
|
| 103 |
-
# (apply smart_preprocess function - see
|
| 104 |
|
| 105 |
# Extract features
|
| 106 |
tfidf_features = tfidf.transform([text])
|
| 107 |
-
# (extract BERT embeddings - see
|
| 108 |
|
| 109 |
# Combine features and predict
|
| 110 |
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])
|
| 111 |
|
| 112 |
# Get ensemble predictions
|
| 113 |
logistic_proba = logistic_model.predict_proba(X_combined)
|
| 114 |
-
# (apply GB models and adaptive weighting - see
|
| 115 |
|
| 116 |
# Apply optimal thresholds
|
| 117 |
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
|
|
@@ -125,7 +125,7 @@ print(f"Predicted Topics: {predicted_labels}")
|
|
| 125 |
|
| 126 |
The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
|
| 127 |
|
| 128 |
-
- **Documents**: 2,500+ meeting minutes subjects
|
| 129 |
- **Time Period**: 2021-2024
|
| 130 |
- **Source**: Portuguese municipalities (anonymized)
|
| 131 |
- **Labels**: 22 topic categories
|
|
|
|
| 25 |
|
| 26 |
**Council Topics Classifier** is an ensemble machine learning system specialized in **multi-label topic classification** for Portuguese municipal council meeting minutes subjects. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subjects, making it particularly effective for categorizing complex governmental content.
|
| 27 |
|
| 28 |
+
🚀 **Try out the model:** [Demo Council Topics Classifier PT](https://huggingface.co/spaces/anonymous12321/Council_Topics_Classifier_PT)
|
| 29 |
|
| 30 |
## Key Features
|
| 31 |
|
|
|
|
| 100 |
|
| 101 |
# Preprocess text
|
| 102 |
text = "A Câmara Municipal aprovou o orçamento de 2024..."
|
| 103 |
+
# (apply smart_preprocess function - see demo source code)
|
| 104 |
|
| 105 |
# Extract features
|
| 106 |
tfidf_features = tfidf.transform([text])
|
| 107 |
+
# (extract BERT embeddings - see demo source code)
|
| 108 |
|
| 109 |
# Combine features and predict
|
| 110 |
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])
|
| 111 |
|
| 112 |
# Get ensemble predictions
|
| 113 |
logistic_proba = logistic_model.predict_proba(X_combined)
|
| 114 |
+
# (apply GB models and adaptive weighting - see demo source code)
|
| 115 |
|
| 116 |
# Apply optimal thresholds
|
| 117 |
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
|
|
|
|
| 125 |
|
| 126 |
The model was trained on a curated dataset of Portuguese municipal council meeting minutes:
|
| 127 |
|
| 128 |
+
- **Documents**: 2,500+ meeting minutes discussion subjects
|
| 129 |
- **Time Period**: 2021-2024
|
| 130 |
- **Source**: Portuguese municipalities (anonymized)
|
| 131 |
- **Labels**: 22 topic categories
|