tech: first publish changes

Browse files

1. updated git ignore
2. update ReadMe
3. add config.json for inference api

Files changed (3) hide show

.gitignore +3 -1
README.md +59 -1
config.json +23 -0

.gitignore CHANGED Viewed

@@ -1,4 +1,6 @@
 transactify_venv
 tokenizer.joblib
 label_encoder.joblib
-transactify.h5

 transactify_venv
 tokenizer.joblib
 label_encoder.joblib
+transactify.h5
+venv
+.venv

README.md CHANGED Viewed

@@ -2,4 +2,62 @@
 license: mit
 language:
 - en
----

 license: mit
 language:
 - en
+---
+## What is Transactify?
+Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
+By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
+This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
+Transactify is trained on real-world transaction data for improved accuracy and generalization.
+## Table of contents
+## 1. Data Collection
+The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category.
+Example entries include:
+- "Live concert stream on YouTube" (Movies & Entertainment)
+- "Coffee at Starbucks" (Food & Dining)
+These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others.
+---
+## 2. Data Preprocessing
+The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include:
+- Lowercasing all text.
+- Removing digits and punctuation using regular expressions (regex).
+- Tokenizing the cleaned text to convert it into a sequence of tokens.
+- Applying `text_to_sequences` to transform the tokenized words into numerical sequences.
+- Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model.
+- Label encoding the target categories to convert them into numerical labels.
+After preprocessing, the data is split into training and testing sets to build and validate the model.
+---
+## 3. Model Building
+- **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
+- **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
+- **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
+- **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
+### Model Compilation
+- Compiled with the Adam optimizer for efficient learning.
+- Sparse categorical cross-entropy loss for multi-class classification.
+- Accuracy as the evaluation metric.
+### Model Training
+The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training.
+### Saving the Model and Preprocessing Objects
+- The trained model is saved as `transactify.h5` for future use.
+- The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
+---
+## 4. Prediction
+Once trained

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "model_type": "lstm",
+  "vocab_size": 10000,
+  "embedding_dim": 128,
+  "hidden_size": 64,
+  "num_layers": 2,
+  "dropout_rate": 0.2,
+  "max_sequence_length": 150,
+  "batch_size": 8,
+  "epochs": 50,
+  "loss_function": "sparse_categorical_crossentropy",
+  "optimizer": "adam",
+  "metrics": [
+    "accuracy"
+  ],
+  "train_data_size": 5000,
+  "categories": [
+    "Lifestyle",
+    "Movies & Entertainment",
+    "Food & Dining",
+    "Others"
+  ]
+}