Tech: Add Config yml

#14
by ai-venkat-r - opened
Files changed (3) hide show
  1. .gitignore +3 -1
  2. README.md +59 -1
  3. config.json +23 -0
.gitignore CHANGED
@@ -1,4 +1,6 @@
1
  transactify_venv
2
  tokenizer.joblib
3
  label_encoder.joblib
4
- transactify.h5
 
 
 
1
  transactify_venv
2
  tokenizer.joblib
3
  label_encoder.joblib
4
+ transactify.h5
5
+ venv
6
+ .venv
README.md CHANGED
@@ -2,4 +2,62 @@
2
  license: mit
3
  language:
4
  - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  language:
4
  - en
5
+ ---
6
+
7
+ ## What is Transactify?
8
+ Transactify is an LSTM-based model designed to predict the category of online payment transactions from their descriptions.
9
+ By analyzing textual inputs like "Live concert stream on YouTube" or "Coffee at Starbucks," it classifies transactions into categories such as "Movies & Entertainment" or "Food & Dining."
10
+ This model helps users track and organize their spending across various sectors, providing better financial insights and budgeting.
11
+ Transactify is trained on real-world transaction data for improved accuracy and generalization.
12
+
13
+ ## Table of contents
14
+ ## 1. Data Collection
15
+ The dataset consists of **5,000 transaction records** generated using ChatGPT, each containing a transaction description and its corresponding category.
16
+
17
+ Example entries include:
18
+ - "Live concert stream on YouTube" (Movies & Entertainment)
19
+ - "Coffee at Starbucks" (Food & Dining)
20
+
21
+ These records cover various spending categories such as **Lifestyle**, **Movies & Entertainment**, **Food & Dining**, and others.
22
+
23
+ ---
24
+
25
+ ## 2. Data Preprocessing
26
+ The preprocessing step involves several natural language processing (NLP) tasks to clean and prepare the text data for model training. These include:
27
+
28
+ - Lowercasing all text.
29
+ - Removing digits and punctuation using regular expressions (regex).
30
+ - Tokenizing the cleaned text to convert it into a sequence of tokens.
31
+ - Applying `text_to_sequences` to transform the tokenized words into numerical sequences.
32
+ - Using `pad_sequences` to ensure all sequences have the same length for input into the LSTM model.
33
+ - Label encoding the target categories to convert them into numerical labels.
34
+
35
+ After preprocessing, the data is split into training and testing sets to build and validate the model.
36
+
37
+ ---
38
+
39
+ ## 3. Model Building
40
+ - **Embedding Layer**: Converts tokenized transaction descriptions into dense vectors, capturing word semantics and relationships.
41
+
42
+ - **LSTM Layer**: Learns sequential patterns from the embedded text, helping the model understand the context and relationships between words over time.
43
+
44
+ - **Dropout Layer**: Introduces regularization by randomly turning off neurons during training, reducing overfitting and improving the model's generalization.
45
+
46
+ - **Dense Layer with Softmax Activation**: Outputs a probability distribution across categories, allowing the model to predict the correct category for each transaction description.
47
+
48
+ ### Model Compilation
49
+ - Compiled with the Adam optimizer for efficient learning.
50
+ - Sparse categorical cross-entropy loss for multi-class classification.
51
+ - Accuracy as the evaluation metric.
52
+
53
+ ### Model Training
54
+ The model is trained for **50 epochs** with a batch size of **8**, using a validation set to monitor performance and adjust during training.
55
+
56
+ ### Saving the Model and Preprocessing Objects
57
+ - The trained model is saved as `transactify.h5` for future use.
58
+ - The tokenizer and label encoder used during preprocessing are saved using joblib as `tokenizer.joblib` and `label_encoder.joblib`, respectively, ensuring they can be reused for consistent tokenization and label encoding when making predictions on new data.
59
+
60
+ ---
61
+
62
+ ## 4. Prediction
63
+ Once trained
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "lstm",
3
+ "vocab_size": 10000,
4
+ "embedding_dim": 128,
5
+ "hidden_size": 64,
6
+ "num_layers": 2,
7
+ "dropout_rate": 0.2,
8
+ "max_sequence_length": 150,
9
+ "batch_size": 8,
10
+ "epochs": 50,
11
+ "loss_function": "sparse_categorical_crossentropy",
12
+ "optimizer": "adam",
13
+ "metrics": [
14
+ "accuracy"
15
+ ],
16
+ "train_data_size": 5000,
17
+ "categories": [
18
+ "Lifestyle",
19
+ "Movies & Entertainment",
20
+ "Food & Dining",
21
+ "Others"
22
+ ]
23
+ }