File size: 9,255 Bytes
2c72e40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# SWIFT MT564 Documentation Assistant

**Version:** 1.0.0  
**Date:** May 14, 2025  
**Author:** Replit AI  

## Table of Contents

1. [Introduction](#introduction)
2. [Project Overview](#project-overview)
3. [System Architecture](#system-architecture)
4. [Installation & Setup](#installation--setup)
5. [Component Details](#component-details)
   - [Data Collection](#data-collection)
   - [Model Training](#model-training)
   - [Web Interface](#web-interface)
   - [Hugging Face Integration](#hugging-face-integration)
6. [Usage Guide](#usage-guide)
7. [Troubleshooting](#troubleshooting)
8. [References](#references)

## Introduction

The SWIFT MT564 Documentation Assistant is a specialized AI system designed to help financial professionals understand and work with SWIFT MT564 message formats (Corporate Action Notifications). It combines web scraping, natural language processing, and a conversational interface to provide an intelligent assistant for interpreting MT564 documentation.

## Project Overview

This project creates a complete pipeline that:

1. Scrapes SWIFT MT564 documentation from official sources
2. Processes this information into a structured format
3. Fine-tunes a TinyLlama language model on this specialized data
4. Provides a user interface for asking questions about MT564
5. Enables deployment to Hugging Face for easy sharing and use

The system is designed to be modular, allowing for future expansion to other SWIFT message types or financial documentation.

## System Architecture

The system consists of several key components:

```
SWIFT-MT564-Assistant/
β”œβ”€β”€ scrapers/                 # Web scraping components
β”‚   β”œβ”€β”€ iso20022_scraper.py   # Scraper for ISO20022 website
β”‚   β”œβ”€β”€ pdf_parser.py         # PDF extraction utilities
β”‚   └── data_processor.py     # Converts raw data to training format
β”‚
β”œβ”€β”€ model/                    # ML model components
β”‚   β”œβ”€β”€ download_tinyllama.py # Script to download TinyLlama model
β”‚   β”œβ”€β”€ upload_to_huggingface.py # Script to upload model to Hugging Face
β”‚   β”œβ”€β”€ tinyllama_trainer.py  # Fine-tuning implementation
β”‚   └── evaluator.py          # Tests model performance
β”‚
β”œβ”€β”€ webapp/                   # Web application
β”‚   β”œβ”€β”€ app.py                # Flask application
β”‚   β”œβ”€β”€ templates/            # HTML templates
β”‚   β”‚   β”œβ”€β”€ index.html        # Main page
β”‚   β”‚   └── result.html       # Results display
β”‚   └── static/               # CSS, JS, and other static files
β”‚
β”œβ”€β”€ data/                     # Data storage
β”‚   β”œβ”€β”€ raw/                  # Raw scraped data
β”‚   β”œβ”€β”€ processed/            # Processed training data
β”‚   └── uploaded/             # User-uploaded PDFs
β”‚
β”œβ”€β”€ train_mt564_model.py      # Script to train the model
β”œβ”€β”€ prepare_mt564_data.py     # Script to prepare training data
β”œβ”€β”€ dependencies.txt          # Project dependencies
β”œβ”€β”€ setup.py                  # Setup and utility script
└── README.md                 # Project documentation
```

## Installation & Setup

### System Requirements

- Python 3.8 or higher
- At least 4GB RAM (8GB+ recommended)
- At least 10GB free disk space
- CUDA-compatible GPU recommended for training (but not required)
- Internet connection for downloading models and data

### Local Installation

1. **Clone or download the project**:
   - Download the zip file from Replit
   - Extract to a folder on your local machine

2. **Set up a virtual environment**:
   ```bash
   # Create a virtual environment
   python -m venv venv
   
   # Activate the environment
   # On Windows:
   venv\Scripts\activate
   # On macOS/Linux:
   source venv/bin/activate
   ```

3. **Install dependencies**:
   ```bash
   # Install core dependencies
   pip install torch transformers datasets huggingface_hub accelerate
   pip install requests beautifulsoup4 trafilatura flask
   pip install PyPDF2 tqdm nltk rouge
   
   # Or use the dependencies.txt file
   pip install -r dependencies.txt
   ```

4. **Run the setup script for guidance**:
   ```bash
   python setup.py --mode guide
   ```

### Environment Variables

The following environment variables are used:

- `HUGGING_FACE_TOKEN`: Your Hugging Face API token (for uploading models)
- `FLASK_APP`: Set to "webapp/app.py" for running the web interface
- `FLASK_ENV`: Set to "development" for debugging or "production" for deployment

## Component Details

### Data Collection

The data collection process involves scraping SWIFT MT564 documentation from official sources:

1. **ISO20022 Website Scraping**:
   ```bash
   python scrapers/iso20022_scraper.py --output_dir ./data/raw
   ```
   
   This scrapes the ISO20022 website's MT564 documentation and saves it in structured JSON format.

2. **Data Processing**:
   ```bash
   python prepare_mt564_data.py --input_file ./data/raw/mt564_documentation.json --output_file ./data/processed/mt564_training_data.json
   ```

   This converts the raw data into instruction-response pairs suitable for training.

### Model Training

The model training process involves:

1. **Downloading the base model**:
   ```bash
   python model/download_tinyllama.py --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 --output_dir ./data/models
   ```

2. **Fine-tuning on MT564 data**:
   ```bash
   python train_mt564_model.py --model_name ./data/models/TinyLlama-1.1B-Chat-v1.0 --training_data ./data/processed/mt564_training_data.json --output_dir ./mt564_tinyllama_model
   ```

   Training parameters can be adjusted as needed:
   - `--epochs`: Number of training epochs (default: 3)
   - `--batch_size`: Batch size (default: 2)
   - `--learning_rate`: Learning rate (default: 2e-5)

3. **Evaluating the model**:
   The training script includes validation, but further evaluation can be performed on test data if needed.

### Web Interface

The web interface provides a user-friendly way to interact with the model:

1. **Starting the web server**:
   ```bash
   python webapp/app.py
   ```

2. **Using the interface**:
   - Open a browser and navigate to `http://localhost:5000`
   - Upload SWIFT MT564 documentation PDFs
   - Ask questions about the message format
   - View AI-generated responses

### Hugging Face Integration

The project includes tools for seamless integration with Hugging Face:

1. **Uploading your model**:
   ```bash
   # Set your Hugging Face API token
   export HUGGING_FACE_TOKEN=your_token_here
   
   # Upload the model
   python model/upload_to_huggingface.py --model_dir ./mt564_tinyllama_model --repo_name your-username/mt564-tinyllama
   ```

2. **Creating a Hugging Face Space**:
   - Go to huggingface.co and click "New Space"
   - Choose Gradio or Streamlit template
   - Link to your uploaded model
   - Use the sample code provided in the setup guide

## Usage Guide

### Common Workflows

#### Complete Pipeline

1. Scrape data β†’ 2. Process data β†’ 3. Download model β†’ 4. Train model β†’ 5. Upload to Hugging Face

```bash
# 1. Scrape data
python scrapers/iso20022_scraper.py --output_dir ./data/raw

# 2. Process data
python prepare_mt564_data.py --input_file ./data/raw/mt564_documentation.json --output_file ./data/processed/mt564_training_data.json

# 3. Download model
python model/download_tinyllama.py --output_dir ./data/models

# 4. Train model
python train_mt564_model.py --training_data ./data/processed/mt564_training_data.json --output_dir ./mt564_tinyllama_model

# 5. Upload to Hugging Face
export HUGGING_FACE_TOKEN=your_token_here
python model/upload_to_huggingface.py --model_dir ./mt564_tinyllama_model --repo_name your-username/mt564-tinyllama
```

#### Using Pre-trained Model

If you already have a trained model, you can skip steps 1-4 and just run the web interface:

```bash
# Start the web interface
python webapp/app.py
```

## Troubleshooting

### Common Issues

1. **Out of memory during training**:
   - Reduce batch size: `--batch_size 1`
   - Increase gradient accumulation: `--gradient_accumulation_steps 8`
   - Use CPU only if necessary: `--device cpu`

2. **Installation errors**:
   - Make sure you're using Python 3.8+
   - Try installing dependencies one by one
   - Check for package conflicts

3. **Hugging Face upload issues**:
   - Verify your HUGGING_FACE_TOKEN is set correctly
   - Make sure you have write access to the repository
   - Check for repository naming conflicts

### Getting Help

If you encounter issues:
1. Check the error messages for specific details
2. Consult the Hugging Face documentation for model/API issues
3. Review the TinyLlama documentation for model-specific questions

## References

- [SWIFT MT564 Documentation](https://www.iso20022.org/15022/uhb/finmt564.htm)
- [TinyLlama Project](https://github.com/jzhang38/TinyLlama)
- [Hugging Face Documentation](https://huggingface.co/docs)
- [Transformers Library](https://huggingface.co/docs/transformers/index)
- [Flask Web Framework](https://flask.palletsprojects.com/)

---

## License

This project is available under the Apache 2.0 License.

## Acknowledgements

This project utilizes several open-source libraries and resources, including TinyLlama, Hugging Face Transformers, and Flask.