samwaugh commited on
Commit
feffffb
Β·
1 Parent(s): 3e73f62

Update README

Browse files
Files changed (1) hide show
  1. README.md +88 -46
README.md CHANGED
@@ -23,39 +23,44 @@ datasets:
23
  ## What ArteFact Does
24
 
25
  - **Upload or select artwork images** and find scholarly passages that describe similar visual elements
26
- - **Search by region** - crop specific areas of paintings to find text about those visual details
27
  - **Filter results** by art historical topics or specific creators
 
28
  - **Access scholarly sources** with full citations, DOI links, and BibTeX references
29
  - **Generate heatmaps** showing which image regions contribute to text similarity using Grad-ECLIP
 
30
 
31
  ## πŸ—οΈ Architecture Overview
32
 
33
  ### **Backend: Flask API with ML Pipeline**
34
  - **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/`
35
- - **ML Models**: CLIP base + PaintingCLIP LoRA fine-tune
36
- - **Inference Engine**: Region-aware analysis with 7Γ—7 grid overlay
37
- - **Background Processing**: Thread-based task queue for ML inference
 
38
 
39
  ### **Frontend: Interactive Web Application**
40
  - **Single-page application** with responsive Bootstrap design
41
- - **Image Tools**: Upload, crop, edit, and analyze specific regions
42
  - **Grid Analysis**: Click-to-analyze 7Γ—7 grid cells for spatial understanding
43
- - **Academic Integration**: Full citation management and source verification
 
 
44
 
45
  ### **Data Architecture: Distributed Hugging Face Datasets**
46
  - **`artefact-embeddings`**: Pre-computed sentence embeddings (12.8GB total)
47
- - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings
48
- - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings
49
- - `*_sentence_ids.json` (71.7MB each) - Sentence ID mappings
50
  - **`artefact-json`**: Metadata and structured data
51
- - `sentences.json` - 3.1M sentence metadata
52
- - `works.json` - 7,200 work records
53
- - `creators.json` - Artist/creator mappings
54
- - `topics.json` - Topic classifications
55
  - `topic_names.json` - Human-readable topic names
56
- - **`artefact-markdown`**: Source documents and images (planned)
57
  - 7,200 work directories with markdown files and associated images
58
- - Organized by work ID for efficient retrieval
59
  - **Local Models**: PaintingCLIP LoRA weights in `data/models/PaintingCLIP/`
60
 
61
  ## πŸš€ Getting Started
@@ -63,7 +68,7 @@ datasets:
63
  ### **Prerequisites**
64
  - Python 3.9+
65
  - Docker (for containerized deployment)
66
- - Access to Hugging Face datasets
67
 
68
  ### **Local Development**
69
  ```bash
@@ -101,79 +106,115 @@ git push hf main:main
101
  - `DATA_ROOT`: Data directory path (default: `/data` for HF Spaces)
102
  - `PORT`: Server port (set by Hugging Face Spaces)
103
  - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
 
 
 
104
 
105
  ### **Data Sources**
106
  The application automatically connects to distributed Hugging Face datasets:
107
- - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search
108
  - **Metadata**: `samwaugh/artefact-json` for sentence, work, and topic information
109
  - **Documents**: `samwaugh/artefact-markdown` for source documents and context
110
- - **Models**: Local `data/models/` directory for ML model weights
111
 
112
  ## πŸ“Š Data Processing Pipeline
113
 
114
  ### **ArtContext Research Pipeline**
115
  ArteFact processes a massive corpus of art historical texts:
116
 
117
- - **Scale**: 3.1 million sentences from scholarly articles
118
  - **Processing**: Executed on Durham University's Bede HPC cluster
119
  - **GPU**: NVIDIA H100 with 32GB memory
120
- - **Processing Time**: ~12 minutes for full corpus
121
  - **Output**: Structured embeddings and metadata for real-time analysis
122
 
123
  ### **Data Organization**
124
  ```
125
  data/
126
  β”œβ”€β”€ models/
127
- β”‚ └── PaintingCLIP/ # LoRA fine-tuned weights
128
- └── marker_output/ # Document analysis outputs
 
 
 
 
129
 
130
  # Data hosted on Hugging Face Hub:
131
- # - samwaugh/artefact-embeddings: 12.8GB embeddings
132
- # - samwaugh/artefact-json: Metadata files
133
- # - samwaugh/artefact-markdown: Source documents
134
  ```
135
 
136
  ## 🧠 AI Models & Features
137
 
138
  ### **Core Models**
139
  - **CLIP**: OpenAI's CLIP-ViT-B/32 for general image-text understanding
140
- - **PaintingCLIP**: Fine-tuned version specialized for art historical content
141
  - **Model Switching**: Users can choose between models for different analysis types
 
142
 
143
  ### **Advanced AI Features**
144
- - **Region-Aware Analysis**: 7Γ—7 grid overlay for spatial understanding
145
- - **Grad-ECLIP Heatmaps**: Visual explanations of AI decision-making
146
- - **Smart Filtering**: Topic and creator-based result filtering
147
- - **Patch-Level Attention**: ViT patch embeddings for detailed analysis
 
 
148
 
149
  ## 🎨 User Interface Features
150
 
151
  ### **Image Analysis Tools**
152
- - **Drag & Drop Upload**: Easy image input with preview
153
- - **Interactive Grid**: Click-to-analyze specific image regions
154
- - **Crop & Edit**: Built-in image manipulation tools
155
- - **Image History**: Track and compare different analyses
 
156
 
157
  ### **Academic Integration**
158
- - **Citation Management**: One-click BibTeX copying
159
- - **Source Verification**: Direct links to scholarly articles
160
  - **Context Preservation**: Full paragraph context for matched sentences
161
- - **Work Exploration**: Browse related images and metadata
 
 
 
 
 
 
 
162
 
163
  ## πŸ”¬ Research & Development
164
 
165
  ### **Technical Innovations**
166
- - **Efficient Embedding Storage**: Safetensors format for fast loading
167
- - **Memory-Optimized Inference**: Caching and batch processing
168
  - **Real-Time Analysis**: Sub-second response times for similarity search
169
- - **Scalable Architecture**: Designed for production deployment
170
- - **Distributed Data**: Hugging Face datasets for scalable data management
 
171
 
172
  ### **Academic Applications**
173
- - **Art Historical Research**: Discover connections across large corpora
174
  - **Digital Humanities**: Computational analysis of visual-textual relationships
175
- - **Educational Tools**: Interactive learning for art history students
176
  - **Scholarly Discovery**: AI-powered literature review and citation analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
 
178
  ## 🀝 Contributing
179
 
@@ -185,9 +226,10 @@ data/
185
  5. Submit a pull request
186
 
187
  ### **Data Contributions**
188
- - **Embeddings**: Process new art historical texts
189
  - **Models**: Improve fine-tuning and model performance
190
  - **Documentation**: Enhance user guides and API documentation
 
191
 
192
  ## πŸ“„ License & Acknowledgments
193
 
@@ -210,8 +252,8 @@ This work made use of the facilities of the N8 Centre of Excellence in Computati
210
  - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
211
  - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
212
  - **JSON Dataset**: [artefact-json on HF](https://huggingface.co/datasets/samwaugh/artefact-json)
213
- - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown) (planned)
214
 
215
  ---
216
 
217
- *ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making. The application now leverages Hugging Face's distributed data infrastructure for scalable and collaborative research.*
 
23
  ## What ArteFact Does
24
 
25
  - **Upload or select artwork images** and find scholarly passages that describe similar visual elements
26
+ - **Search by region** - crop specific areas of paintings to find text about those visual details
27
  - **Filter results** by art historical topics or specific creators
28
+ - **Switch AI models** between CLIP and PaintingCLIP for different analysis approaches
29
  - **Access scholarly sources** with full citations, DOI links, and BibTeX references
30
  - **Generate heatmaps** showing which image regions contribute to text similarity using Grad-ECLIP
31
+ - **Interactive grid analysis** - click on 7Γ—7 grid cells to analyze specific image regions
32
 
33
  ## πŸ—οΈ Architecture Overview
34
 
35
  ### **Backend: Flask API with ML Pipeline**
36
  - **Flask server** (`backend/runner/app.py`) serving the SPA from `frontend/`
37
+ - **ML Models**: CLIP base model + PaintingCLIP LoRA fine-tuned adapter
38
+ - **Inference Engine**: Region-aware analysis with 7Γ—7 grid overlay and patch-level attention
39
+ - **Background Processing**: Thread-based task queue for ML inference with progress tracking
40
+ - **Caching System**: Intelligent caching of model components and embeddings for performance
41
 
42
  ### **Frontend: Interactive Web Application**
43
  - **Single-page application** with responsive Bootstrap design
44
+ - **Image Tools**: Upload, crop, edit, undo, and analyze specific regions
45
  - **Grid Analysis**: Click-to-analyze 7Γ—7 grid cells for spatial understanding
46
+ - **Model Selection**: Dropdown to switch between CLIP and PaintingCLIP models
47
+ - **Academic Integration**: Full citation management, source verification, and BibTeX export
48
+ - **Real-time Feedback**: Progress indicators and status updates during processing
49
 
50
  ### **Data Architecture: Distributed Hugging Face Datasets**
51
  - **`artefact-embeddings`**: Pre-computed sentence embeddings (12.8GB total)
52
+ - `clip_embeddings.safetensors` (6.39GB) - CLIP model embeddings for 3.1M sentences
53
+ - `paintingclip_embeddings.safetensors` (6.39GB) - PaintingCLIP embeddings for 3.1M sentences
54
+ - `clip_sentence_ids.json` & `paintingclip_sentence_ids.json` - Sentence ID mappings
55
  - **`artefact-json`**: Metadata and structured data
56
+ - `sentences.json` - 3.1M sentence metadata with work associations
57
+ - `works.json` - 7,200 work records with DOI and citation information
58
+ - `creators.json` - Artist/creator mappings for filtering
59
+ - `topics.json` - Topic classifications for content filtering
60
  - `topic_names.json` - Human-readable topic names
61
+ - **`artefact-markdown`**: Source documents and images (239,996 files)
62
  - 7,200 work directories with markdown files and associated images
63
+ - Organized by work ID for efficient retrieval and context display
64
  - **Local Models**: PaintingCLIP LoRA weights in `data/models/PaintingCLIP/`
65
 
66
  ## πŸš€ Getting Started
 
68
  ### **Prerequisites**
69
  - Python 3.9+
70
  - Docker (for containerized deployment)
71
+ - Access to Hugging Face datasets (public access)
72
 
73
  ### **Local Development**
74
  ```bash
 
106
  - `DATA_ROOT`: Data directory path (default: `/data` for HF Spaces)
107
  - `PORT`: Server port (set by Hugging Face Spaces)
108
  - `MAX_WORKERS`: Thread pool size for ML inference (default: 2)
109
+ - `ARTEFACT_JSON_DATASET`: HF dataset name for JSON metadata (default: `samwaugh/artefact-json`)
110
+ - `ARTEFACT_EMBEDDINGS_DATASET`: HF dataset name for embeddings (default: `samwaugh/artefact-embeddings`)
111
+ - `ARTEFACT_MARKDOWN_DATASET`: HF dataset name for markdown files (default: `samwaugh/artefact-markdown`)
112
 
113
  ### **Data Sources**
114
  The application automatically connects to distributed Hugging Face datasets:
115
+ - **Embeddings**: `samwaugh/artefact-embeddings` for fast similarity search using safetensors format
116
  - **Metadata**: `samwaugh/artefact-json` for sentence, work, and topic information
117
  - **Documents**: `samwaugh/artefact-markdown` for source documents and context
118
+ - **Models**: Local `data/models/` directory for ML model weights with fallback to base CLIP
119
 
120
  ## πŸ“Š Data Processing Pipeline
121
 
122
  ### **ArtContext Research Pipeline**
123
  ArteFact processes a massive corpus of art historical texts:
124
 
125
+ - **Scale**: 3.1 million sentences from scholarly articles across 7,200 works
126
  - **Processing**: Executed on Durham University's Bede HPC cluster
127
  - **GPU**: NVIDIA H100 with 32GB memory
128
+ - **Processing Time**: ~12 minutes for full corpus embedding generation
129
  - **Output**: Structured embeddings and metadata for real-time analysis
130
 
131
  ### **Data Organization**
132
  ```
133
  data/
134
  β”œβ”€β”€ models/
135
+ β”‚ └── PaintingClip/ # LoRA fine-tuned weights
136
+ β”‚ β”œβ”€β”€ adapter_config.json
137
+ β”‚ β”œβ”€β”€ adapter_model.safetensors
138
+ β”‚ └── README.md
139
+ └── artifacts/ # Uploaded images
140
+ └── outputs/ # Inference results
141
 
142
  # Data hosted on Hugging Face Hub:
143
+ # - samwaugh/artefact-embeddings: 12.8GB embeddings in safetensors format
144
+ # - samwaugh/artefact-json: 5 JSON metadata files
145
+ # - samwaugh/artefact-markdown: 239,996 files across 7,200 work directories
146
  ```
147
 
148
  ## 🧠 AI Models & Features
149
 
150
  ### **Core Models**
151
  - **CLIP**: OpenAI's CLIP-ViT-B/32 for general image-text understanding
152
+ - **PaintingCLIP**: Fine-tuned version specialized for art historical content using LoRA adapters
153
  - **Model Switching**: Users can choose between models for different analysis types
154
+ - **Fallback System**: Graceful degradation to base CLIP if LoRA adapter is unavailable
155
 
156
  ### **Advanced AI Features**
157
+ - **Region-Aware Analysis**: 7Γ—7 grid overlay for spatial understanding of image regions
158
+ - **Grad-ECLIP Heatmaps**: Visual explanations of AI decision-making with attention visualization
159
+ - **Smart Filtering**: Topic and creator-based result filtering with real-time updates
160
+ - **Patch-Level Attention**: ViT patch embeddings for detailed analysis of image components
161
+ - **Batch Processing**: Efficient processing of large embedding datasets with memory optimization
162
+ - **Direct File Loading**: Fast loading of consolidated safetensors files for optimal performance
163
 
164
  ## 🎨 User Interface Features
165
 
166
  ### **Image Analysis Tools**
167
+ - **Drag & Drop Upload**: Easy image input with preview and validation
168
+ - **Interactive Grid**: Click-to-analyze specific image regions with visual feedback
169
+ - **Crop & Edit**: Built-in image manipulation tools with undo functionality
170
+ - **Image History**: Track and compare different analyses with thumbnail navigation
171
+ - **Example Gallery**: Pre-loaded historical artworks for quick testing
172
 
173
  ### **Academic Integration**
174
+ - **Citation Management**: One-click BibTeX copying with formatted output
175
+ - **Source Verification**: Direct links to scholarly articles and DOI resolution
176
  - **Context Preservation**: Full paragraph context for matched sentences
177
+ - **Work Exploration**: Browse related images and metadata from the same scholarly work
178
+ - **Modal Documentation**: Detailed work information with embedded PDF previews
179
+
180
+ ### **User Experience**
181
+ - **Real-time Progress**: Loading indicators and status updates during processing
182
+ - **Responsive Design**: Mobile-friendly interface with Bootstrap components
183
+ - **Error Handling**: Graceful error messages and recovery options
184
+ - **Performance Optimization**: Caching and efficient data loading for fast responses
185
 
186
  ## πŸ”¬ Research & Development
187
 
188
  ### **Technical Innovations**
189
+ - **Efficient Embedding Storage**: Safetensors format for fast loading and memory efficiency
190
+ - **Memory-Optimized Inference**: Intelligent caching and batch processing
191
  - **Real-Time Analysis**: Sub-second response times for similarity search
192
+ - **Scalable Architecture**: Designed for production deployment with distributed data
193
+ - **Distributed Data**: Hugging Face datasets for scalable data management and collaboration
194
+ - **Robust Error Handling**: Fallback mechanisms and graceful degradation
195
 
196
  ### **Academic Applications**
197
+ - **Art Historical Research**: Discover connections across large scholarly corpora
198
  - **Digital Humanities**: Computational analysis of visual-textual relationships
199
+ - **Educational Tools**: Interactive learning for art history students and researchers
200
  - **Scholarly Discovery**: AI-powered literature review and citation analysis
201
+ - **Cross-Reference Analysis**: Find related works and themes across different time periods
202
+
203
+ ## πŸ› οΈ Technical Implementation
204
+
205
+ ### **Backend Architecture**
206
+ - **Flask Application**: RESTful API with async task processing
207
+ - **Thread Pool**: Background processing for ML inference tasks
208
+ - **Caching Layer**: Intelligent caching of model components and embeddings
209
+ - **Error Recovery**: Robust error handling with user-friendly messages
210
+ - **Data Validation**: Input validation and sanitization for security
211
+
212
+ ### **Frontend Architecture**
213
+ - **Single Page Application**: jQuery-based interactive interface
214
+ - **Bootstrap UI**: Responsive design with modern components
215
+ - **Real-time Updates**: WebSocket-like polling for task status
216
+ - **State Management**: Client-side state for user interactions and preferences
217
+ - **Accessibility**: Keyboard navigation and screen reader support
218
 
219
  ## 🀝 Contributing
220
 
 
226
  5. Submit a pull request
227
 
228
  ### **Data Contributions**
229
+ - **Embeddings**: Process new art historical texts and generate embeddings
230
  - **Models**: Improve fine-tuning and model performance
231
  - **Documentation**: Enhance user guides and API documentation
232
+ - **Testing**: Add test cases for new features and edge cases
233
 
234
  ## πŸ“„ License & Acknowledgments
235
 
 
252
  - **Research Paper**: [Download PDF](paper/waugh2025artcontext.pdf)
253
  - **Embeddings Dataset**: [artefact-embeddings on HF](https://huggingface.co/datasets/samwaugh/artefact-embeddings)
254
  - **JSON Dataset**: [artefact-json on HF](https://huggingface.co/datasets/samwaugh/artefact-json)
255
+ - **Markdown Dataset**: [artefact-markdown on HF](https://huggingface.co/datasets/samwaugh/artefact-markdown)
256
 
257
  ---
258
 
259
+ *ArteFact represents a significant contribution to computational art history, making large-scale scholarly resources accessible through AI-powered visual analysis while maintaining academic rigor and providing transparent explanations of AI decision-making. The application leverages Hugging Face's distributed data infrastructure for scalable and collaborative research, enabling researchers worldwide to explore the connections between visual art and textual scholarship.*