Image-Text-to-Text
Transformers
Safetensors
lora
Inference Endpoints
File size: 1,422 Bytes
06e402e
 
 
a093459
06e402e
 
48a2798
 
b292df8
 
dbb98f9
7c1495b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbb98f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
license: mit
base_model:
- mistralai/Pixtral-12B-2409
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- lora
datasets:
- Multimodal-Fatima/FGVC_Aircraft_train
- takara-ai/FloodNet_2021-Track_2_Dataset_HF
---
# pixtral_aerial_VQA_adapter

## Model Details

- **Type**: LoRA Adapter
- **Total Parameters**: 6,225,920
- **Memory Usage**: 23.75 MB
- **Precisions**: torch.float32
- **Layer Types**:
  - lora_A: 40
  - lora_B: 40

## Intended Use

- **Primary intended uses**: Processing aerial footage of construction sites for structural and construction surveying.
- Can also be applied to any detailed VQA use cases with aerial footage.

## Training Data

- **Dataset**:
  1. FloodNet Track 2 dataset
  2. Subset of FGVC Aircraft dataset
  3. Custom dataset of 10 image-caption pairs created using Pixtral

## Training Procedure

- **Training method**: LoRA (Low-Rank Adaptation)
- **Base model**: Ertugrul/Pixtral-12B-Captioner-Relaxed
- **Training hardware**: Nebius-hosted NVIDIA H100 machine

## Citation

```bibtext
@misc{rahnemoonfar2020floodnet,
  title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding},
  author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy},
  year={2020},
  eprint={2012.02951},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2012.02951}
}
```