edugp commited on
Commit
5019883
1 Parent(s): 2b6deb0

Update README and add a training doc

Browse files
Files changed (2) hide show
  1. README.md +25 -7
  2. training.md +7 -0
README.md CHANGED
@@ -1,7 +1,25 @@
1
- # Download datasets:
2
- * Download and decompress tsv file from here: https://github.com/google-research-datasets/wit/blob/main/DATA.md
3
- * Use `prepare_wit.py` to download images from Wikipedia as annotated on each TSV file.
4
- * Use `scale_converter.py` to remove corrupt images and resize suitable images to 224x224
5
- * Use `join_datasets_custom_split.py` to group all JSONs from different subsets of the dataset together
6
- * Use `discard_incorrect_files.py` to filter out images that we were not able to convert.
7
- * Finally, use `run-clip.sh` to train.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ license: CC-BY 4.0
4
+ tags:
5
+ - spanish
6
+ - roberta
7
+ - vit
8
+ ---
9
+ # CLIP-Spanish
10
+ CLIP Spanish is a CLIP-like Model for Spanish. It is composed of a RoBERTa-base language encoder and a ViT-B/32 image encoder using [Flax](https://github.com/google/flax), including training scripts (see training.md).
11
+ This is part of the [Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organised by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
12
+ ## Spanish WIT
13
+ We used a subset of 141,230 Spanish captions from the [WIT dataset](https://github.com/google-research-datasets/wit) for training.
14
+
15
+ ## Team members
16
+ - Eduardo González Ponferrada ([edugp](https://huggingface.co/edugp))
17
+ - Manu Romero ([mrm8488](https://huggingface.co/))
18
+ - María Grandury ([mariagrandury](https://huggingface.co/))
19
+ ## Useful links
20
+ - [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
21
+ - [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
22
+ - [Community Week thread](https://discuss.huggingface.co/t/bertin-pretrain-roberta-large-from-scratch-in-spanish/7125)
23
+ - [Community Week channel](https://discord.com/channels/858019234139602994/859113060068229190)
24
+ - [Hybrid CLIP example scripts](https://github.com/huggingface/transformers/tree/master/examples/research_projects/jax-projects/hybrid_clip)
25
+ - [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/)
training.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Training:
2
+ * Download tsv files from here: https://github.com/google-research-datasets/wit/blob/main/DATA.md
3
+ * Use `prepare_wit.py` to download images from Wikipedia as annotated on each TSV file.
4
+ * Use `scale_converter.py` to remove corrupt images and resize suitable images to 224x224.
5
+ * Use `join_datasets_custom_split.py` to group all JSONs from different subsets of the dataset together.
6
+ * Use `discard_incorrect_files.py` to filter out images that we were not able to convert.
7
+ * Finally, use `run-clip.sh` to train.