ericonaldo commited on
Commit
1606ca5
1 Parent(s): 00342af

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # RoboVLMs model card
5
+
6
+ ## Introduction
7
+
8
+ This repo contains the pre-trained models through **[RoboVLMs](https://github.com/Robot-VLAs/RoboVLMs)**, which is a unified framework for easily building VLAs from VLMs.
9
+
10
+ We open-source three pre-trained model checkpoints and their configs:
11
+
12
+ - `kosmos_ph_calvin_abcd`: RoboKosMos(KosMos+Policy Head) trained on the CALVIN dataset (split ABCD).
13
+ - `kosmos_ph_calvin_abc`: RoboKosMos(KosMos+Policy Head) trained on the CALVIN dataset (split ABC).
14
+ - `kosmos_ph_oxe-pretrain`: RoboKosMos(KosMos+Policy Head) trained on the OXE-magic-soup dataset.
15
+
16
+ ## Usage
17
+
18
+ The model can be used to predict action based on the vision and language input. RoboVLMs supports several VLA structures, multi-view input and various backbones. Taking `kosmos_ph_calvin_abcd` as an example:
19
+
20
+ ```python
21
+ import torch
22
+ import json, functools
23
+ from PIL import Image
24
+ from robovlms.train.base_trainer import BaseTrainer
25
+ from robovlms.data.data_utils import preprocess_image
26
+ from robovlms.data.data_utils import get_text_function
27
+
28
+ configs = josn.load(open('configs/kosmos_ph_calvin_abcd.json', 'r'))
29
+ pretrained_path = 'checkpoints/kosmos_ph_calvin_abcd.pt'
30
+ configs['model_load_path'] = pretrained_path
31
+
32
+ model = BaseTrainer.from_checkpoint(configs)
33
+
34
+ image_fn = functools.partial(
35
+ preprocess_image,
36
+ image_processor=model.model.image_processor,
37
+ model_type=configs["model"],
38
+ )
39
+ text_fn = get_text_function(model.model.tokenizer, configs["model"])
40
+ prompt = "Task: pickup the bottle on the table"
41
+ text_tensor, attention_mask = text_preprocess([lang])
42
+
43
+ for step in range(MAX_STEPS):
44
+
45
+ image: Image.Image = get_from_side_camera(...)
46
+ image = image_fn([image]).unsqueeze(0)
47
+
48
+ input_dict["rgb"] = image
49
+ input_dict["text"] = text_tensor
50
+ input_dict['text_mask'] = attention_mask
51
+
52
+ ### if wrist camera is available
53
+ wrist_image: Image.Image = get_from_wrist_camera(...)
54
+ wrist_image = image_fn([wrist_image]).unsqueeze(0)
55
+ input_dict["hand_rgb"] = wrist_image
56
+
57
+ action = model.inference_step(input_dict)["action"]
58
+
59
+ # unormalize / reproject the action if necessary
60
+ from robovlms.data.data_utils import unnoramalize_action
61
+ if isinstance(action, tuple):
62
+ action = (
63
+ unnoramalize_action(
64
+ action[0], self.configs["norm_min"], self.configs["norm_max"]
65
+ ),
66
+ action[1],
67
+ )
68
+ else:
69
+ action = unnoramalize_action(
70
+ action, self.configs["norm_min"], self.configs["norm_max"]
71
+ )
72
+ ```
73
+
74
+ ## Evaluation
75
+
76
+
77
+