YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

This repository provides the VHASR trained on Flickr8k, ADE20k, COCO, and OpenImages.

Our paper is available at https://arxiv.org/abs/2410.00822.

Our code is available at https://github.com/193746/VHASR/tree/main.

For specific details about training and testing, please refer to https://github.com/193746/VHASR/tree/main.

Infer

If you are interested in our work, you can use large-scale data to train your own model and perform inference using the following command. Note that you should place the config file of clip in '{model_file}/clip_config' like the four pretrained models we provide.

cd VHASR
CUDA_VISIBLE_DEVICES=1 python src/infer.py \
--model_name "{path_to_model_folder}" \
--speech_path "{path_to_speech}" \
--image_path "{path_to_image}" \
--merge_method 3

Citation

@misc{hu2024vhasrmultimodalspeechrecognition,
      title={VHASR: A Multimodal Speech Recognition System With Vision Hotwords}, 
      author={Jiliang Hu and Zuchao Li and Ping Wang and Haojun Ai and Lefei Zhang and Hai Zhao},
      year={2024},
      eprint={2410.00822},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2410.00822}, 
}

License: cc-by-nc-4.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.