--- license: other license_name: health-ai-developer-foundations license_link: https://developers.google.com/health-ai-developer-foundations/terms language: - en tags: - medical - medical-embeddings - audio - health-acoustic extra_gated_heading: Access HeAR on Hugging Face extra_gated_prompt: >- To access HeAR on Hugging Face, you're required to review and agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms). To do this, please ensure you're logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license --- # HeAR model card **Model documentation:** [HeAR](https://developers.google.com/health-ai-developer-foundations/hear) **Resources**: * Model on Google Cloud Model Garden: [HeAR](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/hear) * Model on Hugging Face: [google/hear](https://huggingface.co/google/hear) * GitHub repository (supporting code, Colab notebooks, discussions, and issues): [HeAR](https://github.com/google-health/hear) * Quick start notebook: [notebooks/quick\_start](https://github.com/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face.ipynb) * Support: See [Contact](https://developers.google.com/health-ai-developer-foundations/hear/get-started.md#contact). Terms of use: [Health AI Developer Foundations terms of use](https://developers.google.com/health-ai-developer-foundations/terms) **Author**: Google ## Model information This section describes the HeAR model and how to use it. ### Description Health-related acoustic cues, originating from the respiratory system's airflow, including sounds like coughs and breathing patterns can be harnessed for health monitoring purposes. Such health sounds can also be collected via ambient sensing technologies on ubiquitous devices such as mobile phones, which may augment screening capabilities and inform clinical decision making. Health acoustics, specifically non-semantic respiratory sounds, also have potential as biomarkers to detect and monitor various health conditions, for example, identifying disease status from cough sounds, or measuring lung function using exhalation sounds made during spirometry. Health Acoustic Representations, or HeAR, is a health acoustic foundation model that is pre trained to efficiently represent these non-semantic respiratory sounds to accelerate research and development of AI models that use these inputs to make predictions. HeAR is trained unsupervised on a large and diverse unlabelled corpus, which may generalize better than non-pretrained models to unseen distributions and new tasks. Key Features * Generates health-optimized embeddings for biological sounds such as coughs and breathes * Versatility: Exhibits strong performance across diverse health acoustic tasks. * Data Efficiency: Demonstrates high performance even with limited labeled training data for downstream tasks. * Microphone robustness: Downstream models trained using HeAR generalize well to sounds recorded from unseen devices. Potential Applications HeAR can be a useful tool for AI research geared towards discovery of novel acoustic biomarkers in the following areas: * Aid screening & monitoring for respiratory diseases like COVID-19, tuberculosis, and COPD from cough and breath sounds. * Low-resource settings: Can potentially augment healthcare services in settings with limited resources by offering accessible screening and monitoring tools. ### How to use Below are some example code snippets to help you quickly get started running the model locally. If you want to use the model to run inference on a large amount of audio, we recommend that you create a production version using [the Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/hear). ```python import numpy as np from huggingface_hub.utils import HfFolder from huggingface_hub import notebook_login, from_pretrained_keras, notebook_login if HfFolder.get_token() is None: notebook_login() # Load the model from Hugging Face model = from_pretrained_keras("google/hear",) serving_signature = model.signatures['serving_default'] # Generate 4 Examples of two-second random audio clips raw_audio_batch = np.random.normal(size=(4, 32000)) # Perform Inference to obtain HeAR embeddings # There are 4 embeddings each with length 512 corresponding to the 4 inputs embedding_batch = serving_signature(x=raw_audio_batch)['output_0'].numpy() ``` ### Examples See the following Colab notebooks for examples of how to use HeAR: * To give the model a quick try, running it locally with weights from Hugging Face, see [Quick start notebook in Colab](https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/quick_start_with_hugging_face.ipynb). * For an example of how to use the model to train a linear classifier, see [Linear classifier notebook in Colab](https://colab.research.google.com/github/google-health/hear/blob/master/notebooks/train_data_efficient_classifier.ipynb). ### Model architecture overview HeAR is a [Masked Auto Encoder](https://arxiv.org/abs/2111.06377), a [transformer-based](https://arxiv.org/abs/1706.03762) neural network. * It was trained using masked auto-encoding on a large corpus of health-related sounds, with a self-supervised learning objective on a massive dataset (\~174k hours) of two-second audio clips. At training time, it tries to reconstruct masked spectrogram patches from the visible patches. * After it is trained, its encoder can generate low-dimensional representations of two-second audio clips, optimized for capturing and containing the most salient parts of health-related information from sounds like coughs and breathes. * These representations, or embeddings, can be used as inputs to other models trained for a variety of supervised tasks related to health. * The HeAR model was developed based on a [ViT-L architecture](https://arxiv.org/abs/2010.11929) * Instead of relying on CNNs, a pure transformer applied directly to sequences of image patches is the idea behind the model architecture, and it resulted in good performance in image classification tasks. This approach of using the Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. * The training process for HeAR comprised of three main components * A data curation step (including a health acoustic event detector); * A general purpose training step to develop an audio encoder (embedding model), and * A task-specific evaluation step that adopts the trained embedding model for various downstream tasks. * The system is designed to encode two-second long audio clips and generate audio embeddings for use in downstream tasks. ### Technical Specifications * Model type: [ViT (vision transformer)](https://arxiv.org/abs/2010.11929) * Key publication: [https://arxiv.org/abs/2403.02522](https://arxiv.org/abs/2403.02522) * Model created: 2023-12-04 * Model Version: 1.0.0 ### Performance & Validation HeAR's performance has been validated via linear probing the frozen embeddings on a benchmark of 33 health acoustic tasks across 6 datasets. HeAR is benchmarked on a diverse set of health acoustic tasks spanning 13 health acoustic event detection tasks, 14 cough inference tasks, and 6 spirometry inference tasks, across 6 datasets, and it demonstrated that simple linear classifiers trained on top of our representations can perform as good or better than many similar leading models. ### Key performance metrics * HeAR achieved high performance on **diverse health-relevant tasks**: inference of medical conditions (TB, COVID) and medically-relevant quantities (lung function, smoking status) from recordings of coughs or exhalations, including a task on predicting chest X-ray findings (pleural effusion, opacities etc.). * HeAR had **superior device generalizability** compared to other models (MRR=0.745 versus second-best being CLAP with MRR=0.497), which is crucially important for real-world applications. * HeAR is more **data efficient** than baseline models, sometimes reaching the same level of performance when trained on as little as 6.25% of the amount of training data. ### Inputs and outputs **Input:** Two-second long 16 kHz mono audio clip. Inputs can be batched so you can pass in n=10 as (10,32k) or n=1 as (1,32k) **Output:** Embedding vector of floating point values in (n, 512) for n two-second clips in the vector, or an embedding of length 512 for each two-second input clip. ### Dataset details ### Training dataset For training, a dataset of YT-NS (YouTube Non-Semantic) was curated, and it consisted of two-second long audio clips extracted from three billion public non-copyrighted YouTube videos using a health acoustic event detector, totalling 313.3 million two-second clips or roughly 174k hours of audio. We chose a two-second window since most events we cared about were shorter than that. The HeAR audio encoder is trained solely on this dataset. ### Evaluation dataset Six datasets were used for evaluation: * [FSD50K](https://zenodo.org/records/4060432) * [Flusense](https://github.com/Forsad/FluSense-data) * [CoughVID](https://zenodo.org/records/4048312) * [Coswara](https://zenodo.org/records/7188627) * [CIDRZ](https://www.kaggle.com/datasets/googlehealthai/google-health-ai) * [SpiroSmart](https://dl.acm.org/doi/10.1145/2370216.2370261) ## License The use of the HeAR is governed by the [Health AI Developer Foundations terms of use](https://developers.google.com/health-ai-developer-foundations/terms). ### Implementation information Details about the model internals. ### Software Training was done using [JAX](https://github.com/jax-ml/jax) JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. ## Use and limitations ### Intended use * Research and development of health-related acoustic biomarkers. * Exploration of novel applications in disease detection and health monitoring. ### Benefits HeAR embeddings can be used for efficient training of AI models for health acoustics tasks with significantly less data and compute than training neural networks initialised randomly or from checkpoints trained on generic datasets. This allows quick prototyping to see if health acoustics signals can be used by themselves or combined with other signals to make predictions of interest. ### Limitations * Limited Sequence Length: Primarily trained on 2-second audio clips. * Model Size: Current model size is too large for on-device deployment. * Bias Considerations: Potential for biases based on demographics and recording device quality, necessitating further investigation and mitigation strategies. * HeAR was trained using two-second audio clips of health-related sounds from a public non-copyrighted subset of Youtube. These clips come from a variety of sources but may be noisy or low-quality. * The model is only used to generate embeddings of the user-owned dataset. It does not generate any predictions or diagnosis on its own. * As with any research, developers should ensure that any downstream application is validated to understand performance using data that is appropriately representative of the intended use setting for the specific application (e.g., age, sex, gender, recording device, background noise, etc.).