Indic Conformer 600M Multilingual CoreML RNNT

CoreML conversion of AI4Bharat's indic-conformer-600m-multilingual ASR model for local macOS inference experiments.

This is a converted runtime package derived from AI4Bharat's IndicConformer-600M-Multi, a multilingual Conformer-based hybrid CTC + RNNT ASR model. Please refer to the upstream model card for the original model details, supported IN-22 language coverage, authorship, training context, and usage notes.

This repository contains the CoreML runtime components currently used by the Muesli Indic ASR experiment:

  • INT8 encoder: coreml/encoder/indic_conformer_encoder_int8.mlpackage
  • RNNT prediction network: coreml/rnnt/indic_conformer_rnnt_decoder_reconstructed.mlpackage
  • RNNT joint network components:
    • indic_conformer_joint_enc.mlpackage
    • indic_conformer_joint_pred.mlpackage
    • indic_conformer_joint_pre_net.mlpackage
    • language-specific indic_conformer_joint_post_net_<lang>.mlpackage
  • Metadata:
    • metadata/vocab.json
    • metadata/language_masks.json
    • metadata/config.json
    • metadata/preprocessor.ts
    • metadata/preprocessor_constants.bin

Supported Language Heads

The uploaded runtime set includes language-specific RNNT post-net heads for:

  • Hindi (hi)
  • Bengali (bn)
  • Marathi (mr)
  • Telugu (te)
  • Tamil (ta)
  • Malayalam (ml)
  • Kannada (kn)

The runtime expects the caller to select the language explicitly. This package does not provide automatic language detection.

Input Shapes

Encoder:

  • audio_signal: Float32 [1, 80, 1024]
  • length: Int32 [1]

RNNT decoder:

  • targets: Int32 [1, 1]
  • target_length: Int32 [1]
  • states_1: Float32 [2, 1, 640]
  • cell_state_in: Float32 [2, 1, 640]

The encoder is fixed to 1024 mel frames. Long-form audio should be processed by chunking audio into fixed-size windows and stitching decoded text.

Preprocessing Notes

metadata/preprocessor.ts is the upstream TorchScript audio preprocessor from the AI4Bharat export.

metadata/preprocessor_constants.bin is a compact extraction of the TorchScript preprocessing constants used by the Muesli Swift runtime. It contains the STFT window, mel filterbank, pre-emphasis coefficient, and log/normalization guard values needed to reproduce the upstream log-mel frontend more closely in native Swift.

Muesli treats preprocessor_constants.bin as required metadata for this CoreML runtime package.

Decoding Notes

The Muesli experiment uses greedy RNNT decoding with:

  • blank token id: 256
  • start token id: 5632
  • max symbols per frame: 10

Most shipped language vocabularies do not include punctuation tokens, so punctuation restoration should be handled as a downstream post-processing step.

Excluded Artifacts

This upload intentionally excludes locally compiled .mlmodelc directories. Those are machine-local CoreML compilation outputs and should be regenerated by macOS on the target machine.

The larger FP16 encoder and CTC decoder artifacts from earlier conversion experiments are also excluded from this runtime-focused upload.

Attribution

Original model: ai4bharat/indic-conformer-600m-multilingual

Original model name: IndicConformer-600M-Multi

Original authors/organization: AI4Bharat

Original architecture: multilingual Conformer-based hybrid CTC + RNNT ASR model.

This repository only packages converted CoreML artifacts for local macOS inference. For original training details, intended use, supported languages, contact information, and license terms, see the upstream AI4Bharat model card.

Downloads last month
133
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phequals/indic-conformer-600m-multilingual-coreml-rnnt

Quantized
(6)
this model