--- license: apache-2.0 --- # InkSight Small-p From [InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write](https://github.com/google-research/inksight)
Model Architecture | A multimodal sequence-to-sequence Transformer model with the mT5 encoder-decoder architecture. It takes text tokens and ViT dense image embeddings as inputs to an encoder and autoregressively predicts discrete text and ink tokens with a decoder. |
---|---|
Input(s) | A pair of image and text. |
Output(s) | Generated digital ink and text. |
Usage |
Application: The model is for research prototype, and the public version is released and available for the public. Known Caveats: None. |
System Type |
System Description: This is a standalone model. Upstream Dependencies: None. Downstream Dependencies: None. |
Implementation Frameworks |
Hardware & Software: Hardware: TPU v5e. Software: T5X , JAX/Flax, Flaxformer. Compute Requirements: We train all of our models for 340k steps with batch size 512. With frozen ViT encoders, the training of Small-i takes âŧ33h on 64 TPU v5e chips and the training of Large-i takes âŧ105h on 64 TPU v5e chips. |
Data Overview | Training Datasets: The ViT encoder of Small-p is pretrained on ImageNet-21k, mT5 encoder and decoder are initialized from scratch. The entire model is trained on the mixture of publicly available datasets described in next section. |
Evaluation Results | Evaluation Methods: Human evaluation (reported in Section 4.5.1 of the paper) and automated evaluations (reported in Section 4.5.2 of the paper). |
Model Usage & Limitations |
Sensitive Use: The model is capable of converting images to digital inks. This model should not be used for any of the privacy-intruding use cases, e.g., forging handwritings. Known Limitations: Reported in Appendix I of the paper. Ethical Considerations & Potential Societal Consequences: Reported in Sections 6.1 and 6.2 of the paper. |