InternLM-XComposer

InternLM-XComposer is a vision-language large model (VLLM) based on InternLM for advanced text-image comprehension and composition. InternLM-XComposer has serveal appealing properties:

  • Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. The interleaved text-image composition is implemented in following steps:

    1. Text Generation: It crafts long-form text based on human-provided instructions.
    2. Image Spoting and Captioning: It pinpoints optimal locations for image placement and furnishes image descriptions.
    3. Image Retrieval and Selection: It select image candidates and identify the image that optimally complements the content.
  • Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content.

  • Strong performance: It consistently achieves state-of-the-art results across various benchmarks for vision-language large models, including MME Benchmark (English), MMBench (English), Seed-Bench (English), CCBench(Chinese), and MMBench-CN (Chineese).

We release InternLM-XComposer series in two versions:

  • InternLM-XComposer-VL: The pretrained VLLM model with InternLM as the initialization of the LLM, achieving strong performance on various multimodal benchmarks, e.g., MME Benchmark, MMBench Seed-Bench, CCBench, and MMBench-CN.
  • InternLM-XComposer: The finetuned VLLM for Interleaved Text-Image Composition and LLM-based AI assistant.
Downloads last month
1,081
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.