Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | 
         @@ -5,6 +5,9 @@ base_model: TokenOCR 
     | 
|
| 5 | 
         
             
            base_model_relation: finetune
         
     | 
| 6 | 
         
             
            ---
         
     | 
| 7 | 
         | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 8 | 
         
             
            [\[π GitHub\]](https://github.com/Token-family/TokenOCR)    [\[π Paper\]]() [\[π Blog\]]()    [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)    [\[π Quick Start\]](#quick-start)  
         
     | 
| 9 | 
         | 
| 10 | 
         
             
            <div align="center">
         
     | 
| 
         @@ -22,7 +25,8 @@ we seamlessly replace previous VFMs with TokenOCR to construct a document-level 
     | 
|
| 22 | 
         | 
| 23 | 
         
             
            # Token Family
         
     | 
| 24 | 
         | 
| 25 | 
         
            -
            ## TokenIT
         
     | 
| 
         | 
|
| 26 | 
         | 
| 27 | 
         
             
            In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
         
     | 
| 28 | 
         
             
            text-mask pairs. 
         
     | 
| 
         @@ -50,7 +54,9 @@ The comparisons with other visual foundation models: 
     | 
|
| 50 | 
         
             
            | **TokenOCR**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
         
     | 
| 51 | 
         | 
| 52 | 
         | 
| 53 | 
         
            -
            ## TokenOCR
         
     | 
| 
         | 
|
| 
         | 
|
| 54 | 
         | 
| 55 | 
         
             
            ### Model Architecture
         
     | 
| 56 | 
         | 
| 
         @@ -136,7 +142,8 @@ Please refer to our technical report for more details. 
     | 
|
| 136 | 
         | 
| 137 | 
         
             
            <!-- 
         
     | 
| 138 | 
         
             
             -->
         
     | 
| 139 | 
         
            -
            ## TokenVL
         
     | 
| 
         | 
|
| 140 | 
         | 
| 141 | 
         
             
            we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding. 
         
     | 
| 142 | 
         
             
            Following the previous training paradigm, TokenVL also includes two stages: 
         
     | 
| 
         | 
|
| 5 | 
         
             
            base_model_relation: finetune
         
     | 
| 6 | 
         
             
            ---
         
     | 
| 7 | 
         | 
| 8 | 
         
            +
            # A Token-level Text Image Foundation Model for Document Understanding
         
     | 
| 9 | 
         
            +
             
     | 
| 10 | 
         
            +
             
     | 
| 11 | 
         
             
            [\[π GitHub\]](https://github.com/Token-family/TokenOCR)    [\[π Paper\]]() [\[π Blog\]]()    [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)    [\[π Quick Start\]](#quick-start)  
         
     | 
| 12 | 
         | 
| 13 | 
         
             
            <div align="center">
         
     | 
| 
         | 
|
| 25 | 
         | 
| 26 | 
         
             
            # Token Family
         
     | 
| 27 | 
         | 
| 28 | 
         
            +
            <!-- ## TokenIT -->
         
     | 
| 29 | 
         
            +
            <h2 style="color: #4CAF50;">TokenIT</h2>
         
     | 
| 30 | 
         | 
| 31 | 
         
             
            In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
         
     | 
| 32 | 
         
             
            text-mask pairs. 
         
     | 
| 
         | 
|
| 54 | 
         
             
            | **TokenOCR**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
         
     | 
| 55 | 
         | 
| 56 | 
         | 
| 57 | 
         
            +
            <!-- ## TokenOCR
         
     | 
| 58 | 
         
            +
             -->
         
     | 
| 59 | 
         
            +
            <h2 style="color: #4CAF50;">TokenOCR</h2>
         
     | 
| 60 | 
         | 
| 61 | 
         
             
            ### Model Architecture
         
     | 
| 62 | 
         | 
| 
         | 
|
| 142 | 
         | 
| 143 | 
         
             
            <!-- 
         
     | 
| 144 | 
         
             
             -->
         
     | 
| 145 | 
         
            +
            <!-- ## TokenVL -->
         
     | 
| 146 | 
         
            +
            <h2 style="color: #4CAF50;">TokenVL</h2>
         
     | 
| 147 | 
         | 
| 148 | 
         
             
            we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding. 
         
     | 
| 149 | 
         
             
            Following the previous training paradigm, TokenVL also includes two stages: 
         
     |