Does bge-m3 understand html tages <h1>, <h2> , <code> etc?

#9
by DracoDev - opened

I want to vector embed html pages using colbertv2.0, but I am not sure about the best format to use? In particular I am concerned with preserving logical structure of the documents - for example lets consider a simple document:

        <html> 
         <header>
                  <h1> Title: How To Prepare HTML Text for colvert-ir/colbertv2.0  Embedding Model   <h1>
                 <p> Do you have html files but your not sure how to prepare them for colbertv2.0 models? You have come to the right place.  </p>
       </header>
         <body>
                  <h2>  section title: First Steps In Preparing Your HTML Data! </h2>
                          <p>  The first steps are ...  </p>
                          <p> second steps are .. here is some examples:   </p>
                          <code>
                          <p> more info ..  </p>
                              <h3> subsection title: What About Chunking Size
                                      <p> more paragraphs here etc. </p>
                                      </p>
                              </h3>
                       </h2>
                       <h2> section title 2: how do you encode the data into colbertv2.0
                                   <p>  initial steps are ... </p>
                                    <p>  now for some code  </p>
                                     <code> print("colbertv2.0") </code>
                                    <p> in summary </p>
                         </h2>
         </body>

What is the most understandable format for this html to be  transformed into?  Can we leave pseudo html tags  like above? or should we use some type of text only indicators
like  
           ** Title: How to Prepare HTML Text for Colbertv2.0 Embedding Model **
                 
                    *** section title: How ou prepare your html data for Colbert 2.0 model  ***
                           he first steps are ...

                          second steps are .. here is some examples:  <link> to another document that is also ColbertV2.0 embedded
                          ``` #hello world 
                                    import os
                                    print("hello world")
            
          ```

colbert-ir/colbertv2.0 is an amazing model and I would like to know how we should actually prepare and transform our html files in a manner that preserves the semantics meaning and logical positioning implied by the orginal html format? Also encoding of links as well. What is a best practice? What is the ideal format ? And to what extent are html tags understood?

If it does not understand about html does it understand the concept of links to other documents? And if so how should the links be represented in plain text?

So to summarize about links irrespective of html representation we have at least the following scenarios:

  1. A link pointing to an external site unrelated to ColbertV2.0 for example a link to https://www.tiktok.com
  2. A link to another document that is (or will be) encoded in ColbertV2.0 but is not part of the logical progression of the document. For example a reference to find out more about a term that appears in the document.
  3. A link to another section within the same document. This might be a menu that links to different parts of the document and scrulls up or down to the part of the document to read.
  4. A link to another page that is still a logical progression of the same document. ie a link at the bottom to Next Step. Or Step 2, Step 3, Step N etc.

How can these links be represented optimally in plain text (pending training of ColbertV2.0 in html syntax). Every document includng pdf's have these same linking concepts so I think a general textual representation is needed.

Sign up or log in to comment