Does bge-m3 understand html tages <h1>, <h2> , <code> etc?

#56
by DracoDev - opened

I want to vector embed html pages using bge-m3, but I am not sure about the best format to use? Is there documentation on this question? In particular I am concerned with preserving logical structure of the documents - for example lets consider a simple document:

        <html> 
         <header>
                  <h1> Title: How To Prepare HTML Text for bge-m3 Embedding Model   <h1>
                 <p> Do you have html files but your not sure how to prepare them for bge-m3 models? You have come to the right place.  </p>
       </header>
         <body>
                  <h2>  section title: First Steps In Preparing Your HTML Data! </h2>
                          <p>  The first steps are ...  </p>
                          <p> second steps are .. here is some examples:   </p>
                          <code>
                          <p> more info ..  </p>
                              <h3> subsection title: What About Chunking Size
                                      <p> more paragraphs here etc. </p>
                                      </p>
                              </h3>
                       </h2>
                       <h2> section title 2: how do you encode the data into bge-m3
                                   <p>  initial steps are ... </p>
                                    <p>  now for some code  </p>
                                     <code> print("bge-m3") </code>
                                    <p> in summary </p>
                         </h2>
         </body>

What is the most understandable format for this html to be  transformed into?  Can we leave pseudo html tags like above? or should we use some type of text only indicators like  
           ** Title: How to Prepare HTML Text for BAAI/bge-m3 Embedding Model **
                 
                    *** section title: How ou prepare your html data for bge-m3 model  ***
                           he first steps are ...

                          second steps are .. here is some examples:  <link> to another document that is also bge-m3 embedded
                          ``` #hello world 
                                    import os
                                    print("hello world")
            
          ```

I would like to hear from the inventors of this great model how a we should actually prepare our html files in a manner that preserves the semantic svisible in html format? Also incoding of links as well. What is a best practice? What is the ideal format ? And to what extent are html tags understood?

Beijing Academy of Artificial Intelligence org

Hi, @DracoDev , we haven't tried bge-m3 model on html docs, so we also lack experience with it. Considering that the model hasn't been optimized for HTML, the second format might be more suitable.

Ok, does the model understand about links (I can convert the html links to some reference to other document ids) to other documents? Note: some links are to other documents that are or will be embedded using bge-m3 and stored in same vector space, while others are external documents.. Is there a recommended way to represent the link so that bge-m3 understands it? Do you recommend for a document of moderate size to embed the entire document or split into separate sections? if so, how to preserve the relationship to the original document? What sort of meta-data needs to be stored with each embedded chunk?

Another type of link of course can point to locations within the same document. This is information that I would want to capture during the embedding process whether by html or though plain text. What is an optimal textual way represent a link that points to another part of the same document?

So to summarize about links irrespective of html representation we have at least the following scenarios:

A link pointing to an external site unrelated to bge-m3 for example a link to https://www.tiktok.com
A link to another document that is (or will be) encoded in bge-m3 but is not part of the logical progression of the document. For example a reference to find out more about a term that appears in the document.
A link to another section within the same document. This might be a menu that links to different parts of the document and scrulls up or down to the part of the document to read.
A link to another page that is still a logical progression of the same document. ie a link at the bottom to Next Step. Or Step 2, Step 3, Step N etc.

How can these links be represented optimally in plain text (pending training of bge-m2 in html syntax). Every document includng pdf's have these same linking concepts so I think a general textual representation is needed.

Just make some test and look at the sparse vector representation : analyzing the weight the model apply to text that include links/tags you should be able or make some assumptions about how the model process those elements.

About the metà data question, I suggest you to take a look to llama index documentations, and it's hierarchical chunking strategy.

@bobox why the sparse vector representation in particular? also how would I see the weight of the model applied to text under test?

@bobox I'm also interested in how well the model embeds code? would this again be an examination of the models weights applied to the sparse vector representation?

Given that bge-m3 has an 8192 context length there has been quite a bit of research into chunking vs not chunking and storing the whole document. if memory is left out as a factor what would be the result of both long embedding the entire document and also chunking sections of the the same document?

Sign up or log in to comment