Could this be used for entity matching?
Hey, cool model!
Would it be possible for entity matching/named entity recognition?
Thanks! I think you would want to do additional training on top of it for ner using some model that at least computes ngrams. Otherwise, it would just be classifying the vocabulary (32k), and I'd worry that's not expressive enough. I'd bet that you could do something simple like running a linear svc over average pooled trigrams and it might do ok. That's probably where I'd start at least.
Thank you for this model ! Do you have any tutorials on fine tuning ? Do you have any other small models you plan on releasing ?
Thanks for the interest! I will be working on training tutorials soon. I just released a semantic splitting feature, which was quite a bit of effort. So I should have more time this next week to circle back around and add in some training and examples.
I may work on some other small model architectures, but probably not within the wordllama ecosystem. I am keeping pretty tight constraints on that model to keep the platform fast and with few dependencies. I will occasionally try training additional wordllama models -- but generally I expect improvements in that model to come from more careful training data selection than anything else. I have even tried distillation, but found that without adding attention mechanisms, it could not learn the contextual representations of the teacher model and produced worse results.
Fantastic, Thank you. Actually I was referring to making new small models , as i believe in the ethos and the thought design behind word llama, im still pretty green to this scene but I would love to contribute if you are interested. I do have Data science skills that may be of use. I look forward to seeing what comes out next !
Sounds good! At the moment, I'm mostly spending my time on developing useful algorithms for the wordllama platform. So if you have any ideas, I'd love to hear. I think it's 'good enough' for a lot of things, and if I can improve the model over time, that's great too. I'll always take better performance on the model side, but on the platform side, it's useful right now, and I want to prove that out by helping tackle some of the hard preprocessing problems for LLM (or generally NLP) workflows (like semantic splitting).
Additionally, I am going to try and address some gaps on the data side by working on putting together some datasets to help round out areas of training embeddings. I'm working on a "toxic pairs" dataset I'll upload here for use with sentence transformers that I'd like to add into training data. And I think I need to circle back and do a careful analysis of each dataset I've used to determine how well the model can learn from it. In some cases, I suspect such a simple model can't learn effectively and could even be detrimental. I noticed when doing distillation from transformer models that the lack of context presented "unlearnable" representations, which caused distillation to have slightly worse results. I suspect that data which is too NLI heavy might not be very effective datasets, but I need more time to experiment.
Anyway, those are the issues I'm thinking about looking into next. If any of that interests you, just let me know!