File size: 3,878 Bytes
b34efa5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
# Norwegian LLM and Embedding Models Research
## Open-Source LLMs with Norwegian Language Support
### 1. NorMistral-7b-scratch
- **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
- **Architecture**: Based on Mistral architecture with 7 billion parameters
- **Context Length**: 2k tokens
- **Performance**:
- Perplexity on NCC validation set: 7.43
- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
- **License**: Apache-2.0
- **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch
- **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo
### 2. Viking 7B
- **Description**: The first multilingual large language model for all Nordic languages (including Norwegian)
- **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
- **Context Length**: 4k tokens
- **Performance**: Best-in-class performance in all Nordic languages without compromising English performance
- **License**: Apache 2.0
- **Notes**:
- Developed by Silo AI and University of Turku's research group TurkuNLP
- Also available in larger sizes (13B and 33B parameters)
- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages
### 3. NorskGPT
- **Description**: A Norwegian large language model made for Norwegian society
- **Versions**:
- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
- **License**: cc-by-nc-sa-4.0 (non-commercial)
- **Website**: https://www.norskgpt.com/norskgpt-llm
## Embedding Models for Norwegian
### 1. NbAiLab/nb-sbert-base
- **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
- **Architecture**: Based on nb-bert-base
- **Vector Dimensions**: 768
- **Performance**:
- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
- **License**: apache-2.0
- **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base
- **Use Cases**:
- Sentence similarity
- Semantic search
- Few-shot classification (with SetFit)
- Keyword extraction (with KeyBERT)
- Topic modeling (with BERTopic)
- **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications
### 2. FFI/SimCSE-NB-BERT-large
- **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology
- **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large
## Vector Database Options for Hugging Face RAG Integration
### 1. Milvus
- **Integration**: Well-documented integration with Hugging Face for RAG pipelines
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus
### 2. MongoDB
- **Integration**: Can be used with Hugging Face models for RAG systems
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb
### 3. MyScale
- **Integration**: Supports building RAG applications with Hugging Face embedding models
- **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293
### 4. FAISS (Facebook AI Similarity Search)
- **Integration**: Lightweight vector database that works well with Hugging Face
- **Notes**: Can be used with `autofaiss` for quick experimentation
## Hugging Face RAG Implementation Options
1. **Transformers Library**: Provides access to pre-trained models
2. **Sentence Transformers**: For text embeddings
3. **Datasets**: For managing and processing data
4. **LangChain Integration**: For advanced RAG pipelines
5. **Spaces**: For deploying and sharing the application
|