File size: 3,878 Bytes
b34efa5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# Norwegian LLM and Embedding Models Research

## Open-Source LLMs with Norwegian Language Support

### 1. NorMistral-7b-scratch
- **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
- **Architecture**: Based on Mistral architecture with 7 billion parameters
- **Context Length**: 2k tokens
- **Performance**: 
  - Perplexity on NCC validation set: 7.43
  - Good performance on reading comprehension, sentiment analysis, and machine translation tasks
- **License**: Apache-2.0
- **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch
- **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo

### 2. Viking 7B
- **Description**: The first multilingual large language model for all Nordic languages (including Norwegian)
- **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
- **Context Length**: 4k tokens
- **Performance**: Best-in-class performance in all Nordic languages without compromising English performance
- **License**: Apache 2.0
- **Notes**: 
  - Developed by Silo AI and University of Turku's research group TurkuNLP
  - Also available in larger sizes (13B and 33B parameters)
  - Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages

### 3. NorskGPT
- **Description**: A Norwegian large language model made for Norwegian society
- **Versions**:
  - NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
  - NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
- **License**: cc-by-nc-sa-4.0 (non-commercial)
- **Website**: https://www.norskgpt.com/norskgpt-llm

## Embedding Models for Norwegian

### 1. NbAiLab/nb-sbert-base
- **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
- **Architecture**: Based on nb-bert-base
- **Vector Dimensions**: 768
- **Performance**: 
  - Cosine Similarity: Pearson 0.8275, Spearman 0.8245
- **License**: apache-2.0
- **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base
- **Use Cases**:
  - Sentence similarity
  - Semantic search
  - Few-shot classification (with SetFit)
  - Keyword extraction (with KeyBERT)
  - Topic modeling (with BERTopic)
- **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications

### 2. FFI/SimCSE-NB-BERT-large
- **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology
- **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large

## Vector Database Options for Hugging Face RAG Integration

### 1. Milvus
- **Integration**: Well-documented integration with Hugging Face for RAG pipelines
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus

### 2. MongoDB
- **Integration**: Can be used with Hugging Face models for RAG systems
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb

### 3. MyScale
- **Integration**: Supports building RAG applications with Hugging Face embedding models
- **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293

### 4. FAISS (Facebook AI Similarity Search)
- **Integration**: Lightweight vector database that works well with Hugging Face
- **Notes**: Can be used with `autofaiss` for quick experimentation

## Hugging Face RAG Implementation Options

1. **Transformers Library**: Provides access to pre-trained models
2. **Sentence Transformers**: For text embeddings
3. **Datasets**: For managing and processing data
4. **LangChain Integration**: For advanced RAG pipelines
5. **Spaces**: For deploying and sharing the application