nielsr HF Staff commited on
Commit
09151f8
·
verified ·
1 Parent(s): b6aa528

Improve model card: add metadata, paper link, and project resources

Browse files

Hi! I'm Niels from the Hugging Face community team.

This PR improves the model card for DARE by:
- Adding `library_name: sentence-transformers` to the metadata to enable better integration and automated code snippets.
- Ensuring the `pipeline_tag` is set to `feature-extraction`.
- Adding links to the original paper ([2603.04743](https://huggingface.co/papers/2603.04743)), the GitHub repository, and the project page.
- Including the official BibTeX citation for researchers.
- Standardizing the usage examples for better readability.

These changes help make the model more discoverable and easier for the community to use and cite.

Files changed (1) hide show
  1. README.md +33 -14
README.md CHANGED
@@ -1,5 +1,9 @@
1
  ---
 
2
  language: en
 
 
 
3
  tags:
4
  - sentence-transformers
5
  - feature-extraction
@@ -8,26 +12,25 @@ tags:
8
  - tool-use
9
  - llm-agent
10
  - r-language
11
- license: apache-2.0
12
- base_model: sentence-transformers/all-MiniLM-L6-v2
13
  ---
14
 
15
- ![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
16
 
17
  DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
18
 
19
  It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
20
 
 
 
 
 
21
  ## Model Details
22
  - **Architecture:** Bi-encoder (Sentence Transformer)
23
  - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
24
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
25
- - **Performance**: SoTA on R package retrieval tasks.
26
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
27
 
28
- <!-- ## 💡 Why DARE? (The Input Formatting)
29
- Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
30
-
31
  ### Usage (Sentence-Transformers)
32
 
33
  First, install the `sentence-transformers` library:
@@ -35,18 +38,19 @@ First, install the `sentence-transformers` library:
35
  pip install -U sentence-transformers
36
  ```
37
 
38
- ### Usage by our RPKB (Optional and Recommended)
39
- Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB)
40
 
41
  ```python
42
  from huggingface_hub import snapshot_download
43
  import chromadb
 
44
 
45
  # 1. Download the database folder from Hugging Face
46
  db_path = snapshot_download(
47
  repo_id="Stephen-SMJ/RPKB",
48
  repo_type="dataset",
49
- allow_patterns="RPKB/*" # Adjust this if your folder name is different
50
  )
51
 
52
  # 2. Connect to the local ChromaDB instance
@@ -58,10 +62,9 @@ collection = client.get_collection(name="inference")
58
  print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
59
  ```
60
 
61
- ### Then, you can load the DARE model do retrieval:
62
  ```python
63
  from sentence_transformers import SentenceTransformer
64
- from sentence_transformers.util import cos_sim
65
 
66
  # 1. Load the DARE model
67
  model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
@@ -72,9 +75,9 @@ in the data. Please set the random seed to 123 at the start. I need to filter fo
72
  first value of the estimated scores (est_a) for the very first region identified."
73
 
74
  # 3. Generate embedding
75
- query_embedding = model.encode(user_query).tolist()
76
 
77
- # 4. Search in the database with Hard Filters
78
  results = collection.query(
79
  query_embeddings=[query_embedding],
80
  n_results=3,
@@ -83,4 +86,20 @@ results = collection.query(
83
 
84
  # Display Top-1 Result
85
  print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
 
1
  ---
2
+ base_model: sentence-transformers/all-MiniLM-L6-v2
3
  language: en
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ pipeline_tag: feature-extraction
7
  tags:
8
  - sentence-transformers
9
  - feature-extraction
 
12
  - tool-use
13
  - llm-agent
14
  - r-language
 
 
15
  ---
16
 
17
+ ![DARE Banner](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
18
 
19
  DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
20
 
21
  It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
22
 
23
+ - **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
24
+ - **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
25
+ - **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)
26
+
27
  ## Model Details
28
  - **Architecture:** Bi-encoder (Sentence Transformer)
29
  - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
30
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
31
+ - **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
32
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
33
 
 
 
 
34
  ### Usage (Sentence-Transformers)
35
 
36
  First, install the `sentence-transformers` library:
 
38
  pip install -U sentence-transformers
39
  ```
40
 
41
+ ### Usage with RPKB (Recommended)
42
+ Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.
43
 
44
  ```python
45
  from huggingface_hub import snapshot_download
46
  import chromadb
47
+ import os
48
 
49
  # 1. Download the database folder from Hugging Face
50
  db_path = snapshot_download(
51
  repo_id="Stephen-SMJ/RPKB",
52
  repo_type="dataset",
53
+ allow_patterns="RPKB/*"
54
  )
55
 
56
  # 2. Connect to the local ChromaDB instance
 
62
  print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
63
  ```
64
 
65
+ ### Retrieval with DARE
66
  ```python
67
  from sentence_transformers import SentenceTransformer
 
68
 
69
  # 1. Load the DARE model
70
  model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
 
75
  first value of the estimated scores (est_a) for the very first region identified."
76
 
77
  # 3. Generate embedding
78
+ query_embedding = model.encode(query).tolist()
79
 
80
+ # 4. Search in the database
81
  results = collection.query(
82
  query_embeddings=[query_embedding],
83
  n_results=3,
 
86
 
87
  # Display Top-1 Result
88
  print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
89
+ ```
90
+
91
+ ## Citation
92
+
93
+ If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:
94
+
95
+ ```bibtex
96
+ @article{sun2026dare,
97
+ title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
98
+ author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
99
+ year={2026},
100
+ eprint={2603.04743},
101
+ archivePrefix={arXiv},
102
+ primaryClass={cs.IR},
103
+ url={https://arxiv.org/abs/2603.04743},
104
+ }
105
  ```