Spaces:
Runtime error
Runtime error
Canstralian
commited on
Commit
•
0b8afaf
1
Parent(s):
4f584bf
Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,88 @@ pinned: false
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
+
RedPajama Dataset API
|
12 |
+
|
13 |
+
A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset
|
14 |
+
|
15 |
+
Overview
|
16 |
+
|
17 |
+
This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.
|
18 |
+
|
19 |
+
Features
|
20 |
+
1. Retrieve Dataset Chunks
|
21 |
+
Fetch smaller, manageable subsets of the dataset to explore or preprocess.
|
22 |
+
2. Search Data
|
23 |
+
Search for specific keywords in the dataset and retrieve relevant results.
|
24 |
+
3. Dataset Summary
|
25 |
+
Get an overview of the dataset’s structure, including available splits.
|
26 |
+
|
27 |
+
Endpoints
|
28 |
+
|
29 |
+
Endpoint Method Parameters Description
|
30 |
+
/ GET None Displays a welcome message.
|
31 |
+
/get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset.
|
32 |
+
/search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword.
|
33 |
+
/data_summary/ GET None Displays a summary of the dataset.
|
34 |
+
|
35 |
+
Getting Started
|
36 |
+
|
37 |
+
Prerequisites
|
38 |
+
• Python 3.8+
|
39 |
+
• Pip for dependency management
|
40 |
+
|
41 |
+
Setup
|
42 |
+
1. Clone the repository:
|
43 |
+
|
44 |
+
git clone https://huggingface.co/spaces/Canstralian/DockerTester
|
45 |
+
cd DockerTester
|
46 |
+
|
47 |
+
|
48 |
+
2. Install dependencies:
|
49 |
+
|
50 |
+
pip install -r requirements.txt
|
51 |
+
|
52 |
+
|
53 |
+
3. Run the application:
|
54 |
+
|
55 |
+
uvicorn app:app --host 0.0.0.0 --port 8000
|
56 |
+
|
57 |
+
|
58 |
+
4. Access the API in your browser or using tools like Postman at:
|
59 |
+
|
60 |
+
http://127.0.0.1:8000
|
61 |
+
|
62 |
+
Example Usage
|
63 |
+
1. Retrieve a Small Chunk of Data
|
64 |
+
Fetch 5 examples from the dataset:
|
65 |
+
|
66 |
+
curl "http://127.0.0.1:8000/get_data/?chunk_size=5"
|
67 |
+
|
68 |
+
|
69 |
+
2. Search the Dataset
|
70 |
+
Search for the keyword example and return up to 3 results:
|
71 |
+
|
72 |
+
curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"
|
73 |
+
|
74 |
+
|
75 |
+
3. View Dataset Summary
|
76 |
+
Get an overview of available splits:
|
77 |
+
|
78 |
+
curl "http://127.0.0.1:8000/data_summary/"
|
79 |
+
|
80 |
+
Technologies Used
|
81 |
+
• FastAPI: For building the API.
|
82 |
+
• Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset.
|
83 |
+
• Uvicorn: For running the ASGI server.
|
84 |
+
• Python: Backend language.
|
85 |
+
|
86 |
+
Future Enhancements
|
87 |
+
• Add support for advanced filtering (e.g., by metadata or specific fields).
|
88 |
+
• Implement user authentication for restricted dataset access.
|
89 |
+
• Add visualization endpoints for dataset insights.
|
90 |
+
|
91 |
+
License
|
92 |
+
|
93 |
+
This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.
|
94 |
+
|
95 |
+
Feel free to reach out for questions, feature requests, or contributions!
|