Canstralian commited on
Commit
0b8afaf
1 Parent(s): 4f584bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -1
README.md CHANGED
@@ -8,4 +8,88 @@ pinned: false
8
  license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  license: mit
9
  ---
10
 
11
+ RedPajama Dataset API
12
+
13
+ A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset
14
+
15
+ Overview
16
+
17
+ This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.
18
+
19
+ Features
20
+ 1. Retrieve Dataset Chunks
21
+ Fetch smaller, manageable subsets of the dataset to explore or preprocess.
22
+ 2. Search Data
23
+ Search for specific keywords in the dataset and retrieve relevant results.
24
+ 3. Dataset Summary
25
+ Get an overview of the dataset’s structure, including available splits.
26
+
27
+ Endpoints
28
+
29
+ Endpoint Method Parameters Description
30
+ / GET None Displays a welcome message.
31
+ /get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset.
32
+ /search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword.
33
+ /data_summary/ GET None Displays a summary of the dataset.
34
+
35
+ Getting Started
36
+
37
+ Prerequisites
38
+    •   Python 3.8+
39
+    •   Pip for dependency management
40
+
41
+ Setup
42
+ 1. Clone the repository:
43
+
44
+ git clone https://huggingface.co/spaces/Canstralian/DockerTester
45
+ cd DockerTester
46
+
47
+
48
+ 2. Install dependencies:
49
+
50
+ pip install -r requirements.txt
51
+
52
+
53
+ 3. Run the application:
54
+
55
+ uvicorn app:app --host 0.0.0.0 --port 8000
56
+
57
+
58
+ 4. Access the API in your browser or using tools like Postman at:
59
+
60
+ http://127.0.0.1:8000
61
+
62
+ Example Usage
63
+ 1. Retrieve a Small Chunk of Data
64
+ Fetch 5 examples from the dataset:
65
+
66
+ curl "http://127.0.0.1:8000/get_data/?chunk_size=5"
67
+
68
+
69
+ 2. Search the Dataset
70
+ Search for the keyword example and return up to 3 results:
71
+
72
+ curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"
73
+
74
+
75
+ 3. View Dataset Summary
76
+ Get an overview of available splits:
77
+
78
+ curl "http://127.0.0.1:8000/data_summary/"
79
+
80
+ Technologies Used
81
+    •   FastAPI: For building the API.
82
+    •   Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset.
83
+    •   Uvicorn: For running the ASGI server.
84
+    •   Python: Backend language.
85
+
86
+ Future Enhancements
87
+    •   Add support for advanced filtering (e.g., by metadata or specific fields).
88
+    •   Implement user authentication for restricted dataset access.
89
+    •   Add visualization endpoints for dataset insights.
90
+
91
+ License
92
+
93
+ This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.
94
+
95
+ Feel free to reach out for questions, feature requests, or contributions!