File size: 2,678 Bytes
3526e73
 
 
 
 
 
 
 
 
 
0b8afaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
title: DockerTester
emoji: 📚
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit
---

RedPajama Dataset API

A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset

Overview

This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets.

Features
	1.	Retrieve Dataset Chunks
Fetch smaller, manageable subsets of the dataset to explore or preprocess.
	2.	Search Data
Search for specific keywords in the dataset and retrieve relevant results.
	3.	Dataset Summary
Get an overview of the dataset’s structure, including available splits.

Endpoints

Endpoint	Method	Parameters	Description
/	GET	None	Displays a welcome message.
/get_data/	GET	chunk_size (int, default: 10)	Fetches a subset of the dataset.
/search_data/	GET	keyword (str, required), max_results (int, default: 10)	Searches for entries containing the given keyword.
/data_summary/	GET	None	Displays a summary of the dataset.

Getting Started

Prerequisites
   •   Python 3.8+
   •   Pip for dependency management

Setup
	1.	Clone the repository:

git clone https://huggingface.co/spaces/Canstralian/DockerTester
cd DockerTester


	2.	Install dependencies:

pip install -r requirements.txt


	3.	Run the application:

uvicorn app:app --host 0.0.0.0 --port 8000


	4.	Access the API in your browser or using tools like Postman at:

http://127.0.0.1:8000

Example Usage
	1.	Retrieve a Small Chunk of Data
Fetch 5 examples from the dataset:

curl "http://127.0.0.1:8000/get_data/?chunk_size=5"


	2.	Search the Dataset
Search for the keyword example and return up to 3 results:

curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3"


	3.	View Dataset Summary
Get an overview of available splits:

curl "http://127.0.0.1:8000/data_summary/"

Technologies Used
   •   FastAPI: For building the API.
   •   Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset.
   •   Uvicorn: For running the ASGI server.
   •   Python: Backend language.

Future Enhancements
   •   Add support for advanced filtering (e.g., by metadata or specific fields).
   •   Implement user authentication for restricted dataset access.
   •   Add visualization endpoints for dataset insights.

License

This project uses the Apache 2.0 License. Refer to the LICENSE file for more details.

Feel free to reach out for questions, feature requests, or contributions!