--- title: DockerTester emoji: 📚 colorFrom: red colorTo: blue sdk: docker pinned: false license: mit --- RedPajama Dataset API A FastAPI-based Application for Exploring the RedPajama-Data-1T Dataset Overview This application provides an intuitive API to interact with the RedPajama-Data-1T dataset. Built using FastAPI, it allows users to retrieve data chunks, perform searches, and view dataset summaries with ease. Ideal for researchers and developers working on large-scale language model datasets. Features 1. Retrieve Dataset Chunks Fetch smaller, manageable subsets of the dataset to explore or preprocess. 2. Search Data Search for specific keywords in the dataset and retrieve relevant results. 3. Dataset Summary Get an overview of the dataset’s structure, including available splits. Endpoints Endpoint Method Parameters Description / GET None Displays a welcome message. /get_data/ GET chunk_size (int, default: 10) Fetches a subset of the dataset. /search_data/ GET keyword (str, required), max_results (int, default: 10) Searches for entries containing the given keyword. /data_summary/ GET None Displays a summary of the dataset. Getting Started Prerequisites    •   Python 3.8+    •   Pip for dependency management Setup 1. Clone the repository: git clone https://huggingface.co/spaces/Canstralian/DockerTester cd DockerTester 2. Install dependencies: pip install -r requirements.txt 3. Run the application: uvicorn app:app --host 0.0.0.0 --port 8000 4. Access the API in your browser or using tools like Postman at: http://127.0.0.1:8000 Example Usage 1. Retrieve a Small Chunk of Data Fetch 5 examples from the dataset: curl "http://127.0.0.1:8000/get_data/?chunk_size=5" 2. Search the Dataset Search for the keyword example and return up to 3 results: curl "http://127.0.0.1:8000/search_data/?keyword=example&max_results=3" 3. View Dataset Summary Get an overview of available splits: curl "http://127.0.0.1:8000/data_summary/" Technologies Used    •   FastAPI: For building the API.    •   Hugging Face Datasets: To access and process the RedPajama-Data-1T dataset.    •   Uvicorn: For running the ASGI server.    •   Python: Backend language. Future Enhancements    •   Add support for advanced filtering (e.g., by metadata or specific fields).    •   Implement user authentication for restricted dataset access.    •   Add visualization endpoints for dataset insights. License This project uses the Apache 2.0 License. Refer to the LICENSE file for more details. Feel free to reach out for questions, feature requests, or contributions!