Dataset Resources
Find existing datasets to jumpstart your fine-tuning project
Kaggle Datasets
World's largest data science community
Kaggle hosts 50,000+ public datasets including many suitable for LLM fine-tuning. Before creating your own dataset from scratch, check if the community has already curated training data for your use case.
Popular Dataset Types for Fine-Tuning
- • Instruction datasets - Alpaca, ShareGPT format collections
- • Code datasets - GitHub code, Stack Overflow Q&A
- • Domain-specific - Medical, legal, scientific papers
- • Conversational - Chat logs, support tickets
- • Multilingual - Non-English training data
Using Kaggle Datasets with EdukaAI
Search Kaggle
Look for datasets in JSONL, CSV, or JSON format. Check the license (CC0, MIT preferred).
Download
Download the dataset files to your local machine. Most are 100MB-10GB.
Import to EdukaAI
Use EdukaAI's import feature to bring the data into your workspace. See import guide →
Review and Curate
Not all Kaggle datasets are high quality. Review samples, rate quality, and curate the best examples for your use case.
⚠️ Quality Varies
Kaggle datasets range from excellent to unusable. Always review samples before training. Look for datasets with:
- • High upvotes/downloads
- • Clear documentation
- • Recent updates
- • Permissive licenses (CC0, MIT, Apache)
Other Dataset Sources
HuggingFace Datasets
Curated ML datasets with built-in loaders. Search by task, language, or size.
Browse →GitHub Repositories
Many projects share training data. Search for "fine-tuning dataset" or "instruction dataset".
Search →Papers with Code
Find datasets used in research papers. Great for reproducing published results.
Browse →Reddit/Discord Communities
r/LocalLLaMA and specialized Discords often share niche datasets.
Search manually in communities