Dataset Resources

Find existing datasets to jumpstart your fine-tuning project

📊

Kaggle Datasets

World's largest data science community

Kaggle hosts 50,000+ public datasets including many suitable for LLM fine-tuning. Before creating your own dataset from scratch, check if the community has already curated training data for your use case.

Popular Dataset Types for Fine-Tuning

• Instruction datasets - Alpaca, ShareGPT format collections
• Code datasets - GitHub code, Stack Overflow Q&A
• Domain-specific - Medical, legal, scientific papers
• Conversational - Chat logs, support tickets
• Multilingual - Non-English training data

Browse Kaggle Datasets 50,000+ datasets available

Using Kaggle Datasets with EdukaAI

Search Kaggle

Look for datasets in JSONL, CSV, or JSON format. Check the license (CC0, MIT preferred).

Download

Download the dataset files to your local machine. Most are 100MB-10GB.

Import to EdukaAI

Use EdukaAI's import feature to bring the data into your workspace. See import guide →

Review and Curate

Not all Kaggle datasets are high quality. Review samples, rate quality, and curate the best examples for your use case.

⚠️ Quality Varies

Kaggle datasets range from excellent to unusable. Always review samples before training. Look for datasets with:

• High upvotes/downloads
• Clear documentation
• Recent updates
• Permissive licenses (CC0, MIT, Apache)

Other Dataset Sources

HuggingFace Datasets

Curated ML datasets with built-in loaders. Search by task, language, or size.

Browse →

GitHub Repositories

Many projects share training data. Search for "fine-tuning dataset" or "instruction dataset".

Search →

Papers with Code

Find datasets used in research papers. Great for reproducing published results.

Browse →

Reddit/Discord Communities

r/LocalLLaMA and specialized Discords often share niche datasets.

Search manually in communities

← Back to Getting Started Capture Your Own Data →