Dataset Usage Guide
GitCode AI Community provides comprehensive dataset management features to help you easily create, share, and use high-quality datasets. This guide will introduce the main operations related to datasets.
Dataset Creation
Create New Dataset
- Login to your GitCode AI account
- Go to “Datasets” > “Create Dataset”
- Fill in dataset information:
- Dataset name
- Description
- Tags
- License type
- Select dataset type:
- Tabular data
- Image data
- Text data
- Audio data
- Video data
- Upload data files
- Set access permissions
- Click “Create” to complete
[Image: Dataset creation page screenshot]
Dataset Configuration
Create a dataset-config.yaml
file to define dataset structure:
dataset-name: my-awesome-dataset
version: 1.0.0
type: image-classification
format:
- jpg
- png
structure:
train: train/
validation: val/
test: test/
labels:
path: labels.csv
format: csv
Dataset Search
Basic Search
- Enter keywords in the search box
- Use filter conditions:
- Data type
- Task type
- License
- Data volume
- Update time
Advanced Search
Supports the following search syntax:
type:image
- Search by data typesize:>1GB
- Search by dataset sizelicense:MIT
- Search by licenselanguage:chinese
- Search by dataset language
Dataset Download
Web Interface Download
- Go to dataset details page
- Click “Download” button
- Select download content:
- Complete dataset
- Partial data
- Sample data
Command Line Download
# Install GitCode CLI
pip install gitcode
# Download complete dataset
gitcode download-dataset username/dataset-name
# Download specific version
gitcode download-dataset username/dataset-name --version v1.0.0
# Download partial data
gitcode download-dataset username/dataset-name --split train
Dataset Usage
Python Code Examples
from gitcode_hub import load_dataset
# Load dataset
dataset = load_dataset("username/dataset-name")
# Get training set
train_data = dataset["train"]
# Data preprocessing
processed_data = dataset.map(preprocess_function)
# Batch processing
for batch in dataset.iter_batches(batch_size=32):
process_batch(batch)
Dataset Version Control
# Load specific version
dataset_v1 = load_dataset("username/dataset-name", version="1.0.0")
# View version history
dataset.version_history()
# Create new version
dataset.create_version("1.1.0", description="Added new samples")
Best Practices
Data Quality Control
- Perform data cleaning
- Check data integrity
- Annotation quality validation
Dataset Documentation
- Detailed data description
- Data collection methods
- Usage restriction descriptions
- Privacy considerations
Version Management
- Semantic versioning
- Update logs
- Change descriptions
Data Security
- Data anonymization
- Access permission control
- Compliance checks
Common Questions
Q: How to handle large datasets? A: You can use streaming loading or chunked download features to avoid loading all data at once.
Q: What data formats are supported? A: Supports common data formats such as CSV, JSON, images, audio, etc. See documentation for details.
Q: How to contribute data? A: You can submit new data through dataset update features, or create dataset branches for collaboration.
Q: What are the storage limits for datasets? A: Free accounts can create datasets up to 10GB, premium accounts have larger storage limits.