Dataset Usage Guide | GitCode Help Docs

Dataset Usage Guide

GitCode AI Community provides comprehensive dataset management features to help you easily create, share, and use high-quality datasets. This guide will introduce the main operations related to datasets.

Dataset Creation

Create New Dataset

Login to your GitCode AI account
Go to “Datasets” > “Create Dataset”
Fill in dataset information:
- Dataset name
- Description
- Tags
- License type
Select dataset type:
- Tabular data
- Image data
- Text data
- Audio data
- Video data
Upload data files
Set access permissions
Click “Create” to complete

[Image: Dataset creation page screenshot]

Dataset Configuration

Create a dataset-config.yaml file to define dataset structure:

dataset-name: my-awesome-dataset
version: 1.0.0
type: image-classification
format: 
  - jpg
  - png
structure:
  train: train/
  validation: val/
  test: test/
labels:
  path: labels.csv
  format: csv

Dataset Search

Basic Search

Enter keywords in the search box
Use filter conditions:
- Data type
- Task type
- License
- Data volume
- Update time

Advanced Search

Supports the following search syntax:

type:image - Search by data type
size:>1GB - Search by dataset size
license:MIT - Search by license
language:chinese - Search by dataset language

Dataset Download

Web Interface Download

Go to dataset details page
Click “Download” button
Select download content:
- Complete dataset
- Partial data
- Sample data

Command Line Download

# Install GitCode CLI
pip install gitcode

# Download complete dataset
gitcode download-dataset username/dataset-name

# Download specific version
gitcode download-dataset username/dataset-name --version v1.0.0

# Download partial data
gitcode download-dataset username/dataset-name --split train

Dataset Usage

Python Code Examples

from gitcode_hub import load_dataset

# Load dataset
dataset = load_dataset("username/dataset-name")

# Get training set
train_data = dataset["train"]

# Data preprocessing
processed_data = dataset.map(preprocess_function)

# Batch processing
for batch in dataset.iter_batches(batch_size=32):
    process_batch(batch)

Dataset Version Control

# Load specific version
dataset_v1 = load_dataset("username/dataset-name", version="1.0.0")

# View version history
dataset.version_history()

# Create new version
dataset.create_version("1.1.0", description="Added new samples")

Best Practices

Data Quality Control
- Perform data cleaning
- Check data integrity
- Annotation quality validation
Dataset Documentation
- Detailed data description
- Data collection methods
- Usage restriction descriptions
- Privacy considerations
Version Management
- Semantic versioning
- Update logs
- Change descriptions
Data Security
- Data anonymization
- Access permission control
- Compliance checks

Common Questions

Q: How to handle large datasets? A: You can use streaming loading or chunked download features to avoid loading all data at once.

Q: What data formats are supported? A: Supports common data formats such as CSV, JSON, images, audio, etc. See documentation for details.

Q: How to contribute data? A: You can submit new data through dataset update features, or create dataset branches for collaboration.

Q: What are the storage limits for datasets? A: Free accounts can create datasets up to 10GB, premium accounts have larger storage limits.

← Dataset Cards