Dataset Usage Guide

GitCode AI Community provides comprehensive dataset management features to help you easily create, share, and use high-quality datasets. This guide will introduce the main operations related to datasets.

Dataset Creation

Create New Dataset

  1. Login to your GitCode AI account
  2. Go to “Datasets” > “Create Dataset”
  3. Fill in dataset information:
    • Dataset name
    • Description
    • Tags
    • License type
  4. Select dataset type:
    • Tabular data
    • Image data
    • Text data
    • Audio data
    • Video data
  5. Upload data files
  6. Set access permissions
  7. Click “Create” to complete

[Image: Dataset creation page screenshot]

Dataset Configuration

Create a dataset-config.yaml file to define dataset structure:

dataset-name: my-awesome-dataset
version: 1.0.0
type: image-classification
format: 
  - jpg
  - png
structure:
  train: train/
  validation: val/
  test: test/
labels:
  path: labels.csv
  format: csv
  1. Enter keywords in the search box
  2. Use filter conditions:
    • Data type
    • Task type
    • License
    • Data volume
    • Update time

Supports the following search syntax:

  • type:image - Search by data type
  • size:>1GB - Search by dataset size
  • license:MIT - Search by license
  • language:chinese - Search by dataset language

Dataset Download

Web Interface Download

  1. Go to dataset details page
  2. Click “Download” button
  3. Select download content:
    • Complete dataset
    • Partial data
    • Sample data

Command Line Download

# Install GitCode CLI
pip install gitcode

# Download complete dataset
gitcode download-dataset username/dataset-name

# Download specific version
gitcode download-dataset username/dataset-name --version v1.0.0

# Download partial data
gitcode download-dataset username/dataset-name --split train

Dataset Usage

Python Code Examples

from gitcode_hub import load_dataset

# Load dataset
dataset = load_dataset("username/dataset-name")

# Get training set
train_data = dataset["train"]

# Data preprocessing
processed_data = dataset.map(preprocess_function)

# Batch processing
for batch in dataset.iter_batches(batch_size=32):
    process_batch(batch)

Dataset Version Control

# Load specific version
dataset_v1 = load_dataset("username/dataset-name", version="1.0.0")

# View version history
dataset.version_history()

# Create new version
dataset.create_version("1.1.0", description="Added new samples")

Best Practices

  1. Data Quality Control

    • Perform data cleaning
    • Check data integrity
    • Annotation quality validation
  2. Dataset Documentation

    • Detailed data description
    • Data collection methods
    • Usage restriction descriptions
    • Privacy considerations
  3. Version Management

    • Semantic versioning
    • Update logs
    • Change descriptions
  4. Data Security

    • Data anonymization
    • Access permission control
    • Compliance checks

Common Questions

Q: How to handle large datasets? A: You can use streaming loading or chunked download features to avoid loading all data at once.

Q: What data formats are supported? A: Supports common data formats such as CSV, JSON, images, audio, etc. See documentation for details.

Q: How to contribute data? A: You can submit new data through dataset update features, or create dataset branches for collaboration.

Q: What are the storage limits for datasets? A: Free accounts can create datasets up to 10GB, premium accounts have larger storage limits.