Best Practices Guide

On the GitCode AI community platform, following best practices can help you use various features more efficiently, improve work efficiency, and avoid common problems. This guide compiles best practices for platform usage, covering models, datasets, Space, Notebook, and other aspects.

Model Management Best Practices

Model Creation and Publishing

1. Model Naming Conventions

  • Use clear, descriptive names
  • Include model type and main functionality
  • Avoid special characters and spaces
  • Examples: bert-chinese-sentiment-analysis, resnet50-image-classification

2. Model Description Writing

# Model Name
## Function Description
- Main Purpose: Text Sentiment Analysis
- Supported Languages: Chinese
- Input Format: Text String
- Output Format: Sentiment Labels (Positive/Negative/Neutral)

## Technical Details
- Framework: PyTorch
- Pre-trained Model: BERT-base-chinese
- Model Size: 110M
- Inference Speed: 100ms/sentence

## Usage Examples
```python
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="username/bert-chinese-sentiment")
result = classifier("This product is very good!")

Notes

  • Only supports Chinese text
  • Recommended input length not exceeding 512 characters
  • Requires transformers library installation

**3. Model Configuration File Optimization**
```yaml
# model-config.yaml
model-name: bert-chinese-sentiment
version: 1.0.0
framework: pytorch
task: text-classification
tags:
  - chinese
  - sentiment-analysis
  - bert
  - nlp

dependencies:
  - torch>=1.9.0
  - transformers>=4.20.0
  - numpy>=1.21.0

model-info:
  architecture: bert-base-chinese
  parameters: 110M
  max-input-length: 512
  
performance:
  inference-time: 100ms
  accuracy: 0.92
  f1-score: 0.91

usage-examples:
  - input: "This product is very good!"
    output: "Positive"
  - input: "Quality is too poor"
    output: "Negative"

Model Version Management

1. Version Number Conventions

  • Use semantic versioning: Major.Minor.Patch
  • Major version: Incompatible API changes
  • Minor version: Backward-compatible new functionality
  • Patch version: Backward-compatible bug fixes

2. Version Update Strategy

# Minor version update
git tag v1.0.1
git push origin v1.0.1

# Feature update
git tag v1.1.0
git push origin v1.1.0

# Major update
git tag v2.0.0
git push origin v2.0.0

3. Changelog Maintenance

# Changelog

## [1.1.0] - 2024-01-15
### Added
- Support for batch inference
- Added quantized model version
- Optimized inference speed

### Fixed
- Fixed long text processing issues
- Resolved memory leak problems

### Changed
- Updated dependency library versions
- Optimized model structure

## [1.0.1] - 2024-01-01
### Fixed
- Fixed input validation issues
- Resolved encoding problems

Dataset Management Best Practices

Dataset Organization

1. Directory Structure Standards

dataset-name/
├── README.md
├── data/
│   ├── train/
│   ├── validation/
│   └── test/
├── metadata.json
├── schema.json
└── examples/
    ├── sample1.jpg
    ├── sample2.jpg
    └── sample3.jpg

2. Metadata File Standards

{
  "name": "chinese-text-classification",
  "version": "1.0.0",
  "description": "Chinese Text Classification Dataset",
  "license": "MIT",
  "creator": "username",
  "created_date": "2024-01-01",
  "last_updated": "2024-01-15",
  
  "statistics": {
    "total_samples": 10000,
    "train_samples": 8000,
    "validation_samples": 1000,
    "test_samples": 1000,
    "classes": 5
  },
  
  "format": {
    "input_type": "text",
    "output_type": "label",
    "encoding": "utf-8"
  },
  
  "quality_metrics": {
    "completeness": 0.98,
    "consistency": 0.95,
    "accuracy": 0.92
  }
}

3. Data Quality Assurance

# Data quality check script
import pandas as pd
import numpy as np
from typing import Dict, List

def check_data_quality(data: pd.DataFrame) -> Dict[str, float]:
    """Check data quality"""
    quality_metrics = {}
    
    # Completeness check
    quality_metrics['completeness'] = 1 - data.isnull().sum().sum() / (data.shape[0] * data.shape[1])
    
    # Consistency check
    quality_metrics['consistency'] = check_data_consistency(data)
    
    # Accuracy check
    quality_metrics['accuracy'] = check_data_accuracy(data)
    
    return quality_metrics

def check_data_consistency(data: pd.DataFrame) -> float:
    """Check data consistency"""
    # Implement consistency check logic
    pass

def check_data_accuracy(data: pd.DataFrame) -> float:
    """Check data accuracy"""
    # Implement accuracy check logic
    pass

Dataset Documentation

1. README Template

# Dataset Name

## Overview
Briefly describe the dataset content, purpose, and characteristics.

## Data Source
Explain data source, collection methods, and time.

## Data Format
Describe data format, structure, and field descriptions in detail.

## Usage Instructions
Provide examples of data loading, preprocessing, and usage.

## Quality Assessment
Explain data quality assessment results and considerations.

## License
Explain data usage license and restrictions.

## Citation
If dataset comes from research papers, provide citation information.

## Contact
Provide contact information for dataset maintainers.

Space Application Best Practices

Application Architecture Design

1. Modular Design

# app.py
from flask import Flask, request, jsonify
from modules.preprocessor import DataPreprocessor
from modules.model import ModelWrapper
from modules.postprocessor import ResultPostprocessor

app = Flask(__name__)

# Initialize components
preprocessor = DataPreprocessor()
model = ModelWrapper()
postprocessor = ResultPostprocessor()

@app.route('/predict', methods=['POST'])
def predict():
    try:
        # Data preprocessing
        input_data = request.json
        processed_data = preprocessor.process(input_data)
        
        # Model inference
        raw_result = model.predict(processed_data)
        
        # Result post-processing
        final_result = postprocessor.process(raw_result)
        
        return jsonify({
            'success': True,
            'result': final_result
        })
    except Exception as e:
        return jsonify({
            'success': False,
            'error': str(e)
        }), 400

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

2. Configuration File Management

# app.yaml
name: text-classification-app
version: 1.0.0
description: Text Classification Application

runtime: python3.9
entrypoint: app:app

env_variables:
  MODEL_PATH: /app/models/model.pkl
  MAX_INPUT_LENGTH: 512
  BATCH_SIZE: 32

resources:
  cpu: 2
  memory: 4Gi
  gpu: 1

dependencies:
  - flask==2.3.0
  - torch==2.0.0
  - transformers==4.30.0

health_check:
  path: /health
  interval: 30s
  timeout: 10s
  retries: 3

Performance Optimization

1. Caching Strategy

import functools
import redis
import pickle

# Redis cache decorator
def cache_result(expire_time=3600):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            # Generate cache key
            cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
            
            # Try to get from cache
            cached_result = redis_client.get(cache_key)
            if cached_result:
                return pickle.loads(cached_result)
            
            # Execute function and cache result
            result = func(*args, **kwargs)
            redis_client.setex(cache_key, expire_time, pickle.dumps(result))
            
            return result
        return wrapper
    return decorator

@cache_result(expire_time=1800)
def expensive_computation(input_data):
    # Expensive computation logic
    pass

2. Asynchronous Processing

import asyncio
from concurrent.futures import ThreadPoolExecutor
import aiohttp

class AsyncModelService:
    def __init__(self):
        self.executor = ThreadPoolExecutor(max_workers=4)
    
    async def batch_predict(self, input_list):
        """Asynchronous batch prediction"""
        tasks = []
        for input_data in input_list:
            task = asyncio.create_task(self.predict_single(input_data))
            tasks.append(task)
        
        results = await asyncio.gather(*tasks)
        return results
    
    async def predict_single(self, input_data):
        """Single prediction task"""
        loop = asyncio.get_event_loop()
        result = await loop.run_in_executor(
            self.executor, 
            self._predict, 
            input_data
        )
        return result
    
    def _predict(self, input_data):
        """Actual prediction logic"""
        # Model inference code
        pass

Notebook Development Best Practices

Code Organization

1. Cell Structure

# 1. Imports and Configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')

# 2. Data Loading
@load_data
def load_dataset():
    """Load dataset"""
    data = pd.read_csv('dataset.csv')
    print(f"Dataset shape: {data.shape}")
    return data

# 3. Data Exploration
def explore_data(data):
    """Data exploration analysis"""
    print("Data basic information:")
    print(data.info())
    
    print("\nData statistical summary:")
    print(data.describe())
    
    print("\nMissing value statistics:")
    print(data.isnull().sum())

# 4. Data Preprocessing
def preprocess_data(data):
    """Data preprocessing"""
    # Handle missing values
    data = data.fillna(data.median())
    
    # Feature engineering
    data['feature_new'] = data['feature1'] * data['feature2']
    
    return data

# 5. Model Training
def train_model(X, y):
    """Model training"""
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    return model, X_test, y_test

# 6. Result Evaluation
def evaluate_model(model, X_test, y_test):
    """Model evaluation"""
    from sklearn.metrics import classification_report, confusion_matrix
    
    y_pred = model.predict(X_test)
    
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    # Plot confusion matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
    plt.title('Confusion Matrix')
    plt.show()

2. Function and Class Design

class DataProcessor:
    """Data processing class"""
    
    def __init__(self, config):
        self.config = config
        self.data = None
    
    def load(self, file_path):
        """Load data"""
        if file_path.endswith('.csv'):
            self.data = pd.read_csv(file_path)
        elif file_path.endswith('.json'):
            self.data = pd.read_json(file_path)
        else:
            raise ValueError("Unsupported file format")
        
        return self.data
    
    def clean(self):
        """Data cleaning"""
        if self.data is None:
            raise ValueError("Please load data first")
        
        # Remove duplicate rows
        self.data = self.data.drop_duplicates()
        
        # Handle missing values
        self.data = self.data.fillna(self.config.get('fill_method', 'median'))
        
        return self.data
    
    def transform(self):
        """Data transformation"""
        if self.data is None:
            raise ValueError("Please load data first")
        
        # Feature engineering
        for feature in self.config.get('features', []):
            if feature['type'] == 'categorical':
                self.data = pd.get_dummies(self.data, columns=[feature['name']])
            elif feature['type'] == 'numerical':
                self.data[feature['name']] = pd.to_numeric(self.data[feature['name']])
        
        return self.data

# Usage example
config = {
    'fill_method': 'mean',
    'features': [
        {'name': 'category', 'type': 'categorical'},
        {'name': 'price', 'type': 'numerical'}
    ]
}

processor = DataProcessor(config)
data = processor.load('data.csv').clean().transform()

Experiment Management

1. Experiment Tracking

import mlflow
import os

# Set up MLflow
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Text Classification Experiment")

def run_experiment(params):
    """Run experiment"""
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params(params)
        
        # Train model
        model = train_model_with_params(params)
        
        # Evaluate model
        metrics = evaluate_model(model)
        
        # Log metrics
        mlflow.log_metrics(metrics)
        
        # Save model
        mlflow.sklearn.log_model(model, "model")
        
        # Log data version
        mlflow.log_artifact("data.csv")
        
        return model, metrics

# Experiment parameters
experiment_params = [
    {'model': 'random_forest', 'n_estimators': 100},
    {'model': 'random_forest', 'n_estimators': 200},
    {'model': 'xgboost', 'n_estimators': 100}
]

# Run experiments
results = []
for params in experiment_params:
    model, metrics = run_experiment(params)
    results.append({'params': params, 'metrics': metrics})

# Compare results
results_df = pd.DataFrame(results)
print(results_df)

2. Version Control

# Save checkpoint
def save_checkpoint(notebook, name, description=""):
    """Save checkpoint"""
    checkpoint = {
        'name': name,
        'description': description,
        'timestamp': pd.Timestamp.now(),
        'variables': notebook.user_ns.copy(),
        'outputs': notebook.cell_outputs.copy()
    }
    
    # Save to file
    checkpoint_file = f"checkpoints/{name}.pkl"
    os.makedirs("checkpoints", exist_ok=True)
    
    with open(checkpoint_file, 'wb') as f:
        pickle.dump(checkpoint, f)
    
    print(f"Checkpoint saved: {checkpoint_file}")

# Restore checkpoint
def restore_checkpoint(notebook, name):
    """Restore checkpoint"""
    checkpoint_file = f"checkpoints/{name}.pkl"
    
    if not os.path.exists(checkpoint_file):
        print(f"Checkpoint doesn't exist: {checkpoint_file}")
        return False
    
    with open(checkpoint_file, 'rb') as f:
        checkpoint = pickle.load(f)
    
    # Restore variables
    notebook.user_ns.update(checkpoint['variables'])
    
    print(f"Checkpoint restored: {name}")
    print(f"Description: {checkpoint['description']}")
    print(f"Time: {checkpoint['timestamp']}")
    
    return True

Collaboration Development Best Practices

Team Collaboration

1. Code Standards

# Code style guide
"""
Code Style Standards:
1. Use meaningful variable names and function names
2. Add appropriate comments and docstrings
3. Follow PEP 8 code style
4. Use type hints
5. Write unit tests
"""

from typing import List, Dict, Optional, Union
import numpy as np
import pandas as pd

def process_text_data(
    text_list: List[str],
    max_length: int = 512,
    tokenizer: Optional[object] = None
) -> Dict[str, np.ndarray]:
    """
    Process text data
    
    Args:
        text_list: List of texts
        max_length: Maximum length
        tokenizer: Tokenizer object
    
    Returns:
        Dictionary containing processed data
    
    Raises:
        ValueError: When input parameters are invalid
    """
    if not text_list:
        raise ValueError("Text list cannot be empty")
    
    # Processing logic
    processed_data = {}
    
    return processed_data

2. Version Control Workflow

# Branch management strategy
# main: Main branch, keep stable
# develop: Development branch, integrate features
# feature/*: Feature branches, develop new features
# hotfix/*: Hotfix branches, fix urgent issues

# Create feature branch
git checkout -b feature/text-classification

# Commit after development completion
git add .
git commit -m "feat: Add text classification feature

- Implement basic text classification model
- Add data preprocessing functionality
- Integrate evaluation metrics"

# Push to remote
git push origin feature/text-classification

# Create merge request
# Create MR/PR on GitLab/GitHub

Documentation Management

1. Project Documentation Structure

project/
├── README.md
├── docs/
│   ├── api/
│   ├── user-guide/
│   └── development/
├── examples/
├── tests/
└── requirements.txt

2. Documentation Writing Standards

# Documentation Title

## Overview
Briefly describe the function or module purpose.

## Features
- Feature 1: Description
- Feature 2: Description

## Usage
Provide detailed usage instructions and example code.

## API Reference
Detailed API interface and parameter descriptions.

## Notes
Explain usage considerations and limitations.

## FAQ
List common questions and solutions.

## Changelog
Record version update information.

Security Best Practices

Data Security

1. Sensitive Information Protection

# Environment variable management
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Get sensitive configuration
API_KEY = os.getenv('API_KEY')
DATABASE_URL = os.getenv('DATABASE_URL')

# Validate configuration
if not API_KEY:
    raise ValueError("API_KEY environment variable not set")

# .env file (don't commit to version control)
API_KEY=your_api_key_here
DATABASE_URL=your_database_url_here

2. Data Validation

from pydantic import BaseModel, validator
from typing import List

class InputData(BaseModel):
    """Input data model"""
    text: str
    max_length: int = 512
    
    @validator('text')
    def validate_text(cls, v):
        if not v or len(v.strip()) == 0:
            raise ValueError('Text cannot be empty')
        if len(v) > 10000:
            raise ValueError('Text length cannot exceed 10000 characters')
        return v.strip()
    
    @validator('max_length')
    def validate_max_length(cls, v):
        if v <= 0 or v > 10000:
            raise ValueError('Maximum length must be between 1-10000')
        return v

# Usage example
try:
    data = InputData(text="Test text", max_length=100)
    print("Data validation passed")
except ValueError as e:
    print(f"Data validation failed: {e}")

Access Control

1. Permission Management

from functools import wraps
from flask import request, jsonify

def require_auth(f):
    """Authentication decorator"""
    @wraps(f)
    def decorated_function(*args, **kwargs):
        token = request.headers.get('Authorization')
        
        if not token:
            return jsonify({'error': 'Missing authentication token'}), 401
        
        if not validate_token(token):
            return jsonify({'error': 'Invalid authentication token'}), 401
        
        return f(*args, **kwargs)
    return decorated_function

def require_role(role):
    """Role validation decorator"""
    def decorator(f):
        @wraps(f)
        def decorated_function(*args, **kwargs):
            user_role = get_user_role(request.headers.get('Authorization'))
            
            if user_role != role:
                return jsonify({'error': 'Insufficient permissions'}), 403
            
            return f(*args, **kwargs)
        return decorated_function
    return decorator

# Usage example
@app.route('/admin', methods=['GET'])
@require_auth
@require_role('admin')
def admin_panel():
    return jsonify({'message': 'Welcome to admin panel'})

Performance Monitoring Best Practices

Monitoring Metrics

1. System Performance Monitoring

import psutil
import time
from datetime import datetime

class PerformanceMonitor:
    """Performance monitor"""
    
    def __init__(self):
        self.metrics = []
    
    def collect_metrics(self):
        """Collect performance metrics"""
        metrics = {
            'timestamp': datetime.now(),
            'cpu_percent': psutil.cpu_percent(interval=1),
            'memory_percent': psutil.virtual_memory().percent,
            'disk_usage': psutil.disk_usage('/').percent
        }
        
        self.metrics.append(metrics)
        return metrics
    
    def get_summary(self):
        """Get performance summary"""
        if not self.metrics:
            return {}
        
        cpu_values = [m['cpu_percent'] for m in self.metrics]
        memory_values = [m['memory_percent'] for m in self.metrics]
        
        return {
            'cpu_avg': sum(cpu_values) / len(cpu_values),
            'cpu_max': max(cpu_values),
            'memory_avg': sum(memory_values) / len(memory_values),
            'memory_max': max(memory_values)
        }

# Use monitor
monitor = PerformanceMonitor()

# Collect metrics periodically
while True:
    metrics = monitor.collect_metrics()
    print(f"CPU: {metrics['cpu_percent']}%, Memory: {metrics['memory_percent']}%")
    time.sleep(60)  # Collect once per minute

2. Application Performance Monitoring

import time
from functools import wraps

def performance_monitor(func):
    """Performance monitoring decorator"""
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        
        try:
            result = func(*args, **kwargs)
            execution_time = time.time() - start_time
            
            # Log performance metrics
            log_performance(func.__name__, execution_time, 'success')
            
            return result
        except Exception as e:
            execution_time = time.time() - start_time
            
            # Log error and performance metrics
            log_performance(func.__name__, execution_time, 'error', str(e))
            raise
    
    return wrapper

def log_performance(func_name, execution_time, status, error=None):
    """Log performance logs"""
    log_entry = {
        'function': func_name,
        'execution_time': execution_time,
        'status': status,
        'timestamp': datetime.now(),
        'error': error
    }
    
    # Write to log file or database
    print(f"Performance log: {log_entry}")

# Usage example
@performance_monitor
def process_large_dataset(data):
    """Process large dataset"""
    time.sleep(2)  # Simulate processing time
    return len(data)

# Test performance monitoring
result = process_large_dataset(range(1000))

Summary

Following these best practices can help you:

  1. Improve Development Efficiency: Through standardized processes and tools
  2. Ensure Code Quality: Through code standards and testing
  3. Enhance Team Collaboration: Through clear documentation and version control
  4. Improve System Performance: Through optimization and monitoring
  5. Ensure Security: Through security measures and permission management

Remember, best practices are constantly evolving. As technology develops and team experience accumulates, you should continuously update and improve these practices. At the same time, you should also adjust and customize these practices according to the specific needs of your projects.

Most importantly, integrate these best practices into your daily development work and form habits, so that you can truly leverage their benefits!