Best Practices Guide
On the GitCode AI community platform, following best practices can help you use various features more efficiently, improve work efficiency, and avoid common problems. This guide compiles best practices for platform usage, covering models, datasets, Space, Notebook, and other aspects.
Model Management Best Practices
Model Creation and Publishing
1. Model Naming Conventions
- Use clear, descriptive names
- Include model type and main functionality
- Avoid special characters and spaces
- Examples:
bert-chinese-sentiment-analysis
,resnet50-image-classification
2. Model Description Writing
# Model Name
## Function Description
- Main Purpose: Text Sentiment Analysis
- Supported Languages: Chinese
- Input Format: Text String
- Output Format: Sentiment Labels (Positive/Negative/Neutral)
## Technical Details
- Framework: PyTorch
- Pre-trained Model: BERT-base-chinese
- Model Size: 110M
- Inference Speed: 100ms/sentence
## Usage Examples
```python
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="username/bert-chinese-sentiment")
result = classifier("This product is very good!")
Notes
- Only supports Chinese text
- Recommended input length not exceeding 512 characters
- Requires transformers library installation
**3. Model Configuration File Optimization**
```yaml
# model-config.yaml
model-name: bert-chinese-sentiment
version: 1.0.0
framework: pytorch
task: text-classification
tags:
- chinese
- sentiment-analysis
- bert
- nlp
dependencies:
- torch>=1.9.0
- transformers>=4.20.0
- numpy>=1.21.0
model-info:
architecture: bert-base-chinese
parameters: 110M
max-input-length: 512
performance:
inference-time: 100ms
accuracy: 0.92
f1-score: 0.91
usage-examples:
- input: "This product is very good!"
output: "Positive"
- input: "Quality is too poor"
output: "Negative"
Model Version Management
1. Version Number Conventions
- Use semantic versioning:
Major.Minor.Patch
- Major version: Incompatible API changes
- Minor version: Backward-compatible new functionality
- Patch version: Backward-compatible bug fixes
2. Version Update Strategy
# Minor version update
git tag v1.0.1
git push origin v1.0.1
# Feature update
git tag v1.1.0
git push origin v1.1.0
# Major update
git tag v2.0.0
git push origin v2.0.0
3. Changelog Maintenance
# Changelog
## [1.1.0] - 2024-01-15
### Added
- Support for batch inference
- Added quantized model version
- Optimized inference speed
### Fixed
- Fixed long text processing issues
- Resolved memory leak problems
### Changed
- Updated dependency library versions
- Optimized model structure
## [1.0.1] - 2024-01-01
### Fixed
- Fixed input validation issues
- Resolved encoding problems
Dataset Management Best Practices
Dataset Organization
1. Directory Structure Standards
dataset-name/
├── README.md
├── data/
│ ├── train/
│ ├── validation/
│ └── test/
├── metadata.json
├── schema.json
└── examples/
├── sample1.jpg
├── sample2.jpg
└── sample3.jpg
2. Metadata File Standards
{
"name": "chinese-text-classification",
"version": "1.0.0",
"description": "Chinese Text Classification Dataset",
"license": "MIT",
"creator": "username",
"created_date": "2024-01-01",
"last_updated": "2024-01-15",
"statistics": {
"total_samples": 10000,
"train_samples": 8000,
"validation_samples": 1000,
"test_samples": 1000,
"classes": 5
},
"format": {
"input_type": "text",
"output_type": "label",
"encoding": "utf-8"
},
"quality_metrics": {
"completeness": 0.98,
"consistency": 0.95,
"accuracy": 0.92
}
}
3. Data Quality Assurance
# Data quality check script
import pandas as pd
import numpy as np
from typing import Dict, List
def check_data_quality(data: pd.DataFrame) -> Dict[str, float]:
"""Check data quality"""
quality_metrics = {}
# Completeness check
quality_metrics['completeness'] = 1 - data.isnull().sum().sum() / (data.shape[0] * data.shape[1])
# Consistency check
quality_metrics['consistency'] = check_data_consistency(data)
# Accuracy check
quality_metrics['accuracy'] = check_data_accuracy(data)
return quality_metrics
def check_data_consistency(data: pd.DataFrame) -> float:
"""Check data consistency"""
# Implement consistency check logic
pass
def check_data_accuracy(data: pd.DataFrame) -> float:
"""Check data accuracy"""
# Implement accuracy check logic
pass
Dataset Documentation
1. README Template
# Dataset Name
## Overview
Briefly describe the dataset content, purpose, and characteristics.
## Data Source
Explain data source, collection methods, and time.
## Data Format
Describe data format, structure, and field descriptions in detail.
## Usage Instructions
Provide examples of data loading, preprocessing, and usage.
## Quality Assessment
Explain data quality assessment results and considerations.
## License
Explain data usage license and restrictions.
## Citation
If dataset comes from research papers, provide citation information.
## Contact
Provide contact information for dataset maintainers.
Space Application Best Practices
Application Architecture Design
1. Modular Design
# app.py
from flask import Flask, request, jsonify
from modules.preprocessor import DataPreprocessor
from modules.model import ModelWrapper
from modules.postprocessor import ResultPostprocessor
app = Flask(__name__)
# Initialize components
preprocessor = DataPreprocessor()
model = ModelWrapper()
postprocessor = ResultPostprocessor()
@app.route('/predict', methods=['POST'])
def predict():
try:
# Data preprocessing
input_data = request.json
processed_data = preprocessor.process(input_data)
# Model inference
raw_result = model.predict(processed_data)
# Result post-processing
final_result = postprocessor.process(raw_result)
return jsonify({
'success': True,
'result': final_result
})
except Exception as e:
return jsonify({
'success': False,
'error': str(e)
}), 400
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
2. Configuration File Management
# app.yaml
name: text-classification-app
version: 1.0.0
description: Text Classification Application
runtime: python3.9
entrypoint: app:app
env_variables:
MODEL_PATH: /app/models/model.pkl
MAX_INPUT_LENGTH: 512
BATCH_SIZE: 32
resources:
cpu: 2
memory: 4Gi
gpu: 1
dependencies:
- flask==2.3.0
- torch==2.0.0
- transformers==4.30.0
health_check:
path: /health
interval: 30s
timeout: 10s
retries: 3
Performance Optimization
1. Caching Strategy
import functools
import redis
import pickle
# Redis cache decorator
def cache_result(expire_time=3600):
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
# Generate cache key
cache_key = f"{func.__name__}:{hash(str(args) + str(kwargs))}"
# Try to get from cache
cached_result = redis_client.get(cache_key)
if cached_result:
return pickle.loads(cached_result)
# Execute function and cache result
result = func(*args, **kwargs)
redis_client.setex(cache_key, expire_time, pickle.dumps(result))
return result
return wrapper
return decorator
@cache_result(expire_time=1800)
def expensive_computation(input_data):
# Expensive computation logic
pass
2. Asynchronous Processing
import asyncio
from concurrent.futures import ThreadPoolExecutor
import aiohttp
class AsyncModelService:
def __init__(self):
self.executor = ThreadPoolExecutor(max_workers=4)
async def batch_predict(self, input_list):
"""Asynchronous batch prediction"""
tasks = []
for input_data in input_list:
task = asyncio.create_task(self.predict_single(input_data))
tasks.append(task)
results = await asyncio.gather(*tasks)
return results
async def predict_single(self, input_data):
"""Single prediction task"""
loop = asyncio.get_event_loop()
result = await loop.run_in_executor(
self.executor,
self._predict,
input_data
)
return result
def _predict(self, input_data):
"""Actual prediction logic"""
# Model inference code
pass
Notebook Development Best Practices
Code Organization
1. Cell Structure
# 1. Imports and Configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
# 2. Data Loading
@load_data
def load_dataset():
"""Load dataset"""
data = pd.read_csv('dataset.csv')
print(f"Dataset shape: {data.shape}")
return data
# 3. Data Exploration
def explore_data(data):
"""Data exploration analysis"""
print("Data basic information:")
print(data.info())
print("\nData statistical summary:")
print(data.describe())
print("\nMissing value statistics:")
print(data.isnull().sum())
# 4. Data Preprocessing
def preprocess_data(data):
"""Data preprocessing"""
# Handle missing values
data = data.fillna(data.median())
# Feature engineering
data['feature_new'] = data['feature1'] * data['feature2']
return data
# 5. Model Training
def train_model(X, y):
"""Model training"""
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
return model, X_test, y_test
# 6. Result Evaluation
def evaluate_model(model, X_test, y_test):
"""Model evaluation"""
from sklearn.metrics import classification_report, confusion_matrix
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.show()
2. Function and Class Design
class DataProcessor:
"""Data processing class"""
def __init__(self, config):
self.config = config
self.data = None
def load(self, file_path):
"""Load data"""
if file_path.endswith('.csv'):
self.data = pd.read_csv(file_path)
elif file_path.endswith('.json'):
self.data = pd.read_json(file_path)
else:
raise ValueError("Unsupported file format")
return self.data
def clean(self):
"""Data cleaning"""
if self.data is None:
raise ValueError("Please load data first")
# Remove duplicate rows
self.data = self.data.drop_duplicates()
# Handle missing values
self.data = self.data.fillna(self.config.get('fill_method', 'median'))
return self.data
def transform(self):
"""Data transformation"""
if self.data is None:
raise ValueError("Please load data first")
# Feature engineering
for feature in self.config.get('features', []):
if feature['type'] == 'categorical':
self.data = pd.get_dummies(self.data, columns=[feature['name']])
elif feature['type'] == 'numerical':
self.data[feature['name']] = pd.to_numeric(self.data[feature['name']])
return self.data
# Usage example
config = {
'fill_method': 'mean',
'features': [
{'name': 'category', 'type': 'categorical'},
{'name': 'price', 'type': 'numerical'}
]
}
processor = DataProcessor(config)
data = processor.load('data.csv').clean().transform()
Experiment Management
1. Experiment Tracking
import mlflow
import os
# Set up MLflow
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("Text Classification Experiment")
def run_experiment(params):
"""Run experiment"""
with mlflow.start_run():
# Log parameters
mlflow.log_params(params)
# Train model
model = train_model_with_params(params)
# Evaluate model
metrics = evaluate_model(model)
# Log metrics
mlflow.log_metrics(metrics)
# Save model
mlflow.sklearn.log_model(model, "model")
# Log data version
mlflow.log_artifact("data.csv")
return model, metrics
# Experiment parameters
experiment_params = [
{'model': 'random_forest', 'n_estimators': 100},
{'model': 'random_forest', 'n_estimators': 200},
{'model': 'xgboost', 'n_estimators': 100}
]
# Run experiments
results = []
for params in experiment_params:
model, metrics = run_experiment(params)
results.append({'params': params, 'metrics': metrics})
# Compare results
results_df = pd.DataFrame(results)
print(results_df)
2. Version Control
# Save checkpoint
def save_checkpoint(notebook, name, description=""):
"""Save checkpoint"""
checkpoint = {
'name': name,
'description': description,
'timestamp': pd.Timestamp.now(),
'variables': notebook.user_ns.copy(),
'outputs': notebook.cell_outputs.copy()
}
# Save to file
checkpoint_file = f"checkpoints/{name}.pkl"
os.makedirs("checkpoints", exist_ok=True)
with open(checkpoint_file, 'wb') as f:
pickle.dump(checkpoint, f)
print(f"Checkpoint saved: {checkpoint_file}")
# Restore checkpoint
def restore_checkpoint(notebook, name):
"""Restore checkpoint"""
checkpoint_file = f"checkpoints/{name}.pkl"
if not os.path.exists(checkpoint_file):
print(f"Checkpoint doesn't exist: {checkpoint_file}")
return False
with open(checkpoint_file, 'rb') as f:
checkpoint = pickle.load(f)
# Restore variables
notebook.user_ns.update(checkpoint['variables'])
print(f"Checkpoint restored: {name}")
print(f"Description: {checkpoint['description']}")
print(f"Time: {checkpoint['timestamp']}")
return True
Collaboration Development Best Practices
Team Collaboration
1. Code Standards
# Code style guide
"""
Code Style Standards:
1. Use meaningful variable names and function names
2. Add appropriate comments and docstrings
3. Follow PEP 8 code style
4. Use type hints
5. Write unit tests
"""
from typing import List, Dict, Optional, Union
import numpy as np
import pandas as pd
def process_text_data(
text_list: List[str],
max_length: int = 512,
tokenizer: Optional[object] = None
) -> Dict[str, np.ndarray]:
"""
Process text data
Args:
text_list: List of texts
max_length: Maximum length
tokenizer: Tokenizer object
Returns:
Dictionary containing processed data
Raises:
ValueError: When input parameters are invalid
"""
if not text_list:
raise ValueError("Text list cannot be empty")
# Processing logic
processed_data = {}
return processed_data
2. Version Control Workflow
# Branch management strategy
# main: Main branch, keep stable
# develop: Development branch, integrate features
# feature/*: Feature branches, develop new features
# hotfix/*: Hotfix branches, fix urgent issues
# Create feature branch
git checkout -b feature/text-classification
# Commit after development completion
git add .
git commit -m "feat: Add text classification feature
- Implement basic text classification model
- Add data preprocessing functionality
- Integrate evaluation metrics"
# Push to remote
git push origin feature/text-classification
# Create merge request
# Create MR/PR on GitLab/GitHub
Documentation Management
1. Project Documentation Structure
project/
├── README.md
├── docs/
│ ├── api/
│ ├── user-guide/
│ └── development/
├── examples/
├── tests/
└── requirements.txt
2. Documentation Writing Standards
# Documentation Title
## Overview
Briefly describe the function or module purpose.
## Features
- Feature 1: Description
- Feature 2: Description
## Usage
Provide detailed usage instructions and example code.
## API Reference
Detailed API interface and parameter descriptions.
## Notes
Explain usage considerations and limitations.
## FAQ
List common questions and solutions.
## Changelog
Record version update information.
Security Best Practices
Data Security
1. Sensitive Information Protection
# Environment variable management
import os
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Get sensitive configuration
API_KEY = os.getenv('API_KEY')
DATABASE_URL = os.getenv('DATABASE_URL')
# Validate configuration
if not API_KEY:
raise ValueError("API_KEY environment variable not set")
# .env file (don't commit to version control)
API_KEY=your_api_key_here
DATABASE_URL=your_database_url_here
2. Data Validation
from pydantic import BaseModel, validator
from typing import List
class InputData(BaseModel):
"""Input data model"""
text: str
max_length: int = 512
@validator('text')
def validate_text(cls, v):
if not v or len(v.strip()) == 0:
raise ValueError('Text cannot be empty')
if len(v) > 10000:
raise ValueError('Text length cannot exceed 10000 characters')
return v.strip()
@validator('max_length')
def validate_max_length(cls, v):
if v <= 0 or v > 10000:
raise ValueError('Maximum length must be between 1-10000')
return v
# Usage example
try:
data = InputData(text="Test text", max_length=100)
print("Data validation passed")
except ValueError as e:
print(f"Data validation failed: {e}")
Access Control
1. Permission Management
from functools import wraps
from flask import request, jsonify
def require_auth(f):
"""Authentication decorator"""
@wraps(f)
def decorated_function(*args, **kwargs):
token = request.headers.get('Authorization')
if not token:
return jsonify({'error': 'Missing authentication token'}), 401
if not validate_token(token):
return jsonify({'error': 'Invalid authentication token'}), 401
return f(*args, **kwargs)
return decorated_function
def require_role(role):
"""Role validation decorator"""
def decorator(f):
@wraps(f)
def decorated_function(*args, **kwargs):
user_role = get_user_role(request.headers.get('Authorization'))
if user_role != role:
return jsonify({'error': 'Insufficient permissions'}), 403
return f(*args, **kwargs)
return decorated_function
return decorator
# Usage example
@app.route('/admin', methods=['GET'])
@require_auth
@require_role('admin')
def admin_panel():
return jsonify({'message': 'Welcome to admin panel'})
Performance Monitoring Best Practices
Monitoring Metrics
1. System Performance Monitoring
import psutil
import time
from datetime import datetime
class PerformanceMonitor:
"""Performance monitor"""
def __init__(self):
self.metrics = []
def collect_metrics(self):
"""Collect performance metrics"""
metrics = {
'timestamp': datetime.now(),
'cpu_percent': psutil.cpu_percent(interval=1),
'memory_percent': psutil.virtual_memory().percent,
'disk_usage': psutil.disk_usage('/').percent
}
self.metrics.append(metrics)
return metrics
def get_summary(self):
"""Get performance summary"""
if not self.metrics:
return {}
cpu_values = [m['cpu_percent'] for m in self.metrics]
memory_values = [m['memory_percent'] for m in self.metrics]
return {
'cpu_avg': sum(cpu_values) / len(cpu_values),
'cpu_max': max(cpu_values),
'memory_avg': sum(memory_values) / len(memory_values),
'memory_max': max(memory_values)
}
# Use monitor
monitor = PerformanceMonitor()
# Collect metrics periodically
while True:
metrics = monitor.collect_metrics()
print(f"CPU: {metrics['cpu_percent']}%, Memory: {metrics['memory_percent']}%")
time.sleep(60) # Collect once per minute
2. Application Performance Monitoring
import time
from functools import wraps
def performance_monitor(func):
"""Performance monitoring decorator"""
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
execution_time = time.time() - start_time
# Log performance metrics
log_performance(func.__name__, execution_time, 'success')
return result
except Exception as e:
execution_time = time.time() - start_time
# Log error and performance metrics
log_performance(func.__name__, execution_time, 'error', str(e))
raise
return wrapper
def log_performance(func_name, execution_time, status, error=None):
"""Log performance logs"""
log_entry = {
'function': func_name,
'execution_time': execution_time,
'status': status,
'timestamp': datetime.now(),
'error': error
}
# Write to log file or database
print(f"Performance log: {log_entry}")
# Usage example
@performance_monitor
def process_large_dataset(data):
"""Process large dataset"""
time.sleep(2) # Simulate processing time
return len(data)
# Test performance monitoring
result = process_large_dataset(range(1000))
Summary
Following these best practices can help you:
- Improve Development Efficiency: Through standardized processes and tools
- Ensure Code Quality: Through code standards and testing
- Enhance Team Collaboration: Through clear documentation and version control
- Improve System Performance: Through optimization and monitoring
- Ensure Security: Through security measures and permission management
Remember, best practices are constantly evolving. As technology develops and team experience accumulates, you should continuously update and improve these practices. At the same time, you should also adjust and customize these practices according to the specific needs of your projects.
Most importantly, integrate these best practices into your daily development work and form habits, so that you can truly leverage their benefits!