Frequently Asked Questions
This section collects common questions and solutions encountered when using RecIS.
Installation and Environment
Q: How to verify if RecIS installation is successful?
A: Run the following verification script:
import recis
import torch
print(f"RecIS version: {recis.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Test core functionality
from recis.nn import DynamicEmbedding, EmbeddingOption
try:
emb_opt = EmbeddingOption(embedding_dim=16)
embedding = DynamicEmbedding(emb_opt)
ids = torch.LongTensor([1, 2, 3])
output = embedding(ids)
print(f"Embedding output shape: {output.shape}")
print("✅ RecIS installation verification successful!")
except Exception as e:
print(f"❌ RecIS installation verification failed: {e}")
Data Processing
Q: How to handle variable-length sequence data?
A: Use RaggedTensor or sequence processing operations:
from recis.ragged.tensor import RaggedTensor
from recis.features.op import SequenceTruncate
# Method 1: Using RaggedTensor
values = torch.LongTensor([1, 2, 3, 4, 5, 6])
offsets = torch.LongTensor([0, 2, 4, 6]) # Three sequences: [1,2], [3,4], [5,6]
ragged_tensor = RaggedTensor(values, offsets)
# Method 2: Using sequence processing operations
from recis.features.feature import Feature
from recis.features.op import SelectField
sequence_feature = Feature("user_history", [
SelectField("history_ids", dtype=torch.long, from_dict=True),
SequenceTruncate(
seq_len=64,
check_length=True,
truncate=True,
truncate_side="left",
n_dims=2)
])
Q: How to customize data preprocessing?
A: Through the transform_fn parameter:
def custom_transform(batch):
# Custom preprocessing logic
batch['processed_feature'] = process_feature(batch['raw_feature'])
# Data type conversion
for key in ['user_id', 'item_id']:
if key in batch:
batch[key] = batch[key].long()
# Data normalization
if 'score' in batch:
batch['score'] = (batch['score'] - batch['score'].mean()) / batch['score'].std()
return batch
dataset = OdpsDataset(
batch_size=1024,
transform_fn=custom_transform
)
Model Training
Q: What to do when NaN or Inf appears during training?
A: Common causes and solutions:
Learning rate too high:
# Reduce learning rate sparse_optimizer = SparseAdamW(sparse_params, lr=0.0001) # From 0.001 to 0.0001 dense_optimizer = AdamW(model.parameters(), lr=0.0001)
Gradient explosion:
# Add gradient clipping import torch.nn as nn # Add after backward propagation loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step()
Numerical instability:
# Check input data def check_tensor(tensor, name): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") if torch.isinf(tensor).any(): print(f"Inf detected in {name}") # Add checks in model def forward(self, batch): for key, value in batch.items(): check_tensor(value, key) # ... model computation
Q: How to handle class imbalance problems?
A: Several solutions:
Weighted loss function:
import torch.nn as nn # Calculate class weights pos_weight = (negative_samples / positive_samples) loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
Sampling strategies:
from torch.utils.data import WeightedRandomSampler # Create sampling weights sample_weights = [1.0 if label == 0 else 5.0 for label in labels] sampler = WeightedRandomSampler(sample_weights, len(sample_weights))
Evaluation metric adjustment:
# Focus on F1, precision, recall instead of just accuracy from recis.metrics import AUROC auc_metric = AUROC(num_thresholds=200) # Also calculate precision, recall, F1
Distributed Training
Q: How to configure multi-node multi-GPU training?
A: Complete distributed training configuration:
Environment variable setup:
# Master node export MASTER_ADDR="192.168.1.100" export MASTER_PORT="12355" export WORLD_SIZE=8 export RANK=0 export LOCAL_RANK=0 # Other nodes export RANK=1 # Increment sequentially
Code configuration:
import torch.distributed as dist import os def setup_distributed(): # Initialize distributed environment dist.init_process_group(backend='nccl') # Set device local_rank = int(os.environ.get('LOCAL_RANK', 0)) torch.cuda.set_device(local_rank) return local_rank # Wrap model local_rank = setup_distributed() model = model.cuda(local_rank)
Launch script:
# Launch with torchrun torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \ --master_addr="192.168.1.100" --master_port=12355 \ train.py
Q: How to synchronize metrics in distributed training?
A: Use metrics that support distributed synchronization:
from recis.metrics import AUROC
# Enable distributed synchronization
auc_metric = AUROC(
num_thresholds=200,
dist_sync_on_step=True # Sync every step
)
# Or manual synchronization
def sync_tensor(tensor):
if dist.is_initialized():
dist.all_reduce(tensor, op=dist.ReduceOp.AVG)
return tensor
Performance Issues
Q: How to optimize slow training speed?
A: Performance tuning recommendations:
Profile analysis:
from recis.hooks import ProfilerHook profiler_hook = ProfilerHook( output_dir="./profile_logs", schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2), on_trace_ready=torch.profiler.tensorboard_trace_handler('./profile_logs') )
Compute Optimization:
# Enable cuDNN benchmarking torch.backends.cudnn.benchmark = True # Use compilation optimization model = torch.compile(model) # PyTorch 2.0+
Troubleshooting
Q: What should I do if I encounter a CUDA-related error?
A: Common CUDA Errors and Solutions:
CUDA out of memory:
# Reduce batch size batch_size = 512 # Reduce from 1024 to 512 # Clear GPU cache torch.cuda.empty_cache()
CUDA device mismatch:
# Ensure all tensors are on the same device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) for key, value in batch.items(): if torch.is_tensor(value): batch[key] = value.to(device)
Q: How to debug the problem of model not converging?
A: Debugging steps:
Check the data:
# Check the data distribution print("Label distribution:", torch.bincount(labels)) print("Feature statistics:", features.mean(), features.std())
Check the model:
# Check the gradients for name, param in model.named_parameters(): if param.grad is not None: grad_norm = param.grad.norm() print(f"{name}: grad_norm={grad_norm:.6f}")
Adjust hyperparameters:
# Try different learning rates learning_rates = [0.1, 0.01, 0.001, 0.0001] # Try different optimizers optimizers = [ torch.optim.Adam(params, lr=0.001), torch.optim.AdamW(params, lr=0.001), torch.optim.SGD(params, lr=0.01, momentum=0.9) ]
Need Help ?
If none of the above solutions resolve your issue:
View Logs: Carefully review the error log and stack trace.
Search Documentation: Search for relevant keywords in the documentation.
View Examples: Refer to similar example code.
Submit Issue: Submit a detailed description of the issue on GitHub.
Community Help: Join the technical discussion group for help.
Question Template
When seeking help, please provide the following information:
**Environment Information**
- RecIS Version:
- PyTorch Version:
- CUDA Version:
- Operating System:
**Problem Description**
- Specific Issue:
- Expected Behavior:
- Actual Behavior:
**Reproduction Steps**
1. Step 1
2. Step 2
3. ...
**Error Message**
```
Full Error Log
```
**Relevant Code**
```python
Minimal Reproducible Code Example
```