Frequently Asked Questions ========================== This section collects common questions and solutions encountered when using RecIS. Installation and Environment ---------------------------- **Q: How to verify if RecIS installation is successful?** A: Run the following verification script: .. code-block:: python import recis import torch print(f"RecIS version: {recis.__version__}") print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") # Test core functionality from recis.nn import DynamicEmbedding, EmbeddingOption try: emb_opt = EmbeddingOption(embedding_dim=16) embedding = DynamicEmbedding(emb_opt) ids = torch.LongTensor([1, 2, 3]) output = embedding(ids) print(f"Embedding output shape: {output.shape}") print("✅ RecIS installation verification successful!") except Exception as e: print(f"❌ RecIS installation verification failed: {e}") Data Processing --------------- **Q: How to handle variable-length sequence data?** A: Use RaggedTensor or sequence processing operations: .. code-block:: python from recis.ragged.tensor import RaggedTensor from recis.features.op import SequenceTruncate # Method 1: Using RaggedTensor values = torch.LongTensor([1, 2, 3, 4, 5, 6]) offsets = torch.LongTensor([0, 2, 4, 6]) # Three sequences: [1,2], [3,4], [5,6] ragged_tensor = RaggedTensor(values, offsets) # Method 2: Using sequence processing operations from recis.features.feature import Feature from recis.features.op import SelectField sequence_feature = Feature("user_history", [ SelectField("history_ids", dtype=torch.long, from_dict=True), SequenceTruncate( seq_len=64, check_length=True, truncate=True, truncate_side="left", n_dims=2) ]) **Q: How to customize data preprocessing?** A: Through the transform_fn parameter: .. code-block:: python def custom_transform(batch): # Custom preprocessing logic batch['processed_feature'] = process_feature(batch['raw_feature']) # Data type conversion for key in ['user_id', 'item_id']: if key in batch: batch[key] = batch[key].long() # Data normalization if 'score' in batch: batch['score'] = (batch['score'] - batch['score'].mean()) / batch['score'].std() return batch dataset = OdpsDataset( batch_size=1024, transform_fn=custom_transform ) Model Training -------------- **Q: What to do when NaN or Inf appears during training?** A: Common causes and solutions: 1. **Learning rate too high**: .. code-block:: python # Reduce learning rate sparse_optimizer = SparseAdamW(sparse_params, lr=0.0001) # From 0.001 to 0.0001 dense_optimizer = AdamW(model.parameters(), lr=0.0001) 2. **Gradient explosion**: .. code-block:: python # Add gradient clipping import torch.nn as nn # Add after backward propagation loss.backward() nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() 3. **Numerical instability**: .. code-block:: python # Check input data def check_tensor(tensor, name): if torch.isnan(tensor).any(): print(f"NaN detected in {name}") if torch.isinf(tensor).any(): print(f"Inf detected in {name}") # Add checks in model def forward(self, batch): for key, value in batch.items(): check_tensor(value, key) # ... model computation **Q: How to handle class imbalance problems?** A: Several solutions: 1. **Weighted loss function**: .. code-block:: python import torch.nn as nn # Calculate class weights pos_weight = (negative_samples / positive_samples) loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight) 2. **Sampling strategies**: .. code-block:: python from torch.utils.data import WeightedRandomSampler # Create sampling weights sample_weights = [1.0 if label == 0 else 5.0 for label in labels] sampler = WeightedRandomSampler(sample_weights, len(sample_weights)) 3. **Evaluation metric adjustment**: .. code-block:: python # Focus on F1, precision, recall instead of just accuracy from recis.metrics import AUROC auc_metric = AUROC(num_thresholds=200) # Also calculate precision, recall, F1 Distributed Training -------------------- **Q: How to configure multi-node multi-GPU training?** A: Complete distributed training configuration: 1. **Environment variable setup**: .. code-block:: bash # Master node export MASTER_ADDR="192.168.1.100" export MASTER_PORT="12355" export WORLD_SIZE=8 export RANK=0 export LOCAL_RANK=0 # Other nodes export RANK=1 # Increment sequentially 2. **Code configuration**: .. code-block:: python import torch.distributed as dist import os def setup_distributed(): # Initialize distributed environment dist.init_process_group(backend='nccl') # Set device local_rank = int(os.environ.get('LOCAL_RANK', 0)) torch.cuda.set_device(local_rank) return local_rank # Wrap model local_rank = setup_distributed() model = model.cuda(local_rank) 3. **Launch script**: .. code-block:: bash # Launch with torchrun torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \ --master_addr="192.168.1.100" --master_port=12355 \ train.py **Q: How to synchronize metrics in distributed training?** A: Use metrics that support distributed synchronization: .. code-block:: python from recis.metrics import AUROC # Enable distributed synchronization auc_metric = AUROC( num_thresholds=200, dist_sync_on_step=True # Sync every step ) # Or manual synchronization def sync_tensor(tensor): if dist.is_initialized(): dist.all_reduce(tensor, op=dist.ReduceOp.AVG) return tensor Performance Issues ------------------ **Q: How to optimize slow training speed?** A: Performance tuning recommendations: 1. **Profile analysis**: .. code-block:: python from recis.hooks import ProfilerHook profiler_hook = ProfilerHook( output_dir="./profile_logs", schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2), on_trace_ready=torch.profiler.tensorboard_trace_handler('./profile_logs') ) 2. **Compute Optimization**: .. code-block:: python # Enable cuDNN benchmarking torch.backends.cudnn.benchmark = True # Use compilation optimization model = torch.compile(model) # PyTorch 2.0+ Troubleshooting --------------- **Q: What should I do if I encounter a CUDA-related error?** A: Common CUDA Errors and Solutions: 1. **CUDA out of memory**: .. code-block:: python # Reduce batch size batch_size = 512 # Reduce from 1024 to 512 # Clear GPU cache torch.cuda.empty_cache() 2. **CUDA device mismatch**: .. code-block:: python # Ensure all tensors are on the same device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) for key, value in batch.items(): if torch.is_tensor(value): batch[key] = value.to(device) **Q: How to debug the problem of model not converging?** A: Debugging steps: 1. **Check the data**: .. code-block:: python # Check the data distribution print("Label distribution:", torch.bincount(labels)) print("Feature statistics:", features.mean(), features.std()) 2. **Check the model**: .. code-block:: python # Check the gradients for name, param in model.named_parameters(): if param.grad is not None: grad_norm = param.grad.norm() print(f"{name}: grad_norm={grad_norm:.6f}") 3. **Adjust hyperparameters**: .. code-block:: python # Try different learning rates learning_rates = [0.1, 0.01, 0.001, 0.0001] # Try different optimizers optimizers = [ torch.optim.Adam(params, lr=0.001), torch.optim.AdamW(params, lr=0.001), torch.optim.SGD(params, lr=0.01, momentum=0.9) ] Need Help ? ----------- If none of the above solutions resolve your issue: 1. **View Logs**: Carefully review the error log and stack trace. 2. **Search Documentation**: Search for relevant keywords in the documentation. 3. **View Examples**: Refer to similar example code. 4. **Submit Issue**: Submit a detailed description of the issue on GitHub. 5. **Community Help**: Join the technical discussion group for help. **Question Template** When seeking help, please provide the following information: .. code-block:: text **Environment Information** - RecIS Version: - PyTorch Version: - CUDA Version: - Operating System: **Problem Description** - Specific Issue: - Expected Behavior: - Actual Behavior: **Reproduction Steps** 1. Step 1 2. Step 2 3. ... **Error Message** ``` Full Error Log ``` **Relevant Code** ```python Minimal Reproducible Code Example ```