Frequently Asked Questions

This section collects common questions and solutions encountered when using RecIS.

Installation and Environment

Q: How to verify if RecIS installation is successful?

A: Run the following verification script:

import recis
import torch

print(f"RecIS version: {recis.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Test core functionality
from recis.nn import DynamicEmbedding, EmbeddingOption

try:
    emb_opt = EmbeddingOption(embedding_dim=16)
    embedding = DynamicEmbedding(emb_opt)

    ids = torch.LongTensor([1, 2, 3])
    output = embedding(ids)
    print(f"Embedding output shape: {output.shape}")
    print("✅ RecIS installation verification successful!")
except Exception as e:
    print(f"❌ RecIS installation verification failed: {e}")

Data Processing

Q: How to handle variable-length sequence data?

A: Use RaggedTensor or sequence processing operations:

from recis.ragged.tensor import RaggedTensor
from recis.features.op import SequenceTruncate

# Method 1: Using RaggedTensor
values = torch.LongTensor([1, 2, 3, 4, 5, 6])
offsets = torch.LongTensor([0, 2, 4, 6])  # Three sequences: [1,2], [3,4], [5,6]
ragged_tensor = RaggedTensor(values, offsets)

# Method 2: Using sequence processing operations
from recis.features.feature import Feature
from recis.features.op import SelectField

sequence_feature = Feature("user_history", [
    SelectField("history_ids", dtype=torch.long, from_dict=True),
    SequenceTruncate(
     seq_len=64,
     check_length=True,
     truncate=True,
     truncate_side="left",
     n_dims=2)
])

Q: How to customize data preprocessing?

A: Through the transform_fn parameter:

def custom_transform(batch):
    # Custom preprocessing logic
    batch['processed_feature'] = process_feature(batch['raw_feature'])

    # Data type conversion
    for key in ['user_id', 'item_id']:
        if key in batch:
            batch[key] = batch[key].long()

    # Data normalization
    if 'score' in batch:
        batch['score'] = (batch['score'] - batch['score'].mean()) / batch['score'].std()

    return batch

dataset = OdpsDataset(
    batch_size=1024,
    transform_fn=custom_transform
)

Model Training

Q: What to do when NaN or Inf appears during training?

A: Common causes and solutions:

Learning rate too high:

# Reduce learning rate
sparse_optimizer = SparseAdamW(sparse_params, lr=0.0001)  # From 0.001 to 0.0001
dense_optimizer = AdamW(model.parameters(), lr=0.0001)

Gradient explosion:

# Add gradient clipping
import torch.nn as nn

# Add after backward propagation
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Numerical instability:

# Check input data
def check_tensor(tensor, name):
    if torch.isnan(tensor).any():
        print(f"NaN detected in {name}")
    if torch.isinf(tensor).any():
        print(f"Inf detected in {name}")

# Add checks in model
def forward(self, batch):
    for key, value in batch.items():
        check_tensor(value, key)
    # ... model computation

Q: How to handle class imbalance problems?

A: Several solutions:

Weighted loss function:

import torch.nn as nn

# Calculate class weights
pos_weight = (negative_samples / positive_samples)
loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

Sampling strategies:

from torch.utils.data import WeightedRandomSampler

# Create sampling weights
sample_weights = [1.0 if label == 0 else 5.0 for label in labels]
sampler = WeightedRandomSampler(sample_weights, len(sample_weights))

Evaluation metric adjustment:

# Focus on F1, precision, recall instead of just accuracy
from recis.metrics import AUROC

auc_metric = AUROC(num_thresholds=200)
# Also calculate precision, recall, F1

Distributed Training

Q: How to configure multi-node multi-GPU training?

A: Complete distributed training configuration:

Environment variable setup:

# Master node
export MASTER_ADDR="192.168.1.100"
export MASTER_PORT="12355"
export WORLD_SIZE=8
export RANK=0
export LOCAL_RANK=0

# Other nodes
export RANK=1  # Increment sequentially

Code configuration:

import torch.distributed as dist
import os

def setup_distributed():
    # Initialize distributed environment
    dist.init_process_group(backend='nccl')

    # Set device
    local_rank = int(os.environ.get('LOCAL_RANK', 0))
    torch.cuda.set_device(local_rank)

    return local_rank

# Wrap model
local_rank = setup_distributed()
model = model.cuda(local_rank)

Launch script:

# Launch with torchrun
torchrun --nproc_per_node=4 --nnodes=2 --node_rank=0 \
         --master_addr="192.168.1.100" --master_port=12355 \
         train.py

Q: How to synchronize metrics in distributed training?

A: Use metrics that support distributed synchronization:

from recis.metrics import AUROC

# Enable distributed synchronization
auc_metric = AUROC(
    num_thresholds=200,
    dist_sync_on_step=True  # Sync every step
)

# Or manual synchronization
def sync_tensor(tensor):
    if dist.is_initialized():
        dist.all_reduce(tensor, op=dist.ReduceOp.AVG)
    return tensor

Performance Issues

Q: How to optimize slow training speed?

A: Performance tuning recommendations:

Profile analysis:

from recis.hooks import ProfilerHook

profiler_hook = ProfilerHook(
    output_dir="./profile_logs",
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./profile_logs')
)

Compute Optimization:

# Enable cuDNN benchmarking
torch.backends.cudnn.benchmark = True

# Use compilation optimization
model = torch.compile(model) # PyTorch 2.0+

Troubleshooting

Q: What should I do if I encounter a CUDA-related error?

A: Common CUDA Errors and Solutions:

CUDA out of memory:

# Reduce batch size
batch_size = 512 # Reduce from 1024 to 512

# Clear GPU cache
torch.cuda.empty_cache()

CUDA device mismatch:

# Ensure all tensors are on the same device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

for key, value in batch.items():
    if torch.is_tensor(value):
        batch[key] = value.to(device)

Q: How to debug the problem of model not converging?

A: Debugging steps:

Check the data:

# Check the data distribution
print("Label distribution:", torch.bincount(labels))
print("Feature statistics:", features.mean(), features.std())

Check the model:

# Check the gradients
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm()
        print(f"{name}: grad_norm={grad_norm:.6f}")

Adjust hyperparameters:

# Try different learning rates
learning_rates = [0.1, 0.01, 0.001, 0.0001]

# Try different optimizers
optimizers = [
        torch.optim.Adam(params, lr=0.001),
        torch.optim.AdamW(params, lr=0.001),
        torch.optim.SGD(params, lr=0.01, momentum=0.9)
]

Need Help ?

If none of the above solutions resolve your issue:

View Logs: Carefully review the error log and stack trace.
Search Documentation: Search for relevant keywords in the documentation.
View Examples: Refer to similar example code.
Submit Issue: Submit a detailed description of the issue on GitHub.
Community Help: Join the technical discussion group for help.

Question Template

When seeking help, please provide the following information:

**Environment Information**
- RecIS Version:
- PyTorch Version:
- CUDA Version:
- Operating System:

**Problem Description**
- Specific Issue:
- Expected Behavior:
- Actual Behavior:

**Reproduction Steps**
1. Step 1
2. Step 2
3. ...

**Error Message**
```
Full Error Log
```

**Relevant Code**
```python
Minimal Reproducible Code Example
```