IO Module
RecIS’s IO module provides efficient and flexible data loading and preprocessing capabilities, supporting multiple data formats and optimized data pipelines for deep learning model training. With RecIS’s IO module, you can achieve better performance without needing to combine with traditional DataLoader.
Core Features
- Data Support Support
ORC Files: Support for Optimized Row Columnar format, suitable for large-scale offline data processing
- High-Performance Data Processing
Multi-threaded parallel reading and data preprocessing
Configurable prefetching and buffering mechanisms
Direct data organization on different devices (CPU/GPU/Pin Memory)
- Flexible Feature Configuration
Support for sparse features (variable-length) and dense features (fixed-length)
Hash feature processing with FarmHash and MurmurHash algorithms
RaggedTensor format for variable-length features
- Distributed Training Optimization
Multi-worker data sharding
State saving and recovery mechanisms
Dataset Classes
OrcDataset
- class recis.io.OrcDataset(batch_size, worker_idx=0, worker_num=1, read_threads_num=4, pack_threads_num=None, prefetch=1, is_compressed=False, drop_remainder=False, worker_slice_batch_num=None, shuffle=False, ragged_format=True, transform_fn=None, save_interval=100, dtype=torch.float32, device='cpu')[source]
ORC Dataset for reading Optimized Row Columnar format files.
This class provides functionality to read ORC files efficiently with support for both sparse (variable-length) and dense (fixed-length) features. It extends DatasetBase to provide ORC-specific optimizations including hash feature processing, data sharding, and batch processing.
The OrcDataset supports distributed training by allowing multiple workers to process different shards of the data concurrently. It also provides flexible feature configuration with hash bucketing for categorical features.
Example
Creating and configuring an ORC dataset:
# Initialize dataset dataset = OrcDataset( batch_size=512, worker_idx=0, worker_num=4, shuffle=True, ragged_format=True ) # Add data sources dataset.add_paths(["/data/train/part1", "/data/train/part2"]) # Configure sparse features with hashing dataset.varlen_feature("item_id", hash_type="farm", hash_bucket=1000000) dataset.varlen_feature("category_id", hash_type="murmur", hash_bucket=10000) # Configure dense features dataset.fixedlen_feature("price", default_value=0.0) dataset.fixedlen_feature("rating", default_value=3.0)
- Members:
__init__, add_path, add_paths, varlen_feature, fixedlen_feature
Best Practices
Reasonable Parameter Settings:
dataset = OrcDataset( batch_size=1024, read_threads_num=2, prefetch=1, # Number of prefetch batches device="cuda" # Put batch results directly on cuda )
Using Data Preprocessing:
def transform_batch(batch): # Custom batch processing logic return processed_batch dataset = OrcDataset( batch_size=1024, transform_fn=transform_batch )
Distributed Data Reading:
import os rank = int(os.environ.get("RANK", 0)) world_size = int(os.environ.get("WORLD_SIZE", 1)) dataset = OrcDataset( batch_size=1024, worker_idx=rank, worker_num=world_size )
Common Questions
Q: How to handle variable-length sequences?
A: Use varlen_feature to define variable-length features, RecIS will automatically process them into RaggedTensor format:
dataset.varlen_feature("sequence_ids")
# Data will be processed as RaggedTensor, containing values and offsets
Q: How to customize data preprocessing?
A: Pass a custom processing function through the transform_fn parameter:
def custom_transform(batch):
# Custom processing logic
batch['processed_feature'] = process_feature(batch['raw_feature'])
return batch
dataset = OrcDataset(batch_size=1024, transform_fn=custom_transform)
Q: How to optimize data reading performance?
A: You can optimize from the following aspects:
Modify read_threads_num and prefetch parameters
Set reasonable batch_size
Set device=’cuda’ to automatically organize output results on cuda