IO Module ========= RecIS's IO module provides efficient and flexible data loading and preprocessing capabilities, supporting multiple data formats and optimized data pipelines for deep learning model training. With RecIS's IO module, you can achieve better performance without needing to combine with traditional DataLoader. Core Features ------------- **Data Support Support** - **ORC Files**: Support for Optimized Row Columnar format, suitable for large-scale offline data processing **High-Performance Data Processing** - Multi-threaded parallel reading and data preprocessing - Configurable prefetching and buffering mechanisms - Direct data organization on different devices (CPU/GPU/Pin Memory) **Flexible Feature Configuration** - Support for sparse features (variable-length) and dense features (fixed-length) - Hash feature processing with FarmHash and MurmurHash algorithms - RaggedTensor format for variable-length features **Distributed Training Optimization** - Multi-worker data sharding - State saving and recovery mechanisms .. currentmodule:: recis.io Dataset Classes --------------- OrcDataset ~~~~~~~~~~ .. autoclass:: OrcDataset :members: __init__, add_path, add_paths, varlen_feature, fixedlen_feature Best Practices -------------- 1. **Reasonable Parameter Settings**: .. code-block:: python dataset = OrcDataset( batch_size=1024, read_threads_num=2, prefetch=1, # Number of prefetch batches device="cuda" # Put batch results directly on cuda ) 2. **Using Data Preprocessing**: .. code-block:: python def transform_batch(batch): # Custom batch processing logic return processed_batch dataset = OrcDataset( batch_size=1024, transform_fn=transform_batch ) 3. **Distributed Data Reading**: .. code-block:: python import os rank = int(os.environ.get("RANK", 0)) world_size = int(os.environ.get("WORLD_SIZE", 1)) dataset = OrcDataset( batch_size=1024, worker_idx=rank, worker_num=world_size ) Common Questions ---------------- **Q: How to handle variable-length sequences?** A: Use `varlen_feature` to define variable-length features, RecIS will automatically process them into RaggedTensor format: .. code-block:: python dataset.varlen_feature("sequence_ids") # Data will be processed as RaggedTensor, containing values and offsets **Q: How to customize data preprocessing?** A: Pass a custom processing function through the `transform_fn` parameter: .. code-block:: python def custom_transform(batch): # Custom processing logic batch['processed_feature'] = process_feature(batch['raw_feature']) return batch dataset = OrcDataset(batch_size=1024, transform_fn=custom_transform) **Q: How to optimize data reading performance?** A: You can optimize from the following aspects: 1. Modify `read_threads_num` and `prefetch` parameters 2. Set reasonable `batch_size` 3. Set device='cuda' to automatically organize output results on cuda