Feature Generator

RecIS’s Feature Generator module provides configurable feature and embedding settings.

FG

class recis.fg.feature_generator.FG(fg_parser: FGParser, shape_manager: ShapeManager, use_coalesce=True, grad_reduce_by='worker', initializer='uniform', init_kwargs=None, emb_default_class='hash_table', emb_default_device='cuda', emb_default_type=torch.float32)[source]

Feature Generator for managing feature configurations and embeddings.

The FG class serves as the main interface for feature generation in the RecIS system. It manages feature parsing, shape inference, embedding configurations, and provides utilities for building feature pipelines with proper initialization and device management.

Key Features:

Feature configuration parsing and validation
Automatic shape inference for features and blocks
Embedding configuration management with multiple initializers
Support for both hash table and bucket embeddings
Multi-hash feature support for advanced embedding strategies
Integration with dataset I/O operations

fg_parser

Parser for feature configuration files.

Type:: FGParser

shape_manager

Manager for feature and block shapes.

Type:: ShapeManager

use_coalesce

Whether to use coalesced operations for efficiency.

Type:: bool

grad_reduce_by

Gradient reduction strategy (“worker” or other).

Type:: str

embedding_initializer: Initializer class for embedding parameters.

emb_default_class

Default embedding class (“hash_table” or “bucket_emb”).

Type:: str

emb_default_device

Default device for embeddings (“cpu” or “cuda”).

Type:: str

emb_default_type

Default data type for embeddings.

Type:: torch.dtype

init_kwargs

Keyword arguments for embedding initialization.

Type:: dict

_labels

Dictionary storing label configurations.

Type:: dict

_ids

Set of ID feature names.

Type:: set

__init__(fg_parser: FGParser, shape_manager: ShapeManager, use_coalesce=True, grad_reduce_by='worker', initializer='uniform', init_kwargs=None, emb_default_class='hash_table', emb_default_device='cuda', emb_default_type=torch.float32)[source]

Initialize the Feature Generator.

Parameters:

fg_parser (FGParser) – Parser for feature configuration files.
shape_manager (ShapeManager) – Manager for feature and block shapes.
use_coalesce (bool, optional) – Whether to use coalesced operations. Defaults to True.
grad_reduce_by (str, optional) – Gradient reduction strategy. Defaults to “worker”.
initializer (str, optional) – Embedding initializer type. Must be one of “constant”, “uniform”, “normal”, “xavier_normal”, “xavier_uniform”. Defaults to “uniform”.
init_kwargs (dict, optional) – Custom initialization parameters. If None, uses default parameters for the specified initializer.
emb_default_class (str, optional) – Default embedding class. Must be “hash_table” or “bucket_emb”. Defaults to “hash_table”.
emb_default_device (str, optional) – Default device for embeddings. Must be “cpu” or “cuda”. Defaults to “cuda”.
emb_default_type (torch.dtype, optional) – Default data type for embeddings. Defaults to torch.float32.

Raises:

ValueError – If emb_default_class is not “hash_table” or “bucket_emb”.
ValueError – If emb_default_device is not “cpu” or “cuda”.
NotImplementedError – If bucket embedding is selected (not yet implemented).

add_id(id_name)[source]

Add an ID feature name.

Parameters:: id_name (str) – Name of the ID feature.

add_io_features(dataset: DatasetBase)[source]

Add I/O features to a dataset based on parser configurations.

This method configures the dataset with features from the parser’s I/O configurations, adds label features with their dimensions and default values, and adds variable-length ID features.

Parameters:: dataset (DatasetBase) – Dataset to configure with features.

add_label(label_name, dim=1, default_value=0.0)[source]

Add a label configuration.

Parameters:

label_name (str) – Name of the label.
dim (int, optional) – Dimension of the label. Defaults to 1.
default_value (float, optional) – Default value for the label. Defaults to 0.0.

property block_shapes

Get block shapes from the shape manager.

Returns:: Dictionary mapping block names to their shapes.
Return type:: dict

property feature_blocks

Get feature blocks from the parser.

Returns:: Dictionary mapping block names to feature lists.
Return type:: dict

property feature_shapes

Get feature shapes from the shape manager.

Returns:: Dictionary mapping feature names to their shapes.
Return type:: dict

get_block_seq_len(block_name)[source]

Get sequence length for a sequence block.

Parameters:: block_name (str) – Name of the sequence block.
Returns:: Sequence length of the block.
Return type:: int

get_emb_confs()[source]

Generate embedding configurations for all features.

This method processes all embedding configurations from the parser and creates EmbeddingOption objects with appropriate settings for device, data type, initializer, and hooks.

Returns:: Dictionary mapping embedding names to EmbeddingOption objects.
Return type:: OrderedDict
Raises:: RuntimeError – If an unsupported transform configuration is encountered.

get_feature_confs()[source]

Generate feature configurations for all features.

This method processes all embedding configurations from the parser and creates Feature objects with appropriate operations based on the transformation types (bucketize, hash, mod, etc.).

Returns:: List of Feature objects with configured operations.
Return type:: list
Raises:: RuntimeError – If an unsupported ID transform type is encountered.

get_shape(name)[source]

Get shape for a feature or block by name.

Parameters:: name (str) – Name of the feature or block.
Returns:: Shape of the specified feature or block.
Return type:: list

is_seq_block(block_name)[source]

Check if a block is a sequence block.

Parameters:: block_name (str) – Name of the block to check.
Returns:: True if the block is a sequence block, False otherwise.
Return type:: bool
Raises:: RuntimeError – If the block name is not found in feature blocks.

property labels

Get list of label names.

Returns:: List of label names.
Return type:: list

property sample_ids

Get list of sample ID feature names.

Returns:: List of ID feature names.
Return type:: list

property seq_block_names

Get sequence block names from the parser.

Returns:: List of sequence block names.
Return type:: list

recis.fg.feature_generator.build_fg(fg_conf_path, mc_conf_path=None, mc_config=None, fg_parser_class=<class 'recis.fg.fg_parser.FGParser'>, mc_parser_class=<class 'recis.fg.mc_parser.MCParser'>, fg_class=<class 'recis.fg.feature_generator.FG'>, shape_manager_class=<class 'recis.fg.shape_manager.ShapeManager'>, uses_columns=None, lower_case=False, with_seq_prefix=False, already_hashed=False, hash_in_io=False, devel_mode=False, **kwargs)[source]

Build a complete Feature Generator with all necessary components.

This factory function creates and initializes all components needed for feature generation: MC parser, FG parser, shape manager, and the main FG instance. It provides a convenient way to set up the entire feature generation pipeline with proper configuration.

Parameters:

fg_conf_path (str) – Path to the feature generation configuration file.
mc_conf_path (str, optional) – Path to the MC configuration file. Either this or mc_config must be provided.
mc_config (dict, optional) – MC configuration dictionary. Either this or mc_conf_path must be provided.
fg_parser_class (type, optional) – FGParser class to use. Defaults to FGParser.
mc_parser_class (type, optional) – MCParser class to use. Defaults to MCParser.
fg_class (type, optional) – FG class to use. Defaults to FG.
shape_manager_class (type, optional) – ShapeManager class to use. Defaults to ShapeManager.
uses_columns (list, optional) – List of column names to use. If None, uses all columns.
lower_case (bool, optional) – Whether to convert configuration keys to lowercase. Defaults to False.
with_seq_prefix (bool, optional) – Whether the feature name already has sequence block name as prefix. Defaults to False.
already_hashed (bool, optional) – Whether features are already hashed. Defaults to False.
hash_in_io (bool, optional) – Whether to perform hashing in I/O layer. Defaults to False.
devel_mode (bool, optional) – Whether to enable development mode. Defaults to False.
**kwargs – Additional keyword arguments passed to the FG constructor.

Returns:

Configured Feature Generator instance ready for use.

Return type:

FG

Example

# Build FG with file paths
fg = build_fg(
    fg_conf_path="features.json",
    mc_conf_path="model_config.json",
    initializer="xavier_uniform",
    emb_default_device="cuda",
)

# Build FG with configuration dictionary
fg = build_fg(
    fg_conf_path="features.json",
    mc_config={"block1": ["feature1", "feature2"]},
    uses_columns=["block1"],
)

FGParser

class recis.fg.fg_parser.FGParser(conf_file_path, mc_parser, already_hashed=False, hash_in_io=False, lower_case=False, devel_mode=False)[source]

Feature Generation configuration parser and processor.

The FGParser class is responsible for parsing feature generation configuration files, processing feature definitions, and creating structured configurations for the feature generation pipeline. It handles both regular and sequence features, applies various transformations, and manages feature filtering based on model configuration.

Key Features:

Parse JSON configuration files for feature definitions
Filter features based on model configuration requirements
Handle sequence features with proper length and structure
Support feature copying and inheritance
Generate I/O and embedding configurations
Validate and transform feature parameters

already_hashed

Whether input features are already hashed.

Type:: bool

hash_in_io

Whether to perform hashing in I/O layer.

Type:: bool

mc_parser: Model configuration parser instance.

devel_mode

Whether development mode is enabled.

Type:: bool

multihash_conf_

Multi-hash configuration dictionary.

Type:: dict

fg_conf

Parsed feature generation configuration.

Type:: list

parsed_conf_

Processed feature configurations.

Type:: list

io_conf_

I/O configuration dictionary.

Type:: dict

emb_conf_

Embedding configuration dictionary.

Type:: dict

__init__(conf_file_path, mc_parser, already_hashed=False, hash_in_io=False, lower_case=False, devel_mode=False)[source]

Initialize the FG Parser.

Parameters:

conf_file_path (str) – Path to the feature generation configuration file.
mc_parser – Model configuration parser instance.
already_hashed (bool, optional) – Whether features are already hashed. Defaults to False.
hash_in_io (bool, optional) – Whether to hash in I/O layer. Defaults to False.
lower_case (bool, optional) – Whether to convert keys to lowercase. Defaults to False.
devel_mode (bool, optional) – Whether to enable development mode. Defaults to False.

property emb_configs

Get embedding configurations for all features.

Returns:: Dictionary mapping feature names to embedding configurations.
Return type:: dict

property feature_blocks

Get feature blocks from the model configuration parser.

Returns:: Dictionary mapping block names to feature lists.
Return type:: dict

get_seq_len(fea_name)[source]

Get sequence length for a sequence feature.

Parameters:: fea_name (str) – Name of the sequence feature.
Returns:: Sequence length of the feature.
Return type:: int
Raises:: RuntimeError – If the feature is not a sequence feature.

property io_configs

Get I/O configurations for all features.

Returns:: Dictionary mapping feature names to I/O configurations.
Return type:: dict

property seq_block_names

Get sequence block names from the model configuration parser.

Returns:: List of sequence block names.
Return type:: list

MCParser

class recis.fg.mc_parser.MCParser(mc_config_path=None, mc_config=None, uses_columns=None, lower_case=False, with_seq_prefix=False)[source]

Model Configuration parser for managing feature blocks and sequences.

The MCParser class is responsible for parsing model configuration files and managing the organization of features into blocks. It handles both regular feature blocks and sequence blocks, providing utilities to check feature availability and manage feature groupings for model training.

Key Features:

Parse JSON model configuration files
Manage feature blocks and sequence blocks
Filter features based on column usage requirements
Provide feature availability checking utilities
Support both file-based and dictionary-based configuration

uses_columns

List of column names to use. If None, uses all columns.

Type:: list

mc_conf

Parsed and formatted model configuration.

Type:: dict

seq_blocks

Dictionary mapping sequence block names to feature names.

Type:: OrderedDict

blocks

Dictionary of all usable feature names.

Type:: OrderedDict

fea_blocks

Dictionary mapping block names to feature lists for concatenation.

Type:: OrderedDict

__init__(mc_config_path=None, mc_config=None, uses_columns=None, lower_case=False, with_seq_prefix=False)[source]

Initialize the MC Parser.

Parameters:

mc_config_path (str, optional) – Path to the model configuration file. Either this or mc_config must be provided.
mc_config (dict, optional) – Model configuration dictionary. Either this or mc_config_path must be provided.
uses_columns (list, optional) – List of column names to use. If None, uses all columns from the configuration.
lower_case (bool, optional) – Whether to convert configuration keys to lowercase. Defaults to False.
with_seq_prefix (bool, optional) – Whether the feature name already has sequence block name as prefix. Defaults to False.

Raises:

AssertionError – If neither mc_config_path nor mc_config is provided.

property feature_blocks

Get feature blocks dictionary.

Returns:: Dictionary mapping block names to feature lists.
Return type:: OrderedDict

has_fea(fea_name)[source]

Check if a feature is available in the configuration.

Parameters:: fea_name (str) – Name of the feature to check.
Returns:: True if the feature is available, False otherwise.
Return type:: bool

has_seq_fea(seq_block, fea_name)[source]

Check if a sequence feature is available in a sequence block.

Parameters:

seq_block (str) – Name of the sequence block.
fea_name (str) – Name of the feature within the sequence block.

Returns:

True if the sequence feature is available, False otherwise.

Return type:

bool

property seq_block_names

Get sequence block names.

Returns:: Keys of sequence blocks dictionary.
Return type:: dict_keys