Feature Generator

RecIS’s Feature Generator module provides configurable feature and embedding settings.

FG

class recis.fg.feature_generator.FG(fg_parser: FGParser, shape_manager: ShapeManager, use_coalesce=True, grad_reduce_by='worker', initializer='uniform', init_kwargs=None, emb_default_class='hash_table', emb_default_device='cuda', emb_default_type=torch.float32)[source]

Feature Generator for managing feature configurations and embeddings.

The FG class serves as the main interface for feature generation in the RecIS system. It manages feature parsing, shape inference, embedding configurations, and provides utilities for building feature pipelines with proper initialization and device management.

Key Features:
  • Feature configuration parsing and validation

  • Automatic shape inference for features and blocks

  • Embedding configuration management with multiple initializers

  • Support for both hash table and bucket embeddings

  • Multi-hash feature support for advanced embedding strategies

  • Integration with dataset I/O operations

fg_parser

Parser for feature configuration files.

Type:

FGParser

shape_manager

Manager for feature and block shapes.

Type:

ShapeManager

use_coalesce

Whether to use coalesced operations for efficiency.

Type:

bool

grad_reduce_by

Gradient reduction strategy (“worker” or other).

Type:

str

embedding_initializer

Initializer class for embedding parameters.

emb_default_class

Default embedding class (“hash_table” or “bucket_emb”).

Type:

str

emb_default_device

Default device for embeddings (“cpu” or “cuda”).

Type:

str

emb_default_type

Default data type for embeddings.

Type:

torch.dtype

init_kwargs

Keyword arguments for embedding initialization.

Type:

dict

_labels

Dictionary storing label configurations.

Type:

dict

_ids

Set of ID feature names.

Type:

set

__init__(fg_parser: FGParser, shape_manager: ShapeManager, use_coalesce=True, grad_reduce_by='worker', initializer='uniform', init_kwargs=None, emb_default_class='hash_table', emb_default_device='cuda', emb_default_type=torch.float32)[source]

Initialize the Feature Generator.

Parameters:
  • fg_parser (FGParser) – Parser for feature configuration files.

  • shape_manager (ShapeManager) – Manager for feature and block shapes.

  • use_coalesce (bool, optional) – Whether to use coalesced operations. Defaults to True.

  • grad_reduce_by (str, optional) – Gradient reduction strategy. Defaults to “worker”.

  • initializer (str, optional) – Embedding initializer type. Must be one of “constant”, “uniform”, “normal”, “xavier_normal”, “xavier_uniform”. Defaults to “uniform”.

  • init_kwargs (dict, optional) – Custom initialization parameters. If None, uses default parameters for the specified initializer.

  • emb_default_class (str, optional) – Default embedding class. Must be “hash_table” or “bucket_emb”. Defaults to “hash_table”.

  • emb_default_device (str, optional) – Default device for embeddings. Must be “cpu” or “cuda”. Defaults to “cuda”.

  • emb_default_type (torch.dtype, optional) – Default data type for embeddings. Defaults to torch.float32.

Raises:
  • ValueError – If emb_default_class is not “hash_table” or “bucket_emb”.

  • ValueError – If emb_default_device is not “cpu” or “cuda”.

  • NotImplementedError – If bucket embedding is selected (not yet implemented).

add_id(id_name)[source]

Add an ID feature name.

Parameters:

id_name (str) – Name of the ID feature.

add_io_features(dataset: DatasetBase)[source]

Add I/O features to a dataset based on parser configurations.

This method configures the dataset with features from the parser’s I/O configurations, adds label features with their dimensions and default values, and adds variable-length ID features.

Parameters:

dataset (DatasetBase) – Dataset to configure with features.

add_label(label_name, dim=1, default_value=0.0)[source]

Add a label configuration.

Parameters:
  • label_name (str) – Name of the label.

  • dim (int, optional) – Dimension of the label. Defaults to 1.

  • default_value (float, optional) – Default value for the label. Defaults to 0.0.

property block_shapes

Get block shapes from the shape manager.

Returns:

Dictionary mapping block names to their shapes.

Return type:

dict

property feature_blocks

Get feature blocks from the parser.

Returns:

Dictionary mapping block names to feature lists.

Return type:

dict

property feature_shapes

Get feature shapes from the shape manager.

Returns:

Dictionary mapping feature names to their shapes.

Return type:

dict

get_block_seq_len(block_name)[source]

Get sequence length for a sequence block.

Parameters:

block_name (str) – Name of the sequence block.

Returns:

Sequence length of the block.

Return type:

int

get_emb_confs()[source]

Generate embedding configurations for all features.

This method processes all embedding configurations from the parser and creates EmbeddingOption objects with appropriate settings for device, data type, initializer, and hooks.

Returns:

Dictionary mapping embedding names to EmbeddingOption objects.

Return type:

OrderedDict

Raises:

RuntimeError – If an unsupported transform configuration is encountered.

get_feature_confs()[source]

Generate feature configurations for all features.

This method processes all embedding configurations from the parser and creates Feature objects with appropriate operations based on the transformation types (bucketize, hash, mod, etc.).

Returns:

List of Feature objects with configured operations.

Return type:

list

Raises:

RuntimeError – If an unsupported ID transform type is encountered.

get_shape(name)[source]

Get shape for a feature or block by name.

Parameters:

name (str) – Name of the feature or block.

Returns:

Shape of the specified feature or block.

Return type:

list

is_seq_block(block_name)[source]

Check if a block is a sequence block.

Parameters:

block_name (str) – Name of the block to check.

Returns:

True if the block is a sequence block, False otherwise.

Return type:

bool

Raises:

RuntimeError – If the block name is not found in feature blocks.

property labels

Get list of label names.

Returns:

List of label names.

Return type:

list

property sample_ids

Get list of sample ID feature names.

Returns:

List of ID feature names.

Return type:

list

property seq_block_names

Get sequence block names from the parser.

Returns:

List of sequence block names.

Return type:

list

recis.fg.feature_generator.build_fg(fg_conf_path, mc_conf_path=None, mc_config=None, fg_parser_class=<class 'recis.fg.fg_parser.FGParser'>, mc_parser_class=<class 'recis.fg.mc_parser.MCParser'>, fg_class=<class 'recis.fg.feature_generator.FG'>, shape_manager_class=<class 'recis.fg.shape_manager.ShapeManager'>, uses_columns=None, lower_case=False, with_seq_prefix=False, already_hashed=False, hash_in_io=False, devel_mode=False, **kwargs)[source]

Build a complete Feature Generator with all necessary components.

This factory function creates and initializes all components needed for feature generation: MC parser, FG parser, shape manager, and the main FG instance. It provides a convenient way to set up the entire feature generation pipeline with proper configuration.

Parameters:
  • fg_conf_path (str) – Path to the feature generation configuration file.

  • mc_conf_path (str, optional) – Path to the MC configuration file. Either this or mc_config must be provided.

  • mc_config (dict, optional) – MC configuration dictionary. Either this or mc_conf_path must be provided.

  • fg_parser_class (type, optional) – FGParser class to use. Defaults to FGParser.

  • mc_parser_class (type, optional) – MCParser class to use. Defaults to MCParser.

  • fg_class (type, optional) – FG class to use. Defaults to FG.

  • shape_manager_class (type, optional) – ShapeManager class to use. Defaults to ShapeManager.

  • uses_columns (list, optional) – List of column names to use. If None, uses all columns.

  • lower_case (bool, optional) – Whether to convert configuration keys to lowercase. Defaults to False.

  • with_seq_prefix (bool, optional) – Whether the feature name already has sequence block name as prefix. Defaults to False.

  • already_hashed (bool, optional) – Whether features are already hashed. Defaults to False.

  • hash_in_io (bool, optional) – Whether to perform hashing in I/O layer. Defaults to False.

  • devel_mode (bool, optional) – Whether to enable development mode. Defaults to False.

  • **kwargs – Additional keyword arguments passed to the FG constructor.

Returns:

Configured Feature Generator instance ready for use.

Return type:

FG

Example

# Build FG with file paths
fg = build_fg(
    fg_conf_path="features.json",
    mc_conf_path="model_config.json",
    initializer="xavier_uniform",
    emb_default_device="cuda",
)

# Build FG with configuration dictionary
fg = build_fg(
    fg_conf_path="features.json",
    mc_config={"block1": ["feature1", "feature2"]},
    uses_columns=["block1"],
)

FGParser

class recis.fg.fg_parser.FGParser(conf_file_path, mc_parser, already_hashed=False, hash_in_io=False, lower_case=False, devel_mode=False)[source]

Feature Generation configuration parser and processor.

The FGParser class is responsible for parsing feature generation configuration files, processing feature definitions, and creating structured configurations for the feature generation pipeline. It handles both regular and sequence features, applies various transformations, and manages feature filtering based on model configuration.

Key Features:
  • Parse JSON configuration files for feature definitions

  • Filter features based on model configuration requirements

  • Handle sequence features with proper length and structure

  • Support feature copying and inheritance

  • Generate I/O and embedding configurations

  • Validate and transform feature parameters

already_hashed

Whether input features are already hashed.

Type:

bool

hash_in_io

Whether to perform hashing in I/O layer.

Type:

bool

mc_parser

Model configuration parser instance.

devel_mode

Whether development mode is enabled.

Type:

bool

multihash_conf_

Multi-hash configuration dictionary.

Type:

dict

fg_conf

Parsed feature generation configuration.

Type:

list

parsed_conf_

Processed feature configurations.

Type:

list

io_conf_

I/O configuration dictionary.

Type:

dict

emb_conf_

Embedding configuration dictionary.

Type:

dict

__init__(conf_file_path, mc_parser, already_hashed=False, hash_in_io=False, lower_case=False, devel_mode=False)[source]

Initialize the FG Parser.

Parameters:
  • conf_file_path (str) – Path to the feature generation configuration file.

  • mc_parser – Model configuration parser instance.

  • already_hashed (bool, optional) – Whether features are already hashed. Defaults to False.

  • hash_in_io (bool, optional) – Whether to hash in I/O layer. Defaults to False.

  • lower_case (bool, optional) – Whether to convert keys to lowercase. Defaults to False.

  • devel_mode (bool, optional) – Whether to enable development mode. Defaults to False.

property emb_configs

Get embedding configurations for all features.

Returns:

Dictionary mapping feature names to embedding configurations.

Return type:

dict

property feature_blocks

Get feature blocks from the model configuration parser.

Returns:

Dictionary mapping block names to feature lists.

Return type:

dict

get_seq_len(fea_name)[source]

Get sequence length for a sequence feature.

Parameters:

fea_name (str) – Name of the sequence feature.

Returns:

Sequence length of the feature.

Return type:

int

Raises:

RuntimeError – If the feature is not a sequence feature.

property io_configs

Get I/O configurations for all features.

Returns:

Dictionary mapping feature names to I/O configurations.

Return type:

dict

property seq_block_names

Get sequence block names from the model configuration parser.

Returns:

List of sequence block names.

Return type:

list

MCParser

class recis.fg.mc_parser.MCParser(mc_config_path=None, mc_config=None, uses_columns=None, lower_case=False, with_seq_prefix=False)[source]

Model Configuration parser for managing feature blocks and sequences.

The MCParser class is responsible for parsing model configuration files and managing the organization of features into blocks. It handles both regular feature blocks and sequence blocks, providing utilities to check feature availability and manage feature groupings for model training.

Key Features:
  • Parse JSON model configuration files

  • Manage feature blocks and sequence blocks

  • Filter features based on column usage requirements

  • Provide feature availability checking utilities

  • Support both file-based and dictionary-based configuration

uses_columns

List of column names to use. If None, uses all columns.

Type:

list

mc_conf

Parsed and formatted model configuration.

Type:

dict

seq_blocks

Dictionary mapping sequence block names to feature names.

Type:

OrderedDict

blocks

Dictionary of all usable feature names.

Type:

OrderedDict

fea_blocks

Dictionary mapping block names to feature lists for concatenation.

Type:

OrderedDict

__init__(mc_config_path=None, mc_config=None, uses_columns=None, lower_case=False, with_seq_prefix=False)[source]

Initialize the MC Parser.

Parameters:
  • mc_config_path (str, optional) – Path to the model configuration file. Either this or mc_config must be provided.

  • mc_config (dict, optional) – Model configuration dictionary. Either this or mc_config_path must be provided.

  • uses_columns (list, optional) – List of column names to use. If None, uses all columns from the configuration.

  • lower_case (bool, optional) – Whether to convert configuration keys to lowercase. Defaults to False.

  • with_seq_prefix (bool, optional) – Whether the feature name already has sequence block name as prefix. Defaults to False.

Raises:

AssertionError – If neither mc_config_path nor mc_config is provided.

property feature_blocks

Get feature blocks dictionary.

Returns:

Dictionary mapping block names to feature lists.

Return type:

OrderedDict

has_fea(fea_name)[source]

Check if a feature is available in the configuration.

Parameters:

fea_name (str) – Name of the feature to check.

Returns:

True if the feature is available, False otherwise.

Return type:

bool

has_seq_fea(seq_block, fea_name)[source]

Check if a sequence feature is available in a sequence block.

Parameters:
  • seq_block (str) – Name of the sequence block.

  • fea_name (str) – Name of the feature within the sequence block.

Returns:

True if the sequence feature is available, False otherwise.

Return type:

bool

property seq_block_names

Get sequence block names.

Returns:

Keys of sequence blocks dictionary.

Return type:

dict_keys