fg配置

表示特征处理以及查表配置

标准常见使用

String类型+Hash+查表

{
    "features": [
        {
            "feature_name": "item_cate_id",
            "feature_type": "id_feature",
            "value_type": "String",
            "hash_bucket_size": 1000,
            "embedding_dimension": 32,
            "shared_name": "item_cate_id",
            "gen_key_type": "hash",
            "gen_val_type": "lookup"
        },
    ]
}

Double类型+分桶+查表

{
    "features": [
        {
            "feature_name": "uvsum",
            "feature_type": "raw_feature",
            "value_type": "Double",
            "embedding_dimension": 8,
            "shared_name": "uvsum",
            "boundaries": "0.0,2.0,3.0,5.0,6.0",
            "gen_key_type": "boundary",
            "gen_val_type": "lookup"
        }
    ]
}

Double类型+直接使用

{
    "features": [
        {
            "feature_name": "multimodal_correl_score",
            "feature_type": "raw_feature",
            "value_type": "Double",
            "value_dimension": 50,
            "gen_key_type": "idle",
            "gen_val_type": "idle"
        }
    ]
}

String类型+MultiHash+查表(主搜RTP不支持)

{
    "features": [
        {
            "feature_name": "usersex_d_multihash",
            "feature_type": "id_feature",
            "value_type": "String",
            "hash_bucket_size": 10,
            "embedding_dimension": 8,
            "shared_name": "usersex_d",
            "compress_strategy": "yx:50,50,50,50:concat:4",
            "gen_key_type": "multihash",
            "gen_val_type": "multihash_lookup"
        }
    ]
}

序列特征

{
    "features": [
        {
            "feature_name": "cate_seq",
            "sequence_name": "cate_seq",
            "sequence_length": 50,
            "features": [
                {
                    "feature_name": "cate_ids",
                    "feature_type": "id_feature",
                    "value_type": "String",
                    "hash_bucket_size": 1000,
                    "embedding_dimension": 32,
                    "shared_name": "cate_ids",
                    "gen_key_type": "hash",
                    "gen_val_type": "lookup"
                },
                {
                    "feature_name": "cate_rates",
                    "feature_type": "raw_feature",
                    "value_type": "Double",
                    "gen_key_type": "idle",
                    "gen_val_type": "idle"
                }
            ]
        }
    ]
}

特征来源于已有特征(特征clone)

非序列特征

原始样本中只含有lp_time特征,新特征为lp_time_raw,并做Double类型+分桶+查表流程

{
    "features": [
        {
            "feature_name": "lp_time_raw",
            "from_feature": "lp_time",
            "feature_type": "raw_feature",
            "value_type": "Double",
            "embedding_dimension": 4,
            "shared_name": "longterm_time",
            "boundaries": "1.0,8564.0,16590.0",
            "gen_key_type": "boundary",
            "gen_val_type": "lookup"
        }
    ]
}

序列特征

  1. longpay_seq__context为前缀的序列特征,全部来源于longpay_seq为前缀的序列特征

  2. 其中longpay_seq__context_lp_cnt特征,来源于longpay_seq_lp_cnt特征

{
    "features": [
        {
            "feature_name": "longpay_seq",
            "sequence_name": "longpay_seq__context",
            "sequence_length": 200,
            "features": [
                {
                    "feature_name": "lp_cnt",
                    "from_feature": "longpay_seq_lp_cnt",
                    "feature_type": "raw_feature",
                    "value_type": "Double",
                    "embedding_dimension": 4,
                    "boundaries": "1.0,2.0,3.0,4.0",
                    "shared_name": "context_cnt",
                    "gen_key_type": "boundary",
                    "gen_val_type": "lookup"
                }
            ]
        }
    ]
}

配置完整字段解析

非序列特征

{
    "features": [
        {
            "feature_name": "特征名",
            "value_type": "特征值类型",
            "feature_type": "特征类型",
            "gen_key_type": "特征值处理类型",
            "gen_val_type": "特征值产出类型",
            "value_dimension": "特征值dim",
            "from_feature": "输入来源于io中的哪个特征",

            "hash_bucket_size": "哈希桶大小",
            "hash_type": "哈希方式",

            "compress_strategy": "multi哈希配置",

            "boundaries": "分桶值",

            "embedding_dimension": "查表dim",
            "shared_name": "表名",
            "combiner": "查表聚合方式",
            "trainable": "是否做训练梯度更新",
            "emb_device": "表所在的设备",
            "emb_type": "表类型",
            "admit_hook": "特征准入策略",
            "filter_hook": "特征过滤策略"
        },
    ]
}

通用配置

  • feature_name: 特征名称

  • value_type: 特征值类型,可选 Integer | Double | String

  • feature_type: 特征类型,可选 raw_feature | id_feature, raw_feature表示原始特征都已做过补齐操作,可以通过fixedlen_feature读取

  • gen_key_type:特征值处理类型,可选
    • idle: 不做处理,原值返回。(string类型特征的idle尚未支持)

    • boundary:分桶操作

    • hash:哈希处理

    • multihash: 先哈希再根据哈希值做multi哈希

    • mask:根据mask_value做mask(尚未支持)

  • gen_val_type:特征输出方式,可选
    • idle:原值输出

    • lookup:查表后输出

    • multihash_lookup:用multihash结果查多个表,只有gen_key_type 为multihash时才被允许

  • (可选)value_dimension:特征值维度,默认为1

  • (可选)from_feature:处理的特征并不来源于io读取结果的feature_name,而是从其他特征拷贝出来

哈希配置(gen_key_type为hash时需要)

  • hash_bucket_size:哈希操作分桶大小,0表示不做分桶

  • (可选)hash_type:哈希方式,默认为farm。支持 farm | murmur

multi哈希配置(gen_key_type为multihash时需要)

  • compress_strategy:multihash的配置

配置形式为
${prefix}:${bucket1},${bucket2},${bucket3},${bucket4}:${combiner}:${num}
例子
yx:50000,50000,50000,50000,concat,4

目前combiner仅支持concat,且数量一定是4

分桶配置(gen_key_type为boundary时需要)

  • boundaries:分桶区间

emb配置(gen_val_type为lookup时需要)

  • embedding_dimension:emb表的维度

  • (可选)shared_name: emb表名,默认为feature_name

  • (可选)combiner: emb聚合方式,默认为mean,可选 mean | sum

  • (可选)trainable: 是否做训练梯度更新,默认为true

  • (可选)emb_type: 表类型,默认float32,默认根据fg统一配置处理,可选 float|bf16 | fp16 | int8,默认float

  • (可选)emb_device: 表所在的设备,默认根据fg统一配置处理,可选 cuda | cpu,默认cuda

  • (可选)admit_hook: 特征准入策略,默认无准入策略,示例配置 {“name”: “ReadOnly”}

  • (可选)filter_hook: 特征过滤策略,默认无过滤策略,示例配置 {“name”: “GlobalStepFilter”, “params”: {“filter_step”: 5000}}

序列特征

{
    "features": [
        {
            "feature_name": "pv_item_seq",
            "sequence_length": 200,
            "sequence_name": "pv_item_seq",
            "features": [
                "xxx": "xxx"
            ]
        },
    ]
}

序列配置

  • feature_name:序列前缀名

  • sequence_name:序列特征处理后的前缀名,通常和feature_name一致

  • sequence_length:序列最大长度

  • features:序列下特征配置(注实际特征名为 序列前缀 + “_” + 序列下特征名),配置可参考上面非序列特征