日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

如何使用FeatureStore Python SDK

本文為您介紹通過FeatureStore整合數據特征并進行模型離線訓練,以及后續的上線服務操作流程。

背景信息

特征平臺是用來生產、共享和管理機器學習模型特征的存儲庫,可以方便地向多人、多團隊共享特征,保證離線在線的一致性,并提供高效的在線特征訪問。

特征平臺基本適用于所有需要特征的場景,典型場景如推薦場景。特征表注冊在特征平臺中,特征平臺可以自動完成在線和離線表的構建,保證在線和離線的一致性,同時保證特征表只存一份的情況下,能夠向多人共享特征,減少資源成本。特征平臺還可以節省時間成本,原本需要復雜的SQL操作,比如導出訓練表,數據導入到Hologres表中等操作,在特征平臺中都可以通過一行代碼完成。

目前特征平臺封裝了整個特征到模型的流程,離線支持MaxCompute平臺,在線支持Hologres、GraphCompute和TableStore等平臺,開發者或者算法工程師無需深入了解各個平臺的細節,所有的操作在特征平臺中都可以通過網頁手動操作或者Python SDK完成,提升團隊工作效率,同時也會避免一些可能存在的問題,比如推薦場景中比較常見的離線在線不一致的問題。

目前特征平臺已經與EasyRec深度集成,可以非常方便高效地進行FG和模型訓練,并且能夠直接部署到線上,可以做到在短時間內搭建起一套前沿的推薦系統,并且可以取FG得優良的效果。

如果您在使用過程中有任何問題,請通過搜索釘釘群(32260796)進群咨詢。

前提條件

已開通PAI服務并創建PAI工作空間,操作詳情請參見開通PAI并創建默認工作空間。

準備工作

安裝特征平臺Python SDK, 要求在Python3環境下運行。本文的代碼建議在DSW中運行。

! pip install https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-1.3.1-py3-none-any.whl

FeatureStoreClient的建立需要傳入您阿里云賬戶的access_key_idaccess_key_secret,我們建議您通過環境變量方式傳入,以降低泄漏風險。進入DSW實例后,您可以單擊上方的Terminal,進入終端界面,并運行以下命令:

對于access_key_id,您需要將您的AccessKeyID替換YOUR_AccessKey_ID

echo "export AccessKeyID='YOUR_AccessKey_ID'" >> ~/.bashrc
source ~/.bashrc

對于access_key_secret,您需要將您的AccessKeySecret替換YOUR_Access_Key_Secret

echo "export AccessKeySecret='YOUR_Access_Key_Secret'" >> ~/.bashrc
source ~/.bashrc

導入需要的功能模塊。

import unittest
import sys
import os
from os.path import dirname, join, abspath
from feature_store_py.fs_client import FeatureStoreClient, build_feature_store_client
from feature_store_py.fs_project import FeatureStoreProject
from feature_store_py.fs_datasource import UrlDataSource, MaxComputeDataSource, DatahubDataSource, HologresDataSource, SparkDataSource, LabelInput, TrainingSetOutput
from feature_store_py.fs_type import FSTYPE
from feature_store_py.fs_schema import OpenSchema, OpenField
from feature_store_py.fs_feature_view import FeatureView
from feature_store_py.fs_features import FeatureSelector
from feature_store_py.fs_config import EASDeployConfig, LabelInputConfig, PartitionConfig, FeatureViewConfig, TrainSetOutputConfig
import logging
logger = logging.getLogger("foo")
logger.addHandler(logging.StreamHandler(stream=sys.stdout))

數據集介紹

數據集示例開源電影數據集Moviedata-10M,其中主要使用的是Movie、User和Rating這三份數據,分別對應推薦流程中的物料表、用戶表和label表。

配置特征項目

您可以通過特征平臺創建多個項目空間,每個項目空間是獨立的。具體操作,請參見配置FeatureStore項目。運行notebook需要FeatureStore服務端配合運行,開通特征平臺后需要配置數據源,具體操作請參見配置數據源。

其中,offline_datasource_id指的是離線數據源ID,online_datasource_id指的是在線數據源ID。

此處以項目名稱是fs_movie為例進行說明。

# 輸入您阿里云賬戶的access_key_id
access_id = os.getenv("AccessKeyID")
# 輸入您阿里云賬戶的access_key_secret
access_ak = os.getenv("AccessKeySecret")
# 輸入您開通特征平臺所在地域,此處以華東1(杭州)為例
region = 'cn-hangzhou'
fs = FeatureStoreClient(access_key_id=access_id, access_key_secret=access_ak, region=region)
# 輸入您特征平臺的項目名,此處以fs_movie為例
cur_project_name = "fs_movie"
project = fs.get_project(cur_project_name)
if project is None:
    raise ValueError("Need to create project : fs_movie")

運行以下代碼獲取當前的project并打印其信息。

project = fs.get_project(cur_project_name)
project.print_summary()

配置特征實體(FeatureEntity)

特征實體描述了一組相關的特征集合。多個特征視圖可以關聯一個特征實體。每個實體都會有一個JoinId,通過JoinId可以關聯多個特征視圖特征。每一個特征視圖都有一個主鍵(索引鍵)來獲取它的特征數據,但是索引鍵可以和JoinId定義的名稱不一樣。

參考如下示例,創建Movie、User和Rating三個實體。

cur_entity_name_movie = "movie_data"
join_id = 'movie_id'
entity_movie = project.get_entity(cur_entity_name_movie)
if entity_movie is None:
	entity_movie = project.create_entity(name = cur_entity_name_movie, join_id=join_id)
entity_movie.print_summary()
cur_entity_name_user = "user_data"
join_id = 'user_md5'
entity_user = project.get_entity(cur_entity_name_user)
if entity_user is None:
  entity_user = project.create_entity(name = cur_entity_name_user, join_id=join_id)
entity_user.print_summary()
cur_entity_name_ratings = "rating_data"
join_id = 'rating_id'
entity_ratings = project.get_entity(cur_entity_name_ratings)
if entity_ratings is None:
  entity_ratings = project.create_entity(name = cur_entity_name_ratings, join_id=join_id)
entity_ratings.print_summary()

配置特征視圖(FeatureView

FeatureStore是一個專門用來管理和組織特征數據的平臺,外部數據需要通過特征視圖進入FeatureStore。特征視圖定義了數據從哪里來(DataSource)、需要進行哪些預處理或轉換操作(如特征工程/Transformation)、特征的數據結構(包含特征名稱和類型在內的特征schema)、數據存儲的位置(OnlineStore/OfflineStore),并提供特征元信息管理,如主鍵、事件時間、分區鍵、特征實體以及有效期設定ttl(默認-1表示永久有效,正數則表示在線查詢時會取ttl內的最新特征數據)。

特征視圖分為三種類型:

  • BatchFeatureView:離線特征,或者T-1天特征。將離線數據注入到FeatureStore的OfflineStore中,并可以根據需求同步至OnlineStore以支持實時查詢。一般的離線特征,或者T-1天的特征。

  • StreamFeatureView:實時特征。將數據直接寫入OnlineStore,并同時同步到OfflineStore。

  • Sequence FeatureView:序列特征。支持離線寫入序列特征,以及查詢和讀取實時序列特征。

BatchFeatureView

如果數據存在于CSV文件中,通過URL下載寫入到MaxCompute,定義的FeatureView的schema需要手動創建。

path = 'https://feature-store-test.oss-cn-beijing.aliyuncs.com/dataset/moviedata_all/movies.csv'
delimiter = ','
omit_header = True
ds = UrlDataSource(path, delimiter, omit_header)
print(ds)

schema定義了字段的名稱和類型。

movie_schema = OpenSchema(
    OpenField(name='movie_id', type='STRING'),
    OpenField(name='name', type='STRING'),
    OpenField(name='alias', type='STRING'),
    OpenField(name='actores', type='STRING'),
    OpenField(name='cover', type='STRING'),
    OpenField(name='directors', type='STRING'),
    OpenField(name='double_score', type='STRING'),
    OpenField(name='double_votes', type='STRING'),
    OpenField(name='genres', type='STRING'),
    OpenField(name='imdb_id', type='STRING'),
    OpenField(name='languages', type='STRING'),
    OpenField(name='mins', type='STRING'),
    OpenField(name='official_site', type='STRING'),
    OpenField(name='regions', type='STRING'),
    OpenField(name='release_data', type='STRING'),
    OpenField(name='slug', type='STRING'),
    OpenField(name='story', type='STRING'),
    OpenField(name='tags', type='STRING'),
    OpenField(name='year', type='STRING'),
    OpenField(name='actor_ids', type='STRING'),
    OpenField(name='director_ids', type='STRING'),
    OpenField(name='dt', type='STRING')
)
print(movie_schema)

新建batch_feature_view。

feature_view_movie_name = "feature_view_movie"
batch_feature_view = project.get_feature_view(feature_view_movie_name)
if batch_feature_view is None:
  batch_feature_view = project.create_batch_feature_view(name=feature_view_movie_name, schema=movie_schema, online = True, entity= cur_entity_name_movie, primary_key='movie_id', partitions=['dt'], ttl=-1)
batch_feature_view = project.get_feature_view(feature_view_movie_name)
batch_feature_view.print_summary()

數據寫入MaxCompute表。

cur_task = batch_feature_view.write_table(ds, partitions={'dt':'20220830'})
cur_task.wait()

查看當前task的信息。

print(cur_task.task_summary)

數據同步到OnlineStore中。

cur_task = batch_feature_view.publish_table({'dt':'20220830'})
cur_task.wait()
print(cur_task.task_summary)

獲取對應的FeatureView。

batch_feature_view = project.get_feature_view(feature_view_movie_name)

打印該FeatureView的信息。

batch_feature_view.print_summary()

我們按此順序,依次導入users表,ratings表。

users_path = 'https://feature-store-test.oss-cn-beijing.aliyuncs.com/dataset/moviedata_all/users.csv'
ds = UrlDataSource(users_path, delimiter, omit_header)
print(ds)
user_schema = OpenSchema(
  OpenField(name='user_md5', type='STRING'),
  OpenField(name='user_nickname', type='STRING'),
  OpenField(name='ds', type='STRING')
)
print(user_schema)
feature_view_user_name = "feature_view_users"
batch_feature_view = project.get_feature_view(feature_view_user_name)
if batch_feature_view is None:
  batch_feature_view = project.create_batch_feature_view(name=feature_view_user_name, schema=user_schema, online = True, entity= cur_entity_name_user, primary_key='user_md5',ttl=-1, partitions=['ds'])
write_table_task = batch_feature_view.write_table(ds, {'ds':'20220830'})
write_table_task.wait()
print(write_table_task.task_summary)
cur_task = batch_feature_view.publish_table({'ds':'20220830'})
cur_task.wait()
print(cur_task.task_summary)
batch_feature_view = project.get_feature_view(feature_view_user_name)
batch_feature_view.print_summary()
ratings_path = 'https://feature-store-test.oss-cn-beijing.aliyuncs.com/dataset/moviedata_all/ratings.csv'
ds = UrlDataSource(ratings_path, delimiter, omit_header)
print(ds)
ratings_schema = OpenSchema(
  OpenField(name='rating_id', type='STRING'),
  OpenField(name='user_md5', type='STRING'),
  OpenField(name='movie_id', type='STRING'),
  OpenField(name='rating', type='STRING'),
  OpenField(name='rating_time', type='STRING'),
  OpenField(name='dt', type='STRING')
)
feature_view_rating_name = "feature_view_ratings"
batch_feature_view = project.get_feature_view(feature_view_rating_name)
if batch_feature_view is None:
  batch_feature_view = project.create_batch_feature_view(name=feature_view_rating_name, schema=ratings_schema, online = True, entity= cur_entity_name_ratings, primary_key='rating_id', event_time='rating_time', partitions=['dt'])
cur_task = batch_feature_view.write_table(ds, {'dt':'20220831'})
cur_task.wait()
print(cur_task.task_summary)
batch_feature_view = project.get_feature_view(feature_view_rating_name)
batch_feature_view.print_summary()

label表注冊。

label_table_name = 'fs_movie_feature_view_ratings_offline'
ds = MaxComputeDataSource(data_source_id=project.offline_datasource_id, table=label_table_name)
label_table = project.get_label_table(label_table_name)
if label_table is None:
  label_table = project.create_label_table(datasource=ds, event_time='rating_time')

配置離線數據源Offlinestore

離線特征數據存儲的數據倉庫,在MaxCompute或DS上的HDFS,通過Spark進行數據寫入。通過離線數據源可以生成樣本數據TrainingSet,用于模型訓練;也可以生成batch prediction數據,用于批量預測。

配置在線數據源(Onlinestore)

在線預測時,需要低延遲獲取特征數據,在線數據源提供在線特征數據的存儲。目前優先支持Hologres、Tablestore和Graphcompute。

獲取在線特征

從特征視圖的角度獲取在線特征,目前優先支持Hologres。

feature_view_movie_name = "feature_view_movie"
batch_feature_view = project.get_feature_view(feature_view_movie_name)
ret_features_1 = batch_feature_view.get_online_features(join_ids={'movie_id':['26357307']}, features=['name', 'actores', 'regions'])
print("ret_features = ", ret_features_1)
feature_view_movie_name = "feature_view_movie"
batch_feature_view = project.get_feature_view(feature_view_movie_name)
ret_features_2 = batch_feature_view.get_online_features(join_ids={'movie_id':['30444960', '3317352']}, features=['name', 'actores', 'regions'])
print("ret_features = ", ret_features_2)

配置FeatureSelector

從離線數據源或在線數據源獲取特征時,需要明確指出應該獲取哪些特征??梢詮奶卣饕晥D的角度選擇特征。

feature_view_name = 'feature_view_movie'
# 選擇部分特征
feature_selector = FeatureSelector(feature_view_name, ['site_id', 'site_category'])

#選擇全部特征
feature_selector = FeatureSelector(feature_view_name, '*')

# 支持別名
feature_selector = FeatureSelector(
    feature_view='user1',
    features = ['f1','f2', 'f3'],
    alias={"f1":"f1_1"} # 字段別名,最終會產出 f1_1 的字段名稱 
)

配置樣本表(TrainingSet

訓練模型時,首先要構造樣本表,樣本表由Label數據和特征數據組成。在與FeatureStore交互時,Label數據需要由客戶提供,并且需要定義要獲取的特征名稱,然后根據主鍵進行point-in-time join(存在event_time的情況下)。

label_table_name = 'fs_movie_feature_view_ratings_offline'
output_ds = MaxComputeDataSource(data_source_id=project.offline_datasource_id)
train_set_output = TrainingSetOutput(output_ds)
feature_view_movie_name = "feature_view_movie"
feature_movie_selector = FeatureSelector(feature_view_movie_name, ['name', 'actores', 'regions','tags'])
feature_view_user_name = 'feature_view_users'
feature_user_selector = FeatureSelector(feature_view_user_name, ['user_nickname'])
train_set = project.create_training_set(label_table_name=label_table_name, train_set_output= train_set_output, feature_selectors=[feature_movie_selector, feature_user_selector])
print("train_set = ", train_set)

訓練模型(Model

訓練模型并部署成服務后,進行業務預測。其中,訓練樣本可以從上文的train_set獲得。

model_name = "fs_rank_v1"
cur_model = project.get_model(model_name)
if cur_model is None:
  cur_model = project.create_model(model_name, train_set)
print("cur_model_train_set_table_name = ", cur_model.train_set_table_name)

導出樣本表

實際訓練時,需要導出樣本表。指定Label表以及各個特征視圖的分區、event_time。

label_partitions = PartitionConfig(name = 'dt', value = '20220831')
label_input_config = LabelInputConfig(partition_config=label_partitions, event_time='1999-01-00 00:00:00')

movie_partitions = PartitionConfig(name = 'dt', value = '20220830')
feature_view_movie_config = FeatureViewConfig(name = 'feature_view_movie', partition_config=movie_partitions)

user_partitions = PartitionConfig(name = 'ds', value = '20220830')
feature_view_user_config = FeatureViewConfig(name = 'feature_view_users', partition_config=user_partitions)
feature_view_config_list = [feature_view_movie_config, feature_view_user_config]
train_set_partitions = PartitionConfig(name = 'dt', value = '20220831')
train_set_output_config = TrainSetOutputConfig(partition_config=train_set_partitions)

根據指定的條件,導出樣本表。

task = cur_model.export_train_set(label_input_config, feature_view_config_list, train_set_output_config)
task.wait()
print(task.summary)