EasyRec支持csv和Parquet兩種Hive文件存儲格式。本文通過示例為您介紹,如何基于Hive在Data Science集群進行EasyRec模型訓(xùn)練、評估和預(yù)測。
前提條件
- 已創(chuàng)建Hadoop集群,詳情請參見創(chuàng)建集群。
- 已創(chuàng)建DataScience集群,且選擇了EasyRec和TensorFlow服務(wù),詳情請參見創(chuàng)建集群。
- 下載dsdemo代碼:請已創(chuàng)建DataScience集群的用戶,使用釘釘搜索釘釘群號32497587加入釘釘群以獲取dsdemo代碼。
操作步驟
- 開放安全組。將DataScience集群的所有公網(wǎng)IP地址,添加至Hadoop集群的安全組中,端口為10000和9000,詳情請參見添加安全組規(guī)則。
- 修改ml_on_ds目錄下的文件。
- 上傳獲取到的dsdemo*.zip至DataScience集群的header節(jié)點。
- 通過SSH方式連接DataScience集群,詳情請參見登錄集群。
- 解壓dsdemo*.zip。
- 修改ml_on_ds目錄下的easyrec_model.config文件。使用./ml_on_ds/testdata/dssm_hiveinput/dssm_hive_csv_input.config或dssm_hive_parquet_input.config作為easyrec_model.config。
yes | cp ./testdata/dssm_hiveinput/dssm_hive_csv_input.config ./easyrec_model.config
easyrec_model.config的代碼片段如下,根據(jù)集群信息修改host和username。
hive_train_input { host: "*.*.*.*" username: "admin" port:10000 } hive_eval_input { host: "*.*.*.*" username: "admin" port:10000 } train_config { ... } eval_config { ... } data_config { input_type: HiveInput eval_batch_size: 1024 ... }
參數(shù) 描述 host Hadoop集群Header節(jié)點的外網(wǎng)IP地址。 username Hadoop集群LDAP的用戶名。默認為admin。 input_type 輸入類型。EasyRec支持HiveInput和HiveParquetInput兩種關(guān)于Hive的輸入類型。
- 生成鏡像文件。修改ml_on_ds目錄下的config文件,設(shè)置
DATABASE
、TRAIN_TABLE_NAME
、EVAL_TABLE_NAME
、PREDICT_TABLE_NAME
、PREDICT_OUTPUT_TABLE_NAME
和PARTITION_NAME
。設(shè)置好容器服務(wù)的地址,保證容器服務(wù)是私有的且為企業(yè)鏡像。配置好之后執(zhí)行make build push
命令打包鏡像。 - 準(zhǔn)備數(shù)據(jù)。
- 在ml_on_ds目錄下運行以下命令,以啟動Hive。
spark-sql
- 執(zhí)行以下命令,準(zhǔn)備測試數(shù)據(jù)。
create database pai_online_projects location 'hdfs://192.168.*.*:9000/pai_online_projects.db'; ---------------------csv數(shù)據(jù)--------------------- CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY',' partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data'; CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY',' partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data'; load data local inpath './testdata/taobao/19700101/train/easyrec_demo_taobao_train_data_10000.csv' into table pai_online_projects.easyrec_demo_taobao_train_data partition(dt = 'yyyymmdd'); load data local inpath './testdata/taobao/19700101/test/easyrec_demo_taobao_test_data_1000.csv' into table pai_online_projects.easyrec_demo_taobao_test_data partition(dt = 'yyyymmdd'); ---------------------parquet數(shù)據(jù)--------------------- CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) STORED AS PARQUET partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data_parquet'; CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT) STORED AS PARQUET partitioned by(dt string) LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data_parquet'; insert into pai_online_projects.easyrec_demo_taobao_train_data_parquet partition(dt='yyyymmdd') select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price from pai_online_projects.easyrec_demo_taobao_train_data where dt = yyyymmdd; insert into pai_online_projects.easyrec_demo_taobao_test_data_parquet partition(dt='yyyymmdd') select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price from pai_online_projects.easyrec_demo_taobao_test_data where dt = yyyymmdd;
- 在ml_on_ds目錄下運行以下命令,以啟動Hive。
- 執(zhí)行以下命令,進行模型訓(xùn)練。
kubectl apply -f tfjob_easyrec_training_hive.yaml
- 執(zhí)行以下命令,進行模型評估。
kubectl apply -f tfjob_easyrec_eval_hive.yaml
- 執(zhí)行以下命令,進行模型預(yù)測。
kubectl apply -f tfjob_easyrec_predict_hive.yaml