EasyRec支持csv和Parquet兩種Hive文件存儲格式。本文通過示例為您介紹,如何基于Hive在Data Science集群進行EasyRec模型訓(xùn)練、評估和預(yù)測。

前提條件

  • 已創(chuàng)建Hadoop集群,詳情請參見創(chuàng)建集群
  • 已創(chuàng)建DataScience集群,且選擇了EasyRec和TensorFlow服務(wù),詳情請參見創(chuàng)建集群
  • 下載dsdemo代碼:請已創(chuàng)建DataScience集群的用戶,使用釘釘搜索釘釘群號32497587加入釘釘群以獲取dsdemo代碼。

操作步驟

  1. 開放安全組。
    將DataScience集群的所有公網(wǎng)IP地址,添加至Hadoop集群的安全組中,端口為10000和9000,詳情請參見添加安全組規(guī)則
  2. 修改ml_on_ds目錄下的文件。
    1. 上傳獲取到的dsdemo*.zip至DataScience集群的header節(jié)點。
    2. 通過SSH方式連接DataScience集群,詳情請參見登錄集群
    3. 解壓dsdemo*.zip
    4. 修改ml_on_ds目錄下的easyrec_model.config文件。
      使用./ml_on_ds/testdata/dssm_hiveinput/dssm_hive_csv_input.configdssm_hive_parquet_input.config作為easyrec_model.config
      yes | cp ./testdata/dssm_hiveinput/dssm_hive_csv_input.config ./easyrec_model.config

      easyrec_model.config的代碼片段如下,根據(jù)集群信息修改hostusername

      hive_train_input {
          host: "*.*.*.*"
          username: "admin"
          port:10000
      }
      hive_eval_input {
          host: "*.*.*.*"
          username: "admin"
          port:10000
      }
      train_config {
         ...
      }
      
      eval_config {
          ...
      }
      data_config {
        input_type: HiveInput
        eval_batch_size: 1024
        ...
      
      }
      參數(shù) 描述
      host Hadoop集群Header節(jié)點的外網(wǎng)IP地址。
      username Hadoop集群LDAP的用戶名。默認為admin。
      input_type 輸入類型。EasyRec支持HiveInput和HiveParquetInput兩種關(guān)于Hive的輸入類型。
  3. 生成鏡像文件。
    修改ml_on_ds目錄下的config文件,設(shè)置DATABASETRAIN_TABLE_NAMEEVAL_TABLE_NAMEPREDICT_TABLE_NAMEPREDICT_OUTPUT_TABLE_NAMEPARTITION_NAME。設(shè)置好容器服務(wù)的地址,保證容器服務(wù)是私有的且為企業(yè)鏡像。配置好之后執(zhí)行make build push命令打包鏡像。
  4. 準(zhǔn)備數(shù)據(jù)。
    1. ml_on_ds目錄下運行以下命令,以啟動Hive。
      spark-sql
    2. 執(zhí)行以下命令,準(zhǔn)備測試數(shù)據(jù)。
      create database pai_online_projects location 'hdfs://192.168.*.*:9000/pai_online_projects.db';
      
      ---------------------csv數(shù)據(jù)---------------------
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY','
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data';
      
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY','
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data';
      
      
      load data local inpath './testdata/taobao/19700101/train/easyrec_demo_taobao_train_data_10000.csv' into table pai_online_projects.easyrec_demo_taobao_train_data partition(dt = 'yyyymmdd');
      
      load data local inpath './testdata/taobao/19700101/test/easyrec_demo_taobao_test_data_1000.csv' into table pai_online_projects.easyrec_demo_taobao_test_data partition(dt = 'yyyymmdd');
      
      ---------------------parquet數(shù)據(jù)---------------------
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_train_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      STORED AS PARQUET
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_train_data_parquet';
      
      
      CREATE TABLE pai_online_projects.easyrec_demo_taobao_test_data_parquet(`clk` BIGINT, `buy` BIGINT, `pid` STRING, `adgroup_id` STRING, `cate_id` STRING, `campaign_id` STRING, `customer` STRING, `brand` STRING, `user_id` STRING, `cms_segid` STRING, `cms_group_id` STRING, `final_gender_code` STRING, `age_level` STRING, `pvalue_level` STRING, `shopping_level` STRING, `occupation` STRING, `new_user_class_level` STRING, `tag_category_list` STRING, `tag_brand_list` STRING, `price` BIGINT)
      STORED AS PARQUET
      partitioned by(dt string)
      LOCATION 'hdfs://192.168.*.*:9000/pai_online_projects.db/easyrec_demo_taobao_test_data_parquet';
      
      
      insert into pai_online_projects.easyrec_demo_taobao_train_data_parquet partition(dt='yyyymmdd')
      select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price  from pai_online_projects.easyrec_demo_taobao_train_data where dt = yyyymmdd;
      
      
      insert into pai_online_projects.easyrec_demo_taobao_test_data_parquet partition(dt='yyyymmdd')
      select clk , buy , pid , adgroup_id , cate_id , campaign_id , customer , brand , user_id , cms_segid , cms_group_id , final_gender_code , age_level , pvalue_level , shopping_level , occupation , new_user_class_level , tag_category_list , tag_brand_list , price  from pai_online_projects.easyrec_demo_taobao_test_data where dt = yyyymmdd;
                                      
  5. 執(zhí)行以下命令,進行模型訓(xùn)練。
     kubectl apply -f tfjob_easyrec_training_hive.yaml
  6. 執(zhí)行以下命令,進行模型評估。
      kubectl apply -f tfjob_easyrec_eval_hive.yaml
  7. 執(zhí)行以下命令,進行模型預(yù)測。
     kubectl apply -f tfjob_easyrec_predict_hive.yaml