首页 > 后端开发 > Python教程 > ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

Barbara Streisand
发布: 2024-10-26 05:04:30
原创
596 人浏览过

经典圣人

使用 AWS SageMaker 及其 Python SDK 制作的机器学习模型,用于使用 Terraform 实现基础设施设置自动化的 HDFS 日志分类。

链接:GitHub
语言:HCL(terraform)、Python

内容

  • 概述:项目概述。
  • 系统架构:系统架构图
  • ML 模型:模型概述。
  • 入门:如何运行项目。
  • 控制台观察:运行项目时可以观察到的实例和基础设施的变化。
  • 结束和清理:确保不产生额外费用。
  • 自动创建的对象:在执行过程中创建的文件和文件夹。

  • 首先遵循目录结构以便更好地设置项目。
  • 从 GitHub 上传的 ClassiSage 项目存储库中获取主要参考,以便更好地理解。

概述

  • 该模型是使用 AWS SageMaker 进行 HDFS 日志分类以及用于存储数据集的 S3、Notebook 文件(包含 SageMaker 实例的代码)和模型输出。
  • 基础设施设置是使用 Terraform 自动化的,Terraform 是一个由 HashiCorp 创建的提供基础设施即代码的工具
  • 使用的数据集是HDFS_v1。
  • 该项目使用模型 XGBoost 版本 1.2 实现 SageMaker Python SDK

系统架构

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

机器学习模型

  • 图像 URI
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
登录后复制
登录后复制
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 初始化对容器的超参数和估计器调用
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
登录后复制
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 培训工作
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
登录后复制
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 部署
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 验证
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
登录后复制
登录后复制
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

入门

  • 使用 Git Bash 克隆存储库/下载 .zip 文件/分叉存储库。
  • 转到您的 AWS 管理控制台,单击右上角的帐户配置文件,然后从下拉列表中选择我的安全凭证。
  • 创建访问密钥:在访问密钥部分,单击创建新访问密钥,将出现一个对话框,其中包含您的访问密钥 ID 和秘密访问密钥。
  • 下载或复制密钥:(重要)下载 .csv 文件或将密钥复制到安全位置。这是您唯一可以查看秘密访问密钥的时间。
  • 打开克隆的存储库。在你的 VS Code 中
  • 在ClassiSage下创建一个文件为terraform.tfvars,其内容为
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
登录后复制
登录后复制
登录后复制
登录后复制
  • 下载并安装使用 Terraform 和 Python 的所有依赖项。
  • 在终端中输入/粘贴 terraform init 来初始化后端。

  • 然后输入/粘贴 terraform Plan 以查看计划或简单地进行 terraform 验证以确保没有错误。

  • 最后在终端中输入/粘贴 terraform apply --auto-approve

  • 这将显示两个输出,一个作为bucket_name,另一个作为pretrained_ml_instance_name(第三个资源是赋予存储桶的变量名称,因为它们是全局资源)。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 终端中显示命令完成后,导航到 ClassiSage/ml_ops/function.py 并在文件的第 11 行添加代码
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
登录后复制
登录后复制
登录后复制
登录后复制

并将其更改为项目目录所在的路径并保存。

  • 然后在 ClassiSageml_opsdata_upload.ipynb 上使用代码运行所有代码单元格,直到单元格编号 25
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
登录后复制
登录后复制
登录后复制

将数据集上传到 S3 Bucket。

  • 代码单元执行的输出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 执行笔记本后,重新打开您的 AWS 管理控制台。
  • 您可以搜索 S3 和 Sagemaker 服务,并将看到启动的每个服务的实例(S3 存储桶和 SageMaker Notebook)

名为“data-bucket-”的 S3 存储桶,上传了 2 个对象、一个数据集和包含模型代码的 pretrained_sm.ipynb 文件。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 转到AWS SageMaker中的笔记本实例,单击创建的实例,然后单击打开Jupyter。
  • 之后,单击窗口右上角的“新建”并选择“在终端上”。
  • 这将创建一个新终端。

  • 在终端上粘贴以下内容(替换为 VS Code 终端输出中显示的bucket_name 输出):
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
登录后复制
登录后复制
登录后复制
登录后复制
登录后复制

将 pretrained_sm.ipynb 从 S3 上传到 Notebook 的 Jupyter 环境的终端命令

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 返回到打开的 Jupyter 实例,然后单击 pretrained_sm.ipynb 文件将其打开并为其分配 conda_python3 内核。
  • 向下滚动到第四个单元格,并将变量bucket_name的值替换为VS Code的终端输出bucket_name = ""
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
登录后复制
登录后复制
登录后复制
登录后复制

代码单元执行的输出

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • 在文件顶部,转到“内核”选项卡来重新启动。
  • 执行 Notebook 直到代码单元格编号 27,使用代码
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
登录后复制
登录后复制
登录后复制
登录后复制
  • 您将得到预期的结果。 数据将被获取,在针对具有定义的输出路径的标签和功能进行调整后,分为训练集和测试集,然后使用 SageMaker 的 Python SDK 的模型将被训练、部署为端点、验证以提供不同的指标。

控制台观察笔记

执行第 8 个单元

  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
登录后复制
登录后复制
登录后复制
  • 将在S3中设置输出路径来存储模型数据。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

执行第23个单元

  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
登录后复制
登录后复制
登录后复制
登录后复制
登录后复制
  • 训练作业将会开始,您可以在训练选项卡下查看。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 一段时间后(预计3分钟),它将完成并显示相同的内容。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

执行第 24 个代码单元

  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
登录后复制
登录后复制
登录后复制
登录后复制
  • 端点将部署在推理选项卡下。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

额外的控制台观察:

  • 在“推理”选项卡下创建端点配置。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 也在“推理”选项卡下创建模型。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


结束和清理

  • 在 VS Code 中返回 data_upload.ipynb 执行最后 2 个代码单元,将 S3 存储桶的数据下载到本地系统。
  • 该文件夹将被命名为downloaded_bucket_content。 已下载文件夹的目录结构。

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 您将在输出单元格中获得下载文件的日志。它将包含原始 pretrained_sm.ipynb、final_dataset.csv 和名为“pretrained-algo”的模型输出文件夹,其中包含 sagemaker 代码文件的执行数据。
  • 最后进入 SageMaker 实例内的 pretrained_sm.ipynb 并执行最后 2 个代码单元。 端点和S3存储桶内的资源将被删除,以确保不会产生额外费用。
  • 删除端点
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})
登录后复制
登录后复制
登录后复制
登录后复制

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • 清除S3:(需要销毁实例)
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')
登录后复制
登录后复制
登录后复制
登录后复制
登录后复制
  • 返回项目文件的 VS Code 终端,然后输入/粘贴 terraform destroy --auto-approve
  • 所有创建的资源实例将被删除。

自动创建的对象

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

注意:
如果您喜欢这个机器学习项目的想法和实现,该项目使用 AWS Cloud 的 S3 和 SageMaker 进行 HDFS 日志分类,使用 Terraform 进行 IaC(基础设施设置自动化),请在查看 GitHub 上的项目存储库后考虑喜欢这篇文章并加星标.

以上是ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型的详细内容。更多信息请关注PHP中文网其他相关文章!

来源:dev.to
本站声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
作者最新文章
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板