Amazon SageMaker HyperPod is purpose-built to accelerate foundation model (FM) training, removing the undifferentiated heavy lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks and months without disruption.
Typically, HyperPod clusters are used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files, run their own jobs, and want to avoid impacting each other’s work. To achieve this multi-user environment, you can take advantage of Linux’s user and group mechanism and statically create multiple users on each instance through lifecycle scripts. The drawback to this approach, however, is that user and group settings are duplicated across multiple instances in the cluster, making it difficult to configure them consistently on all instances, such as when a new team member joins.
To solve this pain point, we can use Lightweight Directory Access Protocol (LDAP) and LDAP over TLS/SSL (LDAPS) to integrate with a directory service such as AWS Directory Service for Microsoft Active Directory. With the directory service, you can centrally maintain users and groups, and their permissions.
In this post, we introduce a solution to integrate HyperPod clusters with AWS Managed Microsoft AD, and explain how to achieve a seamless multi-user login environment with a centrally maintained directory.
Solution overview
The solution uses the following AWS services and resources:
SageMaker HyperPod to create a cluster
AWS Managed Microsoft AD to create a managed directory
Elastic Load Balancing (ELB) to create a Network Load Balancer (NLB) in front of the directory service
AWS Certificate Manager (ACM) to import and maintain an SSL/TLS certificate for LDAP over an SSL/TLS (LDAPS) connection
Amazon Elastic Compute Cloud (Amazon EC2) to create a Windows machine to administer users and groups in the directory
We also use AWS CloudFormation to deploy a stack to create the prerequisites for the HyperPod cluster: VPC, subnets, security group, and Amazon FSx for Lustre volume.
The following diagram illustrates the high-level solution architecture.
In this solution, HyperPod cluster instances use the LDAPS protocol to connect to the AWS Managed Microsoft AD via an NLB. We use TLS termination by installing a certificate to the NLB. To configure LDAPS in HyperPod cluster instances, the lifecycle script installs and configures System Security Services Daemon (SSSD)—an open source client software for LDAP/LDAPS.
Prerequisites
This post assumes you already know how to create a basic HyperPod cluster without SSSD. For more details on how to create HyperPod clusters, refer to Getting started with SageMaker HyperPod and the HyperPod workshop.
Also, in the setup steps, you will use a Linux machine to generate a self-signed certificate and obtain an obfuscated password for the AD reader user. If you don’t have a Linux machine, you can create an EC2 Linux instance or use AWS CloudShell.
Create a VPC, subnets, and a security group
Follow the instructions in the Own Account section of the HyperPod workshop. You will deploy a CloudFormation stack and create prerequisite resources such as VPC, subnets, security group, and FSx for Lustre volume. You need to create both a primary subnet and backup subnet when deploying the CloudFormation stack, because AWS Managed Microsoft AD requires at least two subnets with different Availability Zones.
In this post, for simplicity, we use the same VPC, subnets, and security group for both the HyperPod cluster and directory service. If you need to use different networks between the cluster and directory service, make sure security groups and route tables are configured so that they can communicate each other.
Create AWS Managed Microsoft AD on Directory Service
Complete the following steps to set up your directory:
On the Directory Service console, choose Directories in the navigation pane.
Choose Set up directory.
For Directory type, select AWS Managed Microsoft AD.
Choose Next.
For Edition, select Standard Edition.
For Directory DNS name, enter your preferred directory DNS name (for example, hyperpod.abc123.com).
For Admin password¸ set a password and save it for later use.
Choose Next.
In the Networking section, specify the VPC and two private subnets you created.
Choose Next.
Review the configuration and pricing, then choose Create directory. The directory creation starts. Wait until the status changes from Creating to Active, which can take 20–30 minutes.
When the status changes to Active, open the detail page of the directory and take note of the DNS addresses for later use.
Create an NLB in front of Directory Service
To create the NLB, complete the following steps:
On the Amazon EC2 console, choose Target groups in the navigation pane.
Choose Create target groups.
Create a target group with the following parameters:
For Choose a target type, select IP addresses.
For Target group name, enter LDAP.
For Protocol: Port, choose TCP and enter 389.
For IP address type, select IPv4.
For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
For Health check protocol, choose TCP.
Choose Next.
In the Register targets section, register the directory service’s DNS addresses as the targets.
For Ports, choose Include as pending below.The addresses are added in the Review targets section with Pending status.
Choose Create target group.
On the Load Balancers console, choose Create load balancer.
Under Network Load Balancer, choose Create.
Configure an NLB with the following parameters:
For Load balancer name, enter a name (for example, nlb-ds).
For Scheme, select Internal.
For IP address type, select IPv4.
For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
Under Mappings, select the two private subnets and their CIDR ranges (which you created with the CloudFormation template).
For Security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
In the Listeners and routing section, specify the following parameters:
For Protocol, choose TCP.
For Port, enter 389.
For Default action, choose the target group named LDAP.
Here, we are adding a listener for LDAP. We will add LDAPS later.
Choose Create load balancer.Wait until the status changes from Provisioning to Active, which can take 3–5 minutes.
When the status changes to Active, open the detail page of the provisioned NLB and take note of the DNS name (xyzxyz.elb.region-name.amazonaws.com) for later use.
Create a self-signed certificate and import it to Certificate Manager
To create a self-signed certificate, complete the following steps:
On your Linux-based environment (local laptop, EC2 Linux instance, or CloudShell), run the following OpenSSL commands to create a self-signed certificate and private key:
$ openssl genrsa 2048 > ldaps.key
$ openssl req -new -key ldaps.key -out ldaps_server.csr
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter ‘.’, the field will be left blank.
—–
Country Name (2 letter code) [AU]:US
State or Province Name (full name) [Some-State]:Washington
Locality Name (eg, city) []:Bellevue
Organization Name (eg, company) [Internet Widgits Pty Ltd]:CorpName
Organizational Unit Name (eg, section) []:OrgName
Common Name (e.g., server FQDN or YOUR name) []:nlb-ds-abcd1234.elb.region.amazonaws.com
Email Address []:your@email.address.com
Please enter the following ‘extra’ attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
$ openssl x509 -req -sha256 -days 365 -in ldaps_server.csr -signkey ldaps.key -out ldaps.crt
Certificate request self-signature ok
subject=C = US, ST = Washington, L = Bellevue, O = CorpName, OU = OrgName, CN = nlb-ds-abcd1234.elb.region.amazonaws.com, emailAddress = your@email.address.com
$ chmod 600 ldaps.key
On the Certificate Manager console, choose Import.
Enter the certificate body and private key, from the contents of ldaps.crt and ldaps.key respectively.
Choose Next.
Add any optional tags, then choose Next.
Review the configuration and choose Import.
Add an LDAPS listener
We added a listener for LDAP already in the NLB. Now we add a listener for LDAPS with the imported certificate. Complete the following steps:
On the Load Balancers console, navigate to the NLB details page.
On the Listeners tab, choose Add listener.
Configure the listener with the following parameters:
For Protocol, choose TLS.
For Port, enter 636.
For Default action, choose LDAP.
For Certificate source, select From ACM.
For Certificate, enter what you imported in ACM.
Choose Add.Now the NLB listens to both LDAP and LDAPS. It is recommended to delete the LDAP listener because it transmits data without encryption, unlike LDAPS.
Create an EC2 Windows instance to administer users and groups in the AD
To create and maintain users and groups in the AD, complete the following steps:
On the Amazon EC2 console, choose Instances in the navigation pane.
Choose Launch instances.
For Name, enter a name for your instance.
For Amazon Machine Image, choose Microsoft Windows Server 2022 Base.
For Instance type, choose t2.micro.
In the Network settings section, provide the following parameters:
For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
For Subnet, choose either of two subnets you created with the CloudFormation template.
For Common security groups, choose CfStackName-SecurityGroup-XYZXYZ (which you created with the CloudFormation template).
For Configure storage, set storage to 30 GB gp2.
In the Advanced details section, for Domain join directory¸ choose the AD you created.
For IAM instance profile, choose an AWS Identity and Access Management (IAM) role with at least the AmazonSSMManagedEC2InstanceDefaultPolicy policy.
Review the summary and choose Launch instance.
Create users and groups in AD using the EC2 Windows instance
With Remote Desktop, connect to the EC2 Windows instance you created in the previous step. Using an RDP client is recommended over using a browser-based Remote Desktop so that you can exchange the contents of the clipboard with your local machine using copy-paste operations. For more details about connecting to EC2 Windows instances, refer to Connect to your Windows instance.
If you are prompted for a login credential, use hyperpodAdmin (where hyperpod is the first part of your directory DNS name) as the user name, and use the admin password you set to the directory service.
When the Windows desktop screen opens, choose Server Manager from the Start menu.
Choose Local Server in the navigation pane, and confirm that the domain is what you specified to the directory service.
On the Manage menu, choose Add Roles and Features.
Choose Next until you are at the Features page.
Expand the feature Remote Server Administration Tools, expand Role Administration Tools, and select AD DS and AD LDS Tools and Active Directory Rights Management Service.
Choose Next and Install.Feature installation starts.
When the installation is complete, choose Close.
Open Active Directory Users and Computers from the Start menu.
Under hyperpod.abc123.com, expand hyperpod.
Choose (right-click) hyperpod, choose New, and choose Organizational Unit.
Create an organizational unit called Groups.
Choose (right-click) Groups, choose New, and choose Group.
Create a group called ClusterAdmin.
Create a second group called ClusterDev.
Choose (right-click) Users, choose New, and choose User.
Create a new user.
Choose (right-click) the user and choose Add to a group.
Add your users to the groups ClusterAdmin or ClusterDev.Users added to the ClusterAdmin group will have sudo privilege on the cluster.
Create a ReadOnly user in AD
Create a user called ReadOnly under Users. The ReadOnly user is used by the cluster to programmatically access users and groups in AD.
Take note of the password for later use.
(For SSH public key authentication) Add SSH public keys to users
By storing an SSH public key to a user in AD, you can log in without entering a password. You can use an existing key pair, or you can create a new key pair with OpenSSH’s ssh-keygen command. For more information about generating a key pair, refer to Create a key pair for your Amazon EC2 instance.
In Active Directory Users and Computers, on the View menu, enable Advanced Features.
Open the Properties dialog of the user.
On the Attribute Editor tab, choose altSecurityIdentities choose Edit.
For Value to add, choose Add.
For Values, add an SSH public key.
Choose OK.Confirm that the SSH public key appears as an attribute.
Get an obfuscated password for the ReadOnly user
To avoid including a plain text password in the SSSD configuration file, you obfuscate the password. For this step, you need a Linux environment (local laptop, EC2 Linux instance, or CloudShell).
Install the sssd-tools package on the Linux machine to install the Python module pysss for obfuscation:
# Ubuntu
$ sudo apt install sssd-tools
# Amazon Linux
$ sudo yum install sssd-tools
Run the following one-line Python script. Input the password of the ReadOnly user. You will get the obfuscated password.
$ python3 -c “import getpass,pysss; print(pysss.password().encrypt(getpass.getpass(‘AD reader user password: ‘).strip(), pysss.password().AES_256))”
AD reader user password: (Enter ReadOnly user password)
AAAQACK2….
Create a HyperPod cluster with an SSSD-enabled lifecycle script
Next, you create a HyperPod cluster with LDAPS/Active Directory integration.
Find the configuration file config.py in your lifecycle script directory, open it with your text editor, and edit the properties in the Config class and SssdConfig class:
Set True for enable_sssd to enable setting up SSSD.
The SssdConfig class contains configuration parameters for SSSD.
Make sure you use the obfuscated password for the ldap_default_authtok property, not a plain text password.
# Basic configuration parameters
class Config:
:
# Set true if you want to install SSSD for ActiveDirectory/LDAP integration.
# You need to configure parameters in SssdConfig as well.
enable_sssd = True
# Configuration parameters for ActiveDirectory/LDAP/SSSD
class SssdConfig:
# Name of domain. Can be default if you are not sure.
domain = “default”
# Comma separated list of LDAP server URIs
ldap_uri = “ldaps://nlb-ds-xyzxyz.elb.us-west-2.amazonaws.com”
# The default base DN to use for performing LDAP user operations
ldap_search_base = “dc=hyperpod,dc=abc123,dc=com”
# The default bind DN to use for performing LDAP operations
ldap_default_bind_dn = “CN=ReadOnly,OU=Users,OU=hyperpod,DC=hyperpod,DC=abc123,DC=com”
# “password” or “obfuscated_password”. Obfuscated password is recommended.
ldap_default_authtok_type = “obfuscated_password”
# You need to modify this parameter with the obfuscated password, not plain text password
ldap_default_authtok = “placeholder”
# SSH authentication method – “password” or “publickey”
ssh_auth_method = “publickey”
# Home directory. You can change it to “/home/%u” if your cluster doesn’t use FSx volume.
override_homedir = “/fsx/%u”
# Group names to accept SSH login
ssh_allow_groups = {
“controller” : [“ClusterAdmin”, “ubuntu”],
“compute” : [“ClusterAdmin”, “ClusterDev”, “ubuntu”],
“login” : [“ClusterAdmin”, “ClusterDev”, “ubuntu”],
}
# Group names for sudoers
sudoers_groups = {
“controller” : [“ClusterAdmin”, “ClusterDev”],
“compute” : [“ClusterAdmin”, “ClusterDev”],
“login” : [“ClusterAdmin”, “ClusterDev”],
}
Copy the certificate file ldaps.crt to the same directory (where config.py exists).
Upload the modified lifecycle script files to your Amazon Simple Storage Service (Amazon S3) bucket, and create a HyperPod cluster with it.
Wait until the status changes to InService.
Verification
Let’s verify the solution by logging in to the cluster with SSH. Because the cluster was created in a private subnet, you can’t directly SSH into the cluster from your local environment. You can choose from two options to connect to the cluster.
Option 1: SSH login through AWS Systems Manager
You can use AWS Systems Manager as a proxy for the SSH connection. Add a host entry to the SSH configuration file ~/.ssh/config using the following example. For the HostName field, specify the Systems Manger target name in the format of sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]. For the IdentityFile field, specify the file path to the user’s SSH private key. This field is not required if you chose password authentication.
Host MyCluster-LoginNode
HostName sagemaker-cluster:abcd1234_LoginGroup-i-01234567890abcdef
User user1
IdentityFile ~/keys/my-cluster-ssh-key.pem
ProxyCommand aws –profile default –region us-west-2 ssm start-session –target %h –document-name AWS-StartSSHSession –parameters portNumber=%p
Run the ssh command using the host name you specified. Confirm you can log in to the instance with the specified user.
$ ssh MyCluster-LoginNode
:
:
____ __ ___ __ __ __ ___ __
/ __/__ ____ ____ / |/ /__ _/ /_____ ____ / // /_ _____ ___ ____/ _ ___ ___/ /
_ / _ `/ _ `/ -_) /|_/ / _ `/ ‘_/ -_) __/ / _ / // / _ / -_) __/ ___/ _ / _ /
/___/_,_/_, /__/_/ /_/_,_/_/_\__/_/ /_//_/_, / .__/__/_/ /_/ ___/_,_/
/___/ /___/_/
You’re on the controller
Instance Type: ml.m5.xlarge
user1@ip-10-1-111-222:~$
At this point, users can still use the Systems Manager default shell session to log in to the cluster as ssm-user with administrative privileges. To block the default Systems Manager shell access and enforce SSH access, you can configure your IAM policy by referring to the following example:
{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“ssm:StartSession”,
“ssm:TerminateSession”
],
“Resource”: [
“arn:aws:sagemaker:us-west-2:123456789012:cluster/abcd1234efgh”,
“arn:aws:ssm:us-west-2:123456789012:document/AWS-StartSSHSession”
],
“Condition”: {
“BoolIfExists”: {
“ssm:SessionDocumentAccessCheck”: “true”
}
}
}
]
}
For more details on how to enforce SSH access, refer to Start a session with a document by specifying the session documents in IAM policies.
Option 2: SSH login through bastion host
Another option to access the cluster is to use a bastion host as a proxy. You can use this option when the user doesn’t have permission to use Systems Manager sessions, or to troubleshoot when Systems Manager is not working.
Create a bastion security group that allows inbound SSH access (TCP port 22) from your local environment.
Update the security group for the cluster to allow inbound SSH access from the bastion security group.
Create an EC2 Linux instance.
For Amazon Machine Image, choose Ubuntu Server 20.04 LTS.
For Instance type, choose t3.small.
In the Network settings section, provide the following parameters:
For VPC, choose SageMaker HyperPod VPC (which you created with the CloudFormation template).
For Subnet, choose the public subnet you created with the CloudFormation template.
For Common security groups, choose the bastion security group you created.
For Configure storage, set storage to 8 GB.
Identify the public IP address of the bastion host and the private IP address of the target instance (for example, the login node of the cluster), and add two host entries in the SSH config, by referring to the following example:
Host Bastion
HostName 11.22.33.44
User ubuntu
IdentityFile ~/keys/my-bastion-ssh-key.pem
Host MyCluster-LoginNode-with-Proxy
HostName 10.1.111.222
User user1
IdentityFile ~/keys/my-cluster-ssh-key.pem
ProxyCommand ssh -q -W %h:%p Bastion
Run the ssh command using the target host name you specified earlier, and confirm you can log in to the instance with the specified user:
$ ssh MyCluster-LoginNode-with-Proxy
:
:
____ __ ___ __ __ __ ___ __
/ __/__ ____ ____ / |/ /__ _/ /_____ ____ / // /_ _____ ___ ____/ _ ___ ___/ /
_ / _ `/ _ `/ -_) /|_/ / _ `/ ‘_/ -_) __/ / _ / // / _ / -_) __/ ___/ _ / _ /
/___/_,_/_, /__/_/ /_/_,_/_/_\__/_/ /_//_/_, / .__/__/_/ /_/ ___/_,_/
/___/ /___/_/
You’re on the controller
Instance Type: ml.m5.xlarge
user1@ip-10-1-111-222:~$
Clean up
Clean up the resources in the following order:
Delete the HyperPod cluster.
Delete the Network Load Balancer.
Delete the load balancing target group.
Delete the certificate imported to Certificate Manager.
Delete the EC2 Windows instance.
Delete the EC2 Linux instance for the bastion host.
Delete the AWS Managed Microsoft AD.
Delete the CloudFormation stack for the VPC, subnets, security group, and FSx for Lustre volume.
Conclusion
This post provided steps to create a HyperPod cluster integrated with Active Directory. This solution removes the hassle of user maintenance on large-scale clusters and allows you to manage users and groups centrally in one place.
For more information about HyperPod, check out the HyperPod workshop and the SageMaker HyperPod Developer Guide. Leave your feedback on this solution in the comments section.
About the Authors
Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in Cloud side technology. In his free time, he enjoys playing video games, reading books, and writing software.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Monidipa Chakraborty currently serves as a Senior Software Development Engineer at Amazon Web Services (AWS), specifically within the SageMaker HyperPod team. She is committed to assisting customers by designing and implementing robust and scalable systems that demonstrate operational excellence. Bringing nearly a decade of software development experience, Monidipa has contributed to various sectors within Amazon, including Video, Retail, Amazon Go, and AWS SageMaker.
Satish Pasumarthi is a Software Developer at Amazon Web Services. With several years of software engineering and an ML background, he loves to bridge the gap between the ML and systems and is passionate to build systems that make large scale model training possible. He has worked on projects in a variety of domains, including Machine Learning frameworks, model benchmarking, building hyperpod beta involving a broad set of AWS services. In his free time, Satish enjoys playing badminton.