...
...
This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.
Complex Machine Learning workflows are supported through terminal/SSH logins and a full Linux/Ubuntu CUDA development suite. Users may install additional library packages (e.g. conda/pip, CRAN) as needed, or can opt to replace the default environment entirely by launching their own custom Docker containers. High High-speed cluster-local storage houses workspaces and common training corpora (e.g. CIFAR, ImageNet).
...
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Getting Started
The are two ways in which to access the Research Cluster - via ssh or via the datahub.
...
title | Accessing the Research Cluster via SSH |
---|
First, login via SSH to the “dsmlp-login.ucsd.edu" Linux server using your UC San Diego Active Directory (AD) username (with ‘@dsmlp-login.ucsd.edu’) and password. After logging in, you will be in a login node for the Research Cluster and should not perform any computation in the login node.
Login step-by-step guidance:
...
Open command line interface - known as the 'Terminal' for MacOS and 'Command Prompt' for Windows.
...
|
Getting Started
The are two ways in which to access the Research Cluster - via SSH or via the Datahub.
Expand | ||
---|---|---|
| ||
First, login via SSH to the “researchcluster-login.ucsd.edu" Linux server login node using your UC San Diego Active Directory (AD) username (with ‘@researchcluster-login.ucsd.edu’) and password. After logging in, you will be in a login node for the Research Cluster and should not perform any computation in the login node. Login step-by-step guidance:
You may be asked a question after entering your username. Select 'yes’ to continue connecting.
IMPORTANT: DO NOT RUN JOBS IN THE LOGIN NODE. ! Jobs must only be run in a launched container. Follow the guidance in the next section (Launching a Container), before running your compute jobs. |
Expand | ||
---|---|---|
| ||
|
Launching a Container
After signing into the login node, you can start a pod/container using launching a standard Research Cluster launch script or a customize container launch script.
Once started, containers are accessible in either a Bash Shell (command-line) or a Jupyter/Python Notebook environment. Users may access their Jupyter notebook by copying and pasting the launch script link provided by pasting the link in the browser address bar. This link will work as long as your container is active and will cease to work once you logout. Docker container image and CPU/GPU/RAM settings are all configurable - see the “Customization” and "Launch Script Command-line Options" sections below for more details.
Containers terminate when automatically when users exit the interactive shell.
...
Expand | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||
The standard launch scripts are predefined meaning they have specific RAM and CPU (and/or GPU) configurations. Other launch scripts are available at /softwareopt/common64launch-sh/dsmlp/bin/ .
Standard images with ‘pytorch’ include the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner. |
Web Interface Tool
The Research Cluster uses the web interface tool known as Jupyterhub Notebooks as an alternative graphical interface option for users who prefer computing in this type of interface rather than in the command line interface.
To access the web interface tool, users are directed to sign-in at https://datahub.ucsd.edu/ (or via selecting the login button up top).
...
Expand | ||
---|---|---|
| ||
Click on the "Log In" button above, or visit https://datahub.ucsd.edu/ and sign in with your UC San Diego Google account and password. Click the button. Select a software and hardware configuration via the "Spawner options" page: Open a blank Python 3 notebook: When your work is complete, please shut down your Notebook via the Control Panel's "Stop my Server" option: |
...
Expand | ||
---|---|---|
| ||
Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU. |
Modifying Containers
Certain modifications can be made to containers to allow for users to adjust their environment to accommodate specific computing needs.
...
Expand | ||||||
---|---|---|---|---|---|---|
| ||||||
Containers may occasionally exit (or unexpectedly terminate) with one of the following error messages:
Note: These errors will show up in 'kubectl get pods' in the status column. |
Data Storage / Datasets
There are two types of persistent file storage are available within containers - private/home directory and shared directory storage.
A private home directory ($HOME) for automatically generated for each cluster user. User's private home directory is limited to a 100GB storage allocation by default.
A shared directory - for group shared data or for datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet) for individual access. Shared directory storage can vary as this storage may be a mounted storage.
In specific cases, Research IT may make allowances to temporarily increase storage in a user’s private home directory. These requests may be submitted by emailing rcd-support@ucsd.edu.
Standard Datasets
Expand | |||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||
Contact Research IT to request installation of additional datasets. |
Expand | ||
---|---|---|
| ||
The Nielsen subscription dataset are available to authorized users at /uss/dsmlp-a/nielsen-dataset/. All of the datasets have been decompressed into this read-only directory making it easy for users to use software (Ex: Stata, your own code) to read directly from the Nielsen directories. In the interest of being mindful of server space, please do not duplicate these large datasets to your home directory and delete unneeded data from your home directory once you've completed your analyses and have your output files. |
File Transfer
Users can utilize commands (e.g. 'git', 'scp', 'sftp', and 'curl') in the bash shell (command line interface) to import code or data from external servers that are both on and off-campus. Files can be copied into the cluster from external sources using Globus, SCP/SFTP, or RSYNC.
Expand | ||
---|---|---|
| ||
See the page on using Globus to transfer data to and from your computer or another Globus collection. |
...
Expand | ||
---|---|---|
| ||
On MacOS or Linux, 'rsync' can be used from a terminal window to synchronize data sets. Example using the Mac/Linux ‘rsync’ command line program:
|
Customizing a Container Environment
Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to a container. When creating a customized container, it is recommended to use CPU-only containers until your code is fully tested and a test training run has been completed successfully. It is important to note that PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU , as such, a successful run in a CPU-only container should also be successful in a container with GPU.
Expand | ||
---|---|---|
| ||
|
Expand | ||
---|---|---|
| ||
|
Adjusting Container Environment and CPU/RAM/GPU limits
All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, and 1 GPU. You may run eight 1 CPU-core containers, one 8-core container, or any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to to increases to these default limits, as well as, to request other adjustments (including software) to your container environment.
Alternate Docker Images
In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images from dockerhub.io or elsewhere if requested. You can can create or modify these Docker images as needed.
Adjusting Launch Script Environments Command Line Options
Users can change the default variables within a launch script environment variables using specific command line options.
...
Expand | ||
---|---|---|
| ||
|
Custom Python Packages (Anaconda/PIP)
Users may install additional Python packages within their containers using the PIP tool or standard Anaconda package management system. Users should only install Python packages after launching a container. When Python packages are installed, they are installed in a user’s home directory. As such, these packages will be available for all containers launched thereafter by the user.
...
Expand | ||
---|---|---|
| ||
Note: You’ll need to exit your Pod/container and restart for the change to take effect. |
Running Jobs in a Background Container and Long-Running Jobs
To minimize the impact of abandoned/runaway jobs, the cluster allows for containers to run jobs in the background container for up to 12 hours of execution time. Users need to specify that a job should run in a background container by using the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request to rcd-support@ucsd.edu.
Note to users: Please be considerate and terminate any unused background jobs. GPU cards are limited and assigned to containers on an exclusive basis. When attached to a container, GPUs are unusable by others even if the GPU is idle while attached to your container.
Expand | ||
---|---|---|
| ||
In the event that your background container is disconnected, use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container. |
...
Expand | ||
---|---|---|
| ||
|
Run-TIme Error Messages
There may be instances where you receive a CUDA run-time error while running a job in a container. Below are a few of the more commonly encountered errors. These errors can typically be resolved by user adjustments. However, If users encounter a run-time error that requires more assistance to resolve, please contact rcd-support@ucsd.edu.
...
Expand | ||
---|---|---|
| ||
This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node (see 'Adjusting Launch Script Environments Command Line Options” in this user guide).
Please report this type of error directly to rcd-support@ucsd.edu for assistance. |
Monitoring Cluster Status
Users can enter the ‘cluster-status’ command for insight into the number of jobs currently running and GPU/CPU/RAM allocated. Alternatively, users can refer the the cluster ‘Node Status’ ‘Node Status’ page for updates on containers (or images).
Expand | ||
---|---|---|
| ||
Cluster Hardware Specifications
The Research Cluster shares hardware infrastructure with the Data Science and Machine Learning Platform (DSMLP). As such, the information about the hardware specifications for the Research Cluster are described in the Cluster architecture diagram (as displayed in reference to the DSMLP).
...
Expand | ||
---|---|---|
| ||
|
Licensed Software
Installing licensed software is allowed in the Research Cluster; however, certain software versions are required to be compatible for installation in a cluster environment. The purchase of licensed software is the responsibility of the user or their sponsoring department. Research IT Services is available to assist with the installation of licensed software. For questions about installing licensed software, please email rcd-support@ucsd.edu.
Expand | ||
---|---|---|
| ||
For users with provisioned Stata licensing, the launch-scipy-ml.sh container is capable of executing Stata. Stata can be installed in your home directory by the Research IT Services team and can be executed using the command '~/stata-se' from within a container. |
Acknowledging Research IT Services
Papers, presentations, and other publications that feature research that benefited from the Research Cluster computing resource, services or support expertise may include in the text the following acknowledgement:
This research was done using the UC San Diego Research Cluster computing resource, supported by Research IT Services and provided by Academic Technology Services / IT Services.