...
...
This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.
...
Expand | ||
---|---|---|
| ||
Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows: |
Expand | ||
---|---|---|
| ||
Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU. |
Machine Learning
Complex ML workflows are supported through terminal/SSH logins and a full Linux/Ubuntu CUDA development suite. Users may install additional library packages (e.g. conda/pip, CRAN) as needed, or can opt to replace the default environment entirely by launching their own custom Docker containers.
High-speed cluster-local storage houses workspaces and common training corpora (e.g. CIFAR, ImageNet).
Modifying Containers
Certain modifications can be made to containers to allow for users to adjust their environment to accommodate specific computing needs.
...
Expand | ||
---|---|---|
| ||
The Nielsen subscription dataset are available to authorized users at /uss/dsmlp-a/nielsen-dataset/. All of the datasets have been decompressed into this read-only directory making it easy for users to use software (Ex: Stata, your own code) to read directly from the Nielsen directories. In the interest of being mindful of server space, please do not duplicate these large datasets to your home directory and delete unneeded data from your home directory once you've completed your analyses and have your output files. |
File Transfer
Standard utilities such as Users can utilize commands (e.g. 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve ) in the bash shell (command line interface) to import code or data from on- or external servers that are both on and off-campus servers. Files also may can be copied into the cluster from the outside using the following procedures.external sources using Globus, SCP/SFTP, or RSYNC.
Expand | ||
---|---|---|
| ||
See the page on using Globus to transfer data to and from your computer or another Globus collection. |
Expand | ||
---|---|---|
| ||
Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users. Example using the Mac/Linux 'sftp' |
...
command line program:
On Windows, we recommend the WinSCP |
...
utility.
|
Expand | ||
---|---|---|
| ||
On MacOS or Linux, 'rsync' |
...
can be used from a |
...
terminal window to synchronize data sets. Example using the Mac/Linux ‘rsync’ command line program:
|
...
Customizing a Container Environment
Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containersa container. When creating a customized container, it is recommended to use non-GPU ( CPU-only ) containers until your code is fully tested and a simple test training run is successful. (has been completed successfully. It is important to note that PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU ).An example of such a , as such, a successful run in a CPU-only container should also be successful in a container with GPU.
Expand | |||
---|---|---|---|
| |||
| |||
language | bash
|
Expand | ||
---|---|---|
|
...
| ||
|
Adjusting
...
Container Environment
...
and CPU/RAM/GPU limits
The maximum limits (All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, 1 GPU) apply to all of your running containers: you and 1 GPU. You may run 8 eight 1 CPU-core containers, or 1 one 8-core container, or anything in-between. Please contact any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to request increases to these default limits or if you'd like to adjust your software , as well as, to request other adjustments (including software) to your container environment.
Alternate Docker Images
Besides CPU/RAM/GPU settings, you may specify an alternate Docker image: our In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You maycan create or modify these Docker imagesas needed.
Adjusting Launch Script Environments Command Line Options
Defaults set within launch scripts' environment variables may be overridden using the following command-line optionsUsers can change the default variables within a launch script environment variables using specific command line options.
Expand | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||
|
...
Expand | ||
---|---|---|
Code Block | ||
| ||
|
Custom Python Packages (Anaconda/PIP)
...
Users may install
...
additional Python packages within their containers using the PIP tool or standard Anaconda package management system
...
. Users should only install Python packages after launching a container. When Python packages are installed, they are installed in a user’s home directory. As such, these packages will be available for all containers launched thereafter by the user.
For less complex installations the PIP tool can be used to install Python packages.
...
Please see PIP documentation ‘User Installs’ for detailed guidance.
Anaconda is recommended for installing scientific packages with complex dependencies. Please see Anaconda's Getting Started for a guided introduction.
Expand | ||
---|---|---|
| ||
|
Background Execution / Long-Running Jobs
...
Installing TensorBoard
Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:
Code Block |
---|
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user |
You’ll need to exit your Pod/container and restart for the change to take effect.
Running Jobs in a Background Container and Long-Running Jobs
To minimize the impact of abandoned/runaway jobs, the cluster allows for containers to run jobs in the background container for up to 12 hours of execution time. Users need to specify that a job should run in a background container by using the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request. Send an email to rcd-support@ucsd.edu. Use extended upon request to rcd-support@ucsd.edu.
Note to users: Please be considerate and terminate any unused background jobs. GPU cards are limited and assigned to containers on an exclusive basis. When attached to a container, GPUs are unusable by others even if the GPU is idle while attached to your container.
Expand | ||
---|---|---|
| ||
In the event that your background container is disconnected, use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container. |
Expand | ||
---|---|---|
| ||
In the event that you need to terminate a background container, |
...
use the ‘kubectl delete pod <pod-name>’ command to terminate the container. |
...
Expand | |||
---|---|---|---|
| |||
|
...
Run-TIme Error Messages
(59) device-side assert
cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18
There may be instances where you receive a CUDA run-time error while running a job in a container. Below are a few of the more commonly encountered errors. These errors can typically be resolved by user adjustments. However, If users encounter a run-time error that requires more assistance to resolve, please contact rcd-support@ucsd.edu.
Expand | ||
---|---|---|
| ||
Indicates a run-time error in the CUDA code executing on the GPU |
...
and is commonly due to out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages. |
(2) out of memory
...
|
...
|
...
|
...
|
Expand | ||
---|---|---|
| ||
GPU memory has been exhausted. Try reducing your |
...
dataset size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see “Adjusting Launch Script |
...
Environments Command Line Options” in this user guide). |
...
|
...
|
...
|
...
|
...
|
...
|
...
|
Expand | ||
---|---|---|
| ||
This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node |
...
(see 'Adjusting Launch Script Environments Command |
...
Line Options” in this user guide).
Please report this type of error directly to rcd-support@ucsd.edu |
...
for assistance. |
Monitoring Cluster Status
The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.
We plan to deploy more sophisticated monitoring tools over the coming months.
...
Installing TensorBoard
Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment: |
Code Block |
---|
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user |
You’ll need to exit your Pod/container and restart for the change to take effect.
Usage instructions for ‘jupyter_tensorboard’ are available at:
https://github.com/lspvic/jupyter_tensorboard#usage
Users can enter the ‘cluster-status’ command for insight into the number of jobs currently running and GPU/CPU/RAM allocated. Alternatively, users can refer the the cluster ‘Node Status’ page for updates on containers (or images).
Expand | ||
---|---|---|
| ||
Cluster Hardware Specifications
Cluster architecture diagram
Node | CPU Model | #Cores ea. | RAM ea | #GPU | GPU Model | Family | CUDA Cores | GPU RAM | GFLOPS |
Nodes 1-4 | 2xE5-2630 v4 | 20 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Nodes 5-8 | 2xE5-2630 v4 | 20 | 256Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Node 9 | 2xE5-2650 v2 | 16 | 128Gb | 8 | GTX Titan | Kepler | 2688 ea. | 6Gb | 4500 |
Node 10 | 2xE5-2670 v3 | 24 | 320Gb | 7 | GTX 1070Ti | Pascal | 2432 ea. | 8Gb | 7800 |
Nodes 11-12 | 2xXeon Gold 6130 | 32 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Nodes 13-15 | 2xE5-2650v1 | 16 | 320Gb | n/a | n/a | n/a | n/a | n/a | n/a |
Nodes 16-18 | 2xAMD 6128 | 24 | 256Gb | n/a | n/a | n/a | n/a | n/a | n/a |
Nodes are connected via an Arista 7150 10Gb Ethernet switch.
...