UCSD Research Cluster: User Guide

Overview

UC San Diego’s Research Cluster, a service of Research IT Services, provides researchers in all disciplines and divisions access to 80+ modern GPUs running on 10 physical hardware nodes located at SDSC. Funding for the cluster was provided by Research IT Services.

To report problems, or to request assistance, please email Research Computing & Data Support: rcd-support@ucsd.edu.

Jobs on the cluster are executed in the form of Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes.

The Kubernetes container management/orchestration system routes users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.

Please be considerate and terminate idle containers: while containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis. When attached to a container they become unusable by others even if completely idle.

Access to the Front-end / Submission Node

Getting Started

To start a Pod (container), first login via SSH to “dsmlp-login.ucsd.edu" Linux server using your UC San Diego Active Directory (AD) credentials. These systems act as front-end/submission nodes for our cluster; computation is handled elsewhere.

Login steps:

Open command line interface - known as the 'Terminal' for MacOS and 'Command Prompt' for Windows.
Enter command 'ssh yourusername@dsmlp-login.ucsd.edu'.
Enter your password. Note: Your password will not display as you enter it.

The following will display once login is successful:

You are now working in the Login Node.

DO NOT RUN JOBS IN THE LOGIN NODE. Jobs must only be run in a container. Follow the guidance in the next section (Launching a Container), before running your jobs.

Launching a Container

After signing-on to the front-end node, you may start a Pod/container using either of the following commands:

Launch Script	Description	#GPU	#CPU	RAM	Container Image(s)
launch-scipy-ml.sh	Python 3, PyTorch, TensorFlow	0	2	8	ucsdets/scipy-ml-notebook:2020.2.9
launch-scipy-ml-gpu.sh	Python 3, PyTorch, TensorFlow	1	4	16	ucsdets/scipy-ml-notebook:2020.2.9
launch-datascience.sh	Python 3, Datascience, R	0	2	8	ucsdets/datascience-notebook:2020.2-stable
launch-rstudio.sh	R-Studio	1	4	16	ucsdets/datascience-rstudio:latest

Other launch scripts are available at /software/common64/dsmlp/bin/.

Docker container image and CPU/GPU/RAM settings are all configurable; see the “Customization” and "Launch Script Command-line Options" sections below.

We encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful. (PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU).

Once started, containers can provide Bash (shell/command-line), as well as Jupyter/Python Notebook environments.

Bash Shell / Command Line

The predefined launch scripts initiate an interactive Bash shell similar to ‘ssh’; containers terminate when this interactive shell exits. Our ‘pytorch’ image includes the GNU Screen utility, which may be used to manage multiple terminal sessions in a window-like manner.

Jupyter / Python Notebooks

The default container configuration creates an interactive web-based Jupyter/Python Notebook which may be accessed via a TCP proxy URL output by the launch script. Note that access to the TCP proxy URL requires a UCSD IP address: either on-campus wired/wireless, or VPN. See http://blink.ucsd.edu/go/vpn for instructions on the campus VPN.

Monitoring Resource Usage within Jupyter/Python Notebooks

Users of the stock containers will find CPU/Memory (RAM)/GPU utilization noted at the top of the Jupyter notebook screen:

Monitoring Resource Usage within Containers

Users of the Bash command line can find the CPU/RAM usage of their pod by using the `htop` command. To see GPU usage, monitor the `/usr/local/nvidia/bin/nvidia-smi` command.

Container Run Time Limits

By default, containers are limited to 6 hours execution time to minimize impact of abandoned/runaway jobs. This limit may be increased, up to 12 hours, by modifying the "K8S_TIMEOUT_SECONDS" configuration variable. Contact us at rcd-support@ucsd.edu if you require more than 12 hours.

$ export K8S_TIMEOUT_SECONDS=$(( 3600 * 12 ))
$ launch-scipy-ml.sh

Container Termination Messages

Containers may occasionally exit with one of the following error messages:

OOMKilled	Container memory (CPU RAM) limit was reached.
DeadlineExceeded	Container time limit (default 6 hours) exceeded - see above.
Error	Unspecified error. Contact ITS/ETS for assistance.

These errors will show up in 'kubectl get pods' in the status column.

Data Storage / Datasets

Two types of persistent file storage are available within containers: a private home directory ($HOME) for each user, as well as a shared directory / datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet).

Each user's home directory is limited to 100GB.

Standard Datasets

Name	Path	Size	#Files	Notes
MNIST	/datasets/MNIST	53M	4
ImageNet Fall 2011	/datasets/imagenet	1300G	14M
ImageNet 32x32 2010	/datasets/imagenet-ds	1800M	2.6M	ILSVRC2012 Downsampled 32x32,64x64
Tiny-ImageNet	/datasets/Tiny-ImageNet	353M	120k
CIFAR-10	/datasets/CIFAR-10	178M	9
Caltech256	/datasets/Caltech256	1300M	30k
ShapeNet	/datasets/ShapeNet	204G	981k	ShapeNetCore v1/v2
MJSynth	/datasets/MJSynth	36G	8.9M	Synthetic Word Dataset

Contact ITS to request installation of additional datasets.

Chicago Booth Kilts Center for Marketing: Nielsen Datasets

The Nielsen subscription dataset are available to authorized users at /uss/dsmlp-a/nielsen-dataset/. All of the datasets have been decompressed into this read-only directory making it easy for users to use software (Ex: Stata, your own code) to read directly from the Nielsen directories. In the interest of being mindful of server space, please do not duplicate these large datasets to your home directory and delete unneeded data from your home directory once you've completed your analyses and have your output files.

File Transfer

Standard utilities such as 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve code or data from on- or off-campus servers.

Files also may be copied into the cluster from the outside using the following procedures.

Copying Data Into the Cluster: Using Globus

See the page on using Globus to transfer data to and from your computer or another Globus collection.

Copying Data Into the Cluster: SCP/SFTP from your computer

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users.

Example using the Mac/Linux 'sftp' command line program:

slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu
pod agt-4049 up and running; starting sftp
Connected to ieng6.ucsd.edu
sftp> put 2017-11-29-raspbian-stretch-lite.img
Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img
2017-11-29-raspbian-stretch-lite.img             100% 1772MB  76.6MB/s   00:23    
sftp> quit
sftp complete; deleting pod agt-4049
slithy:Downloads agt$

On Windows, we recommend the WinSCP utility.

After installing WinSCP, the tool will open and you will be be prompted to enter the following information:
- Host name: dsmlp-login.ucsd.edu
- User name: ad_username
- Password: ad_password

Copying Data Into the Cluster: rsync

'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:

slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu
pod agt-9924 up and running; starting rsync
building file list ... done
rsync complete; deleting pod agt-9924
sent 557671 bytes  received 20 bytes  53113.43 bytes/sec
total size is 41144035  speedup is 73.78
slithy:ME198 agt$

Customization of the Container Environment

Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containers. An example of such a launch configuration is as follows:

K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest"
K8S_ENTRYPOINT="/run_jupyter.sh"

K8S_NUM_GPU=1  # max of 1 (contact ETS to raise limit)
K8S_NUM_CPU=4  # max of 8 ("")
K8S_GB_MEM=32  # max of 64 ("")

# Controls whether an interactive Bash shell is started
SPAWN_INTERACTIVE_SHELL=YES

# Sets up proxy URL for Jupyter notebook inside
PROXY_ENABLED=YES
PROXY_PORT=8888

Users may copy an existing launch script into their home directory, then modify that private copy such as:

$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh
$ nano $HOME/my-launch-pytorch.sh    
$ $HOME/my-launch-pytorch.sh

Adjusting Software Environment, CPU/RAM/GPU limits

The maximum limits (8 CPU, 64GB, 1 GPU) apply to all of your running containers: you may run 8 1 CPU-core containers, or 1 8-core container, or anything in-between. Please contact rcd-support@ucsd.edu to request increases to these default limits or if you'd like to adjust your software environment.

Alternate Docker Images

Besides CPU/RAM/GPU settings, you may specify an alternate Docker image: our servers will pull container images from dockerhub.io or elsewhere if requested. You may create or modify these Docker images as needed.

Launch Script Command Line Options

Defaults set within launch scripts' environment variables may be overridden using the following command-line options:

Option	Description	Example
-c N	Adjust # CPU cores	-c 8
-g N	Adjust # GPU cards	-g 2
-m N	Adjust # GB RAM	-m 64
-i IMG	Docker image name	-i nvidia/cuda:latest
-e ENTRY	Docker image ENTRYPOINT/CMD	-e /run_jupyter.sh
-n N	Request specific cluster node (1-10)	-n 7
-v	Request specific GPU (gtx1080ti,k5200,titan)	-v k5200
-b	Request background pod	(see below)

Example:

[cs190f @ieng6-201]:~:56$  launch-py3torch-gpu.sh -m 64 -v k5200

Custom Python Packages (Anaconda/PIP)

Users may install personal Python packages within their containers using the standard Anaconda package management system; please see Anaconda's Getting Started guide for a 30-minute introduction. Anaconda is recommended for installing scientific packages with complex dependencies.

For less complex installations the PIP tool can be used to install Python packages. User installs of Python packages will result in a consistent set of packages across containers since they will be installed to your home directory.

Example of a user specific (--user) package installation using 'pip':

agt@agt-10859:~$ pip install --user imutils
Collecting imutils
  Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
  Running setup.py bdist_wheel for imutils ... done
  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5

Background Execution / Long-Running Jobs

To minimize the impact of abandoned/runaway jobs, we permit background execution of containers, up to 12 hours execution time, via the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request. Send an email to rcd-support@ucsd.edu.

Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.

Please be considerate and terminate any unused background jobs: GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.

[amoxley@dsmlp-login]:~:504$ launch-scipy-ml.sh -b
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 0 GPU units.
   (Adjust command line options, or edit "/software/common64/dsmlp/bin/launch-scipy-ml.sh" to change this configuration.)
pod/amoxley-5497 created
Mon Mar 9 14:04:10 PDT 2020 starting up - pod status: Pending ; containers with incomplete status: [init-support]
Mon Mar 9 14:04:15 PDT 2020 pod is running with IP: 10.43.128.17 on node: its-dsmlp-n25.ucsd.edu
ucsdets/scipy-ml-notebook:2019.4-stable is now active.

Connect to your background pod via: "kubesh amoxley-5497"
Please remember to shut down via: "kubectl delete pod amoxley-5497" ; "kubectl get pods" to list running pods.
You may retrieve output from your pod via: "kubectl logs amoxley-5497".
PODNAME=amoxley-5497
[amoxley@dsmlp-login]:~:505$ kubesh amoxley-5497

amoxley@amoxley-5497:~$ hostname
amoxley-5497
amoxley@amoxley-5497:~$ exit
exit

[amoxley@dsmlp-login]:~:506$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
amoxley-5497   1/1     Running   0          45s

[amoxley@dsmlp-login]:~:507$ kubectl delete pod amoxley-5497
pod "amoxley-5497" deleted
[amoxley@dsmlp-login]:~:508$

Common CUDA Run-TIme Error Messages

(59) device-side assert

cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18

Indicates a run-time error in the CUDA code executing on the GPU, commonly due out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.

(2) out of memory

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/THCStorage.cu:66

GPU memory has been exhausted. Try reducing your batch size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see Launch Script Command-line Options).

(30) unknown error

RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70

This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node; see Launch Script Command-line Options.

Please report these errors to rcd-support@ucsd.edu

Monitoring Cluster Status

The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.

We plan to deploy more sophisticated monitoring tools over the coming months.

Installing TensorBoard

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Usage instructions for ‘jupyter_tensorboard’ are available at:

https://github.com/lspvic/jupyter_tensorboard#usage

Hardware Specifications

Cluster architecture diagram

Node	CPU Model	#Cores ea.	RAM ea	#GPU	GPU Model	Family	CUDA Cores	GPU RAM	GFLOPS
Nodes 1-4	2xE5-2630 v4	20	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 5-8	2xE5-2630 v4	20	256Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Node 9	2xE5-2650 v2	16	128Gb	8	GTX Titan (2014)	Kepler	2688 ea.	6Gb	4500
Node 10	2xE5-2670 v3	24	320Gb	7	GTX 1070Ti	Pascal	2432 ea.	8Gb	7800
Nodes 11-12	2xXeon Gold 6130	32	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 13-15	2xE5-2650v1	16	320Gb	n/a	n/a	n/a	n/a	n/a	n/a
Nodes 16-18	2xAMD 6128	24	256Gb	n/a	n/a	n/a	n/a	n/a	n/a

Nodes are connected via an Arista 7150 10Gb Ethernet switch.

Additional nodes can be added into the cluster at peak times.

Example: PyTorch Session with TensorFlow examples

slithy:~ agt$
slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu
Password:
Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu
============================ NOTICE =================================
Authorized use of this system is limited to password-authenticated
usernames which are issued to individuals and are for the sole use of
the person to whom they are issued.
 
Privacy notice: be aware that computer files, electronic mail and
accounts are not private in an absolute sense.  For a statement of
"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage
at http://acms.ucsd.edu/info/aup.html.
=====================================================================
 

Disk quotas for user cs190f (uid 59457):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
acsnfs4.ucsd.edu:/vol/home/linux/ieng6
                      11928  5204800 5204800                 272        9000        9000      
=============================================================
Check Account Lookup Tool at http://acms.ucsd.edu
=============================================================

[…]

Thu Oct 12, 2017 12:34pm - Prepping cs190f
[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units.  (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)
pod "cs190f -4953" created
Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;
Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99
tensorflow/tensorflow:latest-gpu is now active.

Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce

Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.
cs190f@cs190f-4953:~$ ls
TensorFlow-Examples
cs190f@cs190f-4953:~$
cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git
Cloning into 'pytorch-tutorial'...
remote: Counting objects: 658, done.
remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658
Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (350/350), done.
Checking connectivity... done.
cs190f@cs190f-4953:~$ cd pytorch-tutorial/
cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Epoch [1/2], Step [100/600], Loss: 0.7028
Epoch [1/2], Step [200/600], Loss: 0.2479
Epoch [1/2], Step [300/600], Loss: 0.2467
Epoch [1/2], Step [400/600], Loss: 0.2652
Epoch [1/2], Step [500/600], Loss: 0.1919
Epoch [1/2], Step [600/600], Loss: 0.0822
Epoch [2/2], Step [100/600], Loss: 0.0980
Epoch [2/2], Step [200/600], Loss: 0.1034
Epoch [2/2], Step [300/600], Loss: 0.0927
Epoch [2/2], Step [400/600], Loss: 0.0869
Epoch [2/2], Step [500/600], Loss: 0.0139
Epoch [2/2], Step [600/600], Loss: 0.0299
Test Accuracy of the model on the 10000 test images: 97 %
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME
cs190f@cs190f-4953:~$ nvidia-smi
Thu Oct 12 13:30:59 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%  27C    P0     56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

cs190f@cs190f-4953:~$ exit

Licensed Software

Stata

If you have been provisioned with Stata licensing, a container started by launch-scipy-ml.sh is capable of executing Stata. Stata will be installed in your home directory and can be executed using the command '~/stata-se' from within a container.

Acknowledging Research IT Services

Papers, presentations, and other publications that feature research that benefited from the Research Cluster computing resource, services or support expertise may include in the text the following acknowledgement:

This research was done using the UC San Diego Research Cluster computing resource, supported by Research IT Services and provided by Academic Technology Services / IT Services.