This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.
When using the Research Cluster, please be considerate and terminate idle containers prior to closing you command line interface or logging out of the datahub. When a user engages a container, the container become unusable by others even if completely idle. While containers share system RAM and CPU resources under the standard Linux/Unix model, the cluster’s 80 GPU cards are assigned to users on an exclusive basis.
Getting Started
The are two ways in which to access the Research Cluster - via ssh or via the datahub.
Launching a Container
After signing into the login node, you can start a pod/container using launching a standard Research Cluster launch script or a customize container launch script.
Once started, containers are accessible in either a Bash Shell (command-line) or a Jupyter/Python Notebook environment. Users may access their Jupyter notebook by copying and pasting the launch script link provided by pasting the link in the browser address bar. This link will work as long as your container is active and will cease to work once you logout. Docker container image and CPU/GPU/RAM settings are all configurable - see the “Customization” and "Launch Script Command-line Options" sections below for more details.
Containers terminate when automatically when users exit the interactive shell.
More details and guidance on launching a container is available on the “How To: Launching Containers From the Command Line - Data Science/Machine Learning Platform (DSMLP)” guidance page.
Web Interface Tool
The Research Cluster uses the web interface tool known as Jupyterhub Notebooks as an alternative graphical interface option for users who prefer computing in this type of interface rather than in the command line interface. To access the web interface tool, users are directed to sign-in at https://datahub.ucsd.edu (or via selecting the login button up top).
Modifying Containers
Certain modifications can be made to containers to allow for users to adjust their environment to accommodate specific computing needs.
Container Run Time Limits
Container Termination Messages
Data Storage / Datasets
There are two types of persistent file storage are available within containers:
A private home directory ($HOME) for each user
A shared directory - for group shared data or for datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet) for individual access
Each user's private home directory is limited to a 100GB storage allocation by default. Shared directory storage can vary as this storage may be a mounted storage. In specific cases, Research IT may make allowances to temporary increase storage in a user’s private home directory. These requests may be submitted by emailing rcd-support@ucsd.edu.
File Transfer
Standard utilities such as 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve code or data from on- or off-campus servers.
Files also may be copied into the cluster from the outside using the following procedures.
Copying Data Into the Cluster: Using Globus
See the page on using Globus to transfer data to and from your computer or another Globus collection.
Copying Data Into the Cluster: SCP/SFTP from your computer
Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users.
Example using the Mac/Linux 'sftp' command line program:
slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu pod agt-4049 up and running; starting sftp Connected to ieng6.ucsd.edu sftp> put 2017-11-29-raspbian-stretch-lite.img Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img 2017-11-29-raspbian-stretch-lite.img 100% 1772MB 76.6MB/s 00:23 sftp> quit sftp complete; deleting pod agt-4049 slithy:Downloads agt$
On Windows, we recommend the WinSCP utility.
After installing WinSCP, the tool will open and you will be be prompted to enter the following information:
Host name: dsmlp-login.ucsd.edu
User name: ad_username
Password: ad_password
Copying Data Into the Cluster: rsync
'rsync' also may be used from a Mac or Linux terminal window to synchronize data sets:
slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu pod agt-9924 up and running; starting rsync building file list ... done rsync complete; deleting pod agt-9924 sent 557671 bytes received 20 bytes 53113.43 bytes/sec total size is 41144035 speedup is 73.78 slithy:ME198 agt$
Customization of the Container Environment
Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containers. When creating a customized container, it is recommended to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful. (PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU).An example of such a launch configuration is as follows:
K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest" K8S_ENTRYPOINT="/run_jupyter.sh" K8S_NUM_GPU=1 # max of 1 (contact ETS to raise limit) K8S_NUM_CPU=4 # max of 8 ("") K8S_GB_MEM=32 # max of 64 ("") # Controls whether an interactive Bash shell is started SPAWN_INTERACTIVE_SHELL=YES # Sets up proxy URL for Jupyter notebook inside PROXY_ENABLED=YES PROXY_PORT=8888
Users may copy an existing launch script into their home directory, then modify that private copy such as:
$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh $ nano $HOME/my-launch-pytorch.sh $ $HOME/my-launch-pytorch.sh
Adjusting Software Environment, CPU/RAM/GPU limits
The maximum limits (8 CPU, 64GB, 1 GPU) apply to all of your running containers: you may run 8 1 CPU-core containers, or 1 8-core container, or anything in-between. Please contact rcd-support@ucsd.edu to request increases to these default limits or if you'd like to adjust your software environment.
Alternate Docker Images
Besides CPU/RAM/GPU settings, you may specify an alternate Docker image: our servers will pull container images from dockerhub.io or elsewhere if requested. You may create or modify these Docker images as needed.
Launch Script Command Line Options
Defaults set within launch scripts' environment variables may be overridden using the following command-line options:
Option | Description | Example |
-c N | Adjust # CPU cores | -c 8 |
-g N | Adjust # GPU cards | -g 2 |
-m N | Adjust # GB RAM | -m 64 |
-i IMG | Docker image name | -i nvidia/cuda:latest |
-e ENTRY | Docker image ENTRYPOINT/CMD | -e /run_jupyter.sh |
-n N | Request specific cluster node (1-10) | -n 7 |
-v | Request specific GPU (gtx1080ti,k5200,titan) | -v k5200 |
-b | Request background pod | (see below) |
Example:
[cs190f @ieng6-201]:~:56$ launch-py3torch-gpu.sh -m 64 -v k5200
Custom Python Packages (Anaconda/PIP)
Users may install personal Python packages within their containers using the standard Anaconda package management system; please see Anaconda's Getting Started guide for a 30-minute introduction. Anaconda is recommended for installing scientific packages with complex dependencies. For less complex installations the PIP tool can be used to install Python packages. User installs of Python packages will result in a consistent set of packages across containers since they will be installed to your home directory. |
Example of a user specific (--user) package installation using 'pip':
agt@agt-10859:~$ pip install --user imutils Collecting imutils Downloading imutils-0.4.5.tar.gz Building wheels for collected packages: imutils Running setup.py bdist_wheel for imutils ... done Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71 Successfully built imutils Installing collected packages: imutils Successfully installed imutils-0.4.5
Background Execution / Long-Running Jobs
To minimize the impact of abandoned/runaway jobs, we permit background execution of containers, up to 12 hours execution time, via the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request. Send an email to rcd-support@ucsd.edu.
Use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container, and ‘kubectl delete pod <pod-name>’ to terminate.
Please be considerate and terminate any unused background jobs: GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.
[amoxley@dsmlp-login]:~:504$ launch-scipy-ml.sh -b Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 0 GPU units. (Adjust command line options, or edit "/software/common64/dsmlp/bin/launch-scipy-ml.sh" to change this configuration.) pod/amoxley-5497 created Mon Mar 9 14:04:10 PDT 2020 starting up - pod status: Pending ; containers with incomplete status: [init-support] Mon Mar 9 14:04:15 PDT 2020 pod is running with IP: 10.43.128.17 on node: its-dsmlp-n25.ucsd.edu ucsdets/scipy-ml-notebook:2019.4-stable is now active. Connect to your background pod via: "kubesh amoxley-5497" Please remember to shut down via: "kubectl delete pod amoxley-5497" ; "kubectl get pods" to list running pods. You may retrieve output from your pod via: "kubectl logs amoxley-5497". PODNAME=amoxley-5497 [amoxley@dsmlp-login]:~:505$ kubesh amoxley-5497 amoxley@amoxley-5497:~$ hostname amoxley-5497 amoxley@amoxley-5497:~$ exit exit [amoxley@dsmlp-login]:~:506$ kubectl get pods NAME READY STATUS RESTARTS AGE amoxley-5497 1/1 Running 0 45s [amoxley@dsmlp-login]:~:507$ kubectl delete pod amoxley-5497 pod "amoxley-5497" deleted [amoxley@dsmlp-login]:~:508$
Common CUDA Run-TIme Error Messages
(59) device-side assert
cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18
Indicates a run-time error in the CUDA code executing on the GPU, commonly due out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.
(2) out of memory
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1503968623488/work/torch/lib/THC/generic/
THCStorage.cu:66
GPU memory has been exhausted. Try reducing your batch size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see Launch Script Command-line Options).
(30) unknown error
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70
This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node; see Launch Script Command-line Options.
Please report these errors to rcd-support@ucsd.edu
Monitoring Cluster Status
The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.
We plan to deploy more sophisticated monitoring tools over the coming months.
Installing TensorBoard
Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment: |
pip install -U --user jupyter-tensorboard jupyter nbextension enable jupyter_tensorboard/tree --user
You’ll need to exit your Pod/container and restart for the change to take effect.
Usage instructions for ‘jupyter_tensorboard’ are available at:
https://github.com/lspvic/jupyter_tensorboard#usage
Hardware Specifications
Cluster architecture diagram
Node | CPU Model | #Cores ea. | RAM ea | #GPU | GPU Model | Family | CUDA Cores | GPU RAM | GFLOPS |
Nodes 1-4 | 2xE5-2630 v4 | 20 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Nodes 5-8 | 2xE5-2630 v4 | 20 | 256Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Node 9 | 2xE5-2650 v2 | 16 | 128Gb | 8 | GTX Titan | Kepler | 2688 ea. | 6Gb | 4500 |
Node 10 | 2xE5-2670 v3 | 24 | 320Gb | 7 | GTX 1070Ti | Pascal | 2432 ea. | 8Gb | 7800 |
Nodes 11-12 | 2xXeon Gold 6130 | 32 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 |
Nodes 13-15 | 2xE5-2650v1 | 16 | 320Gb | n/a | n/a | n/a | n/a | n/a | n/a |
Nodes 16-18 | 2xAMD 6128 | 24 | 256Gb | n/a | n/a | n/a | n/a | n/a | n/a |
Nodes are connected via an Arista 7150 10Gb Ethernet switch.
Additional nodes can be added into the cluster at peak times.
Example: PyTorch Session with TensorFlow examples
slithy:~ agt$ slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu Password: Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu ============================ NOTICE ================================= Authorized use of this system is limited to password-authenticated usernames which are issued to individuals and are for the sole use of the person to whom they are issued. Privacy notice: be aware that computer files, electronic mail and accounts are not private in an absolute sense. For a statement of "ETS (formerly ACMS) Acceptable Use Policies" please see our webpage at http://acms.ucsd.edu/info/aup.html. ===================================================================== Disk quotas for user cs190f (uid 59457): Filesystem blocks quota limit grace files quota limit grace acsnfs4.ucsd.edu:/vol/home/linux/ieng6 11928 5204800 5204800 272 9000 9000 ============================================================= Check Account Lookup Tool at http://acms.ucsd.edu ============================================================= […] Thu Oct 12, 2017 12:34pm - Prepping cs190f [cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units. (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.) pod "cs190f -4953" created Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ; Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99 tensorflow/tensorflow:latest-gpu is now active. Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks. cs190f@cs190f-4953:~$ ls TensorFlow-Examples cs190f@cs190f-4953:~$ cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git Cloning into 'pytorch-tutorial'... remote: Counting objects: 658, done. remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658 Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done. Resolving deltas: 100% (350/350), done. Checking connectivity... done. cs190f@cs190f-4953:~$ cd pytorch-tutorial/ cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/ cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Processing... Done! Epoch [1/2], Step [100/600], Loss: 0.7028 Epoch [1/2], Step [200/600], Loss: 0.2479 Epoch [1/2], Step [300/600], Loss: 0.2467 Epoch [1/2], Step [400/600], Loss: 0.2652 Epoch [1/2], Step [500/600], Loss: 0.1919 Epoch [1/2], Step [600/600], Loss: 0.0822 Epoch [2/2], Step [100/600], Loss: 0.0980 Epoch [2/2], Step [200/600], Loss: 0.1034 Epoch [2/2], Step [300/600], Loss: 0.0927 Epoch [2/2], Step [400/600], Loss: 0.0869 Epoch [2/2], Step [500/600], Loss: 0.0139 Epoch [2/2], Step [600/600], Loss: 0.0299 Test Accuracy of the model on the 10000 test images: 97 % cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME cs190f@cs190f-4953:~$ nvidia-smi Thu Oct 12 13:30:59 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.81 Driver Version: 384.81 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A | | 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+ cs190f@cs190f-4953:~$ exit
Licensed Software
Stata
If you have been provisioned with Stata licensing, a container started by launch-scipy-ml.sh is capable of executing Stata. Stata will be installed in your home directory and can be executed using the command '~/stata-se' from within a container.
Acknowledging Research IT Services
Papers, presentations, and other publications that feature research that benefited from the Research Cluster computing resource, services or support expertise may include in the text the following acknowledgement:
This research was done using the UC San Diego Research Cluster computing resource, supported by Research IT Services and provided by Academic Technology Services / IT Services.