Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.  

...

Expand
titleMonitoring Resource Usage in a Jupyter Notebook

Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows:

Expand
titleMonitoring Resource Usage in the command line terminal

Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU.

Machine Learning

Complex ML workflows are supported through terminal/SSH logins and a full Linux/Ubuntu CUDA development suite. Users may install additional library packages (e.g. conda/pip, CRAN) as needed, or can opt to replace the default environment entirely by launching their own custom Docker containers. 

High-speed cluster-local storage houses workspaces and common training corpora (e.g. CIFAR, ImageNet).

Modifying Containers

Certain modifications can be made to containers to allow for users to adjust their environment to accommodate specific computing needs.

...

Expand
titleChicago Booth Kilts Center for Marketing: Nielsen Datasets

The Nielsen subscription dataset are available to authorized users at /uss/dsmlp-a/nielsen-dataset/. All of the datasets have been decompressed into this read-only directory making it easy for users to use software (Ex: Stata, your own code) to read directly from the Nielsen directories. In the interest of being mindful of server space, please do not duplicate these large datasets to your home directory and delete unneeded data from your home directory once you've completed your analyses and have your output files.

File Transfer

Standard utilities such as Users can utilize commands (e.g. 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve ) in the bash shell (command line interface) to import code or data from on- or external servers that are both on and off-campus servers.     Files also may can be copied into the cluster from the outside using the following procedures.external sources using Globus, SCP/SFTP, or RSYNC.

Expand
titleCopying Data Into the Cluster: Using Globus

See the page on using Globus to transfer data to and from your computer or another Globus collection.

Expand
titleCopying Data Into the Cluster: SCP/SFTP from your computer

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility.  We recommend this option for most users.

Example using the Mac/Linux 'sftp'

...

 command line program:

Code Block
slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu
pod agt-4049 up and running; starting sftp
Connected to ieng6.ucsd.edu 
sftp> put 2017-11-29-raspbian-stretch-lite.img
Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img
2017-11-29-raspbian-stretch-lite.img             100% 1772MB  76.6MB/s   00:23    
sftp> quit
sftp complete; deleting pod agt-4049
slithy:Downloads agt$

On Windows, we recommend the WinSCP

...

 utility.

  • After installing WinSCP, the tool will open and you will be be prompted to enter the following information:

Expand
titleCopying Data Into the Cluster: rsync

On MacOS or Linux, 'rsync'

...

can be used from a

...

terminal window to synchronize data sets.

Example using the Mac/Linux ‘rsync’ command line program:

Code Block
slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu
pod agt-9924 up and running; starting rsync
building file list ... done
rsync complete; deleting pod agt-9924
sent 557671 bytes  received 20 bytes  53113.43 bytes/sec
total size is 41144035  speedup is 73.78
slithy:ME198 agt$

...

Customizing a Container Environment

Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containersa container. When creating a customized container, it is recommended to use non-GPU ( CPU-only ) containers until your code is fully tested and a simple test training run is successful. (has been completed successfully. It is important to note that PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU ).An example of such a , as such, a successful run in a CPU-only container should also be successful in a container with GPU.

bash
Expand
titleAn example launch configuration is as follows:
Code Block
language
K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest"
K8S_ENTRYPOINT="/run_jupyter.sh"

K8S_NUM_GPU=1  # max of 1 (contact ETS to raise limit)
K8S_NUM_CPU=4  # max of 8 ("")
K8S_GB_MEM=32  # max of 64 ("")

# Controls whether an interactive Bash shell is started
SPAWN_INTERACTIVE_SHELL=YES

# Sets up proxy URL for Jupyter notebook inside
PROXY_ENABLED=YES
PROXY_PORT=8888
Expand
titleUsers may copy an existing launch script into their home directory, then modify that private copy such as

...

;
Code Block
$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh
$ nano $HOME/my-launch-pytorch.sh    
$ $HOME/my-launch-pytorch.sh

Adjusting

...

Container Environment

...

and CPU/RAM/GPU limits

The maximum limits (All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, 1 GPU) apply to all of your running containers: you and 1 GPU. You may run 8 eight 1 CPU-core containers, or 1 one 8-core container, or anything in-between. Please contact any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to request increases to these default limits or if you'd like to adjust your software , as well as, to request other adjustments (including software) to your container environment. 

Alternate Docker Images

Besides CPU/RAM/GPU settings, you may specify an alternate Docker image: our In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You maycan create or modify these Docker imagesas needed.

Adjusting Launch Script Environments Command Line Options

Defaults set within launch scripts' environment variables may be overridden using the following command-line optionsUsers can change the default variables within a launch script environment variables using specific command line options.

Expand
titleCommand line options to adjust launch script variables:

Option

Description

Example

-c N

Adjust # CPU cores

-c 8

-g N

Adjust # GPU cards

-g 2

-m N

Adjust # GB RAM

-m 64

-i IMG

Docker image name

-i nvidia/cuda:latest

-e ENTRY

Docker image ENTRYPOINT/CMD

-e /run_jupyter.sh

-n N

Request specific cluster node (1-10)

-n 7

-v

Request specific GPU (gtx1080ti,k5200,titan)

-v k5200

-b

Request background pod

(see below)

...

[
Expand
Code Block
titleAn example launch script adjustment to the RAM (-m) and the GPU (-v):
Code Block
[cs190f @ieng6-201]:~:56$  launch-py3torch-gpu.sh -m 64 -v k5200

Custom Python Packages (Anaconda/PIP)

...

Users may install

...

additional Python packages within their containers using the PIP tool or standard Anaconda package management system

...

. Users should only install Python packages after launching a container. When Python packages are installed, they are installed in a user’s home directory. As such, these packages will be available for all containers launched thereafter by the user.

  • For less complex installations the PIP tool can be used to install Python packages. 

...

Expand
titleExample of a user specific (--user) package installation using 'pip':
Code Block
agt@agt-10859:~$ pip install --user imutils
Collecting imutils
  Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
  Running setup.py bdist_wheel for imutils ... done
  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5

Background Execution / Long-Running Jobs

...

Installing TensorBoard

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

Code Block
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Running Jobs in a Background Container and Long-Running Jobs

To minimize the impact of abandoned/runaway jobs, the cluster allows for containers to run jobs in the background container for up to 12 hours of execution time. Users need to specify that a job should run in a background container by using the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request. Send an email to rcd-support@ucsd.edu.   Use extended upon request to rcd-support@ucsd.edu.   

Note to users: Please be considerate and terminate any unused background jobs.  GPU cards are limited and assigned to containers on an exclusive basis. When attached to a container, GPUs are unusable by others even if the GPU is idle while attached to your container.

Expand
titleReconnecting a background container:

In the event that your background container is disconnected, use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container.

Expand
titleTerminating a background container:

In the event that you need to terminate a background container,

...

use the ‘kubectl delete pod <pod-name>’ command to terminate the container.

...

Expand
titleAn example of entering a background container command:
Code Block
[amoxley@dsmlp-login]:~:504$ launch-scipy-ml.sh -b
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 0 GPU units.
   (Adjust command line options, or edit "/software/common64/dsmlp/bin/launch-scipy-ml.sh" to change this configuration.)
pod/amoxley-5497 created
Mon Mar 9 14:04:10 PDT 2020 starting up - pod status: Pending ; containers with incomplete status: [init-support]
Mon Mar 9 14:04:15 PDT 2020 pod is running with IP: 10.43.128.17 on node: its-dsmlp-n25.ucsd.edu
ucsdets/scipy-ml-notebook:2019.4-stable is now active.

Connect to your background pod via: "kubesh amoxley-5497"
Please remember to shut down via: "kubectl delete pod amoxley-5497" ; "kubectl get pods" to list running pods.
You may retrieve output from your pod via: "kubectl logs amoxley-5497".
PODNAME=amoxley-5497
[amoxley@dsmlp-login]:~:505$ kubesh amoxley-5497

amoxley@amoxley-5497:~$ hostname
amoxley-5497
amoxley@amoxley-5497:~$ exit
exit

[amoxley@dsmlp-login]:~:506$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
amoxley-5497   1/1     Running   0          45s

[amoxley@dsmlp-login]:~:507$ kubectl delete pod amoxley-5497
pod "amoxley-5497" deleted
[amoxley@dsmlp-login]:~:508$

...

Run-TIme Error Messages

(59) device-side assert 

cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18There may be instances where you receive a CUDA run-time error while running a job in a container. Below are a few of the more commonly encountered errors. These errors can typically be resolved by user adjustments. However, If users encounter a run-time error that requires more assistance to resolve, please contact rcd-support@ucsd.edu.

Expand
title(59) device-side assert

Indicates a run-time error in the CUDA code executing on the GPU

...

and is commonly due to out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.

(2) out of memory

...

Code Block
cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_

...

1503966894950/work/torch/lib/THC/generic/

...

THCTensorCopy.

...

c:18
Expand
title(2) out of memory

GPU memory has been exhausted. Try reducing your

...

dataset size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see “Adjusting Launch Script

...

Environments Command Line Options” in this user guide).

...

Code Block
RuntimeError:

...

 cuda runtime error (

...

2)

...

 : out of memory at /opt/conda/conda-bld/pytorch_

...

1503968623488/work/torch/lib/THC/generic/

...

THCStorage.

...

cu:66
Expand
title(30) unknown error

This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node

...

(see 'Adjusting Launch Script Environments Command

...

Line Options” in this user guide). 

Code Block
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70

Please report this type of error directly to rcd-support@ucsd.edu

...

 for assistance.

Monitoring Cluster Status

The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.

We plan to deploy more sophisticated monitoring tools over the coming months.

...

Installing TensorBoard

Image Removed

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

Code Block
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Usage instructions for ‘jupyter_tensorboard’ are available at:

https://github.com/lspvic/jupyter_tensorboard#usage

Users can enter the ‘cluster-status’ command for insight into the number of jobs currently running and GPU/CPU/RAM allocated. Alternatively, users can refer the the cluster ‘Node Status’ page for updates on containers (or images).

Expand
titleAn example of a 'cluster-status' command output:
Image Added

Cluster Hardware Specifications

Cluster architecture diagram

 Node

CPU Model

#Cores ea.

RAM ea

#GPU

GPU Model

Family

CUDA Cores

GPU RAM

GFLOPS

Nodes 1-4

2xE5-2630 v4

20

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 5-8

2xE5-2630 v4

20

256Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Node 9

2xE5-2650 v2

16

128Gb

8

GTX Titan
(2014)

Kepler

2688 ea.

6Gb

4500

Node 10

2xE5-2670 v3

24

320Gb

7

GTX 1070Ti

Pascal

2432 ea.

8Gb

7800

Nodes 11-12

2xXeon Gold 6130

32

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 13-15

2xE5-2650v1

16

320Gb

n/a

n/a

n/a

n/a

n/a

n/a

Nodes 16-18

2xAMD 6128

24

256Gb

n/a

n/a

n/a

n/a

n/a

n/a


Nodes are connected via an Arista 7150 10Gb Ethernet switch.  

...