Content Comparison

...

This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.

...

Expand

title	Monitoring Resource Usage in a Jupyter Notebook

Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows

:

Image Removed

Expand

title	Monitoring Resource Usage in the command line terminal

Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU.

Machine Learning

Complex ML workflows are supported through terminal/SSH logins and a full Linux/Ubuntu CUDA development suite. Users may install additional library packages (e.g. conda/pip, CRAN) as needed, or can opt to replace the default environment entirely by launching their own custom Docker containers.

...

:

Image Added

Expand

title	Monitoring Resource Usage in the command line terminal

Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU.

Modifying Containers

Certain modifications can be made to containers to allow for users to adjust their environment to accommodate specific computing needs.

...

Expand

title	Chicago Booth Kilts Center for Marketing: Nielsen Datasets

The Nielsen subscription dataset are available to authorized users at /uss/dsmlp-a/nielsen-dataset/. All of the datasets have been decompressed into this read-only directory making it easy for users to use software (Ex: Stata, your own code) to read directly from the Nielsen directories. In the interest of being mindful of server space, please do not duplicate these large datasets to your home directory and delete unneeded data from your home directory once you've completed your analyses and have your output files.

File Transfer

Standard utilities such as Users can utilize commands (e.g. 'git', 'scp', 'sftp', and 'curl' are included in the 'pytorch' container image and may be used to retrieve ) in the bash shell (command line interface) to import code or data from on- or external servers that are both on and off-campus servers. Files also may can be copied into the cluster from the outside using the following procedures.external sources using Globus, SCP/SFTP, or RSYNC.

Expand

title	Copying Data Into the Cluster: Using Globus

See the page on using Globus to transfer data to and from your computer or another Globus collection.

Expand

title	Copying Data Into the Cluster: SCP/SFTP from your computer

Data may be copied to/from the cluster using the "SCP" or "SFTP" file transfer protocol from a Mac or Linux terminal window, or on Windows using a freely downloadable utility. We recommend this option for most users.

Example using the Mac/Linux 'sftp'

...

command line program:

Code Block

slithy:Downloads agt$ sftp <username>@dsmlp-login.ucsd.edu
pod agt-4049 up and running; starting sftp
Connected to ieng6.ucsd.edu 
sftp> put 2017-11-29-raspbian-stretch-lite.img
Uploading 2017-11-29-raspbian-stretch-lite.img to /datasets/home/08/108/agt/2017-11-29-raspbian-stretch-lite.img
2017-11-29-raspbian-stretch-lite.img             100% 1772MB  76.6MB/s   00:23    
sftp> quit
sftp complete; deleting pod agt-4049
slithy:Downloads agt$

On Windows, we recommend the WinSCP

...

utility.

After installing WinSCP, the tool will open and you will be be prompted to enter the following information:
- Host name: dsmlp-login.ucsd.edu
- User name: ad_username
- Password: ad_password

Expand

title	Copying Data Into the Cluster: rsync

On MacOS or Linux, 'rsync'

...

can be used from a

...

terminal window to synchronize data sets.

Example using the Mac/Linux ‘rsync’ command line program:

Code Block

slithy:ME198 agt$ rsync -avr tub_1_17-11-18 <username>@dsmlp-login.ucsd.edu
pod agt-9924 up and running; starting rsync
building file list ... done
rsync complete; deleting pod agt-9924
sent 557671 bytes  received 20 bytes  53113.43 bytes/sec
total size is 41144035  speedup is 73.78
slithy:ME198 agt$

...

Customizing a Container Environment

Each launch script specifies the default Docker image to use, the required number of CPU cores, GPU cards, and GB RAM assigned to its containersa container. When creating a customized container, it is recommended to use non-GPU ( CPU-only ) containers until your code is fully tested and a simple test training run is successful. (has been completed successfully. It is important to note that PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU ).An example of such a , as such, a successful run in a CPU-only container should also be successful in a container with GPU.

language

Expand

title	An example launch configuration is as follows:

Code Block

bash

K8S_DOCKER_IMAGE="ucsdets/instructional:cse190fa17-latest"
K8S_ENTRYPOINT="/run_jupyter.sh"

K8S_NUM_GPU=1  # max of 1 (contact ETS to raise limit)
K8S_NUM_CPU=4  # max of 8 ("")
K8S_GB_MEM=32  # max of 64 ("")

# Controls whether an interactive Bash shell is started
SPAWN_INTERACTIVE_SHELL=YES

# Sets up proxy URL for Jupyter notebook inside
PROXY_ENABLED=YES
PROXY_PORT=8888

Expand

title	Users may copy an existing launch script into their home directory, then modify that private copy such as

...

;

Code Block
$ cp -p `which launch-pytorch.sh` $HOME/my-launch-pytorch.sh $ nano $HOME/my-launch-pytorch.sh $ $HOME/my-launch-pytorch.sh

Adjusting

...

Container Environment

...

and CPU/RAM/GPU limits

The maximum limits (All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, 1 GPU) apply to all of your running containers: you and 1 GPU. You may run 8 eight 1 CPU-core containers, or 1 one 8-core container, or anything in-between. Please contact any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to request increases to these default limits or if you'd like to adjust your software , as well as, to request other adjustments (including software) to your container environment.

Alternate Docker Images

Besides CPU/RAM/GPU settings, you may specify an alternate Docker image: our In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You maycan create or modify these Docker imagesas needed.

Adjusting Launch Script Environments Command Line Options

Defaults set within launch scripts' environment variables may be overridden using the following command-line optionsUsers can change the default variables within a launch script environment variables using specific command line options.

Expand

title	Command line options to adjust launch script variables:

Option	Description	Example
-c N	Adjust # CPU cores	-c 8
-g N	Adjust # GPU cards	-g 2
-m N	Adjust # GB RAM	-m 64
-i IMG	Docker image name	-i nvidia/cuda:latest
-e ENTRY	Docker image ENTRYPOINT/CMD	-e /run_jupyter.sh
-n N	Request specific cluster node (1-10)	-n 7
-v	Request specific GPU (gtx1080ti,k5200,titan)	-v k5200
-b	Request background pod	(see below)

...

Expand

title	An example launch script adjustment to the RAM (-m) and the GPU (-v):

Code Block
[cs190f @ieng6-201]:~:56$ launch-py3torch-gpu.sh -m 64 -v k5200

Custom Python Packages (Anaconda/PIP)

...

Users may install

...

additional Python packages within their

...

containers using the PIP tool or standard Anaconda package management system. Users should only install Python packages after launching a container. When Python packages are installed, they are installed in a user’s home directory. As such, these packages will be available for all containers launched thereafter by the user.

For less complex installations the PIP tool can be used to install Python packages.

...

Please see PIP documentation ‘User Installs’ for detailed guidance.
Anaconda is recommended for installing scientific packages with complex dependencies. Please see Anaconda's Getting Started for a guided introduction.

Expand

title	Example of a user specific (--user) package installation using 'pip':

Code Block

agt@agt-10859:~$ pip install --user imutils
Collecting imutils
  Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
  Running setup.py bdist_wheel for imutils ... done
  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils

-0.4.5

...

-0.4.5

Installing TensorBoard

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

Code Block
pip install -U --user jupyter-tensorboard jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Running Jobs in a Background Container and Long-Running Jobs

To minimize the impact of abandoned/runaway jobs, we permit background execution of containers, the cluster allows for containers to run jobs in the background container for up to 12 hours of execution time, via . Users need to specify that a job should run in a background container by using the "-b" command line option (see example below). To support longer run times, the default execution time can be extended upon request. Send an email to rcd-support@ucsd.edu. Use be extended upon request to rcd-support@ucsd.edu.

Note to users: Please be considerate and terminate any unused background jobs. GPU cards are limited and assigned to containers on an exclusive basis. When attached to a container, GPUs are unusable by others even if the GPU is idle while attached to your container.

Expand

title	Reconnecting a background container:

In the event that your background container is disconnected, use the ‘kubesh <pod-name>’ command to connect or reconnect to a background container.

Expand

title	Terminating a background container:

In the event that you need to terminate a background container,

...

use the ‘kubectl delete pod <pod-name>’ command to terminate the container.

...

Expand

title	An example of entering a background container command:

Code Block

[amoxley@dsmlp-login]:~:504$ launch-scipy-ml.sh -b
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 0 GPU units.
   (Adjust command line options, or edit "/software/common64/dsmlp/bin/launch-scipy-ml.sh" to change this configuration.)
pod/amoxley-5497 created
Mon Mar 9 14:04:10 PDT 2020 starting up - pod status: Pending ; containers with incomplete status: [init-support]
Mon Mar 9 14:04:15 PDT 2020 pod is running with IP: 10.43.128.17 on node: its-dsmlp-n25.ucsd.edu
ucsdets/scipy-ml-notebook:2019.4-stable is now active.

Connect to your background pod via: "kubesh amoxley-5497"
Please remember to shut down via: "kubectl delete pod amoxley-5497" ; "kubectl get pods" to list running pods.
You may retrieve output from your pod via: "kubectl logs amoxley-5497".
PODNAME=amoxley-5497
[amoxley@dsmlp-login]:~:505$ kubesh amoxley-5497

amoxley@amoxley-5497:~$ hostname
amoxley-5497
amoxley@amoxley-5497:~$ exit
exit

[amoxley@dsmlp-login]:~:506$ kubectl get pods
NAME           READY   STATUS    RESTARTS   AGE
amoxley-5497   1/1     Running   0          45s

[amoxley@dsmlp-login]:~:507$ kubectl delete pod amoxley-5497
pod "amoxley-5497" deleted
[amoxley@dsmlp-login]:~:508$

...

Run-TIme Error Messages

(59) device-side assert

cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorCopy.c:18There may be instances where you receive a CUDA run-time error while running a job in a container. Below are a few of the more commonly encountered errors. These errors can typically be resolved by user adjustments. However, If users encounter a run-time error that requires more assistance to resolve, please contact rcd-support@ucsd.edu.

Expand

title	(59) device-side assert

Indicates a run-time error in the CUDA code executing on the GPU

...

and is commonly due to out-of-bounds array access. Consider running in CPU-only mode (remove .cuda() call) to obtain more specific debugging messages.

(2) out of memory

...

Code Block
cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_

...

1503966894950/work/torch/lib/THC/generic/

...

THCTensorCopy.

...

c:18

Expand

title	(2) out of memory

GPU memory has been exhausted. Try reducing your

...

dataset size, or confine your job to 11GB GTX1080Ti cards rather than 6GB Titan or 8GB K5200 (see “Adjusting Launch Script

...

Environments Command Line Options” in this user guide).

...

Code Block
RuntimeError:

...

 cuda runtime error (

...

2)

...

 : out of memory at /opt/conda/conda-bld/pytorch_

...

1503968623488/work/torch/lib/THC/generic/

...

THCStorage.

...

cu:66

Expand

title	(30) unknown error

This indicates a hardware error on the assigned GPU, and usually requires a reboot of the cluster node to correct. As a temporary workaround, you may explicitly direct your job to another node

...

(see 'Adjusting Launch Script Environments Command

...

Line Options” in this user guide).

Code Block
RuntimeError: cuda runtime error (30) : unknown error at /opt/conda/conda-bld/pytorch_1503966894950/work/torch/lib/THC/THCGeneral.c:70

Please report this type of error directly to rcd-support@ucsd.edu

...

for assistance.

Monitoring Cluster Status

The ‘cluster-status’ command provides insight into the number of jobs currently running and GPU/CPU/RAM allocated.

We plan to deploy more sophisticated monitoring tools over the coming months.

...

Installing TensorBoard

Image Removed

Our current configuration doesn’t permit easy access to Tensorboard via port 6006, but the following shell commands will install a TensorBoard interface accessible within the Jupyter environment:

Code Block
pip install -U --user jupyter-tensorboard jupyter nbextension enable jupyter_tensorboard/tree --user

You’ll need to exit your Pod/container and restart for the change to take effect.

Usage instructions for ‘jupyter_tensorboard’ are available at:

https://github.com/lspvic/jupyter_tensorboard#usage

Users can enter the ‘cluster-status’ command for insight into the number of jobs currently running and GPU/CPU/RAM allocated. Alternatively, users can refer the the cluster ‘Node Status’ page for updates on containers (or images).

Expand

title	An example of a 'cluster-status' command output:

Image Added

Cluster Hardware Specifications

Cluster architecture diagram

Node	CPU Model	#Cores ea.	RAM ea	#GPU	GPU Model	Family	CUDA Cores	GPU RAM	GFLOPS
Nodes 1-4	2xE5-2630 v4	20	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 5-8	2xE5-2630 v4	20	256Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Node 9	2xE5-2650 v2	16	128Gb	8	GTX Titan (2014)	Kepler	2688 ea.	6Gb	4500
Node 10	2xE5-2670 v3	24	320Gb	7	GTX 1070Ti	Pascal	2432 ea.	8Gb	7800
Nodes 11-12	2xXeon Gold 6130	32	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 13-15	2xE5-2650v1	16	320Gb	n/a	n/a	n/a	n/a	n/a	n/a
Nodes 16-18	2xAMD 6128	24	256Gb	n/a	n/a	n/a	n/a	n/a	n/a

Nodes are connected via an Arista 7150 10Gb Ethernet switch.

...

Version	Old Version 43	New Version 44
Changes made by	Kimberly Thomas	Kimberly Thomas
Saved on	Nov 03, 2023	Nov 03, 2023

Versions Compared

Key

Users can view the container CPU and memory (RAM) utilization in the Bash command line interface by using the ‘htop’ command. To see GPU usage, enter the `/usr/local/nvidia/bin/nvidia-smi` command for a container that uses GPU.

Machine Learning

Modifying Containers

File Transfer

Customizing a Container Environment

Adjusting

Container Environment

and CPU/RAM/GPU limits

Alternate Docker Images

Adjusting Launch Script Environments Command Line Options

Custom Python Packages (Anaconda/PIP)

Installing TensorBoard

Running Jobs in a Background Container and Long-Running Jobs

Run-TIme Error Messages

(59) device-side assert

(2) out of memory

Monitoring Cluster Status

Installing TensorBoard

Cluster Hardware Specifications

Cluster architecture diagram