Page Comparison

...

This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.

...

Expand

title	Explore More: Programming Training/Educational Resources

Pluralsight at UCSD - Web-based training resource for the UCSD community
Udemy.com - free Python, R Studio, and Jupyter Notebook courses
Coursera - free Python, R Studio, and Jupyter Notebook courses
The Carpentries at UCSD - Check here for upcoming workshop opportunities

Monitoring Container Resource Usage

for upcoming workshop opportunities

Monitoring Container Resource Usage

Cluster users can monitor their resource usage in a launched container in both the command line terminal and in the web interface tool. Monitoring resource usage allows for users to be aware of their job limitations, as well as, identify possible bottlenecks during certain stages of the job execution.

Expand

title	Monitoring Resource Usage in a Jupyter Notebook

Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows:

...

There are two types of persistent file storage are available within containers : - private/home directory and shared directory storage.

A private home directory ($HOME) for automatically generated for each cluster user. User's private home directory is limited to a 100GB storage allocation by default.
A shared directory - for group shared data or for datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet) for individual access

...

. Shared directory storage can vary as this storage may be a mounted storage.

In specific cases, Research IT may make allowances to temporary temporarily increase storage in a user’s private home directory. These requests may be submitted by emailing rcd-support@ucsd.edu.

...

Adjusting Container Environment and CPU/RAM/GPU limits

All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, and 1 GPU. You may run eight 1 CPU-core containers, one 8-core container, or any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to increases to these default limits, as well as, to request other adjustments (including software) to your container environment.

...

In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You can create or modify these Docker imagesas needed.

Adjusting Launch Script Environments Command Line Options

...

Expand

title	Example of a user specific (--user) package installation using 'pip':

Code Block

agt@agt-10859:~$ pip install --user imutils
Collecting imutils
  Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
  Running setup.py bdist_wheel for imutils ... done
  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5

Installing TensorBoard

Our current configuration doesn’t Current cluster configuration does not permit easy access to Tensorboard via port 6006, but the following ; however, there are shell commands will that can install a TensorBoard interface accessible within the Jupyter environmentenvironment.

Expand

title	PIP command to install TensorBoard:

Code Block
pip install -U --user jupyter-tensorboard jupyter nbextension enable jupyter_tensorboard/tree --user

Note: You’ll need to exit your Pod/container and restart for the change to take effect.

Running Jobs in a Background Container and Long-Running Jobs

...

Expand

title	An example of a 'cluster-status' command output:

Cluster Hardware Specifications

The Research Cluster shares hardware infrastructure with the Data Science and Machine Learning Platform (DSMLP). As such, the information about the hardware specifications for the Research Cluster are described in the Cluster architecture diagram

...

(as displayed in reference to the DSMLP).

Expand

title	Additional Node specifications:

Nodes are connected via an Arista 7150 10Gb Ethernet switch. Additional nodes can be added into the cluster at peak times.

Node	CPU Model	#Cores ea.	RAM ea	#GPU	GPU Model	Family	CUDA Cores	GPU RAM	GFLOPS
Nodes 1-4	2xE5-2630 v4	20	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 5-8	2xE5-2630 v4	20	256Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Node 9	2xE5-2650 v2	16	128Gb	8	GTX Titan (2014)	Kepler	2688 ea.	6Gb	4500
Node 10	2xE5-2670 v3	24	320Gb	7	GTX 1070Ti	Pascal

...

2432 ea.	8Gb	7800
Nodes 11-12	2xXeon Gold 6130	32	384Gb	8	GTX 1080Ti	Pascal	3584 ea.	11Gb	10600
Nodes 13-15	2xE5-2650v1	16	320Gb	n/a	n/a	n/a	n/a	n/a	n/a
Nodes 16-18	2xAMD 6128	24	256Gb	n/a	n/a	n/a	n/a	n/a	n/a

Nodes are connected via an Arista 7150 10Gb Ethernet switch.

...

Expand

title	Example: PyTorch Session with TensorFlow examples

...

Code Block

slithy:~ agt$
slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu
Password:
Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu
============================ NOTICE =================================
Authorized use of this system is limited to password-authenticated
usernames which are issued to individuals and are for the sole use of
the person to whom they are issued.
 
Privacy notice: be aware that computer files, electronic mail and
accounts are not private in an absolute sense.  For a statement of
"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage
at http://acms.ucsd.edu/info/aup.html.
=====================================================================
 

Disk quotas for user cs190f (uid 59457):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
acsnfs4.ucsd.edu:/vol/home/linux/ieng6
                      11928  5204800 5204800                 272        9000        9000      
=============================================================
Check Account Lookup Tool at http://acms.ucsd.edu
=============================================================

[…]

Thu Oct 12, 2017 12:34pm - Prepping cs190f
[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units.  (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)
pod "cs190f -4953" created
Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;
Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99
tensorflow/tensorflow:latest-gpu is now active.

Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce

Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.
cs190f@cs190f-4953:~$ ls
TensorFlow-Examples
cs190f@cs190f-4953:~$
cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git
Cloning into 'pytorch-tutorial'...
remote: Counting objects: 658, done.
remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658
Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (350/350), done.
Checking connectivity... done.
cs190f@cs190f-4953:~$ cd pytorch-tutorial/
cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Epoch [1/2], Step [100/600], Loss: 0.7028
Epoch [1/2], Step [200/600], Loss: 0.2479
Epoch [1/2], Step [300/600], Loss: 0.2467
Epoch [1/2], Step [400/600], Loss: 0.2652
Epoch [1/2], Step [500/600], Loss: 0.1919
Epoch [1/2], Step [600/600], Loss: 0.0822
Epoch [2/2], Step [100/600], Loss: 0.0980
Epoch [2/2], Step [200/600], Loss: 0.1034
Epoch [2/2], Step [300/600], Loss: 0.0927
Epoch [2/2], Step [400/600], Loss: 0.0869
Epoch [2/2], Step [500/600], Loss: 0.0139
Epoch [2/2], Step [600/600], Loss: 0.0299
Test Accuracy of the model on the 10000 test images: 97 %
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME
cs190f@cs190f-4953:~$ nvidia-smi
Thu Oct 12 13:30:59 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%  27C    P0     56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------

----+----------------------+ cs190f@cs190f-4953:~$ exit

Licensed Software

Stata

...

+----------------------+

cs190f@cs190f-4953:~$ exit

Licensed Software

Installing licensed software is allowed in the Research Cluster; however, certain software versions are required to be compatible for installation in a cluster environment. The purchase of licensed software is the responsibility of the user or their sponsoring department. Research IT Services is available to assist with the installation of licensed software. For questions about installing licensed software, please email rcd-support@ucsd.edu.

Expand

title	Stata

For users with provisioned Stata licensing, the launch-scipy-ml.sh container is capable of executing Stata. Stata

...

can be installed in your home directory by the Research IT Services team and can be executed using the command '~/stata-se' from within a container.

Acknowledging Research IT Services

...

Versions Compared

Old Version 44

New Version 45

Key

Monitoring Container Resource Usage

Monitoring Container Resource Usage

Adjusting Container Environment and CPU/RAM/GPU limits

Adjusting Launch Script Environments Command Line Options

Installing TensorBoard

Running Jobs in a Background Container and Long-Running Jobs

Cluster Hardware Specifications

Licensed Software

Stata

Licensed Software

Acknowledging Research IT Services