Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.  

...

Expand
titleExplore More: Programming Training/Educational Resources

Monitoring Container Resource Usage

  • for upcoming workshop opportunities

Monitoring Container Resource Usage

Cluster users can monitor their resource usage in a launched container in both the command line terminal and in the web interface tool. Monitoring resource usage allows for users to be aware of their job limitations, as well as, identify possible bottlenecks during certain stages of the job execution.

Expand
titleMonitoring Resource Usage in a Jupyter Notebook

Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows:

...

There are two types of persistent file storage are available within containers : - private/home directory and shared directory storage.

  • A private home directory ($HOME) for automatically generated for each cluster user. User's private home directory is limited to a 100GB storage allocation by default.

  • A shared directory - for group shared data or for datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet) for individual access

...

  • . Shared directory storage can vary as this storage may be a mounted storage.

In specific cases, Research IT may make allowances to temporary temporarily increase storage in a user’s private home directory. These requests may be submitted by emailing rcd-support@ucsd.edu.

...

Adjusting Container Environment and CPU/RAM/GPU limits

All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, and 1 GPU. You may run eight 1 CPU-core containers, one 8-core container, or any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to increases to these default limits, as well as, to request other adjustments (including software) to your container environment. 

...

In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You can create or modify these Docker imagesas needed.

Adjusting Launch Script Environments Command Line Options

...

Expand
titleExample of a user specific (--user) package installation using 'pip':
Code Block
agt@agt-10859:~$ pip install --user imutils
Collecting imutils
  Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
  Running setup.py bdist_wheel for imutils ... done
  Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5

Installing TensorBoard

Our current configuration doesn’t Current cluster configuration does not permit easy access to Tensorboard via port 6006, but the following ; however, there are shell commands will that can install a TensorBoard interface accessible within the Jupyter environmentenvironment.

Expand
titlePIP command to install TensorBoard:
Code Block
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user

Note: You’ll need to exit your Pod/container and restart for the change to take effect.

Running Jobs in a Background Container and Long-Running Jobs

...

Expand
titleAn example of a 'cluster-status' command output:

Cluster Hardware Specifications

The Research Cluster shares hardware infrastructure with the Data Science and Machine Learning Platform (DSMLP). As such, the information about the hardware specifications for the Research Cluster are described in the Cluster architecture diagram

...

...

(as displayed in reference to the DSMLP).

Expand
titleAdditional Node specifications:

Nodes are connected via an Arista 7150 10Gb Ethernet switch.  Additional nodes can be added into the cluster at peak times.

 Node

CPU Model

#Cores ea.

RAM ea

#GPU

GPU Model

Family

CUDA Cores

GPU RAM

GFLOPS

Nodes 1-4

2xE5-2630 v4

20

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 5-8

2xE5-2630 v4

20

256Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Node 9

2xE5-2650 v2

16

128Gb

8

GTX Titan
(2014)

Kepler

2688 ea.

6Gb

4500

Node 10

2xE5-2670 v3

24

320Gb

7

GTX 1070Ti

Pascal

...

2432 ea.

8Gb

7800

Nodes 11-12

2xXeon Gold 6130

32

384Gb

8

GTX 1080Ti

Pascal

3584 ea.

11Gb

10600

Nodes 13-15

2xE5-2650v1

16

320Gb

n/a

n/a

n/a

n/a

n/a

n/a

Nodes 16-18

2xAMD 6128

24

256Gb

n/a

n/a

n/a

n/a

n/a

n/a

Nodes are connected via an Arista 7150 10Gb Ethernet switch.  

...

Expand
titleExample: PyTorch Session with TensorFlow examples

...

Code Block
slithy:~ agt$
slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu
Password:
Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu
============================ NOTICE =================================
Authorized use of this system is limited to password-authenticated
usernames which are issued to individuals and are for the sole use of
the person to whom they are issued.
 
Privacy notice: be aware that computer files, electronic mail and
accounts are not private in an absolute sense.  For a statement of
"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage
at http://acms.ucsd.edu/info/aup.html.
=====================================================================
 

Disk quotas for user cs190f (uid 59457):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
acsnfs4.ucsd.edu:/vol/home/linux/ieng6
                      11928  5204800 5204800                 272        9000        9000      
=============================================================
Check Account Lookup Tool at http://acms.ucsd.edu
=============================================================

[…]

Thu Oct 12, 2017 12:34pm - Prepping cs190f
[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units.  (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)
pod "cs190f -4953" created
Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;
Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99
tensorflow/tensorflow:latest-gpu is now active.

Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce

Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.
cs190f@cs190f-4953:~$ ls
TensorFlow-Examples
cs190f@cs190f-4953:~$
cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git
Cloning into 'pytorch-tutorial'...
remote: Counting objects: 658, done.
remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658
Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (350/350), done.
Checking connectivity... done.
cs190f@cs190f-4953:~$ cd pytorch-tutorial/
cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Epoch [1/2], Step [100/600], Loss: 0.7028
Epoch [1/2], Step [200/600], Loss: 0.2479
Epoch [1/2], Step [300/600], Loss: 0.2467
Epoch [1/2], Step [400/600], Loss: 0.2652
Epoch [1/2], Step [500/600], Loss: 0.1919
Epoch [1/2], Step [600/600], Loss: 0.0822
Epoch [2/2], Step [100/600], Loss: 0.0980
Epoch [2/2], Step [200/600], Loss: 0.1034
Epoch [2/2], Step [300/600], Loss: 0.0927
Epoch [2/2], Step [400/600], Loss: 0.0869
Epoch [2/2], Step [500/600], Loss: 0.0139
Epoch [2/2], Step [600/600], Loss: 0.0299
Test Accuracy of the model on the 10000 test images: 97 %
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME
cs190f@cs190f-4953:~$ nvidia-smi
Thu Oct 12 13:30:59 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 23%  27C    P0     56W / 250W |      0MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------
----+----------------------+ cs190f@cs190f-4953:~$ exit

Licensed Software

Stata

...

+----------------------+

cs190f@cs190f-4953:~$ exit

Licensed Software

Installing licensed software is allowed in the Research Cluster; however, certain software versions are required to be compatible for installation in a cluster environment. The purchase of licensed software is the responsibility of the user or their sponsoring department. Research IT Services is available to assist with the installation of licensed software. For questions about installing licensed software, please email rcd-support@ucsd.edu.

Expand
titleStata

For users with provisioned Stata licensing, the launch-scipy-ml.sh container is capable of executing Stata. Stata

...

can be installed in your home directory by the Research IT Services team and can be executed using the command '~/stata-se' from within a container.

Acknowledging Research IT Services

...