...
...
This documentation includes guidance, instructions, general information about the Research IT Service managed Research Cluster. Researcher execute computing jobs on the cluster in containerized environments known as Docker “containers” which are essentially lightweight virtual machines, each assigned dedicated CPU, RAM, and GPU hardware, and each well isolated from other users’ processes. The Research Cluster uses the Kubernetes container management/orchestration system to route users’ containers onto compute nodes, monitors performance, and applies resource limits/quotas as appropriate.
...
Expand |
---|
title | Explore More: Programming Training/Educational Resources |
---|
|
|
Monitoring Container Resource Usage
Monitoring Container Resource Usage
Cluster users can monitor their resource usage in a launched container in both the command line terminal and in the web interface tool. Monitoring resource usage allows for users to be aware of their job limitations, as well as, identify possible bottlenecks during certain stages of the job execution.
Expand |
---|
title | Monitoring Resource Usage in a Jupyter Notebook |
---|
|
Users can view the container CPU, GPU, and memory (RAM) utilization by selecting the ‘Show Usage’ header menu buttons. The usage will display in the top right of the notebook as follows: |
...
There are two types of persistent file storage are available within containers : - private/home directory and shared directory storage.
A private home directory ($HOME) for automatically generated for each cluster user. User's private home directory is limited to a 100GB storage allocation by default.
A shared directory - for group shared data or for datasets used to distribute common data (e.g. CIFAR-10, Tiny ImageNet) for individual access
...
. Shared directory storage can vary as this storage may be a mounted storage.
In specific cases, Research IT may make allowances to temporary temporarily increase storage in a user’s private home directory. These requests may be submitted by emailing rcd-support@ucsd.edu.
...
Adjusting Container Environment and CPU/RAM/GPU limits
All running containers in the cluster have a maximum configuration limit of 8 CPU, 64GB, and 1 GPU. You may run eight 1 CPU-core containers, one 8-core container, or any configuration within the these bounds. Requests may be submitted to rcd-support@ucsd.edu to increases to these default limits, as well as, to request other adjustments (including software) to your container environment.
...
In addition to configuration settings, users can import alternate or custom Docker images. The cluster servers will pull container images fromdockerhub.ioor elsewhere if requested. You can create or modify these Docker imagesas needed.
Adjusting Launch Script Environments Command Line Options
...
Expand |
---|
title | Example of a user specific (--user) package installation using 'pip': |
---|
|
Code Block |
---|
agt@agt-10859:~$ pip install --user imutils
Collecting imutils
Downloading imutils-0.4.5.tar.gz
Building wheels for collected packages: imutils
Running setup.py bdist_wheel for imutils ... done
Stored in directory: /tmp/xdg-cache/pip/wheels/ec/e4/a7/17684e97bbe215b7047bb9b80c9eb7d6ac461ef0a91b67ae71
Successfully built imutils
Installing collected packages: imutils
Successfully installed imutils-0.4.5 |
|
Installing TensorBoard
Our current configuration doesn’t Current cluster configuration does not permit easy access to Tensorboard via port 6006, but the following ; however, there are shell commands will that can install a TensorBoard interface accessible within the Jupyter environmentenvironment.
Expand |
---|
title | PIP command to install TensorBoard: |
---|
|
Code Block |
---|
pip install -U --user jupyter-tensorboard
jupyter nbextension enable jupyter_tensorboard/tree --user |
Note: You’ll need to exit your Pod/container and restart for the change to take effect. |
Running Jobs in a Background Container and Long-Running Jobs
...
Expand |
---|
title | An example of a 'cluster-status' command output: |
---|
|
|
Cluster Hardware Specifications
The Research Cluster shares hardware infrastructure with the Data Science and Machine Learning Platform (DSMLP). As such, the information about the hardware specifications for the Research Cluster are described in the Cluster architecture diagram
...
...
(as displayed in reference to the DSMLP).
Expand |
---|
title | Additional Node specifications: |
---|
|
Nodes are connected via an Arista 7150 10Gb Ethernet switch. Additional nodes can be added into the cluster at peak times. Node | CPU Model | #Cores ea. | RAM ea | #GPU | GPU Model | Family | CUDA Cores | GPU RAM | GFLOPS | Nodes 1-4 | 2xE5-2630 v4 | 20 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 | Nodes 5-8 | 2xE5-2630 v4 | 20 | 256Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 | Node 9 | 2xE5-2650 v2 | 16 | 128Gb | 8 | GTX Titan (2014) | Kepler | 2688 ea. | 6Gb | 4500 | Node 10 | 2xE5-2670 v3 | 24 | 320Gb | 7 | GTX 1070Ti | Pascal |
|
...
2432 ea. | 8Gb | 7800 | Nodes 11-12 | 2xXeon Gold 6130 | 32 | 384Gb | 8 | GTX 1080Ti | Pascal | 3584 ea. | 11Gb | 10600 | Nodes 13-15 | 2xE5-2650v1 | 16 | 320Gb | n/a | n/a | n/a | n/a | n/a | n/a | Nodes 16-18 | 2xAMD 6128 | 24 | 256Gb | n/a | n/a | n/a | n/a | n/a | n/a |
|
Nodes are connected via an Arista 7150 10Gb Ethernet switch.
...
Expand |
---|
title | Example: PyTorch Session with TensorFlow examples |
---|
|
...
|
Code Block |
---|
slithy:~ agt$
slithy:~ agt$ ssh cs190f@ieng6.ucsd.edu
Password:
Last login: Thu Oct 12 12:29:30 2017 from slithy.ucsd.edu
============================ NOTICE =================================
Authorized use of this system is limited to password-authenticated
usernames which are issued to individuals and are for the sole use of
the person to whom they are issued.
Privacy notice: be aware that computer files, electronic mail and
accounts are not private in an absolute sense. For a statement of
"ETS (formerly ACMS) Acceptable Use Policies" please see our webpage
at http://acms.ucsd.edu/info/aup.html.
=====================================================================
Disk quotas for user cs190f (uid 59457):
Filesystem blocks quota limit grace files quota limit grace
acsnfs4.ucsd.edu:/vol/home/linux/ieng6
11928 5204800 5204800 272 9000 9000
=============================================================
Check Account Lookup Tool at http://acms.ucsd.edu
=============================================================
[…]
Thu Oct 12, 2017 12:34pm - Prepping cs190f
[cs190f @ieng6-201]:~:56$ launch-pytorch-gpu.sh
Attempting to create job ('pod') with 2 CPU cores, 8 GB RAM, and 1 GPU units. (Edit /home/linux/ieng6/cs190f/public/bin/launch-pytorch.sh to change this configuration.)
pod "cs190f -4953" created
Thu Oct 12 12:34:41 PDT 2017 starting up - pod status: Pending ;
Thu Oct 12 12:34:47 PDT 2017 pod is running with IP: 10.128.7.99
tensorflow/tensorflow:latest-gpu is now active.
Please connect to: http://ieng6-201.ucsd.edu:4957/?token=669d678bdb00c89df6ab178285a0e8443e676298a02ad66e2438c9851cb544ce
Connected to cs190f-4953; type 'exit' to terminate processes and close Jupyter notebooks.
cs190f@cs190f-4953:~$ ls
TensorFlow-Examples
cs190f@cs190f-4953:~$
cs190f@cs190f-4953:~$ git clone https://github.com/yunjey/pytorch-tutorial.git
Cloning into 'pytorch-tutorial'...
remote: Counting objects: 658, done.
remote: Total 658 (delta 0), reused 0 (delta 0), pack-reused 658
Receiving objects: 100% (658/658), 12.74 MiB | 24.70 MiB/s, done.
Resolving deltas: 100% (350/350), done.
Checking connectivity... done.
cs190f@cs190f-4953:~$ cd pytorch-tutorial/
cs190f@cs190f-4953:~/pytorch-tutorial$ cd tutorials/02-intermediate/bidirectional_recurrent_neural_network/
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ python main-gpu.py
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Epoch [1/2], Step [100/600], Loss: 0.7028
Epoch [1/2], Step [200/600], Loss: 0.2479
Epoch [1/2], Step [300/600], Loss: 0.2467
Epoch [1/2], Step [400/600], Loss: 0.2652
Epoch [1/2], Step [500/600], Loss: 0.1919
Epoch [1/2], Step [600/600], Loss: 0.0822
Epoch [2/2], Step [100/600], Loss: 0.0980
Epoch [2/2], Step [200/600], Loss: 0.1034
Epoch [2/2], Step [300/600], Loss: 0.0927
Epoch [2/2], Step [400/600], Loss: 0.0869
Epoch [2/2], Step [500/600], Loss: 0.0139
Epoch [2/2], Step [600/600], Loss: 0.0299
Test Accuracy of the model on the 10000 test images: 97 %
cs190f@cs190f-4953:~/pytorch-tutorial/tutorials/02-intermediate/bidirectional_recurrent_neural_network$ cd $HOME
cs190f@cs190f-4953:~$ nvidia-smi
Thu Oct 12 13:30:59 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 23% 27C P0 56W / 250W | 0MiB / 11172MiB | 0% Default |
+-------------------------------+---------------------- | ----+----------------------+
cs190f@cs190f-4953:~$ exit |
Licensed Software
Stata
...
+----------------------+
cs190f@cs190f-4953:~$ exit |
|
Licensed Software
Installing licensed software is allowed in the Research Cluster; however, certain software versions are required to be compatible for installation in a cluster environment. The purchase of licensed software is the responsibility of the user or their sponsoring department. Research IT Services is available to assist with the installation of licensed software. For questions about installing licensed software, please email rcd-support@ucsd.edu.
Expand |
---|
|
For users with provisioned Stata licensing, the launch-scipy-ml.sh container is capable of executing Stata. Stata |
...
can be installed in your home directory by the Research IT Services team and can be executed using the command '~/stata-se' from within a container. |
Acknowledging Research IT Services
...