Sample Python Jobs

Instead of using the system-wide python, it is recommended that you create your own local python conda environment. With this setup, you can install any specific packages or versions of python that you need, and have full administrative control over your python installations. A nice cheat sheet of conda commands can be found here.

Creating a Python Virtual Environment

Begin by logging in to rcfcluster as described in the beginning of this guide.

To use conda, first load the corresponding miniconda module:

module load miniconda

Initialize your shell to use conda:

conda init bash

This step modifies your ~/.bashrc file, and only needs to be done once. Then, activate the base environment:

source ~/.bashrc

This will be reflected in your shell prompt:

(base) user@rcfcluster:~$

To create a new environment in your home directory, type the following command (where “py37” is simply whatever you would like to name the envioronment, and the “python3.7” is the version of python you’d like to install):

conda create --name py37 python=3.7

You will only need to complete this step once. By default, the environment will be installed in your ~/.conda/envs folder, which is accessible to all nodes in the cluster.

Note: although you will only need to initialize your shell once, the step source ~/.bashrc must be completed each session.

Activating / Using the Python Environment

Before you can activate your conda environment, you must make sure the miniconda module is loaded and that you have sourced your .bashrc file. These two commands need to be run once per login:

module load miniconda
source ~/.bashrc

Then, to enter the virtual environment, use the following command (substituting py37 with whatever you chose to name your environment in the previous section):

conda activate py37

If the environment has been activated successfully, you should now see your command line prompt prefaced with (py37), indicating that you have entered the environment.

To see the list of packages and versions installed in your active environment, enter:

conda list

You can then install any additional needed packages, such as “numpy”, with:

conda install numpy

You can continue to install packages at any time. Just note that you must always run the conda install command from within the virtual environment.

To exit the virtual environment, type conda deactivate at the command line. You can also close the environment simply by logging off the cluster.

Note: during any of these steps, you might come across the following errors:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run

$ conda init <SHELL_NAME>

Currently supported shells are:
  - bash
  - fish
  - tcsh
  - xonsh
  - zsh
  - powershell

See 'conda init --help' for more information and options.

IMPORTANT: You may need to close and restart your shell after running 'conda init'.

which is thrown in case you forget to source ~/.bashrc. There is no need to run conda init bash again; simply running source ~/.bashrc will fix the issue.

Another common error is:

conda: command not found

in which case you probably forgot to load the conda module! Enter module load miniconda and then try again to activate the environment.

Submitting a Python Script to Demonstrate Parallel Processing (within a single node)

Below is an example of a python script (gridsearch.py) that can be run on the cluster:

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import time

time_start = time.time()

# Read in data. (Since this is an example, we will instead create random data):
n, m = 1000, 10
train_x = np.random.rand(n,m)
train_y = np.ones((n,))
train_y[0:n//2] = 0
np.random.shuffle(train_y)

print('Train=', train_x.shape)
print('Train=', train_y.shape)

# Do cross validation for a SVM model:
# Choose the parameter space to explore:
kernels = ['rbf', 'poly', 'linear']
Cs = np.arange(0.001, 10, 0.5)
gammas = [0.5, 0.75, 1.0, 1.25, 1.5]

# Run gridsearch - Note the n_jobs=-1 option allows sklearn gridsearch
# to use as many processors in parallel as are available
gridsearch = GridSearchCV(SVC(degree=3), cv=5, \
    param_grid={"kernel": kernels, \
    "C": Cs, \
    "gamma": gammas}, \
    scoring='accuracy', refit=True, n_jobs=-1)
gridsearch.fit(train_x, train_y)

# Print output (will be saved into the output file specified in your sbatch script)
print("GridSearchCV Out of Sample Error (accuracy) for each model:")
for mean, params in zip(gridsearch.cv_results_['mean_test_score'], gridsearch.cv_results_['params']):
        print("%0.6f %r" % (mean, params))
print()
print("Best model found during grid search:", gridsearch.best_estimator_, "(Accuracy =",gridsearch.best_score_,")")

# Print run-time info:
time_end = time.time()
print('Computation time: '+str(round(time_end-time_start,2))+' seconds.')

To run this script, submit either an interactive job or sbatch script to slurm. Be sure to activate you virtual enviornment before executing the python command. Interactive job example:

srun --time=0:10:00 --mem-per-cpu=4G --cpus-per-task=4 --pty bash
module load miniconda
conda activate py37
python ~/path/to/gridsearch.py

Sbatch job example:

sbatch run_gridsearch.sh

where the file run_gridsearch.sh reads:

#!/bin/bash

#SBATCH --time=0:10:00
#SBATCH --mem-per-cpu=4G
#SBATCH --cpus-per-task=4

# Activate conda environment:
module load miniconda

eval "$(conda shell.bash hook)"

conda activate py37

# Run the script (edit the path below to the location of the gridsearch.py script):
python /path/to/gridsearch.py

where the line eval "$(conda shell.bash hook)" initializes the shell to use conda. Forgetting to add this line will lead to the same error CommandNotFoundError mentioned above, asking you to run conda init bash.

Utilizing Batch Scripts to Submit Multiple Jobs in Parallel

While the above example demonstrates how to launch a single python job on a compute node, you will receive the most benefit by running multiple jobs in parallel. We can adapt the above example so that we are submitting multiple batch scripts, each exploring a different parameter space.

EXAMPLE SCRIPT COMING SOON!

More Examples

To take advantage of parallel computing (within a single compute node), you can use the “Parallel” function from the python “joblib” package. An example of such a python script (example.py) is included below:

from joblib import Parallel, delayed
import numpy as np
import time
import sys

# inputs
number_of_simulations = 10 #100000
time_length = 1000
num_cores = int(sys.argv[1]) # Must input number of cores

# function to produce a single random walk of time length T
def random_noise(T):
        data = []
        for t in range(T):
                data.append(np.random.normal(0,1))
        return np.array(data)

# Parallel compute each random walk
time_start = time.time()
output = Parallel(n_jobs=num_cores)(delayed(random_noise)(time_length) for n in range(number_of_simulations))
output_size = np.matrix(output).shape
time_end = time.time()

# Print information
print('Your data size is '+str(output_size)+' with '+str(output_size[0])+' random noise time-series with time length '+str(output_size[1])+'.')
print('Computation time: '+str(round(time_end-time_start,2))+' seconds using '+str(num_cores)+' cores.')

Credit to University of California Merced Research Computing Facility for this sample script, which is available at: http://hpcwiki.ucmerced.edu/knowledgebase/writing-slurm-job-scripts/

The python example above utilizes the Parallel function from joblib. However this parallelization is limited to a “single-computer” and doesn’t take advantage of the full capabilities of the cluster. To write jobs that can be run on multiple nodes, we have to use something called Message Passing Interface (MPI). Documentation on this is still in the works, and will hopefully be available soon!