.. highlight:: rst

Sample Python Jobs
++++++++++++++++++

Instead of using the system-wide python, it is recommended that you create your own local python conda environment. With this setup, you can install any specific packages or versions of python that you need, and have full administrative control over your python installations. A nice cheat sheet of conda commands can be found `here <https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf>`_.

Creating a Python Virtual Environment
*************************************

Begin by logging in to ``rcfcluster`` as described in the :ref:`beginning of this guide <login-label>`.

To use conda, first load the corresponding ``miniconda`` module: ::

        module load miniconda

Initialize your shell to use conda: ::

        conda init bash

This step modifies your ``~/.bashrc`` file, and only needs to be done once. Then, activate the base environment: ::

        source ~/.bashrc

This will be reflected in your shell prompt: ::

        (base) user@rcfcluster:~$

To create a new environment in your home directory, type the following command (where “py37” is simply whatever you would like to name the envioronment, and the “python3.7” is the version of python you’d like to install): ::

        conda create --name py37 python=3.7

You will only need to complete this step once. By default, the environment will be installed in your ``~/.conda/envs`` folder, which is accessible to all nodes in the cluster.

**Note**: although you will only need to initialize your shell once, the step ``source ~/.bashrc`` must be completed each session.

.. _activating-using-the-python-environment:

Activating / Using the Python Environment
*************

Before you can activate your conda environment, you must make sure the miniconda module is loaded and that you have sourced your ``.bashrc`` file. These two commands need to be run once per login: ::

        module load miniconda
        source ~/.bashrc

Then, to enter the virtual environment, use the following command (substituting ``py37`` with whatever you chose to name your environment in the previous section): ::

        conda activate py37

If the environment has been activated successfully, you should now see your command line prompt prefaced with ``(py37)``, indicating that you have entered the environment.

To see the list of packages and versions installed in your active environment, enter: ::

        conda list

You can then install any additional needed packages, such as "numpy", with: ::

        conda install numpy

You can continue to install packages at any time. Just note that you must always run the ``conda install`` command from within the virtual environment.

To exit the virtual environment, type ``conda deactivate`` at the command line. You can also close the environment simply by logging off the cluster.

**Note**: during any of these steps, you might come across the following errors: ::

        CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
        To initialize your shell, run

        $ conda init <SHELL_NAME>

        Currently supported shells are:
          - bash
          - fish
          - tcsh
          - xonsh
          - zsh
          - powershell

        See 'conda init --help' for more information and options.

        IMPORTANT: You may need to close and restart your shell after running 'conda init'.

which is thrown in case you forget to source ``~/.bashrc``. There is no need to run ``conda init bash`` again; simply running ``source ~/.bashrc`` will fix the issue.

Another common error is: ::

        conda: command not found

in which case you probably forgot to load the conda module! Enter ``module load miniconda`` and then try again to activate the environment.


.. _gridsearch-label:

Submitting a Python Script to Demonstrate Parallel Processing (within a single node)
*************

Below is an example of a python script (``gridsearch.py``) that can be run on the cluster: ::

        import numpy as np
        from sklearn.model_selection import GridSearchCV
        from sklearn.svm import SVC
        import time

        time_start = time.time()

        # Read in data. (Since this is an example, we will instead create random data):
        n, m = 1000, 10
        train_x = np.random.rand(n,m)
        train_y = np.ones((n,))
        train_y[0:n//2] = 0
        np.random.shuffle(train_y)

        print('Train=', train_x.shape)
        print('Train=', train_y.shape)

        # Do cross validation for a SVM model:
        # Choose the parameter space to explore:
        kernels = ['rbf', 'poly', 'linear']
        Cs = np.arange(0.001, 10, 0.5)
        gammas = [0.5, 0.75, 1.0, 1.25, 1.5]

        # Run gridsearch - Note the n_jobs=-1 option allows sklearn gridsearch
        # to use as many processors in parallel as are available
        gridsearch = GridSearchCV(SVC(degree=3), cv=5, \
            param_grid={"kernel": kernels, \
            "C": Cs, \
            "gamma": gammas}, \
            scoring='accuracy', refit=True, n_jobs=-1)
        gridsearch.fit(train_x, train_y)

        # Print output (will be saved into the output file specified in your sbatch script)
        print("GridSearchCV Out of Sample Error (accuracy) for each model:")
        for mean, params in zip(gridsearch.cv_results_['mean_test_score'], gridsearch.cv_results_['params']):
                print("%0.6f %r" % (mean, params))
        print()
        print("Best model found during grid search:", gridsearch.best_estimator_, "(Accuracy =",gridsearch.best_score_,")")

        # Print run-time info:
        time_end = time.time()
        print('Computation time: '+str(round(time_end-time_start,2))+' seconds.')

To run this script, submit either an interactive job or sbatch script to slurm. Be sure to activate you virtual enviornment before executing the python command. Interactive job example: ::

        srun --time=0:10:00 --mem-per-cpu=4G --cpus-per-task=4 --pty bash
        module load miniconda
        conda activate py37
        python ~/path/to/gridsearch.py

Sbatch job example: ::

        sbatch run_gridsearch.sh

where the file ``run_gridsearch.sh`` reads: ::

        #!/bin/bash

        #SBATCH --time=0:10:00
        #SBATCH --mem-per-cpu=4G
        #SBATCH --cpus-per-task=4

        # Activate conda environment:
        module load miniconda

        eval "$(conda shell.bash hook)"

        conda activate py37

        # Run the script (edit the path below to the location of the gridsearch.py script):
        python /path/to/gridsearch.py

where the line ``eval "$(conda shell.bash hook)"`` initializes the shell to use conda. Forgetting to add this line will lead to the same error ``CommandNotFoundError`` mentioned above, asking you to run ``conda init bash``.

Utilizing Batch Scripts to Submit Multiple Jobs in Parallel
*************

While the above example demonstrates how to launch a single python job on a compute node, you will receive the most benefit by running multiple jobs in parallel. We can adapt the above example so that we are submitting multiple batch scripts, each exploring a different parameter space.

**EXAMPLE SCRIPT COMING SOON!**

More Examples
*************
To take advantage of parallel computing (within a single compute node), you can use the "Parallel" function from the python "joblib" package. An example of such a python script (``example.py``) is included below: ::

        from joblib import Parallel, delayed
        import numpy as np
        import time
        import sys

        # inputs
        number_of_simulations = 10 #100000
        time_length = 1000
        num_cores = int(sys.argv[1]) # Must input number of cores

        # function to produce a single random walk of time length T
        def random_noise(T):
                data = []
                for t in range(T):
                        data.append(np.random.normal(0,1))
                return np.array(data)

        # Parallel compute each random walk
        time_start = time.time()
        output = Parallel(n_jobs=num_cores)(delayed(random_noise)(time_length) for n in range(number_of_simulations))
        output_size = np.matrix(output).shape
        time_end = time.time()

        # Print information
        print('Your data size is '+str(output_size)+' with '+str(output_size[0])+' random noise time-series with time length '+str(output_size[1])+'.')
        print('Computation time: '+str(round(time_end-time_start,2))+' seconds using '+str(num_cores)+' cores.')

Credit to University of California Merced Research Computing Facility for this sample script, which is available at: http://hpcwiki.ucmerced.edu/knowledgebase/writing-slurm-job-scripts/


The python example above utilizes the Parallel function from joblib. However this parallelization is limited to a "single-computer" and doesn't take advantage of the full capabilities of the cluster. To write jobs that can be run on multiple nodes, we have to use something called Message Passing Interface (MPI). Documentation on this is still in the works, and will hopefully be available soon!