Guide ‒ LPHE ‐ EPFL

In this page you find a description of the cluster and a good practice section.

Please also read this presentation to learn what the cluster is, how to use it and how to use it’s batch system.

Cluster Machines:

lphelc1a and lphelc1b

The lpehlc1a and lphelc1b machines are our two interactive nodes.
These two interactive nodes are for development and testing purposes.
Please run long or CPU-intensive jobs via the batch system.
lphelcsrv2

The lphelcsrv2 is our head node.
It hosts the RAID6 home disk and manages the batch system.
There is rarely a need for a user to ever log into this machine.
Nodes

The cluster has 20 identical workernodes which accept jobs through our batch system and two additional nodes for testing.

Specifications:
- 16 x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
- CentOS Linux 7 (Core)
- 32 Gb RAM
- 600 Gb Scratch Disk
- Automatically mounted /home and /panfs dirs
Test nodes

The two test nodes are node21 and node22. Please use these nodes for submission of ganga jobs (LHCb) and for interactive work whenever possible.

Cluster Usage: good practices

Avoid filling up the batch queue with many long jobs
Submit smaller jobs if possible to make use of the many cores
Run test jobs on the interactive nodes and test machines beforehand to time job execution.
Submit jobs with adequate and accurate requests for resources.
Run everything local to a node.
Be aware that requesting large amounts of walltime will prevent your jobs from running in some situations.
Store large datasets on the shared disks.

Avoid Filling the Batch Queue with Long Jobs

The batch fairshare is configured such that the resources available are allocated to each user equally. This does not mean that you can only use 1/Nth of the available resources. If there are only a few people running jobs, then you will be able to use a larger share of the resources. This can result in one person being able to use all the resources at once.

Seconds before you submit your jobs, someone saturates the queue with their jobs such that there are no resouces left. Now your jobs won’t be able to run until their jobs are finished. So if you think you are going to send a lot of jobs just care to exclude a couple of nodes which can be used by others. Also see the next point.

Submit Smaller Jobs if Possible

The way we generally run our jobs is like this:

Test job on interactive node for a small sample.
Job works and is ready for running on the batch system.
Submit many jobs at once, enough to generate your MC or completely run over your dataset.
Get back output and analyse results.

We don’t really want to have to limit the number of jobs people can submit at once. But problems can occur if people take all resources for a long time.

If you have a very large number of jobs or a large dataset you need to analyse, then it is a good idea to make your jobs run for a few hours at most. This will allow for a better usage of resources and a better sharing of priorities. This may mean that you have to submit more jobs with each job running over fewer files or creating less events but it means that you will not block other people’s jobs for too long!

Run Test Jobs Beforehand

The batch system only knows what resources your jobs will need if you tell it.
This is achieved via the sbatch flags or in the jobs script itself.
Examples of how to set your required resources can be found here.
Try to be as accurate as possible since this means the throughput of the system can be optimised.

One of the main resources you will need to calcluate is walltime, i.e. the time a job will take to execute in then real-world.
You can use the “time” command to do this:

time <testJobScript>

the output should look something like this:

<commandOutput>
real    0m3.235s
user    0m0.101s
sys     0m1.078s

Record the “real” value and then run the test job again but with more events/more files.
After running a few test jobs you should have an idea about how the execution time of your job scales with the number of events or number of files.
Use this information to calculate how long a typical job should take then add 10%, just to be safe.

Run Everything Local to a Node

By “local” we mean that you should minimise all net traffic between machines. The /home disk is mounted on the lphelcsrv2 machine so constantly reading/writing files in your home area is discouraged. The preferred method is to use the /scratch directory of the node you are running on.

Note that slurm expects a shared filesystem so I/O intensive jobs should really write their data to the panfs filesystem.

In order to do this effectively, you will need to create a unique directory for each job you submit. Do not worry – you can easily do this from within the job script itself with the following few lines in bash:

#!/bin/bash
MYID=`echo $PBS_JOBID  | cut -f 1 -d .`
WORKDIR=/scratch/$USER/$MYID
mkdir -p $WORKDIR

Cluster Machines:

lphelc1a and lphelc1b

lphelcsrv2

Nodes

Test nodes

Cluster Usage: good practices

Avoid Filling the Batch Queue with Long Jobs

Submit Smaller Jobs if Possible

Run Test Jobs Beforehand

Run Everything Local to a Node