Questions tagged [slurm]

27 questions
13
votes
2 answers

How can I find out how long my slurm job took to execute?

One idea I have to find out how long my slurm job is taking is to use squeue --job How do I find out how long my job took to complete though once the job is complete?
demongolem
  • 567
  • 4
  • 10
  • 25
5
votes
1 answer

remove slurm sacct command double entries: "extern"

Jobs currently running show two entries, one of them has an .extern suffix. Completed (or failed) jobs also have a third entry: .batch. Is there a way to remove (or not show these) from the sacct output? What are these entries?
DilithiumMatrix
  • 549
  • 1
  • 4
  • 15
4
votes
1 answer

Slurm initialization fails in a Raspberry Pi cluster with Raspbian 9.4

I am trying to set up Slurm in a Raspberry Pi cluster with Raspbian 9.4. I am able to start slurmctld, but when I try to launch slurmd I get the following output: pi@node1:~ $ slurmd -Dvvvc slurmd: debug: Log file re-opened slurmd: error: Domain…
Bub Espinja
  • 151
  • 5
3
votes
2 answers

How to cancel a job that is on completing (CG) state?

I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list. There is any way to get ride off those CG jobs? Sadly, I'm not the administrator of…
Iago Carvalho
  • 131
  • 1
  • 3
2
votes
1 answer

Slurm on AWS returns slurmstepd: error: execve(): : No such file or directory

I have installed a Burstable and Event-driven HPC Cluster on AWS Using Slurm according to this tutorial. With this installation I can burst instances and run jobs in the Slurm environment on EC2. After running: #!/bin/bash #SBATCH --nodes=2 #SBATCH…
Serialchiller
  • 41
  • 1
  • 3
2
votes
1 answer

How to use slurm request for only one core instead of a node or socket?

I wrote Perl scripts to analyze my simulating data. This is not a concurrent program. In the cluster, there are eight nodes. Each of node has 2 sockets which possesses 10 cores. I want to submit my job using Slurm and only request one core to…
Leon
  • 121
  • 1
  • 6
2
votes
0 answers

How to use SLURM's --dependency=expand: correctly

I have 1 slurm job unfinished out of 5 that's been running 19 hours and I'm concerned that it will hit walltime before it finishes. I'm not the admin and it's the weekend, so I would like to try using this feature I discovered recently shown in…
hepcat72
  • 155
  • 7
1
vote
0 answers

How to make a host file in SLURM with $SLURM_JOB_NODELIST

I have access to a HPC with 40 cores on each node. I have a batch file to run a total of 35 codes which are in separate folders. Each code is an open mp code which requires 4 cores each. so how do I allocate resources such that each code gets 4…
1
vote
1 answer

slurmd: Invalid job credential

I'm having some problems with a test configuration of Slurm on my laptop. I'm trying to run four slurmd instances on one machine, which is also the same machine as slurmctld runs on. I have a local munged running as user munge. slurmd and slurmctld…
lukas
  • 11
  • 2
1
vote
0 answers

Slurm - GPU enforcement with cgroups

I am running slurm 19.05 on a single machine (Ubuntu 18.04) for scheduling GPU tasks. However, I am having trouble to setup the gpu enforcement with cgroups. If I set ConstrainDevice=yes in my cgroup.conf file, tensorflow is not able to access my…
Jonas
  • 11
  • 1
1
vote
1 answer

Ubuntu 18.10 and modify installed package - OpenMPI

I've installed openmpi-bin (OpenMPI 3.1) on Ubuntu 18.10. I also run slurm on the same machine and would like to recompile or reconfigure my installation of OpenMPI to cope with Slurm-feature. If one installs OpenMPI from source, there is a setting…
Paer
  • 21
  • 3
1
vote
0 answers

job state=failed reason=nonzero exit code, SLURM

I'm new to Slurm, I have been trying to run a simple job. I'm running Slurm on top of a VM. Here's my…
Ash Bougui
  • 11
  • 3
1
vote
0 answers

Ansys parallele job on Slurm Cluster stuck without error or exit message

I am working on a Slurm Cluster, executing Ansys (V18.2) jobs in a parallelized way. Large jobs (meaning large solver files) often stuck with out error message or exit message, the jobs keep running until timeout is reached. Due to large job size,…
Anatol
  • 11
  • 1
1
vote
1 answer

Ansys Remote Solver with SLURM cluster

I am trying to connect Ansys running on CentOS 7 to use our HPC cluster which using SLURM as a scheduler. I have looked into all the configuration file I could think of. I even wrote my custom hps_commands_SLURM.xml file I get the…
Shahan M
  • 121
  • 7
1
vote
1 answer

SLURM configuration: cons_res with CR_Core either cannot allocate resource or jobs end up in CG status

I am new to SLURM. I am trying to configure slurm in a new cluster. I have 4 nodes each has 14 cores. I wanted to share nodes in a way that every core can run independently (i.e., node01 can have 14 independent serial jobs going on at the same…
Somesh
  • 11
  • 2
1
2