2

I have 1 slurm job unfinished out of 5 that's been running 19 hours and I'm concerned that it will hit walltime before it finishes. I'm not the admin and it's the weekend, so I would like to try using this feature I discovered recently shown in this example:

$ salloc -N4 -C knl,snc4,flat --dependency=expand:$SLURM_JOB_ID bash
salloc: Granted job allocation 65543

However, when I try this, I get an error:

$ salloc --qos=1wk --dependency=expand:14602965
salloc: error: Job submit/allocate failed: Job dependency problem

What am I doing wrong?

UPDATE:

I was able to get the command to execute successfully when I just tried editing the walltime:

$ salloc --job-name freebayes.commands3-extend -t 7-00:00:00 --mem 387000 --dependency=expand:14602965
salloc: Granted job allocation 14604022

One thing I noticed however is that salloc is a running process in my current shell:

$ ps
  PID TTY          TIME CMD
43140 pts/1    00:00:00 tcsh
43284 pts/1    00:00:00 salloc
43286 pts/1    00:00:00 tcsh
43321 pts/1    00:00:00 ps

So I assumed I needed to run it with nohup (or inside screen/tmux) so I could log out. I scancelled and killed the process and redid it with nohup. However, without the ability to change the qos, I anticipate that my job will get killed. I had tried with both -t and --qos, but got the same error. My suspicion is that since I did not explicitly supply --qos, I can't use --dependency=expand to modify the job. I used the default qos ("1day").

My supplemental question is: do I need to use screen/tmux/nohup when I try to modify the job?

Also, is there any info in this squeue output that tells me whether or not it's going to succeed in extending the job?:

   JOBID PARTITION MIN_MEMOR         TIME CPUS     PRIORITY          START_TIME  QOS   TIME_LIMIT NAME
14602965      main    387000     20:05:37    3 0.0000038153 2018-11-02T13:36:30 1day   1-00:00:00 freebayes.commands3
14604022      main    387000         2:53    3 0.0000018135 2018-11-03T09:39:14 1day      3:57:00 freebayes.commands3-extend
hepcat72
  • 155
  • 7

0 Answers0