I have 1 slurm job unfinished out of 5 that's been running 19 hours and I'm concerned that it will hit walltime before it finishes. I'm not the admin and it's the weekend, so I would like to try using this feature I discovered recently shown in this example:
$ salloc -N4 -C knl,snc4,flat --dependency=expand:$SLURM_JOB_ID bash
salloc: Granted job allocation 65543
However, when I try this, I get an error:
$ salloc --qos=1wk --dependency=expand:14602965
salloc: error: Job submit/allocate failed: Job dependency problem
What am I doing wrong?
UPDATE:
I was able to get the command to execute successfully when I just tried editing the walltime:
$ salloc --job-name freebayes.commands3-extend -t 7-00:00:00 --mem 387000 --dependency=expand:14602965
salloc: Granted job allocation 14604022
One thing I noticed however is that salloc is a running process in my current shell:
$ ps
PID TTY TIME CMD
43140 pts/1 00:00:00 tcsh
43284 pts/1 00:00:00 salloc
43286 pts/1 00:00:00 tcsh
43321 pts/1 00:00:00 ps
So I assumed I needed to run it with nohup (or inside screen/tmux) so I could log out. I scancelled and killed the process and redid it with nohup. However, without the ability to change the qos, I anticipate that my job will get killed. I had tried with both -t and --qos, but got the same error. My suspicion is that since I did not explicitly supply --qos, I can't use --dependency=expand to modify the job. I used the default qos ("1day").
My supplemental question is: do I need to use screen/tmux/nohup when I try to modify the job?
Also, is there any info in this squeue output that tells me whether or not it's going to succeed in extending the job?:
JOBID PARTITION MIN_MEMOR TIME CPUS PRIORITY START_TIME QOS TIME_LIMIT NAME
14602965 main 387000 20:05:37 3 0.0000038153 2018-11-02T13:36:30 1day 1-00:00:00 freebayes.commands3
14604022 main 387000 2:53 3 0.0000018135 2018-11-03T09:39:14 1day 3:57:00 freebayes.commands3-extend