3

I normally submitted some jobs using sbatch and canceled some of them after using scancel. However, they are in state CG and I cannot remove the jobs from my list.

There is any way to get ride off those CG jobs? Sadly, I'm not the administrator of the cluster neither do I have the root password.

Iago Carvalho
  • 131
  • 1
  • 3

2 Answers2

0

Killing the slurmstepd process on the 1st node that your job occupy, should work. This process should be under your user, so in principle killing it shouldn't require special privileges.

Be aware not to kill slurmtespd of another yours job that may be running on same node. You probably may tell them apart according to their start time.

mikeraf
  • 1
  • 1
  • 1
    But I can't find the slurmtespd process. And what do you mean by "on the 1st node that your job occupy"? The submitting node, or the node that the sub-job is on? – dgg32 Nov 12 '19 at 16:02
  • A have jobs stuck in CG state after some problem between head node and cluster node failed to communicate. I could not find slurmstepd process on my compute node. On the node, issue squeue gave the following error: slurm_load_jobs error: Unable to contact slurm controller (connect failure). – Kemin Zhou Jun 23 '22 at 20:29
0

Iago, thank you for creating this thread. I have seen the same issue and shared how to resolve it. - requeue and then release, scancel

[test@test02-scheduler ~]$ scontrol release 9
Job has already finished for job 9
slurm_suspend error: Job has already finished
[test@test02-scheduler ~]$ scontrol requeue 9
[test@test02-scheduler ~]$ scontrol release 9
[test@test02-scheduler ~]$
[test@test02-scheduler ~]$ squeue --long
Sun Feb 06 00:17:57 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:00      5:00      1 test02-hpc-pg0-[1-3,5,9]
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:24 test02-hpc-pg0-1
[test@test02-scheduler ~]$ scancel 9
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:30 test02-hpc-pg0-1
[test@test02-scheduler ~]$ squeue -s
     STEPID     NAME PARTITION     USER      TIME NODELIST
    9.batch    batch       hpc      test   1:22:32 test02-hpc-pg0-1
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:18:12 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
             9       hpc sleep.sh      test COMPLETI       0:21      5:00      1 test02-hpc-pg0-[1-3,5,9]
 [test@test02-scheduler ~]$
 [test@test02-scheduler ~]$ squeue --long
 Sun Feb 06 00:21:04 2022
         JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
[test@test02-scheduler ~]$
Sun Feb  6 00:22:32 UTC 2022
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 06 '22 at 07:20