I am working on a Slurm Cluster, executing Ansys (V18.2) jobs in a parallelized way. Large jobs (meaning large solver files) often stuck with out error message or exit message, the jobs keep running until timeout is reached. Due to large job size, the solver files are written/saved not incore (in RAM) or on the node's scratch SSD but on the cluster /data storage. There, I clearly see if the job is stuck by no change in the "date modified" information of the solver files.
Typical error message I experienced were "node fail" or undefined exit message that I related to leak of memory. But those do not occur right now.
Strangely, if running the same job again this might happen at a different time point or (if I am lucky) not at all.
What I tried so far:
- Reducing number of requested CPUs somehow increases the likely hood that the job will complete. But due to specified max job time I need parallelization
- MPI types (intel, platform mpi) with no result
- dedicated storage partition (no significant difference)
- incore VS out of core request (anyway solver always switches to out of core)
I am glad about any advice on how to reduce senseless computational effort by re-runnign jobs number of times which is also time consuming for our project.
PS: Way smaller jobs (e.g. 3 times smaller number of DOFs) never experience this problem and I can use the full number of cores per node that is also the max number allowed with my Ansys licence (16 cores)