Discussion · bonfire.cafe

- Arbitrarily limiting the characters you can subject to `--wrap`.
- Overloading `sbatch` in various ways.
- deviating from documented GRES behaviour (read: every task / cpu combination regarding GPU reservation).

There is more. But I will eventually make a presentation about it.

Also, frequently, albeit not related to code changes: Setting up various physical clusters instead of one with partitions for the different tasks.

Alan Sill

@AlanSill@mast.hpc.social replied · 2 weeks ago

@rupdecat @jannem The #HPC community lost a lot when Slurm failed to implement the DRMAA standard. The API and the code that underpins it are messy as a result. But it was popular and free, so it gained a strong foothold. (Grid Engine was the definitive implementation, and for a while all schedulers supported DRMAA and were interoperable for codes that used it.)

There is support for multiple clusters within a given Slurm instance, but they have to be set up to use these features in advance.

Christian Meesters

@rupdecat@fediscience.org replied · 2 weeks ago

@AlanSill

Indeed.

Even though I started to favour SLURM over LSF (and HTCondor), the transition to it (a few years ago) was a nightmare. And loosing all compatibility between the systems still has so many repercussions ... too many for a single thread.

We had to remove the PBS compatibility layer of SLURM because users hold on to it, but it was never feature complete nor even well usable.

@jannem

Janne Moren

@jannem@fosstodon.org replied · 2 weeks ago

@rupdecat
Can't comment on the other ones without real examples but the last one is unavoidable I think. Different clusters are different, and it may not be feasible to have a single slurm instance covering both.

Especially, like in our case, when the use case (GPU vs. CPU) and the upgrade cadence is completely separate. You can still automate running jobs on both, just not through slurm alone.

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.1-alpha.8 no JS en

Automatic federation enabled