Product Launch

anonymous

4 points

18 comments

Postedabout 2 months agoActiveabout 1 month ago

Show HN: Virtual SLURM HPC cluster in a Docker Compose

github.com

HPCDockerSLURMopen source

Discussion (18 comments)

Showing 18 comments

ZeroCool2u

about 2 months ago

2 replies

Interesting, I've been dealing with replacing a few on-prem HPC clusters lately. One of the things we've been looking at is OpenOnDemand. How does this compare to that? Is this primarily targeted at cluster development or can I really just make an arbitrarily large production HPC cluster with it?

formerly_proven

about 2 months ago

1 reply

ondemand is "just" a web frontend for using a traditional HPC cluster, which of course means its architecture is deeply cursed: https://osc.github.io/ood-documentation/latest/architecture....

linksnapzz

about 2 months ago

1 reply

Yeah, OOD is a giant RoR webapp; you need to be running it on a node that can submit to your cluster.

brightbeige

about 2 months ago

RoR = Ruby on Rails

mbreese

about 2 months ago

Don’t you still need the HPC cluster with OpenOnDemand? I thought it was a web interface to use HPC resources.

robot-wrangler

about 2 months ago

1 reply

Thanks for this! I went looking for something similar a while back and found nothing much.

Even surprisingly popular distributed-systems stuff is always really bad about "follow this 10 step copy/paste to deploy to EKS" but that's obnoxious. In the first place, people want to see something basically working on small scale first to check if it's abandonware. But even after that.. local prototyping without first setting up multiple repositories, then shipping multiple modified container images, and already having CI/CD for all of the above is really nice to have.

throw0101c

about 2 months ago

> I'm guessing that the alternative to this tidy modern repository is a gigantic broken pile of ansible/chef/puppet that hasn't been touched in 10 years.

Not quite sure how well you looked, but there are a bunch of deployment systems for HPC, Ansible or otherwise:

* https://old.reddit.com/r/HPC/comments/1p4a3fq/what_imaging_s...

* My comment listing a bunch: https://news.ycombinator.com/item?id=46037792

janeway

about 2 months ago

1 reply

Cool!

I have worked 100% in 3 comparable systems over the past 10 years. Can you access with ssh?

I find it super fluid to work on the HPC directly to develop methods for huge datasets by using vim to code and tmux for sessions. I focus on printing detailed log files constantly with lots of debugs and an automated monitoring script to print those logs in realtime; a mixture of .out .err and log.txt.

ciclotrone

about 1 month ago

You can access via SSH either with password or with keys.

Our reference cluster has long queuing times during busy hours and requires 2FA for access, so we had extra incentives to have a self-contained solution to run on our development machines.

IshKebab

about 2 months ago

2 replies

I wish there was a sane modern alternative to SLURM.

Futile hope though. My company is still using SGE.

siliconpotato

about 2 months ago

1 reply

Slurm is the modern alternative. We are using SGE too and slurm feels like the future.

IshKebab

about 1 month ago

1 reply

Yeah unfortunately it still sucks. Actually to be fair it's probably fine for its intended use case: researchers interactively running one-off batch jobs on a university HPC machine.

But I work in silicon and every company I've worked in uses SGE/SLURM for automated testing. SLURM absolutely sucks for that. They really want you to submit jobs as bash scripts, they can't handle a large number of jobs without using janky array jobs, submitting a job and waiting for it to finish is kind of janky. Getting the output anywhere except a file is difficult. Nesting jobs is super awkward and buggy. All the command line tools feel like they're from the 80s - by default the column widths are like 5 characters (not an exaggeration).

We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!

I don't think it would actually be that hard to write a modern replacement. The difficult bit is dealing with cgroups. I won't hold my breath for anyone in the silicon industry to write it though. Hardware engineers can't write software for shit.

siliconpotato

about 1 month ago

1 reply

> We even had an issue that SLURM uses 4 ports per job for the duration of the job, so you can't actually run more than a few thousand jobs simultaneously because the controller runs out of TCP ports!

That sounds concerning. Do you have a link to a bug report for this please? Is the tcp port problem on the compute node side or the controller side?

IshKebab

about 1 month ago

The controller side. I don't think it is a bug; that's just how they designed it.

They want you to use array jobs for large jobs, or submit jobs in a fire-and-forget way.

linksnapzz

about 1 month ago

You can pay for LSF; which is older than SLURM, but IMHO more reliable under load....

throw0101c

about 2 months ago

The Digital Research Alliance of Canada (formerly Compute Canada) has Terrafrom recipes that can talk to various cloud APIs that do something similar:

* https://github.com/ComputeCanada/magic_castle

They link to various other projects that do cloud-y-HPC:

* AWS ParallelCluster [AWS]

* Cluster in the cloud [AWS, GCP, Oracle]

* Elasticluster [AWS, GCP, OpenStack]

* Google Cluster Toolkit [GCP]

* illume-v2 [OpenStack]

* NVIDIA DeepOps [Ansible playbooks only]

* StackHPC Ansible Role OpenHPC [Ansible Role for OpenStack]

igleria

about 2 months ago

I wish I had this for my master's thesis! it was a puny 64 core node, but nevertheless...