How to Test Slurm Locally with slurm-docker-cluster¶
This guide explains how to spin up a small Slurm cluster locally with
giovtorres/slurm-docker-cluster
and use it to smoke-test the Slurm support in qcsc-prefect.
The intended use case is:
- one Docker host
- one Slurm controller container
- multiple Slurm compute-node containers (
c1,c2,c3, ...) - testing
sbatch,sacct,scancel, andsrunbehavior locally
[!IMPORTANT] This setup is "single Docker host, multiple Slurm nodes". It is very useful for local development, but it is not the same as a real multi-machine cluster.
What this guide does¶
This guide covers:
- starting a local Slurm cluster
- copying
qcsc-prefectinto the Slurm controller container - installing the local Python packages there
- running a small
run_slurm_job()smoke test - verifying that a multi-node job actually ran on multiple worker containers
Prerequisites¶
- Docker
- Docker Compose /
docker compose make- a local checkout of this repository
The upstream project's quick start and scaling notes are here:
Step 1. Start a local Slurm cluster¶
Clone the upstream repository:
git clone https://github.com/giovtorres/slurm-docker-cluster.git
cd slurm-docker-cluster
cp .env.example .env
Set the CPU worker count to 3 so that multi-node jobs have multiple targets:
perl -0pi -e 's/^CPU_WORKER_COUNT=.*/CPU_WORKER_COUNT=3/m' .env
Pull the prebuilt image and tag it with the version expected by the compose file:
docker pull giovtorres/slurm-docker-cluster:latest
docker tag giovtorres/slurm-docker-cluster:latest slurm-docker-cluster:25.11.4
Start the cluster:
make up
make status
Open a shell in the controller:
make shell
Inside the controller, confirm that Slurm sees the worker nodes:
sinfo
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 3 idle c[1-3]
Step 2. Copy qcsc-prefect into the controller¶
From your host machine, copy this repository into the slurmctld container:
docker cp /Users/hitomi/Project/qcsc-prefect/. slurmctld:/data/qcsc-prefect
If your local checkout lives elsewhere, replace the source path accordingly.
Step 3. Install qcsc-prefect in the controller¶
Open a shell in the controller if you are not already inside it:
docker exec -it slurmctld bash
Move to the copied repository:
cd /data/qcsc-prefect
The upstream image includes python3.12, but pip may not be initialized yet.
Enable it, create a virtual environment, and install the local packages:
python3 -m ensurepip --upgrade
python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install \
-e packages/qcsc-prefect-core \
-e packages/qcsc-prefect-adapters \
-e packages/qcsc-prefect-blocks \
-e packages/qcsc-prefect-executor \
pytest
Step 4. Run a Slurm smoke test¶
Run the following script inside the controller:
cd /data/qcsc-prefect
. .venv/bin/activate
python3 - <<'PY'
import asyncio
from pathlib import Path
from qcsc_prefect_adapters.slurm.builder import SlurmJobRequest
from qcsc_prefect_core.models.execution_profile import ExecutionProfile
from qcsc_prefect_executor.slurm import run as run_mod
class Logger:
def info(self, msg):
print(msg)
def error(self, msg):
print(msg)
async def fake_create_table_artifact(*, table, key):
print("artifact key:", key)
print(table)
run_mod.get_run_logger = lambda: Logger()
run_mod.create_table_artifact = fake_create_table_artifact
work_dir = Path("/data/qcsc-prefect/.slurm-test")
work_dir.mkdir(parents=True, exist_ok=True)
exe = work_dir / "hello.sh"
exe.write_text("#!/bin/sh\necho slurm-integration-ok\nhostname\n")
exe.chmod(0o755)
profile = ExecutionProfile(
command_key="slurm-integration",
num_nodes=2,
mpiprocs=1,
launcher="srun",
walltime="00:05:00",
)
req = SlurmJobRequest(
partition="normal",
account=None,
executable=str(exe),
)
result = asyncio.run(
run_mod.run_slurm_job(
work_dir=work_dir,
script_filename="integration_job.slurm",
exec_profile=profile,
req=req,
watch_poll_interval=2.0,
timeout_seconds=120,
metrics_artifact_key="slurm-integration-metrics",
)
)
print(result)
PY
What this test checks:
sbatchcan submit a jobsacctcan observe final job statesrunis used as the in-job launcher- stdout and stderr files are collected
- the job can run across multiple compute-node containers
Step 5. Verify the output¶
Inspect the generated files:
cd /data/qcsc-prefect/.slurm-test
ls -l
cat output.out
cat output.err
Expected output.out content should include:
slurm-integration-ok- two hostnames such as
c1andc2
If you requested num_nodes=2, seeing two different worker names is the simplest
proof that the job really executed across multiple Slurm nodes.
You can also inspect Slurm directly:
squeue
sacct
Step 6. Scale the worker count if needed¶
From the host machine:
cd slurm-docker-cluster
make scale-cpu-workers N=4
make status
Then rerun the smoke test with a larger num_nodes value if needed.
Optional: rerun local unit tests inside the controller¶
If you want to run the current local Slurm unit tests inside the controller:
cd /data/qcsc-prefect
. .venv/bin/activate
pytest \
packages/qcsc-prefect-adapters/tests/test_slurm_builder.py \
packages/qcsc-prefect-adapters/tests/test_slurm_runtime.py \
packages/qcsc-prefect-executor/tests/test_run_slurm_job_local.py
These tests are still mocked tests. They are useful for basic regression checking, but they do not replace the real Slurm smoke test above.
Future improvement¶
The current repository does not yet have a real Slurm integration test file equivalent to:
packages/qcsc-prefect-executor/tests/test_run_miyabi_job_miyabi_integration.pypackages/qcsc-prefect-executor/tests/test_run_fugaku_job_fugaku_integration.py
If you want repeatable CI-style testing later, the next step is to add:
packages/qcsc-prefect-executor/tests/test_run_slurm_job_slurm_integration.py
and run it inside the slurmctld container or a dedicated Slurm-enabled test environment.