Skip to main content

Distributed Batch Compute

One logical job, N indexed shards. A Distributed Run fans your container out across the platform — each shard self-selects its partition — and rolls every shard's status back up into one aggregate. The single-shard Run is just the N=1 case.

When to reach for a Distributed Run

A single Run is one shard. When your workload is embarrassingly parallel — walk-forward training folds, a cross-sectional book over thousands of symbols, ETL partitions, Monte-Carlo batches, render/encode fan-out — set parallelism and the platform owns the fan-out: sharding, gang or opportunistic admission, per-shard status, and the concurrency budget. You no longer hand-roll Promise.all orchestration in your own app.

Indexed Shards

Each shard gets a stable SYLPHX_SHARD_INDEX (0..N-1) + SYLPHX_SHARD_COUNT — self-select your partition, no coordinator needed

Gang Scheduling

All shards start together or none do — a shard never waits on absent peers (all-or-nothing admission)

Aggregate + Drill-down

One rolled-up status (succeeded / partial / failed) plus per-shard exit codes and logs

Shared Result Volume

Mount one ReadWriteMany volume into every shard at a well-known path — the merge convention for per-shard outputs

Same Security Envelope

Distribution changes shard count, not isolation — sandboxing, image scanning, and scoped secrets are per-shard

runDistributed (replaces client-side Promise.all)

The old pattern fanned out N independent RunsClient.run() calls with Promise.all and merged the results in your app. runDistributed moves all of that into the platform: one call, one handle, one aggregate result. Each shard reads SYLPHX_SHARD_INDEX to pick its slice of the work.

train-distributed.ts
import { RunsClient, createClient } from '@sylphx/sdk'

const config = createClient(process.env.SYLPHX_URL!)

const folds = [0, 1, 2, 3, 4, 5, 6, 7, 8]

// One distributed job — the platform fans it out to 9 indexed shards.
const run = await RunsClient.runDistributed(config, {
  image: 'ghcr.io/acme/trainer:sha-abc123',
  // The container reads SYLPHX_SHARD_INDEX / SYLPHX_SHARD_COUNT to pick its fold.
  command: ['python', 'train.py'],
  parallelism: folds.length,
  machine: 'large',
  // Optional: use low-priority spare capacity and tolerate preemptions.
  gangScheduling: false,
  backoffLimitPerIndex: 12,
  // A ReadWriteMany volume mounted into every shard is the result-merge convention.
  volumeMounts: [{ volumeId: process.env.CACHE_VOLUME_ID!, mountPath: '/cache' }],
  timeoutSeconds: 7200,
})

// .wait() resolves once the aggregate reaches a terminal state
// (succeeded | failed | partial | cancelled | timed_out).
const status = await run.wait()

const shards = await run.shards()
const failures = shards.filter((s) => s.status === 'failed')
if (failures.length > 0) {
  console.error(`${failures.length}/${shards.length} shards failed`)
  // Drill into one shard's logs:
  const { stdout, stderr } = await run.logs({ shard: failures[0].index })
  console.error(stderr)
}
console.log('Aggregate status:', status.status, '— succeeded:', status.succeeded)

Shard self-selection

Inside your container, partition the work deterministically from the injected env vars — no coordinator process required for the embarrassingly-parallel case:
train.py
import os

shard = int(os.environ["SYLPHX_SHARD_INDEX"])   # 0 .. N-1
count = int(os.environ["SYLPHX_SHARD_COUNT"])   # N

symbols = load_universe()
my_symbols = symbols[shard::count]              # every Nth symbol, offset by shard
train_and_write(my_symbols, out=f"/cache/scores-{shard}.parquet")

CLI

The same primitive from the terminal. --parallelism turns a Run into a Distributed Run; --follow streams the aggregate + per-shard status table to completion.

terminal
# Fan out to 9 shards on large machines, follow the per-shard status table.
sylphx run \
  --image ghcr.io/acme/trainer:sha-abc123 \
  --parallelism 9 \
  --opportunistic \
  --backoff-limit-per-index 12 \
  --machine large \
  --timeout 7200 \
  --follow \
  -- python train.py

# Inspect a previously started run (aggregate + per-shard status).
sylphx runs get wrk_abc123

# Stream one shard's logs.
sylphx runs logs wrk_abc123 --shard 3

Configuration Reference

PropertyTypeDescription
parallelismnumberShard count, 1..256. Defaults to 1 (a single-shard Run). > 1 fans out to N indexed shards.
completionMode'Indexed' | 'NonIndexed'Defaults to Indexed when parallelism > 1, giving each shard a stable SYLPHX_SHARD_INDEX.
gangSchedulingbooleanDefaults to true when parallelism > 1 — all shards are admitted together or none, so a shard never waits on absent peers.
backoffLimitPerIndexnumberOptional 0..50 retry budget for opportunistic distributed runs only. Requires parallelism > 1 and gangScheduling false; omitted uses the platform default of 3.
machine'nano' | 'micro' | 'small' | 'standard' | 'large' | 'xlarge' | '2xlarge' | 'mem-2xlarge'Per-shard machine size — every shard runs on the same tier. Defaults to standard.
volumeMountsRunVolumeMount[]Mounted into every shard. A ReadWriteMany volume is the cross-shard result-merge convention.
timeoutSecondsnumberPer-shard hard deadline. Default 3600 (1h), max 86400 (24h).

Machine Tiers

Every shard runs on the machine size you pick. Larger tiers cost proportionally more per compute-second (see Billing). One shard OOMs above its RAM ceiling — shard wide rather than reaching for the largest single machine.

PropertyTypeDescription
nano0.1 vCPU · 256 MiB0.125× standard rate
micro0.25 vCPU · 512 MiB0.25× standard rate
small0.5 vCPU · 1 GiB0.5× standard rate
standard1 vCPU · 2 GiB1× — the reference tier
large2 vCPU · 4 GiB2× standard rate
xlarge4 vCPU · 8 GiB4× standard rate
2xlarge8 vCPU · 16 GiB8× standard rate
mem-2xlarge4 vCPU · 32 GiBMemory-optimized — 4× standard rate

GPU

GPU tiers are reserved in the API but admission-blocked until a dedicated GPU node pool lands. The validated batch workloads here are CPU tree models — GPU is not required.

Quotas

Distributed Runs are governed by a two-dimensional, plan-tier budget — a single big job is never throttled by a global run count, and no tenant can monopolize the pool:

PropertyTypeDescription
Per-job parallelismPlan ceilingThe maximum shards one job may request. Free is single-shard only (1); Pro / Team / Enterprise unlock the full 256-shard contract maximum.
Org concurrent shardsPlan budgetThe maximum shards running at once across ALL your Runs. A 100-shard job consumes 100 of this budget; a single-shard run consumes 1.

429 when a budget is exceeded

Requesting more shards than your plan allows (per-job) or starting a run that would push your org over its concurrent-shard budget returns 429 Too Many Requests with an actionable message. Reduce parallelism, wait for running shards to finish, or upgrade your plan.

Billing Model

Distributed Runs meter under the same compute-second model as single-shard Runs — there is no unmetered capability. Usage is the sum across shards, weighted by machine size:

billing
compute_seconds = wall_clock_seconds × shard_count × machine_units

# machine_units normalizes a tier's vCPU to the standard tier (1 vCPU = 1.0):
#   standard = 1.0   large = 2.0   xlarge = 4.0   small = 0.5   nano = 0.125

Cost example

9 large shards × 600s each = 9 × 600 × 2.0 = 10,800 standard compute-seconds. A single-shard standard run for 600s is 600 compute-seconds — identical to a classic Run. Free-tier credits apply automatically before usage-based billing. A run that never starts (or is cancelled before its shards start) is billed nothing; a run cancelled mid-flight bills only the partial window across its shards.