Skip to content

epfml/getting-started

Repository files navigation

MLO: Getting Started with the EPFL RCP Cluster

This repository contains the basic steps to start running scripts and notebooks on the EPFL RCP cluster. We provide scripts that make your life easier by automating most of the boilerplate. The setup is loosely based on infrastructure from TML/CLAIRE and earlier scripts by Atli.

Overview

The RCP cluster provides:

  • GPUs: A100 (40GB/80GB), H100 (80GB), H200 (140GB), V100
  • Stack: Docker (containers), Kubernetes (orchestration), run:ai (scheduler)

Getting Help

Tip

If you have little prior experience with ML workflows, the setup below may seem daunting at first. You can copy‑paste the commands in order; the scripts are designed to hide most of the complexity. The only requirement is that you have a basic understanding of how to use a terminal and git.

Caution

Using the cluster creates costs. Please be mindful of the resources you use. Do not forget to stop your jobs when not used!

Content overview:


Quick Start

Tip

TL;DR – After completing the setup, interaction with the cluster looks like this:

# Start an interactive job with 1 GPU python csub.py -n sandbox # Connect to your job runai exec sandbox -it -- zsh # Run your codecd /mloscratch/homes/<your_username> python main.py # Or start a training job in one command python csub.py -n experiment --train --command "cd /mloscratch/homes/<your_username>/<your_code> python main.py"

Setup Guide

Important

Network requirement: You must be on the EPFL WiFi or connected to the VPN. The cluster is not accessible otherwise.

1. Pre-setup (Access & Repository)

1. Request cluster access

Ask Jennifer or Martin to add you to the runai-mlo group: https://groups.epfl.ch/

2. Prepare your code repository

While waiting for access, create a GitHub repository for your code. This is best practice regardless of our cluster setup.

3. Set up experiment tracking (optional)

  • Weights & Biases: Create an account at wandb.ai and get your API key
  • Hugging Face: Create an account at huggingface.co and get your token (if using their models)

2. Setup Tools on Your Machine

Important

Platform note: The setup below was tested on macOS with Apple Silicon. For other systems, adapt the commands accordingly.

  • Linux: Replace darwin/arm64 with linux/amd64 in URLs
  • Windows: Use WSL (Windows Subsystem for Linux)

Install kubectl

Download and install kubectl v1.30.11 (matching the cluster version):

# macOS with Apple Silicon curl -LO "https://dl.k8s.io/release/v1.30.11/bin/darwin/arm64/kubectl"# Linux (AMD64)# curl -LO "https://dl.k8s.io/release/v1.30.11/bin/linux/amd64/kubectl"# Install chmod +x ./kubectl sudo mv ./kubectl /usr/local/bin/kubectl sudo chown root: /usr/local/bin/kubectl

See https://kubernetes.io/docs/tasks/tools/install-kubectl/ for other platforms.

Setup kubeconfig

Download the kube config file to ~/.kube/config:

curl -o ~/.kube/config https://raw.githubusercontent.com/epfml/getting-started/main/kubeconfig.yaml

Install run:ai CLI

Download and install the run:ai CLI:

# macOS with Apple Silicon wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/darwin # Linux (replace 'darwin' with 'linux')# wget --content-disposition https://rcp-caas-prod.rcp.epfl.ch/cli/linux# Install chmod +x ./runai sudo mv ./runai /usr/local/bin/runai sudo chown root: /usr/local/bin/runai

3. Login to the Cluster

Login to run:ai

runai login

Verify access

# List available projects runai list projects # Set your default project runai config project mlo-$GASPAR_USERNAME

Verify Kubernetes connection

kubectl get nodes

You should see the RCP cluster nodes listed.

4. Configure Your .env File

This setup keeps all personal configuration and secrets in a local .env file (never committed to git).

Clone and create .env

git clone https://github.com/epfml/getting-started.git cd getting-started cp user.env.example .env

Fill in required fields

Open .env in an editor and configure:

VariableDescriptionExample
LDAP_USERNAMEYour EPFL/Gaspar usernamejdoe
LDAP_UIDYour numeric LDAP user ID123456
LDAP_GROUPNAMEFor MLOMLO-unit
LDAP_GIDFor MLO: 8307083070
RUNAI_PROJECTYour projectmlo-<username>
K8S_NAMESPACEYour namespacerunai-mlo-<username>
RUNAI_IMAGEDocker imageic-registry.epfl.ch/mlo/mlo-base:uv1
RUNAI_SECRET_NAMESecret namerunai-mlo-<username>-env
WORKING_DIRWorking directory/mloscratch/homes/<username>

Find your LDAP UID

To ensure correct file permissions:

# SSH into HaaS machine (use your Gaspar password) ssh <your_gaspar_username>@haas001.rcp.epfl.ch # Get your UID id

Copy the number after uid= (e.g., uid=123456) into LDAP_UID in your .env file.

Optional: Add secrets and tokens

Optionally configure in .env:

  • WANDB_API_KEY – Weights & Biases API key
  • HF_TOKEN – Hugging Face token
  • GIT_USER_NAME / GIT_USER_EMAIL – Git identity for commits
  • GitHub SSH keys (auto-loaded from ~/.ssh/github if empty):
    • GITHUB_SSH_KEY_PATH / GITHUB_SSH_PUBLIC_KEY_PATH (to override default paths)

Sync your secret

The secret is automatically synced when starting a job. To manually sync:

python csub.py --sync-secret-only

5. Start Your First Job

Start an interactive pod

python csub.py -n sandbox

Wait for the pod to start

This can take a few minutes. Monitor the status:

# List all jobs runai list # Check specific job status runai describe job sandbox

Connect to your pod

Once the status shows Running:

runai exec sandbox -it -- zsh

You should now be inside a terminal on the cluster! 🎉

6. Clone and Run Your Code

Clone your repository

Inside the pod, clone your code into your scratch home folder:

cd /mloscratch/homes/<your_username> git clone https://github.com/<your_username>/<your_repo>.git cd<your_repo>

Set up your Python environment

The default image includes uv as the recommended package manager (pip also works):

# Create and activate virtual environment uv venv .venv source .venv/bin/activate # Install dependencies uv pip install -r requirements.txt

Run your code

python main.py

If you configured WANDB_API_KEY or HF_TOKEN in .env, authentication should work automatically.


Using VS Code

For remote development on the cluster:

  1. Install extensions

  2. Attach to your pod

    • Navigate to: Kubernetesrcp-clusterWorkloadsPods
    • Right-click your pod → Attach Visual Studio Code
    • Open /mloscratch/homes/<your_username> in the remote session

For detailed instructions, see the Managing Workflows guide.


Recommended Workflow

Tip

Development cycle:

  1. Develop code locally or on the cluster (using VS Code)
  2. Push changes to GitHub
  3. Run experiments on the cluster via runai exec sandbox -it -- zsh
  4. Keep code and experiments organized and reproducible

Important

Critical reminders:

  • Pods can be killed anytime – Implement checkpointing and recovery
  • Store files on scratch – Everything in ~/ is lost when pods restart
  • Use /mloscratch/homes/<username> – Shell config and VS Code settings persist here
  • Delete failed jobs – Run runai delete job <name> before restarting
  • Background jobs – Use training mode: python csub.py -n exp --train --command "..."

Caution

Using the cluster creates costs. Always stop your jobs when not in use!

For detailed workflow guidance, see the Managing Workflows guide.


csub.py Usage and Arguments

The csub.py script is a thin wrapper around the run:ai CLI that simplifies job submission by:

  • Reading configuration and secrets from .env
  • Syncing Kubernetes secrets automatically
  • Constructing and executing runai submit commands

Basic Usage

python csub.py -n <job_name> -g <num_gpus> -t <time> --command "<cmd>" [--train]

Common Examples

# CPU-only pod for development python csub.py -n dev-cpu # Interactive development pod with 1 GPU python csub.py -n dev-gpu -g 1 # Training job with 4 A100 GPUs python csub.py -n experiment --train -g 4 --command "cd /mloscratch/homes/user/code; python train.py"# Use specific GPU type python csub.py -n my-job -g 2 --node-type h100 --train --command "..."# Dry run (see command without executing) python csub.py -n test --dry --command "..."

Available Arguments

ArgumentDescriptionDefault
-n, --nameJob nameAuto-generated (username + timestamp)
-g, --gpusNumber of GPUs0 (CPU-only)
-t, --timeMaximum runtime (e.g., 12h, 2d6h30m)12h
-c, --commandCommand to runsleep <duration>
--trainSubmit as training workload (non-interactive)Interactive mode
-i, --imageDocker imageFrom RUNAI_IMAGE in .env
--node-typeGPU type: v100, h100, h200, default, a100-40gdefault (A100)
--cpusNumber of CPUsPlatform default
--memoryCPU memory requestPlatform default
-p, --portExpose container port (for Jupyter, etc.)None
--large-shmRequest larger /dev/shmFalse
--host-ipcShare host IPC namespaceFalse
--backofflimitRetries before marking training job failed0

Secret Management

ArgumentDescription
--sync-secret-onlyOnly sync .env to Kubernetes secret, don't submit job
--skip-secret-syncDon't sync secret before submission
--secret-nameOverride RUNAI_SECRET_NAME from .env
--env-filePath to .env file

Advanced Options

ArgumentDescription
--uidOverride LDAP_UID from .env
--gidOverride LDAP_GID from .env
--pvcOverride SCRATCH_PVC from .env
--dryPrint command without executing

After Submission

After submitting, csub.py prints useful follow-up commands:

runai describe job <name># Check job status runai logs <name># View logs runai exec<name> -it -- zsh # Connect to pod runai delete job <name># Delete job

Run python csub.py -h for the complete help text.


Advanced Topics

Managing Workflows

For detailed guides on day-to-day operations, see the Managing Workflows guide:

Alternative Workflows

Creating Custom Docker Images

If you need custom dependencies:

  1. Get registry access

  2. Install Docker

    brew install --cask docker # macOS

    If you get "Cannot connect to the Docker daemon", run Docker Desktop GUI first.

  3. Login to registry

    docker login ic-registry.epfl.ch # Use GASPAR credentials
  4. Modify and publish

    • Edit docker/Dockerfile as needed
    • Use docker/publish.sh to build and push
    • Important: Rename your image (e.g., mlo/<your-username>:tag) to avoid overwriting the default

Example workflow:

docker build . -t <your-tag> docker tag <your-tag> ic-registry.epfl.ch/mlo/<your-tag> docker push ic-registry.epfl.ch/mlo/<your-tag>

See also Matteo's custom Docker example.

Port Forwarding

To access services running in your pod (e.g., Jupyter):

kubectl get pods kubectl port-forward <pod_name> 8888:8888

Then access at http://localhost:8888

Distributed Training

For multi-node training across several compute nodes, see the detailed guide:


Reference

File Overview

├── csub.py # Job submission wrapper (wraps runai submit) ├── utils.py # Python helpers for csub.py ├── user.env.example # Template for .env (copy and configure) ├── docker/ │ ├── Dockerfile # uv-enabled base image (RCP template) │ ├── entrypoint.sh # Runtime bootstrap script │ └── publish.sh # Build and push Docker images ├── kubeconfig.yaml # Kubeconfig template for ~/.kube/config └── docs/ ├── faq.md # Frequently asked questions ├── managing_workflows.md # Day-to-day operations guide ├── README.md # Architecture deep dive ├── runai_cli.md # Alternative run:ai CLI workflows ├── multinode.md # Multi-node/distributed training └── how_to_use_k8s_secret.md # Kubernetes secrets reference 

Deep Dive: How This Setup Works

For technical details about the Docker image, entrypoint script, environment variables, and secret management:

Read the architecture explainer: docs/README.md

Topics covered:

  • Runtime environment and entrypoint
  • Permissions model and shared caches
  • uv-based Python workflow
  • Images and publishing
  • Secrets, SSH, and Kubernetes integration

Quick Links

RCP Resources

run:ai Documentation

Related Resources

MLO Cluster Repositories (OUTDATED)

These repositories contain shared tooling and infrastructure (by previous PhD students). Contact Martin for editor access. They are outdated and not maintained anymore.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 10