Operating Principles

  • Design for reproducibility first (versioned data, images, configs)
  • Prefer batch over bespoke: AWS Batch / SLURM for repeatable runs
  • Keep observability baked in: logs, metrics, cost awareness

AWS Architecture

  • Compute: AWS Batch vs. EC2 vs. SageMaker, chosen per workload profile
  • Storage: S3 data lake with clear prefixes for raw/processed/models
  • Security: Scoped IAM roles, least privilege for jobs, VPC endpoints
  • Packaging: Docker images with pinned dependencies; ECR-backed

HPC / Batch Pipelines

  • Job graphs defined as DAGs; explicit dependencies and retries
  • Containerized workers; CPU/GPU profiles declared per stage
  • Checkpointed training; artifact promotion through environments
  • SLURM/Bash templates mirrored in AWS Batch job definitions

Data Layout Conventions

  • /raw/{source}/{date}/...
  • /processed/{dataset}/{version}/...
  • /models/{project}/{version}/...
  • /figures/{project}/{paper}/...

Cost & Scaling Tradeoffs

  • Batch for bursty workloads; spot where tolerable
  • Cache intermediate artifacts to avoid recomputation
  • Small, immutable images to reduce cold-start penalties
  • Monitor egress and storage class transitions (S3 IA/Glacier where safe)

Reproducibility Strategy

  • Pin: containers, data versions, seeds, configs
  • Log: environment, git SHA, hyperparameters, metrics
  • Validate: smoke tests on small shards before full runs