Operating Principles
- Design for reproducibility first (versioned data, images, configs)
- Prefer batch over bespoke: AWS Batch / SLURM for repeatable runs
- Keep observability baked in: logs, metrics, cost awareness
AWS Architecture
- Compute: AWS Batch vs. EC2 vs. SageMaker, chosen per workload profile
- Storage: S3 data lake with clear prefixes for raw/processed/models
- Security: Scoped IAM roles, least privilege for jobs, VPC endpoints
- Packaging: Docker images with pinned dependencies; ECR-backed
HPC / Batch Pipelines
- Job graphs defined as DAGs; explicit dependencies and retries
- Containerized workers; CPU/GPU profiles declared per stage
- Checkpointed training; artifact promotion through environments
- SLURM/Bash templates mirrored in AWS Batch job definitions
Data Layout Conventions
/raw/{source}/{date}/...
/processed/{dataset}/{version}/...
/models/{project}/{version}/...
/figures/{project}/{paper}/...
Cost & Scaling Tradeoffs
- Batch for bursty workloads; spot where tolerable
- Cache intermediate artifacts to avoid recomputation
- Small, immutable images to reduce cold-start penalties
- Monitor egress and storage class transitions (S3 IA/Glacier where safe)
Reproducibility Strategy
- Pin: containers, data versions, seeds, configs
- Log: environment, git SHA, hyperparameters, metrics
- Validate: smoke tests on small shards before full runs