Joseph Rodriguez

Joseph Rodriguez

Astrophysics @ UC San Diego

Infrastructure

Operating Principles

Design for reproducibility first (versioned data, images, configs)
Prefer batch over bespoke: AWS Batch / SLURM for repeatable runs
Keep observability baked in: logs, metrics, cost awareness

AWS Architecture

Compute: AWS Batch vs. EC2 vs. SageMaker, chosen per workload profile
Storage: S3 data lake with clear prefixes for raw/processed/models
Security: Scoped IAM roles, least privilege for jobs, VPC endpoints
Packaging: Docker images with pinned dependencies; ECR-backed

HPC / Batch Pipelines

Job graphs defined as DAGs; explicit dependencies and retries
Containerized workers; CPU/GPU profiles declared per stage
Checkpointed training; artifact promotion through environments
SLURM/Bash templates mirrored in AWS Batch job definitions

Data Layout Conventions

/raw/{source}/{date}/...
/processed/{dataset}/{version}/...
/models/{project}/{version}/...
/figures/{project}/{paper}/...

Cost & Scaling Tradeoffs

Batch for bursty workloads; spot where tolerable
Cache intermediate artifacts to avoid recomputation
Small, immutable images to reduce cold-start penalties
Monitor egress and storage class transitions (S3 IA/Glacier where safe)

Reproducibility Strategy

Pin: containers, data versions, seeds, configs
Log: environment, git SHA, hyperparameters, metrics
Validate: smoke tests on small shards before full runs