skypilot-users

How do I get started with managed spot jobs in SkyPilot?

I'm looking to use managed spot jobs with SkyPilot and want to understand the steps needed to set up and launch a job efficiently, especially considering the potential for preemptions with spot instances. Could you provide a detailed guide on how to prepare for and execute a managed spot job using SkyPilot?

Zo

Zongheng Yang (skypilot)

Asked on Mar 25, 2024

To get started with managed spot jobs using SkyPilot, you'll need to follow these steps:

  1. Prepare Your Task YAML: Create a task YAML file that describes your job and test it with sky launch.
  2. Implement Checkpointing: Optionally, implement checkpointing in your code to save the state of your job periodically.

Here's an example of a task YAML for a BERT fine-tuning task and the command to launch it as a managed spot job:

# bert_qa.yaml
name: bert_qa

resources:
  accelerators: V100:1

workdir: ~/transformers

setup: |
  pip install -e .
  cd examples/pytorch/question-answering/
  pip install -r requirements.txt
  pip install wandb

run: |
  cd ./examples/pytorch/question-answering/
  python run_qa.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name squad \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 50 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --report_to wandb

Launch the job with:

$ sky spot launch -n bert-qa bert_qa.yaml

SkyPilot will manage the launching and monitoring of your spot job, including handling preemptions. For more details, check the Managed Spot Jobs section in the SkyPilot documentation.

Mar 25, 2024Edited by