How do I get started with managed spot jobs in SkyPilot?
I'm looking to use managed spot jobs with SkyPilot and want to understand the steps needed to set up and launch a job efficiently, especially considering the potential for preemptions with spot instances. Could you provide a detailed guide on how to prepare for and execute a managed spot job using SkyPilot?
Zongheng Yang (skypilot)
Asked on Mar 25, 2024
To get started with managed spot jobs using SkyPilot, you'll need to follow these steps:
- Prepare Your Task YAML: Create a task YAML file that describes your job and test it with
sky launch
. - Implement Checkpointing: Optionally, implement checkpointing in your code to save the state of your job periodically.
Here's an example of a task YAML for a BERT fine-tuning task and the command to launch it as a managed spot job:
# bert_qa.yaml
name: bert_qa
resources:
accelerators: V100:1
workdir: ~/transformers
setup: |
pip install -e .
cd examples/pytorch/question-answering/
pip install -r requirements.txt
pip install wandb
run: |
cd ./examples/pytorch/question-answering/
python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 50 \
--max_seq_length 384 \
--doc_stride 128 \
--report_to wandb
Launch the job with:
$ sky spot launch -n bert-qa bert_qa.yaml
SkyPilot will manage the launching and monitoring of your spot job, including handling preemptions. For more details, check the Managed Spot Jobs section in the SkyPilot documentation.