How can I address connection timeout errors and key errors affecting load balancer status in Skypilot?
I'm encountering connection timeout errors and key errors when running sky serve controller
on Azure, which seem to be affecting the load balancer status of available nodes. Despite sky serve status
showing everything as ready, some requests fail when redirected to one of the replicas. I suspect the errors are preventing the load balancer from updating its status, keeping faulty nodes active. Here's the error output from sky serve logs --controller <name>
:
Error when probing replica 3 with url 20.171.124.9:8080: requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='20.171.124.9', port=8080): Max retries exceeded with url: /v1/models (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x14dff00ea0b0>, 'Connection to 20.171.124.9 timed out. (connect timeout=15)')).
E 02-08 21:57:24 replica_managers.py:997] Error in replica prober: KeyError: ''
E 02-08 21:57:24 replica_managers.py:1000] Traceback: Traceback (most recent call last):
...
KeyError: ''
How can I resolve this issue and ensure the load balancer accurately reflects the status of the nodes?
Joe Valle
Asked on Feb 08, 2024
It appears that the issue might be related to Azure returning an empty state for a VM, particularly since you mentioned using spot replicas. If the replicas were decommissioned, as indicated in the Azure UI, and you deleted the resource group containing the decommissioned VM, Skypilot should start provisioning new replicas and update the load balancer accordingly.
To address your request for command line options to remove replicas, the idea of having a --soft-remove
and --destructive-delete
flag for the sky serve down
command is insightful. The Skypilot team has taken note of this and created an issue to implement the feature of terminating a single replica.
For the use case of removing a replica from the load balancer without deleting it, you might want to perform maintenance on the node or restart services. However, this could lead to additional costs if Skypilot spins up a new replica to maintain the desired count. A potential solution could be to automatically adjust the replica count when a node is removed from the load balancer.
Regarding scaling down replicas during light traffic, you can use the sky serve update
command with a reduced replica count. To automate this, you could use sed
and cron
jobs. While Skypilot does not currently have an autoscaling feature like Kubernetes, it's something that could be considered for the roadmap.
In summary, the Skypilot team is aware of the issue and working on improvements, including the ability to terminate individual replicas. In the meantime, you can manually adjust replica counts and remove resource groups as needed.