Wednesday, April 10, 2019

Who is Stealing Your CPU Cycles?

Recently we noticed one of virtual server on Linode performing poorer than what we would expect to. On further investigation, we noticed abnormally high steal (st) time cpu percentage. It was ranging as high 40%. Some examples from top command is below.

top - 11:46:24 up 4 days, 23:07, 1 user, load average: 2.70, 1.51, 0.98
Tasks: 114 total, 3 running, 66 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.9 us, 5.0 sy, 0.0 ni, 36.6 id, 17.8 wa, 0.0 hi, 0.6 si, 32.3 st
KiB Mem : 4039472 total, 117044 free, 1708724 used, 2213704 buff/cache
KiB Swap: 524284 total, 523760 free, 524 used. 2074000 avail Mem

top - 11:46:36 up 4 days, 23:07, 1 user, load average: 2.36, 1.47, 0.97
Tasks: 114 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.5 us, 3.2 sy, 0.0 ni, 33.4 id, 19.6 wa, 0.0 hi, 0.4 si, 34.9 st
KiB Mem : 4039472 total, 115956 free, 1708536 used, 2214980 buff/cache
KiB Swap: 524284 total, 523760 free, 524 used. 2074200 avail Mem

top - 11:46:39 up 4 days, 23:07, 1 user, load average: 2.49, 1.52, 0.99
Tasks: 114 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.1 us, 4.4 sy, 0.0 ni, 25.9 id, 25.1 wa, 0.0 hi, 0.7 si, 35.8 st
KiB Mem : 4039472 total, 115944 free, 1708304 used, 2215224 buff/cache
KiB Swap: 524284 total, 523760 free, 524 used. 2074428 avail Mem

top - 11:46:49 up 4 days, 23:08, 1 user, load average: 2.75, 1.60, 1.02
Tasks: 114 total, 2 running, 67 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.2 us, 5.2 sy, 0.0 ni, 26.5 id, 20.5 wa, 0.0 hi, 0.6 si, 39.9 st
KiB Mem : 4039472 total, 115008 free, 1708524 used, 2215940 buff/cache
KiB Swap: 524284 total, 523760 free, 524 used. 2074200 avail Mem


We logged ticket with Linode support. They migrated this virtual server to another physical server. And server performance was as expected after that.

So, what is steal time cpu metric? 
Steal time is percentage of time virtual cpu of your virtual server is waiting for real cpu of physical server when virtualization is actually busy serving somebody else. Virtualization doesn't divide cpu exactly between various virtual servers as it divides memory or some other resources. So, possibly another virtual server is consuming more cpu cycles than it's share. So, your virtual server is not getting enough of it's share. 

We checked if any other virtual server has same issue. And we found one more. Linode support migrated that server too to another physical server. Following graph shows improvement of server before and after migration to newer physical server.


Performace improvement after cpu steal is fixed