Top-Down performance analysis methodology.

This article was originally posted here.

This post aims to help people that want to better understand performance bottlenecks in their application. There are many existing methodolgies to do performance anlysis,
but not so many of them are robust and formal. When I was just starting with performance work I usually just profiled the app and tried to grasp through the hotspots of the
benchmark hoping to find something there. This often lead to random experiments with unrolling, vectorization, inlining, you name it. I’m not saying it’s always a loosing
strategy. Sometimes you can be lucky to get big performance boost from random experiments. But usually you need to have very good intuition and luck :).

In this post I show more formal way to do performance analysis. It’s called Top-down Microarchitecture Analysis Method (TMAM)
(Intel® 64 and IA-32 Architectures Optimization Reference Manual, Appendix B.1). In this metodology we try to detect what was stalling our execution starting from the
high-level components (like Front End, Back End, Retiring, Branch predictor) and narrowing down the source of performance inefficiencies.

It’s an iterative process with 2 steps:

  • Identify the type of the performance problem.
  • Locate the exact place in the code using PEBS (precise event).

After fixing performance issue you repeat the process again.

If it doesn’t make sense to you yet, don’t worry, it’ll become clear with the example..

15 Jan 2024 @ 22:59 PM | 0 Comment(s) | Posted by Denis Bakhvalov

What is the magic number behind PELT?

Dietmar from Arm is working on dynamic PELT halflife [lkml](https://lore.kernel.org/lkml/20220829055450.1703092-1-dietmar.eggemann@arm.com/). This is a feature to adjust the PELT halflife dynamically so as to fit different use case. One motivation to use customized PELT halflife is to facilitate frequency ramp up on Android. Literally shorter halflife brings higher crest (before sleeping) and lower trough values (at wakeup). This could result in higher frequency in sched_util. For a typical periodic task of the display pipeline with a runtime/period of 8ms/16ms the peak amplitude is at ~40 for 32ms, at ~80 for 16ms and at ~160 for 8ms. Dietmar points out, the util_avg peak is calculated by: ``` at wakeup : util_avg = (1-y^A)/(1-y^B)*1024*y^(B-A) before sleeping : util_avg = (1-y^A)/(1-y^B)*1024 ``` with common ratio y using different halflife values: ``` y with 32ms: 0.5^(1/32) = 0.978572 y with 16ms: 0.5^(1/16) = 0.957603 y with 8ms: 0.5^(1/8) = 0.917004 ``` and runtime A = 8ms & period B = 16ms (i.e. a 50% task = 512 utilization) ``` 32ms 16ms 8ms at wakeup : 468 424 341 before sleeping : 556 600 683 ------------------ peak amplitude = before sleeping - 512 44 88 171 ``` Let's have a look at how this is calculated: For a task runs A ms every B ms: ``` time --> now ^ | ...|--------|--A ms--| ...|--------B ms-----| ``` before sleep(at lastest time) ``` lastest B ms B ms util_sum = (y^0+y^1+...+y^(A-1) + y^B+y^(B+1)+...y^(B+A-1) +... = ((1-y^A)/(1-y) + y^B(1-y^A)/(1-y) +... = (1-y^A)*(1+y^B+y^2B+...y^nB)/(1-y) = (1-y^A)*(1-y^(n+1)B) / ((1-y^B)*(1-y)) ``` and the util_avg is calculated by: ``` util_avg = util_sum / (y^0+y^1+...y^n) = util_sum * (1 - y)/ (1-y^n) ``` When n is approaching max: ``` = (1-y^A)*1024 / (1-y^B) [1] ``` For a task running 20ms every 200ms, it is 364.8.deng Another scenario is: ``` now ^ | ...|--A ms--|--------| ...|--------B ms-----| ``` That is, [1] has slept for (B-A)ms, thus decay accordingly to: which is y^(B-A) multiply [1]: ``` util_avg = y^(B-A)*(1-y^A)*1024 / (1-y^B) [2] ``` The calculation code: ``` #include #include #include #define HALFLIFE 32 int main(int argc, char *argv[]) { double y, r, p, u1, u2; int run_ms, period_ms; if (argc != 3) { printf("please provide 2 parameters, run_ms and period_ms\n"); return -1; } run_ms = atoi(argv[1]); period_ms = atoi(argv[2]); /* y^HALFLIFE = 0.5 , y^a=b => y=b^(1/b) */ y = pow(0.5, 1/(double)HALFLIFE); r = pow(y, run_ms); p = pow(y, period_ms); u1 = ( 1 - r) * 1024 / ( 1 -p); printf("util_avg big is %f\n", u1); u2 = pow(y, period_ms-run_ms) * u1; printf("util_avg small is %f\n", u2); return 0; } 18 Jan 2024 @ 22:59 PM | 0 Comment(s) | Posted by Linux
CSS & HTML by MLP Design
Licensed under Creative Commons Attribution-NonCommercial 3.0.