185 lines
5.5 KiB
Typst
185 lines
5.5 KiB
Typst
#import "/doc/metadata.typ": *
|
|
|
|
= Optimization
|
|
|
|
In this laboratory, the usage of `perf` as tool is experimented.
|
|
|
|
|
|
|
|
```
|
|
Performance counter stats for './ex1':
|
|
|
|
40609.10 msec task-clock # 1.000 CPUs utilized
|
|
22 context-switches # 0.542 /sec
|
|
0 cpu-migrations # 0.000 /sec
|
|
48867 page-faults # 1.203 K/sec
|
|
33136692484 cycles # 0.816 GHz
|
|
1671194529 instructions # 0.05 insn per cycle
|
|
269592231 branches # 6.639 M/sec
|
|
1013366 branch-misses # 0.38% of all branches
|
|
|
|
40.618926728 seconds time elapsed
|
|
|
|
39.901620000 seconds user
|
|
0.296158000 seconds sys
|
|
|
|
```
|
|
This program has done 22 context-switches and has 40.6s elapsed.
|
|
|
|
#task([
|
|
Measure the performance of the ex1
|
|
],[
|
|
```
|
|
Performance counter stats for './ex1':
|
|
|
|
40609.10 msec task-clock # 1.000 CPUs utilized
|
|
22 context-switches # 0.542 /sec
|
|
0 cpu-migrations # 0.000 /sec
|
|
48867 page-faults # 1.203 K/sec
|
|
33136692484 cycles # 0.816 GHz
|
|
1671194529 instructions # 0.05 insn per cycle
|
|
269592231 branches # 6.639 M/sec
|
|
1013366 branch-misses # 0.38% of all branches
|
|
|
|
40.618926728 seconds time elapsed
|
|
|
|
39.901620000 seconds user
|
|
0.296158000 seconds sys
|
|
|
|
```
|
|
This program has done 22 context-switches and has 40.6s elapsed.
|
|
])
|
|
|
|
#task([
|
|
Which error is in the program of ex1 ?
|
|
],[
|
|
The program has 2 loops to go trhough the array. But, there is another loops which encapsulate the 2 others. It involves that the whole array is iterated through 10 times for an addition operation. That's the problem. This can be solve by removing the extren loop and putting a addition of 10:
|
|
|
|
```c
|
|
int i, j;
|
|
for (i = 0; i < SIZE; i++)
|
|
{
|
|
for (j = 0; j < SIZE; j++)
|
|
{
|
|
array[j][i]+= 10;
|
|
}
|
|
}
|
|
```
|
|
|
|
With these modifications the performance must be a multiple of 10.
|
|
|
|
```
|
|
Performance counter stats for './optimized':
|
|
|
|
4759.07 msec task-clock # 0.998 CPUs utilized
|
|
20 context-switches # 4.203 /sec
|
|
0 cpu-migrations # 0.000 /sec
|
|
48866 page-faults # 10.268 K/sec
|
|
3883198165 cycles # 0.816 GHz
|
|
282691820 instructions # 0.07 insn per cycle
|
|
40234737 branches # 8.454 M/sec
|
|
653642 branch-misses # 1.62% of all branches
|
|
|
|
4.768030627 seconds time elapsed
|
|
|
|
4.385881000 seconds user
|
|
0.320226000 seconds sys
|
|
```
|
|
|
|
This can be observe by doing the same as before with `perf`. Before the time elapsed was around 40s and now about 4.7s. The same observation can be done with the cache missing:
|
|
- optimzed : 42103472
|
|
- basic : 406627550
|
|
|
|
|
|
])
|
|
|
|
|
|
#task([
|
|
Show l1 cache missing for ex1 :
|
|
],[
|
|
#table(
|
|
columns: (1.5fr, 1fr),
|
|
stroke: none,
|
|
[
|
|
Not optimized
|
|
```
|
|
407036282 L1-dcache-load-misses
|
|
|
|
39.868545227 seconds time elapsed
|
|
39.115950000 seconds user
|
|
0.347522000 seconds sys
|
|
|
|
```
|
|
],[
|
|
Optimzed
|
|
```
|
|
42027157 L1-dcache-load-misses
|
|
|
|
4.132272210 seconds time elapsed
|
|
3.778635000 seconds user
|
|
0.296472000 seconds sys
|
|
```
|
|
]
|
|
)
|
|
There still is a 10 factor as before between the L1 cache misses.
|
|
])
|
|
|
|
|
|
#task([Event analysed with `perf`:],[
|
|
|
|
- *Instructions*: It indicates the number of cpu instruction done during the program is running.
|
|
- *Cache-missing*: This happens when the data used is not currently store in the cache. The ask is passed to the next memory : RAM.
|
|
- *Branch-misses*: It happens when there is conditional branch. The CPU tries to predict the next instruction and misses.
|
|
- *L1-dcache-load-misses*: It happens when the data is not store in the cache L1. It has the next memory technology, here cache L2.
|
|
- *Cpu-migrations*: It indictes the number of times the program has changed of CPU thread.
|
|
- *Context-switches*: The program is sharing the resource with others. Sometimes, it less the cpu core to another. This involves a context-switching. It has to change some register like the PC.
|
|
|
|
])
|
|
|
|
|
|
#task([Timing performance of `perf`], [
|
|
There is some executions of the optimized program:
|
|
|
|
#figure(table(
|
|
columns: (1fr, 1fr),
|
|
// stroke: none,
|
|
[*Without `perf`*], [*With `perf`*],
|
|
[
|
|
```
|
|
real 0m 4.44s
|
|
user 0m 3.83s
|
|
sys 0m 0.29s
|
|
|
|
```
|
|
],
|
|
[
|
|
```
|
|
real 0m 4.38s
|
|
user 0m 4.05s
|
|
sys 0m 0.27s
|
|
|
|
```
|
|
],[
|
|
```
|
|
real 0m 4.75s
|
|
user 0m 4.09s
|
|
sys 0m 0.34s
|
|
|
|
```
|
|
],[
|
|
```
|
|
real 0m 4.75s
|
|
user 0m 4.09s
|
|
sys 0m 0.34s
|
|
|
|
```
|
|
],
|
|
),
|
|
caption:[Impact of the tool `perf`]
|
|
)<impact-perf>
|
|
|
|
In @impact-perf, the tool does not significantly affect program execution. It is certainly due to the CPU allocations.
|
|
|
|
])
|
|
|