doc: proofread
This commit is contained in:
@@ -40,9 +40,8 @@ static void catch_signal(int signal) {
|
|||||||
printf("SIGABRT received\n");
|
printf("SIGABRT received\n");
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
```
|
||||||
|
|
||||||
#pagebreak()
|
#pagebreak()
|
||||||
|
|
||||||
|
|||||||
@@ -2,13 +2,13 @@
|
|||||||
|
|
||||||
= Linux System Optimisation
|
= Linux System Optimisation
|
||||||
|
|
||||||
In this laboratory, the usage of `perf` as tool is experimented.
|
In this laboratory, the usage of `#gls("perf", long: false)` as a performance analysis tool is explored.
|
||||||
|
|
||||||
|
|
||||||
== Exercise 1
|
== Exercise 1
|
||||||
|
|
||||||
#task([
|
#task([
|
||||||
Measure the performance of the ex1
|
Measure the performance of `ex1`
|
||||||
],[
|
],[
|
||||||
```
|
```
|
||||||
Performance counter stats for './ex1':
|
Performance counter stats for './ex1':
|
||||||
@@ -28,15 +28,15 @@ Performance counter stats for './ex1':
|
|||||||
0.296158000 seconds sys
|
0.296158000 seconds sys
|
||||||
|
|
||||||
```
|
```
|
||||||
This program has done 22 context-switches and has 40.6s elapsed.
|
This program performs 22 context switches and takes 40.6 seconds to run.
|
||||||
])
|
])
|
||||||
|
|
||||||
#task([
|
#task([
|
||||||
Which error is in the program of ex1 ?
|
What error is present in the `ex1` program?
|
||||||
],[
|
],[
|
||||||
The error lies in how the array memory is accessed. In C, 2D arrays are stored in a "row-major" order, meaning elements of the same row are contiguous in memory. However, the original code accesses the array using `array[j][i]` within the loops, where `j` (the row) is the inner loop.
|
The error lies in how the array memory is accessed. In C, 2D arrays are stored in "row-major" order, meaning elements of the same row are contiguous in memory. However, the original code accesses the array using `array[j][i]` within the loops, where the row index `j` is in the inner loop.
|
||||||
|
|
||||||
This causes the program to jump across memory addresses non-sequentially, triggering a cache miss almost every time. This can be solved by simply swapping the indices to `array[i][j]` (or swapping the loops) to process memory sequentially:
|
This causes the program to jump across memory addresses non-sequentially, triggering a cache miss almost every time. This can be solved by simply swapping the indices to `array[i][j]` (or swapping the loop order) to process memory sequentially:
|
||||||
|
|
||||||
|
|
||||||
```c
|
```c
|
||||||
@@ -50,7 +50,7 @@ This causes the program to jump across memory addresses non-sequentially, trigge
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
With these modifications the performance must be a multiple of 10.
|
With these modifications, the performance is improved by a factor of nearly 80.
|
||||||
|
|
||||||
```
|
```
|
||||||
Performance counter stats for './optimized':
|
Performance counter stats for './optimized':
|
||||||
@@ -71,16 +71,16 @@ With these modifications the performance must be a multiple of 10.
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This can be observe by doing the same as before with `perf`. Before the time elapsed was around 40s and now about 0.5s. The same observation can be done with the cache missing:
|
This can be observed by running the same performance analysis with `#gls("perf", long: false)`. The elapsed time drops from around 40 seconds to approximately 0.5 seconds. A similar improvement can be observed in the cache misses:
|
||||||
- optimzed : 753502
|
- optimized : 753,502
|
||||||
- basic : 406627550
|
- basic : 406,627,550
|
||||||
|
|
||||||
|
|
||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
#task([
|
#task([
|
||||||
Show l1 cache missing for ex1 :
|
Show `#gls("l1", long: false)` cache misses for `ex1`:
|
||||||
],[
|
],[
|
||||||
#table(
|
#table(
|
||||||
columns: (1.5fr, 1fr),
|
columns: (1.5fr, 1fr),
|
||||||
@@ -96,7 +96,7 @@ This can be observe by doing the same as before with `perf`. Before the time ela
|
|||||||
|
|
||||||
```
|
```
|
||||||
],[
|
],[
|
||||||
Optimzed
|
Optimized
|
||||||
```
|
```
|
||||||
42027157 L1-dcache-load-misses
|
42027157 L1-dcache-load-misses
|
||||||
|
|
||||||
@@ -106,29 +106,29 @@ This can be observe by doing the same as before with `perf`. Before the time ela
|
|||||||
```
|
```
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
There still is a 10 factor as before between the L1 cache misses.
|
There is still an approximate 10-fold difference between the two configurations' `#gls("l1", long: false)` cache misses.
|
||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
#task([Event analysed with `perf`:],[
|
#task([Events analysed with `#gls("perf", long: false)`:],[
|
||||||
|
|
||||||
- *Instructions*: It indicates the number of cpu instruction done during the program is running.
|
- *Instructions*: Indicates the total number of `#gls("cpu", long: false)` instructions executed while the program is running.
|
||||||
- *Cache-missing*: This happens when the data used is not currently store in the cache. The ask is passed to the next memory : RAM.
|
- *Cache-misses*: This occurs when the required data is not currently stored in the cache hierarchy, forcing the processor to fetch it from slower main memory (`#gls("ram", long: false)`).
|
||||||
- *Branch-misses*: It happens when there is conditional branch. The CPU tries to predict the next instruction and misses.
|
- *Branch-misses*: Occurs during conditional branching when the `#gls("cpu", long: false)`'s branch predictor incorrectly guesses the next instruction path, resulting in pipeline flushes.
|
||||||
- *L1-dcache-load-misses*: It happens when the data is not store in the cache L1. It has the next memory technology, here cache L2.
|
- *L1-dcache-load-misses*: Occurs when the requested data is not present in the Level 1 Data Cache (`#gls("l1", long: false)` dcache), requiring a lookup in the next cache level (`#gls("l2", long: false)` cache).
|
||||||
- *Cpu-migrations*: It indicates the number of times the program has changed of CPU core.
|
- *CPU-migrations*: Indicates the number of times the operating system scheduler moved the program threads from one `#gls("cpu", long: false)` core to another.
|
||||||
- *Context-switches*: The program is sharing the resource with others. Sometimes, it less the cpu core to another. This involves a context-switching. It has to change some register like the PC.
|
- *Context-switches*: Occurs when the process relinquishes the `#gls("cpu", long: false)` core to allow other processes to run. This context-switch requires saving and restoring processor registers, including the `#gls("pc", long: false)`.
|
||||||
|
|
||||||
])
|
])
|
||||||
|
|
||||||
|
|
||||||
#task([Timing performance of `perf`], [
|
#task([Timing performance of `#gls("perf", long: false)`], [
|
||||||
There is some executions of the optimized program:
|
Below are several execution times for the optimized program:
|
||||||
|
|
||||||
#figure(table(
|
#figure(table(
|
||||||
columns: (1fr, 1fr),
|
columns: (1fr, 1fr),
|
||||||
// stroke: none,
|
// stroke: none,
|
||||||
[*Without `perf`*], [*With `perf`*],
|
[*Without `#gls("perf", long: false)`*], [*With `#gls("perf", long: false)`*],
|
||||||
[
|
[
|
||||||
```
|
```
|
||||||
real 0m 4.44s
|
real 0m 4.44s
|
||||||
@@ -160,22 +160,22 @@ sys 0m 0.34s
|
|||||||
```
|
```
|
||||||
],
|
],
|
||||||
),
|
),
|
||||||
caption:[Impact of the tool `perf`]
|
caption:[Impact of the `#gls("perf", long: false)` tool]
|
||||||
)<impact-perf>
|
)<impact-perf>
|
||||||
|
|
||||||
In @impact-perf, the tool does not significantly affect program execution. It is certainly due to the CPU allocations.
|
As seen in @impact-perf, running the program with `#gls("perf", long: false)` does not introduce a significant performance overhead, which can be attributed to stable `#gls("cpu", long: false)` core scheduling and allocation.
|
||||||
|
|
||||||
])
|
])
|
||||||
|
|
||||||
== Exercise 2
|
== Exercise 2
|
||||||
|
|
||||||
The program fills an array of random between 0 and 512. Then it iterates 10'000 times over all the array to make a sum of all number generated equal or bigger than 256.
|
The program fills an array with random numbers between 0 and 512. Then, it iterates 10,000 times over the entire array to sum all elements that are greater than or equal to 256.
|
||||||
|
|
||||||
|
|
||||||
#figure(
|
#figure(
|
||||||
table(
|
table(
|
||||||
columns: (1fr),
|
columns: (1fr),
|
||||||
[Withtout Optimization],
|
[Without Optimisation],
|
||||||
[
|
[
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -193,7 +193,7 @@ The program fills an array of random between 0 and 512. Then it iterates 10'000
|
|||||||
26.117025000 seconds user
|
26.117025000 seconds user
|
||||||
0.003961000 seconds sys
|
0.003961000 seconds sys
|
||||||
```
|
```
|
||||||
], [With "sort" optimization],[
|
], [With "sort" optimisation],[
|
||||||
```
|
```
|
||||||
23430.74 msec task-clock
|
23430.74 msec task-clock
|
||||||
17 context-switches # 0.726 /sec
|
17 context-switches # 0.726 /sec
|
||||||
@@ -211,18 +211,18 @@ The program fills an array of random between 0 and 512. Then it iterates 10'000
|
|||||||
```
|
```
|
||||||
]
|
]
|
||||||
),
|
),
|
||||||
caption:[Ex02 timing optimization]
|
caption:[Ex02 timing optimisation]
|
||||||
)<sort-optimization>
|
)<sort-optimization>
|
||||||
|
|
||||||
In @sort-optimization, there is a gain of around 3s due to a massive decrease in branch misses, dropping from 33.17% to 0.08%.
|
In @sort-optimization, there is a gain of around 3 seconds due to a massive decrease in branch misses, dropping from 33.17% to 0.08%.
|
||||||
|
|
||||||
This is explained by the CPU's Branch Predictor. Inside the loop, the program checks if the value is `>= 256`. When the array is filled with random numbers, the CPU cannot predict the outcome of this condition, resulting in frequent pipeline flushes. However, when the array is sorted, the condition is always false for the first half of the array, and always true for the second half. The CPU easily predicts this pattern, avoiding branch misses and executing much faster.
|
This is explained by the `#gls("cpu", long: false)`'s branch predictor. Inside the loop, the program checks if the value is `>= 256`. When the array is filled with random numbers, the processor cannot predict the outcome of this condition, resulting in frequent pipeline flushes. However, when the array is sorted, the condition is always false for the first half of the array, and always true for the second half. The `#gls("cpu", long: false)` easily predicts this pattern, avoiding branch misses and executing much faster.
|
||||||
|
|
||||||
The same test was done with the `-01` compiler flag and there is almost no difference between the two scipts. The optimzed is around 4.12s and the basic is around 4.6s. The difference of 0.6 sec can be explained with the sort algorithm used in the optimized script, because this is the only difference.
|
The same test was performed with the `-O1` optimisation flag, and there is almost no difference between the two scripts. The optimized version is around 4.12s and the basic version is around 4.6s. The difference of 0.6 seconds can be explained by the sorting algorithm itself in the optimized version, as sorting is the only added operation.
|
||||||
|
|
||||||
|
|
||||||
== Exercise 3
|
== Exercise 3
|
||||||
By analyzing the call graph with `perf report`, we can trace the indirect calls to `std::operator==<char>` back to our application. The bottleneck originates in the `HostCounter::isNewHost` function, specifically during the `std::find` operation on a `std::vector`:
|
By analysing the call graph with `#gls("perf", long: false) report`, we can trace the indirect calls to `std::operator==<char>` back to our application. The bottleneck originates in the `HostCounter::isNewHost` function, specifically during the `std::find` operation on a `std::vector`:
|
||||||
|
|
||||||
```c
|
```c
|
||||||
bool HostCounter::isNewHost(std::string hostname)
|
bool HostCounter::isNewHost(std::string hostname)
|
||||||
@@ -234,7 +234,7 @@ bool HostCounter::isNewHost(std::string hostname)
|
|||||||
Searching through an unsorted vector requires a linear comparison of strings ($O(N)$ complexity), which is highly inefficient. As shown below, processing just a sample of the logs takes over 2 minutes:
|
Searching through an unsorted vector requires a linear comparison of strings ($O(N)$ complexity), which is highly inefficient. As shown below, processing just a sample of the logs takes over 2 minutes:
|
||||||
|
|
||||||
```
|
```
|
||||||
# time ./read-apache-logs access_log_NASA_Jul95_samples
|
|> time ./read-apache-logs access_log_NASA_Jul95_samples
|
||||||
Processing log file access_log_NASA_Jul95_samples
|
Processing log file access_log_NASA_Jul95_samples
|
||||||
Found 14867 unique Hosts/IPs
|
Found 14867 unique Hosts/IPs
|
||||||
real 2m 15.58s
|
real 2m 15.58s
|
||||||
@@ -242,17 +242,17 @@ user 2m 14.68s
|
|||||||
sys 0m 0.12s
|
sys 0m 0.12s
|
||||||
|
|
||||||
```
|
```
|
||||||
To fix this, the data structure needs to be changed from std::vector to std::set. A set uses a tree-based or hash-based structure, reducing the search complexity to $O(log N)$ or $O(1)$.
|
To fix this, the data structure must be changed from `std::vector` to `std::set`. A set uses a tree-based structure, reducing the search complexity to $O(log N)$ (or $O(1)$.
|
||||||
|
|
||||||
|
|
||||||
#figure(
|
#figure(
|
||||||
image("command-after-optimization.png"),
|
image("command-after-optimization.png"),
|
||||||
caption:[ `perf` report after migrating to `std::set`]
|
caption:[ `#gls("perf", long: false)` report after migrating to `std::set`]
|
||||||
)<command-opti>
|
)<command-opti>
|
||||||
|
|
||||||
After applying the changes, the perf report in @command-opti shows a much healthier execution profile. The execution time drops drastically, creating a massive performance gap compared to the initial vector implementation:
|
After applying these changes, the `#gls("perf", long: false)` report in @command-opti shows a much healthier execution profile. The execution time drops drastically, creating a massive performance gap compared to the initial `std::vector` implementation:
|
||||||
```
|
```
|
||||||
# time ./read-apache-logs access_log_NASA_Jul95_samples
|
|> time ./read-apache-logs access_log_NASA_Jul95_samples
|
||||||
Processing log file access_log_NASA_Jul95_samples
|
Processing log file access_log_NASA_Jul95_samples
|
||||||
Found 14867 unique Hosts/IPs
|
Found 14867 unique Hosts/IPs
|
||||||
real 0m 1.55s
|
real 0m 1.55s
|
||||||
@@ -262,10 +262,9 @@ sys 0m 0.10s
|
|||||||
|
|
||||||
Even when processing the entire log file containing roughly 2 million entries, the optimized program finishes in under 15 seconds:
|
Even when processing the entire log file containing roughly 2 million entries, the optimized program finishes in under 15 seconds:
|
||||||
```
|
```
|
||||||
# time ./read-apache-logs access_log_NASA_Jul95
|
|> time ./read-apache-logs access_log_NASA_Jul95
|
||||||
askljdalksjda
|
|
||||||
Processing log file access_log_NASA_Jul95
|
Processing log file access_log_NASA_Jul95
|
||||||
Found 81983 unique Hosts/IPs
|
Found 81983 unique Hosts/#gls("ip", long: false)s
|
||||||
real 0m 14.76s
|
real 0m 14.76s
|
||||||
user 0m 13.90s
|
user 0m 13.90s
|
||||||
sys 0m 0.68s
|
sys 0m 0.68s
|
||||||
@@ -273,12 +272,12 @@ sys 0m 0.68s
|
|||||||
|
|
||||||
|
|
||||||
#task([Measure interruption latency and jitter], [
|
#task([Measure interruption latency and jitter], [
|
||||||
The hardware approach is chosen with an oscilloscope and a square-wave generator.
|
To measure latency and jitter, a hardware-based approach using an oscilloscope and a square-wave generator was implemented.
|
||||||
First, the generator toggles a processor pin to trigger the interrupt routine. Then, another pin creates a pulse as a response, which is measured by the oscilloscope. The latency is the delay between the generator's rising edge and the response pulse. The jitter is the variation of this latency over multiple measurements.
|
First, the generator toggles a processor pin to trigger the interrupt routine. Then, another pin creates a pulse as a response, which is measured by the oscilloscope. The latency is the delay between the generator's rising edge and the response pulse. The jitter is the variation of this latency over multiple measurements.
|
||||||
|
|
||||||
To differentiate between Kernel Space and User Space:
|
To differentiate between Kernel Space and User Space:
|
||||||
- *Kernel Space*: The response pin is toggled directly inside the kernel's Interrupt Service Routine (IRQ handler / driver).
|
- *Kernel Space*: The response pin is toggled directly inside the kernel's Interrupt Service Routine (`#gls("irq", long: false)` handler / driver).
|
||||||
- *User Space*: The response pin is toggled by a user application that wakes up (using `epoll()`) after the kernel has handled the interrupt.
|
- *User Space*: The response pin is toggled by a user application that wakes up (using `#gls("epoll", long: false)()`) after the kernel has handled the interrupt.
|
||||||
|
|
||||||
The difference between these two latency measurements represents the context-switch overhead from kernel mode to user mode.
|
The difference between these two latency measurements represents the context-switch overhead from kernel mode to user mode.
|
||||||
])
|
])
|
||||||
|
|||||||
@@ -41,6 +41,34 @@
|
|||||||
description: "The primary component of a computer that performs most of the processing inside the computer, executing instructions of computer programs.",
|
description: "The primary component of a computer that performs most of the processing inside the computer, executing instructions of computer programs.",
|
||||||
group: "Hardware"
|
group: "Hardware"
|
||||||
),
|
),
|
||||||
|
(
|
||||||
|
key: "l1",
|
||||||
|
short: "L1",
|
||||||
|
long: "Level 1 Cache",
|
||||||
|
description: "The primary cache of a CPU, typically built directly into the processor chip, representing the fastest but smallest cache level closest to the execution units.",
|
||||||
|
group: "Hardware"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
key: "l2",
|
||||||
|
short: "L2",
|
||||||
|
long: "Level 2 Cache",
|
||||||
|
description: "A secondary cache that is larger but slightly slower than the L1 cache, serving to catch cache misses from the L1 cache before querying system memory.",
|
||||||
|
group: "Hardware"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
key: "ram",
|
||||||
|
short: "RAM",
|
||||||
|
long: "Random-Access Memory",
|
||||||
|
description: "A form of volatile computer memory that can be read and changed in any order, used to store working data and machine code currently in use.",
|
||||||
|
group: "Hardware"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
key: "pc",
|
||||||
|
short: "PC",
|
||||||
|
long: "Program Counter",
|
||||||
|
description: "A processor register that indicates where the computer is in its program sequence, holding the address of the next instruction to be executed.",
|
||||||
|
group: "Hardware"
|
||||||
|
),
|
||||||
(
|
(
|
||||||
key: "led",
|
key: "led",
|
||||||
short: "LED",
|
short: "LED",
|
||||||
@@ -61,7 +89,15 @@
|
|||||||
short: "PID",
|
short: "PID",
|
||||||
plural: "PIDs",
|
plural: "PIDs",
|
||||||
long: "Process Identifier",
|
long: "Process Identifier",
|
||||||
description: "A unique number assigned by the operating system kernel to identify an active process.",
|
description: "A unique numerical identifier assigned by the operating system kernel to each active process, used for managing, scheduling, and tracking processes.",
|
||||||
|
group: "Operating System"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
key: "irq",
|
||||||
|
short: "IRQ",
|
||||||
|
plural: "IRQs",
|
||||||
|
long: "Interrupt Request",
|
||||||
|
description: "A signal sent to the processor that temporarily suspends the current program execution to allow an Interrupt Service Routine (ISR) to run in response to a hardware event.",
|
||||||
group: "Operating System"
|
group: "Operating System"
|
||||||
),
|
),
|
||||||
(
|
(
|
||||||
@@ -92,6 +128,14 @@
|
|||||||
description: "The communication between an information processing system (such as a computer) and the outside world.",
|
description: "The communication between an information processing system (such as a computer) and the outside world.",
|
||||||
group: "Computer Science"
|
group: "Computer Science"
|
||||||
),
|
),
|
||||||
|
(
|
||||||
|
key: "ip",
|
||||||
|
short: "IP",
|
||||||
|
plural: "IPs",
|
||||||
|
long: "Internet Protocol",
|
||||||
|
description: "The principal communications protocol in the Internet protocol suite for relaying datagrams across network boundaries.",
|
||||||
|
group: "Computer Science"
|
||||||
|
),
|
||||||
(
|
(
|
||||||
key: "oom",
|
key: "oom",
|
||||||
short: "OOM",
|
short: "OOM",
|
||||||
@@ -113,6 +157,13 @@
|
|||||||
description: "A standard protocol and utility for system message logging in UNIX and Linux systems, allowing applications to log messages to files, consoles, or remote syslog daemons.",
|
description: "A standard protocol and utility for system message logging in UNIX and Linux systems, allowing applications to log messages to files, consoles, or remote syslog daemons.",
|
||||||
group: "Operating System"
|
group: "Operating System"
|
||||||
),
|
),
|
||||||
|
(
|
||||||
|
key: "perf",
|
||||||
|
short: "perf",
|
||||||
|
long: "Performance Events for Linux",
|
||||||
|
description: "A powerful performance supervising and analyzing tool in Linux, capable of profiling hardware performance counters, tracepoints, software performance counters, and dynamic probes.",
|
||||||
|
group: "Operating System"
|
||||||
|
),
|
||||||
(
|
(
|
||||||
key: "epoll",
|
key: "epoll",
|
||||||
short: "epoll",
|
short: "epoll",
|
||||||
|
|||||||
Reference in New Issue
Block a user