Perforator

Perforator is a tool for recording performance metrics over subregions of a program (e.g., functions) using the Linux "perf" interface. The perf tool provided by the Linux kernel only supports collecting statistics over the complete lifetime of a program, which is often inconvenient when a program includes setup and cleanup that should not be profiled along with the benchmark. Perforator is not as comprehensive as perf but it allows you to collect statistics for individual functions or address ranges.

Perforator only supports Linux AMD64. The target ELF binary may be generated from any language. For function lookup, make sure the binary is not stripped (it must contain a symbol table), and for additional information (source code regions, inlined function lookup), the binary must include DWARF information. Perforator supports position-independent binaries.

Perforator is primarily intended to be used as a CLI tool, but includes a library for more general user-code tracing called utrace, a library for reading ELF/DWARF information from executables, and a library for tracing perf events in processes.

Installation

There are three ways to install Perforator.

Download the prebuilt binary from the releases page.
Install from source:

git clone https://github.com/zyedidia/perforator
cd perforator
make build # or make install to install to $GOBIN

Install with go get (version info will be missing):

go get github.com/zyedidia/perforator/cmd/perforator

Usage

First make sure that you have the appropriate permissions to record the events you are interested in (this may require running Perforator with sudo or modifying /proc/sys/kernel/perf_event_paranoid -- see this post). If Perforator still can't find any events, double check that your system supports the perf_event_open system call (try installing the perf tool from the Linux kernel).

Example

Suppose we had a C function that summed an array and wanted to benchmark it for some large array of numbers. We could write a small benchmark program like so:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

#define SIZE 10000000

uint64_t sum(uint32_t* numbers) {
    uint64_t sum = 0;
    for (int i = 0; i < SIZE; i++) {
        sum += numbers[i];
    }
    return sum;
}

int main() {
    srand(time(NULL));
    uint32_t* numbers = malloc(SIZE * sizeof(uint32_t));
    for (int i = 0; i < SIZE; i++) {
        numbers[i] = rand();
    }

    uint64_t result = sum(numbers);
    printf("%lu\n", result);
    return 0;
}

If we want to determine the number of cache misses, branch mispredictions, etc... perf is not suitable because running perf stat on this program will profile the creation of the array in addition to the sum. With Perforator, we can measure just the sum.

Profiling functions

First compile with

$ gcc -g -O2 -o bench bench.c

Now we can measure with Perforator:

$ perforator -r sum ./bench
+---------------------+-------------+
| Event               | Count (sum) |
+---------------------+-------------+
| instructions        | 50000004    |
| branch-instructions | 10000002    |
| branch-misses       | 10          |
| cache-references    | 1246340     |
| cache-misses        | 14984       |
| time-elapsed        | 4.144814ms  |
+---------------------+-------------+
10736533065142551

Results are printed immediately when the profiled function returns.

Note: in this case we compiled with -g to include DWARF debugging information. This was necessary because GCC will inline the call to sum, so Perforator needs to be able to read the DWARF information to determine where it was inlined to. If you compile without -g make sure the target function is not being inlined (either you know it is not inlined, or you mark it with the noinline attribute).

Fun fact: clang does a better job optimizing this code than gcc. I tried running this example with clang instead and found it only had 1,250,000 branch instructions (roughly 8x fewer than gcc!). The reason: vector instructions.

Events

By default, Perforator will measure some basic events such as instructions executed, cache references, cache misses, branches, branch misses. You can specify events yourself with the -e flag:

$ perforator -e l1d-read-accesses,l1d-read-misses -r sum ./bench
+-------------------+-------------+
| Event             | Count (sum) |
+-------------------+-------------+
| l1d-read-accesses | 10010311    |
| l1d-read-misses   | 625399      |
| time-elapsed      | 4.501523ms  |
+-------------------+-------------+
10736888439771461

To view available events, use the --list flag:

$ perforator --list hardware # List hardware events
$ perforator --list software # List software events
$ perforator --list cache    # List cache events
$ perforator --list trace    # List kernel trace events

Detailed documentation for each event is available in the manual page for Perforator. See the perforator.1 manual included with the prebuilt binary. The man directory in the source code contains the Markdown source, which can be compiled using Pandoc (via make perforator.1).

Source code regions

In additional to profiling functions, you may profile regions specified by source code ranges if your binary has DWARF debugging information.

$ perforator -r bench.c:18-bench.c:23 ./bench
+---------------------+-------------------------------+
| Event               | Count (bench.c:18-bench.c:23) |
+---------------------+-------------------------------+
| instructions        | 668794280                     |
| branch-instructions | 169061639                     |
| branch-misses       | 335360                        |
| cache-references    | 945581                        |
| cache-misses        | 3569                          |
| time-elapsed        | 78.433272ms                   |
+---------------------+-------------------------------+
10737167007294257

Only certain line numbers are available for breakpoints. The range is exclusive on the upper bound, meaning that in the example above bench.c:23 is not included in profiling.

You may also directly specify addresses as decimal or hexadecimal numbers. This is useful if you don't have DWARF information but you know the addresses you want to profile (for example, by inspecting the disassembly via objdump).

Multiple regions

You can also profile multiple regions at once:

$ perforator -r bench.c:18-bench.c:23 -r sum -r main ./bench
+---------------------+-------------------------------+
| Event               | Count (bench.c:18-bench.c:23) |
+---------------------+-------------------------------+
| instructions        | 697120715                     |
| branch-instructions | 162949718                     |
| branch-misses       | 302849                        |
| cache-references    | 823087                        |
| cache-misses        | 3645                          |
| time-elapsed        | 78.832332ms                   |
+---------------------+-------------------------------+
+---------------------+-------------+
| Event               | Count (sum) |
+---------------------+-------------+
| instructions        | 49802557    |
| branch-instructions | 10000002    |
| branch-misses       | 9           |
| cache-references    | 1246639     |
| cache-misses        | 14382       |
| time-elapsed        | 4.235705ms  |
+---------------------+-------------+
10739785644063349
+---------------------+--------------+
| Event               | Count (main) |
+---------------------+--------------+
| instructions        | 675150939    |
| branch-instructions | 184259174    |
| branch-misses       | 386503       |
| cache-references    | 1128637      |
| cache-misses        | 8368         |
| time-elapsed        | 83.132829ms  |
+---------------------+--------------+

In this case, it may be useful to use the --summary option, which will aggregate all results into a table that is printed when tracing stops.

$ perforator --summary -r bench.c:19-bench.c:24 -r sum -r main ./bench
10732787118410148
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+
| region                | instructions | branch-instructions | branch-misses | cache-references | cache-misses | time-elapsed |
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+
| bench.c:18-bench.c:23 | 718946520    | 172546336           | 326000        | 833098           | 3616         | 81.798381ms  |
| main                  | 678365328    | 174259806           | 363737        | 1115394          | 4403         | 86.321344ms  |
| sum                   | 43719896     | 10000002            | 9             | 1248069          | 16931        | 4.453342ms   |
+-----------------------+--------------+---------------------+---------------+------------------+--------------+--------------+

You can use the --sort-key and --reverse-sort options to modify which columns are sorted and how. In addition, you can use the --csv option to write the output table in CSV form.

Note: to an astute observer, the results from the above table don't look very accurate. In particular the totals for the main function seem questionable. This is due to event multiplexing (explained more below), and for best results you should not profile multiple regions simultaneously. In the table above, you can see that it's likely that profiling for main was disabled while sum was running.

Groups

The CPU has a fixed number of performance counters. If you try recording more events than there are counters, "multiplexing" will be performed to estimate the totals for all the events. For example, if we record 6 events on the sum benchmark, the instruction count becomes less stable. This is because the number of events now exceeds the number of hardware registers for counting, and multiplexing occurs. To ensure that certain events are always counted together, you can put them all in a group with the -g option. The -g option has the same syntax as the -e option, but may be specified multiple times (for multiple groups).

Notes and caveats

Tip: enable verbose mode with the -V flag when you are not seeing the expected result.
Many CPUs expose additional/non-standardized raw perf events. Perforator does not currently support those events.
Perforator has only limited support for multithreaded programs. It supports profiling programs with multiple threads as long as it is the case that each profiled region is only run by one thread (ever). In addition, the beginning and end of a region must be run by the same thread. This means if you are benchmarking Go you should call runtime.LockOSThread in your benchmark to prevent a goroutine migration while profiling.
A region is either active or inactive, it cannot be active multiple times at once. This means for recursive functions only the first invocation of the function is tracked.
Be careful of multiplexing, which occurs when you are trying to record more events than there are hardware counter registers. In particular, if you profile a function inside of another function being profiled, this will likely result in multiplexing and possibly incorrect counts. Perforator will automatically attempt to scale counts when multiplexing occurs. To see if this has happened, use the -V flag, which will print information when multiplexing is detected.
Be careful if your target functions are being inlined. Perforator will automatically attempt to read DWARF information to determine the inline sites for target functions but it's a good idea to double check if you are seeing weird results. Use the -V flag to see where Perforator thinks the inline site is.

How it works

Perforator uses ptrace to trace the target program and enable profiling for certain parts of the target program. Perforator places the 0xCC "interrupt" instruction at the beginning of the profiled function which allows it to regain control when the function is executed. At that point, Perforator will place the original code back (whatever was initially overwritten by the interrupt byte), determine the return address by reading the top of the stack, and place an interrupt byte at that address. Then Perforator will enable profiling and resume the target process. When the next interrupt happens, the target will have reached the return address and Perforator can stop profiling, remove the interrupt, and place a new interrupt back at the start of the function.

Note: this is not an issue regarding the software, rather information on how to use it with hand-written assembly files. That way others may more easily find the information.

When a compiler creates an ELF binary used in Linux, it stores information about its symbols (functions etc.) - such as its name, size and type. In hand-written assembly, this information is usually lacking, which means perforator will be unable to properly identify the symbols.

Let's say we have an assembly file poly_Rq_mul.s containing the following code:

global poly_Rq_mul
.global _poly_Rq_mul
poly_Rq_mul:
_poly_Rq_mul:
...
mov %r8, %rsp
pop %r12
ret

What we want to do is A; tell the compiler that poly_Rq_mul is a function and B; tell the compiler the size of poly_Rq_mul.

We do that by first, adding a section after the .global definitions as follows:

.global poly_Rq_mul
.global _poly_Rq_mul
+#ifdef __linux__
+.type poly_Rq_mul, @function
+.type _poly_Rq_mul, @function
+#endif
poly_Rq_mul:
_poly_Rq_mul:

After the end of a function, add these symbols to specify the size.

mov %r8, %rsp
pop %r12
ret
+#ifdef __linux__
+.poly_Rq_mul_end:
+.size poly_Rq_mul, .-poly_Rq_mul
+.size _poly_Rq_mul, .-_poly_Rq_mul
+#endif

Lastly, we want to rename the file so that it's file extension is .S instead of .s as GCC (and I assume other compilers) will correctly invoke the preprocessor to deal with the ifdef. That way, the same assembly file may be compiled on Linux, macOS and more. If you are certain that the code will only be run on Linux, simply ignore the #ifdef __linux__ and #endif, as well as the file's extension.

Add flag to not sort output

I've added a flag, --no-sort, which will disable the sorting of the summary table.

I opted for the inverse (a flag to disable something) as it's non-breaking.
Add a flag to ignore missing regions

I've added a flag, --ignore-missing-regions, which will allow perforator to continue running even though it's unable to find a region.

The change is non-breaking.

The use case is times where you compare a non-optimized reference implementation with an optimized (-O3 or the like) implementation. In the latter case, GCC and friends may happily inline functions that are only used a single time. This results in the binary (symbol table) not containing the symbol at all - since it's no longer a function per say.

With this change, one may use --ignore-missing-regions to run both binaries with the same perforator command, making it more useful in scripts where it may be unknown if the target binary lacks a specific region.
Thank you

This is no issue per say, but I'd just like to sincerely thank you for providing this tool. As I'm sure you've experienced first hand, using perf_event_open and the related tooling can be a big hassle. I've spent countless hours implementing perf_event_open in C for micro-benchmarks and as many hours reading up on the perf tool. The custom code I wrote worked great, but it was cumbersome to manually instrument the source code. The perf tool is great - but it lacks one important factor; the ability to run perf stat for a specific function that may be somewhere along the lifecycle of the program. Sure, perf record provides similar functionality, but it is not necessarily easy to convert the measurements to actual values. Your tool solves all of this, and for that I'm truly grateful.
Analyzing a hand-written assembly file
Note: this is not an issue regarding the software, rather information on how to use it with hand-written assembly files. That way others may more easily find the information.

When a compiler creates an ELF binary used in Linux, it stores information about its symbols (functions etc.) - such as its name, size and type. In hand-written assembly, this information is usually lacking, which means perforator will be unable to properly identify the symbols.

Let's say we have an assembly file poly_Rq_mul.s containing the following code:

global poly_Rq_mul .global _poly_Rq_mul poly_Rq_mul: _poly_Rq_mul: ... mov %r8, %rsp pop %r12 ret

What we want to do is A; tell the compiler that poly_Rq_mul is a function and B; tell the compiler the size of poly_Rq_mul.

We do that by first, adding a section after the .global definitions as follows:

.global poly_Rq_mul .global _poly_Rq_mul +#ifdef __linux__ +.type poly_Rq_mul, @function +.type _poly_Rq_mul, @function +#endif poly_Rq_mul: _poly_Rq_mul:

After the end of a function, add these symbols to specify the size.

mov %r8, %rsp pop %r12 ret +#ifdef __linux__ +.poly_Rq_mul_end: +.size poly_Rq_mul, .-poly_Rq_mul +.size _poly_Rq_mul, .-_poly_Rq_mul +#endif

Lastly, we want to rename the file so that it's file extension is .S instead of .s as GCC (and I assume other compilers) will correctly invoke the preprocessor to deal with the ifdef. That way, the same assembly file may be compiled on Linux, macOS and more. If you are certain that the code will only be run on Linux, simply ignore the #ifdef __linux__ and #endif, as well as the file's extension.
Where is trace repo

hello, @zyedidia . I would like to try perforator tool. go get depends on utrace repo, but seems utrace absent. Is this a necessary dependency or is there an alternative?
[Feature Request] Print ratios

First, thanks for this program, it's helping me a lot to profile my program !

Would it be possible to show the ratios of, e.g., branch-misses / branches or cache-misses / cache-references (something similar to how perf stat print the results) ?

Perforator is a tool for recording performance metrics over subregions of a program using the Linux "perf" interface.

Perforator

Installation

Usage

Example

Profiling functions

Events

Source code regions

Multiple regions

Groups

Notes and caveats

How it works

Owner

Zachary Yedidia

Comments

Add flag to not sort output

Add a flag to ignore missing regions

Thank you

Analyzing a hand-written assembly file

Where is trace repo

[Feature Request] Print ratios

Related tags

Package ethtool allows control of the Linux ethtool generic netlink interface.

Goridge is high performance PHP-to-Golang codec library which works over native PHP sockets and Golang net/rpc package.

Continuous profiling for analysis of CPU, memory usage over time, and down to the line number. Saving infrastructure cost, improving performance, and increasing reliability.

Go library to interface with NEAR nodes' JSON-RPC interface

A program to create assembly 8086 strings to print without using any printing/strings related function but only mov-xchg-int and loops

Creates Prometheus Metrics for PolicyReports and ClusterPolicyReports. It also sends PolicyReportResults to different Targets like Grafana Loki or Slack

Automatically set GOMAXPROCS to match Linux container CPU quota.

A Go library for the Linux Landlock sandboxing feature

A simple API for computing diffs of your documents over the time built on a scalable technology stack.

📋 cross-platform clipboard package that supports accessing text and image in Go (macOS/Linux/Windows/Android/iOS)

A fully Go userland with Linux bootloaders! u-root can create a one-binary root file system (initramfs) containing a busybox-like set of tools written in Go.

Go Parrot Twap will execute buy or sell orders over a specific time interval.

Execute a binary from memory, without touching the disk. Linux only.

A program for extracting information from chrome session files.

Hex dump and read values of files quickly and swiftly with Go-Hex a program designed to dump any file in a hexadecimal format

Wrap byte read options with uniform interface for io.Reader and byte slice

Automatically creates & tiles .tmx format maps from a world map interface

Tugas Alta Immersive Backend Golang Fundamental Programming (Pointer, Struct, Method, Interface)

A pure Golang implementation of Rockchip rknand vendor storage interface.