Benchmarking with CMake, VSC, and nanobench
Years ago I used to be in the habit of going to the Borders in Palo Alto and thumb through the Programming Guides, and eyeing the 1 or 2 shelf on C++ programming. Most books were the “Teach yourself C++ in 21 days” type books, but once in a while you’d find a real gem like Scott Meyer’s “Effective C++”. I remember buying that book and completely changing the way I programmed after reading it.
The Borders is gone, replaced by HanaHaus, a co-working space, but my desire to learn about the latests in Programming and C++ has not diminished. Fortunately, there’s a huge amount of online resources available to C++ programmers these days. Some of these I only knew by name a couple of years ago, but this past couple of years prompted me to learn more about them, and a long the way I found some really great resources.
I wanted to have a solid hobby project where I could put some energy into trying out some of these resources I’ve been checking out. I recently purchased an M1 MacBook Air. I’ve always been curious about DSP, so I thought trying out some benchmarking tools could be a great way to learn something about this new machine, and also to learn something about what’s new in C++ out there.
I’ve been hearing a lot of great stuff about CMake. Here are some great resources to know about:
You can find the source and example files on my github.
Let’s start with a super simple CMakeLists.txt
.
➜ cat CMakeLists.txt
cmake_minimum_required(VERSION 3.23)
set (CMAKE_CXX_STANDARD 20)
project(
fft_benchmark
VERSION 1.0
LANGUAGES CXX)
add_executable(fft_benchmark fft_benchmark.cpp)
➜ cat fft_benchmark.cpp
#include <iostream>
int main() {
std::cout<<"hello world\n";
}
I’m going to target benchmarking FFTW. First I’m going to check to update my CMake file to use the library:
➜ cat CMakeLists.txt
...
find_package(PkgConfig REQUIRED)
pkg_search_module(FFTW REQUIRED fftw3 IMPORTED_TARGET)
include_directories(PkgConfig::FFTW)
link_libraries (PkgConfig::FFTW)
And for using it I’ll create a little struct to wrap the file:
struct FFT {
using complex = std::complex<double>;
FFT(int log2n);
void fill_input(std::vector<complex> const &input);
std::vector<complex> fetch_output() const;
void execute();
fftw_plan p{};
};
And I’ll show it in action with a Cosine:
int main() {
auto fft = FFT(4);
for (auto i = 0; i < fft.N; ++i) {
in[i] = cos(i * 2 * std::numbers::pi/fft.N);
}
std::cout<<"Input: ";
std::copy(in.begin(), in.end(), std::ostream_iterator<FFT::complex>(std::cout, ", "));
std::cout<<"\n";
fft.fill_input(in);
fft.execute();
auto out = fft.fetch_output();
std::cout<<"\nOutput: ";
std::copy(out.begin(), out.end(), std::ostream_iterator<FFT::complex>(std::cout, ", "));
std::cout<<"\n";
}
Which gives me these results:
Input: (1,0), (0.92388,0), (0.707107,0), (0.382683,0), (6.12323e-17,0), (-0.382683,0), (-0.707107,0), (-0.92388,0), (-1,0), (-0.92388,0), (-0.707107,0), (-0.382683,0), (-1.83697e-16,0), (0.382683,0), (0.707107,0), (0.92388,0),
Output: (-8.99621e-16,0), (8,-1.08026e-15), (4.39601e-17,4.71028e-16), (-9.05066e-16,-6.90891e-17), (9.95799e-17,-8.88178e-16), (3.66477e-16,6.90891e-17), (2.00969e-16,4.71028e-16), (0,4.59171e-17), (2.10602e-16,0), (0,-3.76406e-17), (2.00969e-16,-4.71028e-16), (3.58201e-16,-6.90891e-17), (9.95799e-17,8.88178e-16), (-9.13342e-16,6.90891e-17), (4.39601e-17,-4.71028e-16), (8,1.07198e-15),
or when plotted:
Ok, time to add some benchmarking. I was intrigued by this benchmarking software I ran across called nanobench. So let’s integrate it into our test:
➜ cat CMakeLists.txt
...
include(FetchContent)
FetchContent_Declare(
nanobench
GIT_REPOSITORY https://github.com/martinus/nanobench.git
GIT_TAG v4.1.0
GIT_SHALLOW TRUE)
FetchContent_MakeAvailable(nanobench)
...
target_link_libraries(fft_benchmark PRIVATE nanobench)
And now we use it in our code:
for (auto i : { 7, 8, 9, 10, 11, 12}) {
auto fft = FFT(i);
auto in = std::vector<FFT::complex>(fft.N);
for (auto i = 0; i < fft.N; ++i) {
in[i] = cos(i * 2 * std::numbers::pi/fft.N);
}
ankerl::nanobench::Bench().run(std::string("fftw ") + std::to_string(fft.N) , [&] {
fft.execute();
});
}
Which gives us the results when configured for release:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 238.29 | 4,196,629.09 | 3.7% | 0.00 | `fftw 128`
| 536.39 | 1,864,312.35 | 0.2% | 0.00 | `fftw 256`
| 1,226.35 | 815,427.00 | 0.8% | 0.00 | `fftw 512`
| 2,815.12 | 355,224.01 | 0.2% | 0.00 | `fftw 1024`
| 6,104.25 | 163,820.29 | 0.1% | 0.00 | `fftw 2048`
| 14,041.67 | 71,216.62 | 1.9% | 0.00 | `fftw 4096`
Nice! Very easy to get nanobench integrated with this.
Now let’s try benchmarking Accelerate, Apple’s DSP library. First we add to CMake:
➜ cat CMakeLists.txt
...
find_library(ACCELERATE_LIB Accelerate)
target_link_libraries(fft_benchmark PRIVATE nanobench ${ACCELERATE_LIB})
Then we’re going to create a struct for managing the Accelerate, and use it in benchmarks here:
for (auto i : { 7, 8, 9, 10, 11, 12 }) {
auto fft = vDSP_FFT(i);
auto in = std::vector<vDSP_FFT::complex>(fft.N);
for (auto i = 0; i < fft.N; ++i) {
in[i].real = cos(i * 2 * std::numbers::pi/fft.N);
}
ankerl::nanobench::Bench().run(std::string("accelerate ") + std::to_string(fft.N) , [&] {
fft.execute();
});
}
And finally we get our results:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 245.92 | 4,066,403.96 | 0.0% | 0.00 | `fftw 128`
| 574.13 | 1,741,766.20 | 0.2% | 0.00 | `fftw 256`
| 1,302.71 | 767,632.98 | 0.1% | 0.00 | `fftw 512`
| 3,014.93 | 331,682.82 | 0.1% | 0.00 | `fftw 1024`
| 6,541.71 | 152,865.13 | 0.1% | 0.00 | `fftw 2048`
| 14,763.67 | 67,733.85 | 0.1% | 0.00 | `fftw 4096`
| 199.10 | 5,022,696.70 | 0.1% | 0.00 | `accelerate 128`
| 474.46 | 2,107,648.73 | 0.1% | 0.00 | `accelerate 256`
| 1,248.76 | 800,791.37 | 0.1% | 0.00 | `accelerate 512`
| 2,803.94 | 356,640.86 | 0.1% | 0.00 | `accelerate 1024`
| 5,786.50 | 172,816.04 | 0.1% | 0.00 | `accelerate 2048`
| 12,906.25 | 77,481.84 | 0.1% | 0.00 | `accelerate 4096`
So there you have it. With a couple of simple external projects like CMake and nanobench I’m able to do some quick benchmarking of FFTw vs Accelerate.