`sqrt`

function. Let's see if I can reproduce the steps to derive this.
- Table of contents
- Setting-up the scene.
- Finding a lower bound to the norm.
- Finding an upper bound to the norm.
- Choosing the best approximation for the norm.
- Conclusion

Calculating the norm of a vector \((x,y)\), or a complex number \(x+iy\) means calculating \(\sqrt{x^2+y^2}\). Without loss of generality, we can set \(\sqrt{x^2+y^2}=1\). If we draw this, we get the following.

Now, the issue with the norm is that the \(\sqrt{}\) operation is expensive to compute. That's why we would like another way to approximate the norm. A first idea is to look at other norms available, indeed, what we have called "norm" so far is actually the 2-norm, also named *euclidean norm*. Let's have a look at two other norms : the infinity norm and the Manhattan norm.

Infinity norm is :

\[ \lVert(x,y)\rVert_\infty = \max(x,y) \]Manhattan norm is :

\[ \lVert(x,y)\rVert_1 = |x|+|y| \]Now we see the Manhattan norm is indeed a lower bound for the 2-norm, even if it's rough. The Infinity norm, however, is too high. But that is not an issue, we could simply scale it up so that it is always higher than the 2-norm. The scaling factor is chosen, such as the yellow curve tangent to the circle. For that, we need it to be equal to \(\cos\frac{\pi}{4}=\frac{1}{\sqrt{2}}\).

We have a lower bound! By choosing the closest to the circle between the yellow and green curves, we get an octagon that is very close to the circle. We can define the upper bound of the circle with a function \(f\) such as:

\[ f(x,y) = \max\left(\max(x,y), \frac{1}{\sqrt{2}}(|x|+|y|)\right) \]Note that this is different from Paul's article. You **do** need to take the maximum value of the two norms to select the points that are closest to the center. Generally speaking, for two norms, if one's value is higher than the other, then the former will be drawn closer to the origin when plotting the \(\text{norm}(x,y)=1\) curve.

To trace this function, note that Manhattan and infinity norms isolines cross when \(|y|=1\) and \(|x| = \sqrt{2}-1\) or \(|x|=1\) and \(|y| = \sqrt{2}-1\).

The first idea you can get from the lower bound we found is to scale it up so that the octagon corners touch the circle.

To do so, we need to find the 2-norm of one of the corners and divide \(f\) by it.

Let's take the one at \(x=1\), \(y=\sqrt{2}-1\). We have:

\[ \begin{align} \sqrt{x^2+y^2} &=& \sqrt{1 + \left(\sqrt{2}-1\right)^2}\\ &=& \sqrt{1 + 2 - 2\sqrt{2} + 1}\\ &=& \sqrt{4 - 2\sqrt{2}} \end{align} \]Thus, the upper-bound for the 2-norm with the octagon method is \(\sqrt{4 - 2\sqrt{2}}f(x,y)\):

\[ f(x,y) \leq \sqrt{x^2+y^2} \leq \sqrt{4 - 2\sqrt{2}}f(x,y) \]Now, we could stick to Paul Hsieh's choice of taking the middle between the lower and the upper bounds, and it will probably be fine. But come on, let's see if it is the *best* choice. π

Formally, the problem is to find a number \(a\in[0,1]\) such as \(g\) defined as follows is the closest possible to the norm-2.

\[ \begin{align} g(x,y,a) &=& (1-a)f(x,y)+\frac{a}{\sqrt{4 - 2\sqrt{2}}}f(x,y)\\ &=& \left((1-a) + a\sqrt{4 - 2\sqrt{2}}\right)f(x,y) \end{align} \]Let's plot this function for various values of \(a\). To make things easier, I will "unroll" the circle, and plot the norms against \(\theta\), the angle between our vector and the \(x\) axis.

As expected, we can continuously vary our approximation between the upper and lower bounds. Notice that these functions are periodic and even. We can thus focus on the first half period to minimize the error. The first half period is when the vector is at the first octagon vertices, starting from the \(x\) axis and circling anti-clockwise.

To minimize the error with our approximation, we want to minimize the square error. That is:

\[ \begin{align} e(a) &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}(g(x,y,a)-1)^2\text{d}\theta \end{align} \]Thankfully, the expression of \(f(x,y)\) and thus of \(g(x,y,a)\) should simplify a lot on the given interval. You can see on schematic above that on this interval we have, \(f(x,y)=max(|x|,|y|)=|x|=x=\cos\theta\). We can thus rewrite \(e(a)\) as follows.

\[ \begin{align} e(a) &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}(g(x,y,a)-1)^2\text{d}\theta\\ &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}\left(\left(1-a + a\sqrt{4-2\sqrt{2}}\right)\cos\theta-1\right)^2\text{d}\theta\\ &=& \int_0^{\arctan\left(\sqrt{2}-1\right)}\left(h(a)\cos\theta-1\right)^2\text{d}\theta \end{align} \]Where \(h(a)=\left(1-a + a\sqrt{4-2\sqrt{2}}\right)\) and \(\arctan\left(\sqrt{2}-1\right)=\frac{\pi}{8}\).

As we can see from these plots, there is a minimal error, and though 0.5 is a reasonable choice for \(a\), we can do slightly better around 0.3.

We can explicitly calculate \(e(a)\). Let \(h(a)=(1+a(A-1))\). We have

\[ \begin{align} e(a) &=& \int_0^{\pi/8}(h(a)\cos\theta-1)^2\text{d}\theta\\ &=&h^2(a)\int_0^{\pi/8}\cos^2\theta\text{d}\theta-2h(a)\int_0^{\pi/8}\cos\theta\text{d}\theta + \frac{\pi}{8}\\ &=& h^2(a)B-2h(a)\sin\frac{\pi}{8} + \frac{\pi}{8} \end{align}\]Where \(B=\frac{\pi}{16}+\frac{1}{4\sqrt2}\). Thus, we look for the position of the minimum, that is where \(e'(a)=0\).

\[ \begin{align} 0 &=& 2B(A-1)(1+a(A-1))-\sin\frac{\pi}{8}\\ 0 &=& 2B(A-1)(1+a(A-1)) - \frac{A}{2\sqrt2}\\ a &=& \left(\frac{A}{2B\sqrt2}-1\right)\times\frac{1}{A-1}\\ a &\approx& 0.311 \end{align} \]Not that far from 0.3!

The maximum deviation from the result is then \(\max_\theta{|h(a)\cos\theta-1|}\). Looking for that maximum is like looking for the maximum of \(\left(h(a)\cos\theta-1\right)^2\). Long story short, the maxima can only occur on the boundaries of the allowed domain for \(\theta\), that is \(\theta=0\) or \(\theta=\pi/8\), meaning

\[ \max_\theta{|h(a)\cos\theta-1|} = \max\left(h(a)-1, \left|h(a)\frac{\sqrt{2-\sqrt{2}}}{2}-1\right|\right) \]With our choice for \(a\), we get \(h(a)\approx 1.026\), so the maximum deviation is 0.052. That is, we have at most a 5.3% deviation from the norm-2!

That was a fun Sunday project! Originally this was intended to be included in a longer blog-post that is yet to be finished, but I figured it was interesting enough to have its own post. The take-home message being, you can approximate the Euclidean norm of a vector with:

\[ \begin{align} \text{norm}(x,y) &=& \frac{\sqrt{2-\sqrt{2}}}{\frac{\pi}{8}+\frac{1}{2\sqrt{2}}}\max\left(\max(|x|,|y|), \frac{1}{\sqrt{2}}(|x|+|y|)\right)\\ &\approx& 1.026\max\left(\max(|x|,|y|), \frac{1}{\sqrt{2}}(|x|+|y|)\right) \end{align} \]You'll get at most a 5.3% error. This is a bit different from what's proposed on Paul Hsieh's blog-post. Unless I made a mistake, there might be a typo on his blog!

If you are interested in playing with the code used to generate the figures in this article, have a look at the companion notebook!

As always, if you have any question, or want to add something to this post, you can leave me comment or ping me on Twitter or Mastodon.

]]>There is a companion GitHub repository where you can retrieve all the codes presented in this article.

[1] | Yes, I went through my Firefox history database to find this date. |

- Table of contents
- Why reinvent the wheel?
- Because I did not know how to implement the FFT.
- Because I thought it was possible to do better.
- In-place or out-of-place algorithm?
- Trigonometry can be
*blazingly fast*. πππ π₯π₯

- Interlude: some tooling for debugging.
- Using
`arduino-cli`

to upload your code. - Don't bother with communication protocols over Serial.

- Using
- Fast, accurate FFT, and other floating-point trickeries.
- A first dummy implementation of the FFT.
- Forbidden occult arts are fun. π
- Approximate floating-point FFT.

- How fixed-point arithmetic came to the rescue.
- Fixed-point multiplication.
- Controlled result growth.
- Trigonometry is demanding.
- Saturating additions. (a.k.a. "Trigonometry is demanding" returns.)
- Calculating modules with a chainsaw.
- 16 bits fixed-point FFT.
- 8 bits fixed-point FFT.
- Implementing fixed-point FFT for longer inputs

- Benchmarking all these solutions.
- Closing thoughts.

As I said in the introduction, I explicitly researched an implementation of the FFT because I did not want to implement my own. So what changed my mind ?

Let's start with the obvious: abhilash_patel's instructable is a **Great** instructable. It is part of a series of instructables on implementing the FFT on Arduino, and this is his fastest accurate implementation. The instructable does a great job at explaining the big ideas behind it, with not only appropriate, but also good-looking illustrations. That is why I decided to read his code, to be certain of my good understanding of it.

And that is the exact moment I entered an infinite spiral. Not because the code was bad, even though it could use some indenting, but because I did not understand how it achieves its purpose. To my own disappointment, I realized that maybe I did not know how to implement an FFT. Sure, I had my share of lectures on the Fourier Transform, and on the Fast Fourier Transform, but the lecturers only showed us how the FFT was an algorithm with a very nice complexity through its recursive definition. But what I was looking at did not even remotely look like what I expected to see.

So I did what seemed the most sensible thing to me at the time: I spent nights reading Wikipedia pages and obscure articles on 2000s looking website to understand how the FFT was *actually* implemented.

About one month later, on May 23Κ³α΅, I started writing a tutorial on zestedesavoir.com : "Jouons Γ implΓ©menter une transformΓ©e de Fourier rapide !", a sloppy translation of which is also available on my blog. My goal here was to write down what I had learned throughout the month, and it helped me clarify the math behind the implementation. Today, I use it as a reference when I have doubts on the implementation.

With this newly acquired knowledge on FFT implementations, I was ready to have another look at @abhilash_patel's code.

As I said, I was now capable of understanding the code provided by @abhilash_patel. And there I found two low-hanging fruits:

The program was weirdly mixing in-place and out-of-place algorithm,

The trigonometry computation was inefficient.

Let me state more clearly what I mean here.

The FFT can either be implemented *in-place* or *out-of-place*. Implementing *out-of-place* of course allows you to keep the input data unchanged by the computation. However, the *in-place* algorithm offers several key advantages, the first, obvious, one being that it only requires the amount of space needed to store the input array.

This might not be obvious, but it also works for real-valued signals. Indeed, one might think that if you have an array of, say, `float`

representing such a signal, its FFT would require twice the amount of space since the Fourier transform is complex-valued. The trick here is to use a key property of the Fourier transform : the Fourier transform of a real-valued signal, knowing the positive-frequencies part is enough. You can see the full explanation in my blog post on implementing the FFT in Julia.

This would help me get an FFT implementation that can run on more than 256 data points on my Arduino Uno, which the original instructable implementation cannot.^{[2]}

[2] | Even though the code used for the benchmark cannot. This is not due to a memory size issue, but to the variable types I used for my buffers (`uint8_t` ). I think you can understand this would be easily fixed to run the FFT on bigger samples, and since I was especially interested in benchmarks in time, I allowed myself that. |

I believe this is where the biggest improvement in benchmark-time originates from. Step 2 of the original instructable details how to use a kind of look-up table to compute very quickly the trigonometry functions. This is an efficient method if you have to implement a fast cosine or a fast sine function. However, using such a method for the FFT means forgetting a very interesting property of the algorithm : the angles for which trigonometry calculations is required do not appear at random **at all**. In fact for each recursion step of the algorithm, they increase by a constant amount, and always start from the same angle : 0.

This arithmetical progression of the angle allows using a simple, yet efficient formula for calculating the next sine and cosine :

\[\begin{aligned}\cos(\theta + \delta) &= \cos\theta - [\alpha \cos\theta + \beta\sin\theta]\\\sin(\theta + \delta) &= \sin\theta - [\alpha\sin\theta - \beta\cos\theta]\end{aligned}\]With \(\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\;\beta=\sin\delta\).

I have included the derivation of these formulas in the relevant section of my tutorial.

As I said, this is most likely the biggest source of improvement in execution time, as trigonometry computation-time instantaneously becomes negligible using this trick.

I am a big fan of the Julia programming language. It is my main programming tool at work, and I also use it for my hobbies. However, I believe the tips given in this section are easily transportable to other programming languages.

The main idea here is that when you start working with arrays of data, good old `Serial.println`

is not usable anymore. Because you cannot simply evaluate the correctness of your results at a simple glance, you want to use higher level tools, such as statistical analysis or plotting libraries. And since you are also likely to want to upload your code to the Arduino often, it is convenient to be able to upload it programmatically.

This machinery allows testing all the different implementations in a reproducible way. All the examples given in this article are calculated on the following input signal.

`arduino-cli`

to upload your code.At the time I started this project, the new Arduino IDE wasn't available yet. If you have ever used the `1.x`

versions of the IDE, then you know why one would like to avoid the old IDE. Thankfully, there is a command-line utility that allows uploading code from your terminal: `arduino-cli`

. If you take a look at the GitHub repository, you'll notice a Julia script, which purpose is to upload code to the Arduino and retrieve the results of computations and benchmarks. The upload part is simply a system call to `arduino-cli`

.

```
function upload_code(directory)
build = joinpath(workdir, directory, "build")
ino = joinpath(workdir, directory, directory * ".ino") build_command = `arduino-cli compile -b arduino:avr:uno -p $portname --build-path "$build" -u -v "$ino"`
run(pipeline(build_command, stdout="log_arduino-cli.txt", stderr="log_arduino-cli.txt"))
end
```

At first, I was tempted to use some fancy communication protocols for the serial link. This is not useful in our case, because you can simply reset the Arduino programmatically to ensure the synchronization of the computer and the development board, and then exchange raw binary data.

Resetting is done using the DTR pin of the port. In Julia, you can do this like this using the `LibSerialPort.jl`

library:

```
function reset_arduino()
LibSerialPort.open(portname, baudrate) do sp
@info "Resetting Arduino"
# Reset the Arduino
set_flow_control(sp, dtr=SP_DTR_ON)
sleep(0.1)
set_flow_control(sp, dtr=SP_DTR_OFF)
sp_flush(sp, SP_BUF_INPUT)
sp_flush(sp, SP_BUF_OUTPUT)
end
end
```

Because your computer can now reset the Arduino at will, you can easily ensure the synchronization of your board. That means the benchmark script knows when to read data from the Arduino.

Then, the Arduino would send data to the computer like this:

`Serial.write((byte*)data, sizeof(fixed_t)*N);`

This way, the array `data`

is sent directly through the serial link as a stream of raw bytes. We don't bother with any form of encoding.

On the computer side, you can easily read the incoming data:

```
data = zeros(retrieve_datatype, n_read)
read!(sp, data)
```

Where `sp`

is an object created by `LibSerialPort.jl`

when opening a port.

You can then happily analyze your data, it's `DataFrames.jl`

and `Makie.jl`

time !

My first approach was to re-use as much as I could the code I wrote for my FFT tutorial in Julia. That's why I started working with floating-point arithmetic. This also was convenient because it kept away some issues like overflowing numbers, that I had to address once I started working with fixed-point arithmetic.

As I said, my first implementation was a simple, stupid translation of one of the codes presented in my Julia tutorial. I did not even bother with writing optimized trigonometry functions, I just wanted something that worked as a basis for other implementations. The code is fairly simple and can be viewed here.

As expected, this gives almost error-free results.

Now let's move on to more interesting stuffs. The first obvious improvement you can make on the base implementation is fast trigonometry, and that's what yields the biggest improvement in terms of speed. Then, I decided to mess around with IEEE-754 to write my own approximate routines for float multiplication, halving and modulus calculation. The idea is always the same: treat IEEE-754 representation of a floating-point number as its logarithm. This does give weird-looking implementations though. I have written several posts on Zeste-de-Savoir explaining how all these work. It is in French, but I trust you can make DeepL run!

"Approximer rapidement le carrΓ© d'un nombre flottant" explains how to square a number using its floating-point representation.

"IEEE 754: Quand votre code prend la float" explains how the IEEE-754 representation of a number looks alike it's logarithm.

"Multiplications avec Arduino: jetons-nous Γ la float" explains how the approximate multiplication of two floating-point numbers can be efficiently calculated.

Without further delay, here is a sneak preview of the result I got with the approximate floating-point FFT. For a full benchmark, you will have to wait for the end of this article! The code is available here.

Rather than endlessly optimizing the floating-point implementation, I decided to change my approach. The main motivation being: **Floats are actually overkill for our purpose**. Indeed, they have the ability to represent numbers with a good relative precision over enormous ranges. However, when calculating FFTs the range output variables may cover can indeed vary, but not that much. And most importantly, it varies **predictably**. This means a **fixed-point** representation can be used. Also, because of their amazing properties Floats actually take a lot of space in the limited RAM available on a microcontroller. And finally, I want to be able to run FFTs on signal read from Arduino's ADC. If my program can deal with `int`

-like data types, then it'll spare me the trouble of converting from integers to floating-points.

I first played with the idea of implementing a fixed-point FFT because I realized the AVR instruction set gives us the `fmul`

instruction, dedicated to multiplying fixed-point numbers. This means we can use it to have a speed-efficient implementation of the multiplication, that should even beat the custom `float`

one.

I wrote a blog-post on Zeste-de-Savoir (in French) on implementing the fixed-point multiplication. It is based on the proposed implementation in the AVR instruction set manual.

```
/* Signed fractional multiply of two 16-bit numbers with 32-bit result. */
fixed_t fixed_mul(fixed_t a, fixed_t b) {
fixed_t result;
asm (
// We need a register that's always zero
"clr r2" "\n\t"
"fmuls %B[a],%B[b]" "\n\t" // Multiply the MSBs
"movw %A[result],__tmp_reg__" "\n\t" // Save the result
"mov __tmp_reg__,%B[a]" "\n\t"
"eor __tmp_reg__,%B[b]" "\n\t"
"eor __tmp_reg__,%B[result]" "\n\t"
"fmul %A[a],%A[b]" "\n\t" // Multiply the LSBs
"adc %A[result],r2" "\n\t" // Do not forget the carry
"movw r18,__tmp_reg__" "\n\t" // The result of the LSBs multipliplication is stored in temporary registers
"fmulsu %B[a],%A[b]" "\n\t" // First crossed product
// This will be reported onto the MSBs of the temporary registers and the LSBs
// of the result registers. So the carry goes to the result's MSB.
"sbc %B[result],r2" "\n\t"
// Now we sum the cross product
"add r19,__tmp_reg__" "\n\t"
"adc %A[result],__zero_reg__" "\n\t"
"adc %B[result],r2" "\n\t"
"fmulsu %B[b],%A[a]" "\n\t" // Second cross product, same as first.
"sbc %B[result],r2" "\n\t"
"add r19,__tmp_reg__" "\n\t"
"adc %A[result],__zero_reg__" "\n\t"
"adc %B[result],r2" "\n\t"
"clr __zero_reg__" "\n\t"
:
[result]"+r"(result):
[a]"a"(a),[b]"a"(b):
"r2","r18","r19"
);
return result;
}
```

Obviously, you can also create the same function for 8-bits fixed-point arithmetic.

```
fixed8_t fixed_mul_8_8(fixed8_t a, fixed8_t b) {
fixed8_t result; asm (
"fmuls %[a],%[b]" "\n\t"
"mov %[result],__zero_reg__" "\n\t"
"clr __zero_reg__" "\n\t"
:
[result]"+r"(result):
[a]"a"(a),[b]"a"(b)
);
return result;
}
```

As you can see, this requires writing some assembly code because the `fmul`

instruction is not directly accessible from C. However, even though it is fairly simple, this limits the implementation to AVR platforms. You might still get some reasonably efficient code by implementing everything in pure C, and extend the implementation to other platforms.

As I said before, the FFT grows predictably. First, we can see that the final Fourier transform is bounded. Recall that the FFT is an algorithm to compute the Discrete Fourier Transform (DFT), which is written:

\[\begin{aligned} X[k] &=& \sum_{n=0}^{N-1}x[n]e^{-2i\pi nk/N} \end{aligned}\]Where \(X\) is the discrete Fourier transform of the input signal \(x\) of size \(N\). From that we have:

\[\begin{aligned} |X[k]| &\leq \left|\sum_{n=0}^{N-1}x[n]e^{-2i\pi nk/N}\right|\\ &\leq \sum_{n=0}^{N-1}\left|x[n]e^{-2i\pi nk/N}\right| \\ &\leq \sum_{n=0}^{N-1}\left|x[n]\right|\\ &\leq N\times\max_n|x[n]| \end{aligned}\]In our case, because we use the `Q0f7`

fixed point format, the input signal \(x\) is in the range \([-1,1]\). That means the components of the DFT are within range \([-N,N]\). Note that these bounds are attained for some signals, *e.g.* a constant input.

With that, we know how to scale the result of the FFT so that it can be stored. But what about the intermediary steps ? How do we ensure that the intermediary values stay within range? You may recall from the blog post explaining FFT this kind of "butterfly" diagrams:

This diagram also shows you that each step of the algorithm actually performs some FFTs on input signals of smaller sizes. That means our bounding rule applies for intermediary signals, given that we plug the right size of input signal in the formula! Notice how at each step, corresponding sub-FFTs have a size of \(2^{i}\), where \(i\) is the number of the step, starting at 0. That basically means that if we scale down the signal between each step by dividing it by a factor of two, we will keep the signal bounded in \([-1,1]\) at each step!

Note that this does not mean we get the optimal scale for every input signal. For example, signals which are poorly periodic would have a lot of low module Fourier coefficients, and would not fully take advantage of the scale offered by our representation. I did some tests scaling the array only when it was needed, and did not notice many changes in terms of execution times, so that's something you might want to explore if your project requires it.

If all you have is a hammer, everything looks like a nail.

Once I had fixed-point arithmetic working, I started wanting to use it everywhere. But I quickly encountered an issue: trigonometry stopped working.

The reason is simple, 8-bits precision is not enough for trigonometry calculations when we approach the small angles. The key point here, is that the precision needed for fixed-point calculation of trigonometry functions depends on the size of the input array. Recall from section Trigonometry can be blazingly fast. πππ π₯π₯ that we need to precompute values for \(\alpha\) and \(\beta\), where

\[\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\quad\beta=\sin\delta\]And \(\delta\) is the angle increment by which we want to increase the angle of the complex number we are summing with in the FFT. This angle depends on \(N\), the total length of the input array, and is equal to \(\frac{2\pi}{N}\). That means we need to be able to represent at least \(2\sin^2\frac{\pi}{N}\) for trigonometry to work. For \(N=256\), this is approximately equal to \(0.000301\). Unfortunately, the lowest number one can represent using `Q0f7`

fixed point representation, that is with 7 bits in the fractional part, is \(2^{-7}=0.0078125\). That is why even for the 8 bit fixed point FFT, trigonometry calculations are performed using 16 bits fixed point arithmetic.

This limit on trigonometry also explains why the code presented here is not usable "as is" for very long arrays. Indeed, while 512 cases-long arrays could be handled using 16-bits trigonometry, the theoretical limit for an Arduino Uno would be 1024 cases-long arrays (because RAM is 2048 bytes, and we need some space for temporary variables), and that would require 32-bits trigonometry, which I did not implement.

One other issue with trigonometry I did not see coming is its sensitivity to overflow. Since there is basically no protection against it, overflowing a fixed-point representation of a number flips the sign. In the case of trigonometry this is especially annoying, because that means we add a \(\pi\) phase error for even the slightest error when values are close to one. And trust me, it took me some time to understand where the error was coming from.

To mitigate this, I had to implement my own addition, that saturates to one instead of flipping the sign when overflow happens. The trick here is to use the status register (`SREG`

) of the microcontroller to detect overflow. Again this requires doing the addition in assembly, as the check needs to happen right after the addition was performed, and there is no way to tell what the compiler might do between the addition and the actual check.

Checking overflow is done using the `brvc`

instruction (*Branch if Overflow Cleared*), and the function for 16-bits saturating addition goes like this:

```
/* Fixed point addition with saturation to Β±1. */
fixed_t fixed_add_saturate(fixed_t a, fixed_t b) {
fixed_t result;
asm (
"movw %A[result], %A[a]" "\n\t"
"add %A[result],%A[b]" "\n\t"
"adc %B[result],%B[b]" "\n\t"
"brvc fixed_add_saturate_goodbye" "\n\t"
"subi %B[result], 0" "\n\t"
"brmi fixed_add_saturate_plus_one" "\n\t"
"fixed_add_saturate_minus_one:" "\n\t"
"ldi %B[result],0x80" "\n\t"
"ldi %A[result],0x00" "\n\t"
"jmp fixed_add_saturate_goodbye" "\n\t"
"fixed_add_saturate_plus_one:" "\n\t"
"ldi %B[result],0x7f" "\n\t"
"ldi %A[result],0xff" "\n\t"
"fixed_add_saturate_goodbye:" "\n\t"
:
[result]"+d"(result):
[a]"r"(a),[b]"r"(b)
); return result;
}
```

One might be tempted to use this routine for every single addition performed in the program. This is actually useless, since additions in the actual FFT algorithm will not overflow thanks to scaling, if they are done in a sensible order (check the code if you want to see how!).

After a lot of wandering on the Internets, I ended up using Paul Hsieh's technique for computing approximate modules of vectors. However, while writing this article I discovered some mistakes and things that could be improved in his article, so I ended up writing a dedicated article on this, showing how you can minimize the mean square error, and get at most a 5.3% error.

The main idea is that you can approximate the unit circle using a set of well-chosen octagons. That reminds me of what a rough cylinder carved using a chainsaw might look like, hence the name of this section.

Enough small talk, time for some action! You can find here the code for 16-bits fixed-point FFT. The benchmark is available at the end of this article, but in the meantime here is the error comparison against reference implementation.

And now the fastest FFT on Arduino that I implemented, the 8-bits fixed-point FFT! As for previous implementations, you can find the code here. Below is a comparison of the calculated module of the FFT against a reference implementation.

The Arduino Uno has 2048 bytes of RAM. But because this implementation of the FFT needs an input array whose length is a power of two, and because you need some space for variables,^{[3]} the limit would be a 1024 bytes long FFT. But the code presented here would have to be modified a bit (not that much). From where I am standing I see two major issues:

As discussed previously, trigonometry would need 32-bits arithmetic. That means you would need to implement the multiplication and saturating addition for those numbers.

The buffers are single bytes right now, so you would need to upgrade them to 16-bits buffers.

Once those two issues, and the inevitable hundreds of other issues I did not think of are addressed, I don't see why one could not perform FFT on 1024 bytes-long input arrays.

[3] | Although I am sure a very determined person would be able to fit all the temporary variables in registers and calculate a 2048 bytes-long FFT. Do it, I vouch for you, you beautiful nerd! |

I won't go into the details of how I do the benchmarks here, it's basically just using the Arduino `micros()`

function. I present here only two benchmarks: how much time is required to run the FFT, and how "bad" the result is, measured with the mean squared error. Now, this is not the perfect way to measure the error made by the algorithm, so I do encourage you to have a look at the different comparison plots above. You will also notice that `ApproxFFT`

seems to perform poorly in terms of error for small-sized input arrays. This is because it does not compute the result for frequency 0, so the error is probably over-estimated. Overall, I think it is safe to say that `ApproxFFT`

and `Fixed16FFT`

introduce the same amount of errors in the calculation. Notice how `ExactFFT`

is *literally* billions times more precise than the other FFT algorithms. For 8-bits algorithms, the quantization mean squared error is \({}^1/{}_3 LSB^2\approx2\times10^{-5}\), which means there are still sources of error introduced in the algorithm other than simple quantization. The same goes for `ApproxFFT`

and `Fixed16FFT`

, where the quantization error is approximately \(3\times10^{-10}\).

Execution time is where my implementations truly shine. Indeed, you can see that for 256 bytes-long input array, `Fixed8FFT`

only needs about 12 ms to compute the FFT, when it takes 52ms for `ApproxFFT`

to do the same. And if you need the same level of precision as what `ApproxFFT`

offers, you can use `Fixed16FFT`

, which only needs about 30ms to perform the computation. It's worth noticing that `FloatFFT`

is not far behind, with only 67ms needed to compute the 256 bytes FFT. Of course Exact FFT takes much longer.

It has been a fun journey! I had a lot of fun and "ha-ha!" moments when debugging all these implementations. As I wrote before, there are ways to improve them, either by making `Fixed8FFT`

able to handle longer input arrays, or writing a custom-made addition for floating-point number to speed-up `FloatFFT`

. I don't know if I will do it in the near future, as this whole project was just intended to be a small side-project, which ended-up bigger than expected.

As always, feel free to contact me if you need any further detail on this. You can join me on mastodon, or on GitHub, or even through the comment section below! In the meantime, have fun with your projects. :)

]]>Someone asked me how to make a honeycomb grid in @FreeCADNews. Here's how I do it, and bonus it's parametric! β¬οΈ

Let's start with a simple plate with four holes. I give a name to each dimension in the sketcher so that I can re-use them later.

Then I create a new body and start sketching on the `XY`

plane. For this example I wanted to constrain the hexagon side, so a bit of trigonometry is needed to get the width of each hexagon. I also decided here that the separation between hexagons would be about 2mm.

The two construction lines will serve as directions to which we repeat the hexagon. Notice how I also link the pad length of the new solid with the plate pad length. Then we head to the `Create MultiTransform`

tool in Part Design, and start a first `LinearPattern`

. We need it a bit longer than the width of the plate since we will duplicate the hexagons sideways. Any "big" number will do, but a bit of trigonometry gives me the exact length.

Then using another `LinearPattern`

I can complete the line of hexagons. Since our pattern is symmetric I could also have used a symmetry tool. As before I use one of the construction lines for the direction of the pattern.

Now I do the other direction! Using another `LinearPattern`

, the second construction line, and a bit of trigonometry (again).

The number of occurrences is given by `Length / <<Sketch001>>.hexagon_sep`

. Freecad will round that to the nearest integer, if you're not happy with that, you can mess around with ceil and floor. Then, once again I can complete the pattern.

Let's create another body using the sketcher. It will represent the area where I want the honeycomb pattern to be present. I can re-use the dimensions I set for the base plate using their name.

One body remaining! We want some of the hexagons to be full. So let's create a body representing these. It re-uses the dimensions of the first hexagon.

Now I want to repeat the body a certain amount of time to fill some of the hexagons. Once again MultiTransform is our friend.

Notice that I used the dimension from the honeycomb pattern to match the correct positions of the hexagon. Also, everything being parametric, I can simply change the number of hexagons by setting the `Occurrences`

parameter of `LinearPatter004`

. At this stage, I have four bodies. I named them `main_plate`

, `hexagons`

, `allowed_cut_zone`

and `text_zone`

. Let's combine them cleverly using boolean operations!

First, let's remove the text zone from the allowed cut, using `PartDesign`

's boolean operation.

Then I can create the cut zone, which is the intersection between the allowed cut zone and the hexagons.

Finally, I can do the cutting, by taking the difference between the base plate and the cut zone.

I just need to add some text using the Draft workbench... whoops, the text zone is a bit too big, good thing that our model is parametric, so we can easily change its size. π¬

And there you have it!

If you want to mess around with the model, it is available here.

Have fun!

]]>This tutorial is intended for people who have already had the opportunity to encounter the Fourier transform, but who have not yet implemented it. It is largely based on the third edition of Numerical Recipes^{[1]}, which I encourage you to consult: it is a gold mine.

[1] | William H. Press, Saul A. Teukolsky, William T. Vetterling, & Brian P. Flannery. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing (3rd ed.). Cambridge University Press. |

- Table of contents
- Some reminders on the discrete Fourier transform
- The Fourier transform
- From the Fourier transform to the discrete Fourier transform
- Calculating the discrete Fourier transform
- Why a fast Fourier transform algorithm?

- Implementing the FFT
- My first FFT
- Analysis of the first implementation
- Calculate the reverse permutation of the bits
- My second FFT
- The special case of a real signal
- Property 1: Compute the Fourier transform of two real functions at the same time
- Property 2 : Compute the Fourier transform of a single function
- Calculation in place

- An FFT for the reals
- Optimization of trigonometric functions

The discrete Fourier transform is a transformation that follows from the Fourier transform and is, as its name indicates, adapted for discrete signals. In this first part I propose to discover how to build the discrete Fourier transform and then understand why the fast Fourier transform is useful.

This tutorial is not intended to present the Fourier transform. However, there are several definitions of the Fourier transform and even within a single domain, several are sometimes used. We will use the following: for a function \(f\), its Fourier transform \(\hat{f}\) is defined by:

\[ \hat{f}(\nu) = \int_{-\infty}^{+\infty}f(x)e^{-2i\nu x}\text{d}x \]As defined in the previous section, the Fourier transform of a signal is a continuous function of the variable \(\nu\). However, to represent any signal, we can only use a finite number of values. To do this we proceed in four steps:

We

**sample**(or discretize) the signal to analyze. This means that instead of working on the function that associates the value of the signal with the variable \(x\), we will work on a discrete series of values of the signal. In the case of the FFT, we sample with a constant step. For example if we look at a temporal signal like the value of a voltage read on a voltmeter, we could record the value at each*tic*of a watch.We

**window**the discretized signal. This means that we keep only a finite number of points of the signal.We sample the Fourier transform of the signal to obtain the discrete Fourier transform.

We window the discrete Fourier transform for storage.

I suggest you to reason on a toy signal which will have the shape of a Gaussian. This makes the reasoning a little simpler because the Fourier transform of a real Gaussian is also a real Gaussian^{[2]}, which simplifies the graphical representations.

More formally, we have:

\[ f(x) = e^{-x^2},\;\hat{f}(\nu)=\sqrt{\pi}e^{-(\pi\nu)^2} \]Let's first look at the sampling. Mathematically, we can represent the process by the multiplication of the signal \(f\) by a Dirac comb of period \(T\), \(Ρ_T\). The Dirac comb is defined as follows:

\[ Ρ_T(x) = \sum_{k=-\infty}^{+\infty}\delta(x-kT) \]With \(\delta\) the Dirac distribution. Here is the plot that we can obtain if we represent \(f\) and \(g=Ρ_T\times f\) together:

The Fourier transform of the new \(g\) function is written ^{[3]} :

If we put \(f[k]=f(kT)\) the sampled signal and \(\nu_{text{ech}} = \frac{1}{T}\) the sampling frequency, we have:

\[ \hat{g} = \sum_{k=-\infty}^{+\infty}f[k]e^{-2i\pi k\frac{\nu}{\nu_{\text{ech}}}} \]If we plot the Fourier transform of the starting signal \(\hat{f}\) and that of the sampled signal \(\hat{g}\), we obtain the following plot:

We can then look at the windowing process. There are several methods that each have their advantages, but we will focus here only on the rectangular window. The principle is simple: we only look at the values of \(f\) for \(x\) between \(-x_0\) and \(+x_0\). This means that we multiply the function \(f\) by a gate function \(\Pi_{x_0}\) which verifies:

\[ \Pi_{x_0}(x) = \begin{aligned} 1 & \;\text{if}\; x\in[-x_0,x_0] \\ 0 & \;\text{else} \end{aligned} \]Graphically, here is how we could represent \(h\) and \(f\) together.

Concretely, this is equivalent to limiting the sum of the Dirac comb to a finite number of terms. We can then write the Fourier transform of \(h=Pi_{x_0} \times Ρ_T \times f\) :

\[ \hat{h}(\nu) = \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{\nu}{\nu_{\text{ech}}}} \]We can now proceed to the last step: sampling the Fourier transform. Indeed, we can only store a finite number of values on our computer and, as defined, the function \(\hat{h}\) is continuous. We already know that it is periodic, with period \(\nu_{\text{ech}}\), so we can store only the values between \(0\) and \(\nu_{\text{ech}}\). We still have to sample it, and in particular to find the adequate sampling step. It is clear that we want the sampling to be as "fine" as possible, in order not to miss any detail of the Fourier transform! For this we can take inspiration from what happened when we sampled \(f\): its Fourier transform became periodic, with period \(\nu_{\text{ech}}\). Now the inverse Fourier transform (the operation that allows to recover the signal from its Fourier transform) has similar properties to the Fourier transform. This means that if we sample \(\hat{h}\) with a sampling step \(\nu_s\), then its inverse Fourier transform becomes periodic with period \(1/\nu_s\). This gives a low limit on the values that \(\nu_s\) can take ! Indeed, if the inverse transform has a period smaller than the width of the window (\(1/\nu_s < 2x_0\)), then the reconstructed signal taken between \(-x_0\) and \(x_0\) will not correspond to the initial signal \(f\) !

So we choose \(\nu_s = \frac{1}{2x_0}\) to discretize \(\hat{h}\). We use the same process of multiplication by a Dirac comb to discretize. In this way we obtain the Fourier transform of a new function \(l\) :

\[ \begin{aligned} \hat{l}(\nu) = \sum_{n=-\infty}^{+\infty} \delta(\nu-n\nu_s) \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{n\nu_s}{\nu_{\text{ech}}}} \end{aligned} \]This notation is a bit complicated, and we can be more interested in \(\hat{l}[n]=\hat{l}(n\nu_s)\) :

\[ \begin{aligned} \hat{l}[n] = \hat{l}(n\nu_s) &=& \sum_{k=-k_0}^{+k_0}f[k]e^{-2i\pi k\frac{n\nu_s}{\nu_{\text{ech}}}}\\ &=& \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \end{aligned} \]To get the last line, I re-indexed \(f[k]\) to start at 0, noting \(N\) the number of samples. I then assumed that the window size corresponded to an integer number of samples, i.e. that \(2x_0 = N\times T\), which is rewritten as \(N\times \nu_s = \nu_{\text{ech}}\). This expression is the **discrete Fourier transform** of the signal.

This problem is solved quite simply by windowing the discrete Fourier transform. Since the transform has been periodized by the sampling of the starting signal, it is enough to store one period of the transform to store all the information contained in it. The choice which is generally made is to keep all the points between O and \(\nu_{\text{ech}}\). This allows to use only positive \(n\), and one can easily reconstruct the plot of the transform if needed by inverting the first and the second half of the computed transform. In practice (for the implementation), the discrete Fourier transform is thus given by :

\[ \boxed{ \forall n=0...(N-1),\; \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} } \]To conclude on our example function, we obtain the following plot:

So we have at our disposal the expression of the discrete Fourier transform of a signal \(f\):

\[ \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \]This s the expression of a matrix product which would look like this:

\[ \hat{f} = \mathbf{M} \cdot f \]with

\[ \mathbf{M} = \begin{pmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & e^{-2i\pi 1 \times 1 / N} & e^{-2i\pi 2 \times 1 / N} & \dots & e^{-2i\pi 1\times (N-1)/N} \\ 1 & e^{-2i\pi 1 \times 2 \times 1 / N} & e^{-2i\pi 2 \times 2 / N} & \ddots & \vdots\\ \vdots & \vdots & \ddots & \ddots & e^{e-2i\pi (N-2)\times (N-1) / N}\\ 1 & e^{-2i\pi (N-1) \times 1/N} & \dots & e^{e-2i\pi (N-1) \times (N-2) / N} & e^{-2i\pi (N-1)\times (N-1) / N} \end{pmatrix} \]Those in the know will notice that this is a Vandermonde matrix on the roots of the unit.

So this calculation can be implemented relatively easily!

```
function naive_dft(x)
N = length(x)
k = reshape(0:(N-1), 1, :)
n = 0:(N-1)
M = @. exp(-2im * Ο * k * n / N)
M * x
end
```

And to check that it does indeed give the right result, it is enough to compare it with a reference implementation:

`using FFTW`

```
a = rand(1024)
b = fft(a)
c = naive_dft(a)
b β c
```

The last block evaluates to `true`

, which confirms that we are not totally off the mark!

However, is this code effective? We can check by comparing the memory footprint and execution speed.

`using BenchmarkTools`

`@benchmark fft(a) setup=(a = rand(1024))`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 16.876 ΞΌs β¦ 12.990 ms β GC (min β¦ max): 0.00% β¦ 42.84%
Time (median): 19.108 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 25.991 ΞΌs Β± 169.304 ΞΌs β GC (mean Β± Ο): 3.95% Β± 0.61% ββ
ββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
16.9 ΞΌs Histogram: frequency by time 39 ΞΌs < Memory estimate: 33.97 KiB, allocs estimate: 27.
```

`@benchmark naive_dft(a) setup=(a = rand(1024))`

```
BenchmarkTools.Trial: 105 samples with 1 evaluation.
Range (min β¦ max): 42.351 ms β¦ 61.851 ms β GC (min β¦ max): 0.00% β¦ 2.90%
Time (median): 46.299 ms β GC (median): 0.00%
Time (mean Β± Ο): 48.020 ms Β± 4.551 ms β GC (mean Β± Ο): 0.50% Β± 1.62% βββ βββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
42.4 ms Histogram: frequency by time 60.3 ms < Memory estimate: 16.03 MiB, allocs estimate: 4.
```

So our implementation is *really* slow (about 10,000 times) and has a very high memory footprint (about 500 times) compared to the benchmark implementation! To improve this, we will implement the fast Fourier transform.

Before getting our hands dirty again, let's first ask the question: is it really necessary to try to improve this algorithm?

Before answering directly, let us look at some applications of the Fourier transform and the discrete Fourier transform.

The Fourier transform has first of all a lot of theoretical applications, whether it is to solve differential equations, in signal processing or in quantum physics. It also has practical applications in optics and in spectroscopy.

The discrete Fourier transform also has many applications, in signal analysis, for data compression, multiplication of polynomials or the computation of convolution products.

Our naive implementation of the discrete Fourier transform has a time and memory complexity in \(\mathcal{O}(N^2)\) with \(N\) the size of the input sample, this is due to the storage of the matrix and the computation time of the matrix product. Concretely, if one wished to analyze a sound signal of 3 seconds sampled at 44kHz with data stored on simple precision floats (4 bytes), it would thus be necessary approximately \(2\times(44000\times3)^2\times 4\approx100\;000\;000\;000\) bytes of memory (a complex number is stored on 2 floats) We can also estimate the time necessary to make this calculation. The median time for 1024 points was 38.367 ms. For our 3 seconds signal, it would take about \(38.867\times\left(\frac{44000\times3}{1024}\right)^2\approx 637\;537\) milliseconds, that is more than 10 minutes !

One can easily understand the interest to reduce the complexity of the calculation. In particular the fast Fourier transform algorithm (used by the reference implementation) has a complexity in \(\mathcal{O}(N\log N)\). According to our *benchmark*, the algorithm processes a 1024-point input in 23.785Β΅s. It should therefore process the sound signal in about \(23.785\times\frac{44000\times\log(44000\times3)}{1024\times\log1024}\approx 5\;215\) microseconds, that is to say about 120000 times faster than our algorithm. We can really say that the *fast* of *Fast Fourier Transform* is not stolen !

[2] | Gaussians are said to be eigenfunctions of the Fourier transform. |

[3] | It should be justified here that we can invert the sum and integral signs. |

We saw how the discrete Fourier transform was constructed, and then we naively tried to implement it. While this implementation is relatively simple to implement (especially with a language like Julia that facilitates matrix manipulations), we also saw its limitations in terms of execution time and memory footprint.

It's time to move on to the FFT itself!

In this part we will implement the FFT by starting with a simple approach, and then making it more complex as we go along to try to calculate the Fourier transform of a real signal in the most efficient way possible. To compare the performances of our implementations, we will continue to compare with the FFTW implementation.

We have previously found the expression of the discrete Fourier transform :

\[ \hat{f}[n] = \sum_{k=0}^{N-1}f[k]e^{-2i\pi k\frac{n}{N}} \]The trick at the heart of the FFT algorithm is to notice that if we try to cut this sum in two, separating the even and odd terms, we get (assuming \(N\) is even), for \(n < N/2\) :

\[ \begin{aligned} \hat{f}[n] &= \sum_{k=0}^{N}f[k]e^{-2i\pi k\frac{n}{N}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi 2m\frac{n}{N}} + \sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi (2m+1)\frac{n}{N}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n}{N/2}} + e^{-2i\pi n/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n}{N/2}}\\ &= \hat{f}^\text{even}[n] + e^{-2i\pi n/N}\hat{f}^\text{odd}[n] \end{aligned} \]where \(\hat{f}^\text{even}\) and \(\hat{f}^\text{odd}\) are the Fourier transforms of the sequence of even terms of \(f\) and of the sequence of odd terms of \(f\). We can therefore compute the first half of the Fourier transform of \(f\) by computing the Fourier transforms of these two sequences of length \(N/2\) and recombining them. Similarly, if we compute \(\hat{f}[n+N/2]\) we have :

\[ \begin{aligned} \hat{f}[n+N/2] &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n+N/2}{N/2}} + e^{-2i\pi(n+N/2)/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n+N/2}{N/2}}\\ &= \sum_{m=0}^{N/2-1}f[2m]e^{-2i\pi m\frac{n}{N/2}} - e^{-2i\pi n/N}\sum_{m=0}^{N/2-1}f[2m+1]e^{-2i\pi m\frac{n}{N/2}}\\ &= \hat{f}^\text{even}[n] - e^{-2i\pi n/N}\hat{f}^\text{odd}[n] \end{aligned} \]This means that by computing two Fourier transforms of length \(N/2\), we are able to compute two elements of a Fourier transform of length \(N\). Assuming for simplicity that \(N\) is a power of two^{[4]}, this naturally draws a recursive implementation of the FFT. According to the master theorem, this algorithm will have complexity \(\mathcal{O}(N\log_2 N)\), which is much better than the first naive algorithm we implemented, which has complexity in \(\mathcal{O}(N^2)\).

```
function my_fft(x)
# Stop condition, the TF of an array of size 1 is this same array.
if length(x) <= 1
x
else
N = length(x)
# Xα΅ contains the TF of odd terms and Xα΅ that of even terms.
# The subtlety being that Julia's tablals start at 1 and not 0.
Xα΅ = my_fft(x[2:2:end])
Xα΅ = my_fft(x[1:2:end])
factors = @. exp(-2im * Ο * (0:(N/2 - 1)) / N)
[Xα΅ .+ factors .* Xα΅; Xα΅ .- factors .* Xα΅]
end
end
```

We can check as before that code gives a fair result, then compare its runtime qualities with the reference implementation.

`@benchmark fft(a) setup=(a = rand(1024))`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 17.308 ΞΌs β¦ 9.319 ms β GC (min β¦ max): 0.00% β¦ 45.83%
Time (median): 18.813 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 22.727 ΞΌs Β± 93.183 ΞΌs β GC (mean Β± Ο): 1.88% Β± 0.46% βββββββ βββ
ββ β
βββββββββββ
β
β
β
ββββ
ββ
ββββ
β
β
ββ
ββββ
βββββββββββββββββββββββββ
ββ
β
17.3 ΞΌs Histogram: log(frequency) by time 38.7 ΞΌs < Memory estimate: 33.97 KiB, allocs estimate: 27.
```

`@benchmark my_fft(a) setup=(a = rand(1024))`

```
BenchmarkTools.Trial: 1174 samples with 1 evaluation.
Range (min β¦ max): 3.511 ms β¦ 23.508 ms β GC (min β¦ max): 0.00% β¦ 77.03%
Time (median): 3.905 ms β GC (median): 0.00%
Time (mean Β± Ο): 4.249 ms Β± 1.681 ms β GC (mean Β± Ο): 2.94% Β± 6.60% β
βββ
ββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββββ β
3.51 ms Histogram: frequency by time 6.54 ms < Memory estimate: 1.09 MiB, allocs estimate: 14322.
```

We can see that we have improved the execution time (by a factor of 8) and the memory footprint of the algorithm (by a factor of 13), without getting closer to the reference implementation.

Let's go back to the previous code:

```
function my_fft(x)
# Stop condition, the TF of an array of size 1 is this same array.
if length(x) <= 1
x
else
N = length(x)
# Xα΅ contains the TF of odd terms and Xα΅ that of even terms.
# The subtlety being that Julia's tablals start at 1 and not 0.
Xα΅ = my_fft(x[2:2:end])
Xα΅ = my_fft(x[1:2:end])
factors = @. exp(-2im * Ο * (0:(N/2 - 1)) / N)
[Xα΅ .+ factors .* Xα΅; Xα΅ .- factors .* Xα΅]
end
end
```

And let's try to keep track of the memory allocations. For simplicity, we can assume that we are working on an array of 4 elements, `[f[0], f[1], f[2], f[3]]`

. The first call to `my_fft`

keeps in memory the initial array, then launches the fft on two sub-arrays of size 2: `[f[0], f[2]]`

and `[f[1], f[3]]`

, then recursive calls keep in memory before recombining the arrays `[f[0]]`

and `[f[2]]`

then `[f[1]]`

and `[f[3]]`

. At most, we have \(log_2(N)\) arrays allocated with sizes divided by two each time. Not only do these arrays take up memory, but we also waste time allocating them!

However, if we observe the definition of the recurrence we use, at each step \(i\) (i.e. for each array size, \(N/2^i\)), the sum of the intermediate array sizes is always \(N\). In other words, this gives the idea that we could save all these array allocations and use the same array all the time, provided that we make all the associations of arrays of the same size at the same step.

Schematically we can represent the FFT process for an array with 8 elements as follows:

How to read this diagram? Each column corresponds to a depth of the recurrence of our first FFT. The leftmost column corresponds to the deepest recurrence: we have cut the input array enough to arrive at subarrays of size 1. These 8 sub-tables are symbolized by 8 different geometrical shapes. We then go to the next level of recurrence. Each pair of sub-tables of size 1 must be combined to create a sub-table of size 2, which will be stored in the same memory cells as the two sub-tables of size 1. For example, we combine the subarray β² that contains \(f[0]\) and the subarray β that contains \(f[4]\) using the formula demonstrated earlier to form the array \([f[0] + f[4], f[0] - f[4]]\), which I call in the following β, and store the two values in position 0 and 4. The colors of the arrows allow us to distinguish those bearing a coefficient (which correspond to the treatment we give to the subarray \(\hat{f}^{\text{odd}}\) in the formulas of the previous section). After having constructed the 4 sub-tables of size 2, we can proceed to a new step of the recurrence to compute two sub-tables of size 4. Finally the last step of the recurrence combines the two subarrays of size 4 to compute the array of size 8 which contains the Fourier transform.

Based on this scheme we can think of having a function whose main loop would calculate successively each column to arrive at the final result. In this way, all the calculations are performed on the same array and the number of allocations is minimized! There is however a problem: we see that the \(\hat{f}[k]\) do not seem to be ordered at the end of the process.

In reality, these \(\hat{f}[k]\) are ordered via a reverse bit permutation. This means that if we write the indices \(k\) in binary, then reverse this writing (the MSB becoming the LSB^{[5]}), we obtain the index at which \(\hat{f}[k]\) is found after the FFT algorithm. The permutation process is described by the following table in the case of a calculation on 8 elements.

\(k\) | Binary representation of \(k\) | Reverse binary representation | Index of \(\hat{f}[k]\) |
---|---|---|---|

0 | 000 | 000 | 0 |

1 | 001 | 100 | 4 |

2 | 010 | 010 | 2 |

3 | 011 | 110 | 6 |

4 | 100 | 001 | 1 |

5 | 101 | 101 | 5 |

6 | 110 | 011 | 3 |

7 | 111 | 111 | 7 |

If we know how to calculate the reverse permutation of the bits, we can simply reorder the array at the end of the process to obtain the right result. However, before jumping on the implementation, it is interesting to look at what happens if instead we reorder the input array *via* this permutation.

We can see that by proceeding in this way we have a simple ordering of the sub-tables. Since in any case it will be necessary to proceed to a permutation of the table, it is interesting to do it before the calculation of the FFT.

We must therefore begin by being able to calculate the permutation. It is possible to perform the permutation in place simply once we know which elements to exchange. Several methods exist to perform the permutation, and a search in Google Scholar will give you an overview of the wealth of approaches.

We can use a little trick here: since we are dealing only with arrays whose size is a power of 2, we can write the size \(N\) as \(N=2^p\). This means that the indices can be stored on \(p\) bits. We can then simply calculate the permuted index *via* binary operations. For example if \(p=10\) then the index \(797\) could be represented as: `1100011101`

.

We can separate the inversion process in several steps. First we exchange the 5 most significant bits and the 5 least significant bits. Then on each of the half-words we invert the two most significant bits and the two least significant bits (the central bits do not change). Finally on the two bits words that we have just exchanged, we exchange the most significant bit and the least significant bit.

An example of implementation would be the following:

```
bit_reverse(::Val{10}, num) = begin
num = ((num&0x3e0)>>5)|((num&0x01f)<<5)
num = ((num&0x318)>>3)|(num&0x084)|((num&0x063)<<3)
((num&0x252)>>1)|(num&0x084)|((num&0x129)<<1)
end
```

An equivalent algorithm can be applied for all values of \(p\), you just have to be careful not to change the central bits anymore when you have an odd number of bits in a half word. In the following there is an example for several word lengths.

Then we can do the permutation itself. The algorithm is relatively simple: just iterate over the array, calculate the inverted index of the current index and perform the inversion. The only subtlety is that the inversion must be performed only once per index of the array, so we discriminate by performing the inversion only if the current index is lower than the inverted index.

```
function reverse_bit_order!(X, order)
N = length(X)
for i in 0:(N-1)
j = bit_reverse(order, i)
if i<j
X[i+1],X[j+1]=X[j+1],X[i+1]
end
end
X
end
```

We are now sufficiently equipped to start a second implementation of the FFT. The first step will be to compute the reverse bit permutation. Then we will be able to compute the Fourier transform following the scheme shown previously. To do this we will store the size \(n_1\) of the sub-arrays and the number of cells \(n_2\) in the global array that separate two elements of the same index in the sub-arrays. The implementation can be done as follows:

```
function my_fft_2(x)
N = length(x)
order = Int(log2(N))
@inbounds reverse_bit_order!(x, Val(order))
nβ = 0
nβ = 1
for i=1:order # i done the number of the column we are in.
nβ = nβ # nβ = 2β±-ΒΉ
nβ *= 2 # nβ = 2β±
step_angle = -2Ο/nβ
angle = 0
for j=1:nβ # j is the index in Xα΅ and Xα΅
factors = exp(im*angle) # z = exp(-2im*Ο*(j-1)/nβ)
angle += step_angle # a = -2Ο*(j+1)/nβ
# We combine the element j of each group of subarrays
for k=j:nβ:N
@inbounds x[k], x[k+nβ] = x[k] + factors * x[k+nβ], x[k] - factors * x[k+nβ]
end
end
end
x
end
```

We can again measure the performance of this implementation. To keep the comparison fair, the `fft!`

function should be used instead of `fft`

, as it works in place.

`@benchmark fft!(a) setup=(a = rand(1024) |> complex)`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 17.362 ΞΌs β¦ 16.111 ms β GC (min β¦ max): 0.00% β¦ 33.10%
Time (median): 19.093 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 24.535 ΞΌs Β± 161.059 ΞΌs β GC (mean Β± Ο): 2.17% Β± 0.33% βββββ
ββ β
β
β β
ββββββββββββ
β
β
ββ
β
ββββββ
ββ
ββ
β
ββ
ββββ
ββββββββββββββββββ
βββββββ
β β
17.4 ΞΌs Histogram: log(frequency) by time 42.1 ΞΌs < Memory estimate: 1.72 KiB, allocs estimate: 25.
```

`@benchmark my_fft_2(a) setup=(a = rand(1024) .|> complex)`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 55.132 ΞΌs β¦ 350.764 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 55.773 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 63.343 ΞΌs Β± 15.417 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00% ββ β
β β
βββββ
ββββββββββ
βββββ
βββ
ββ
ββ
ββ
β
β
β
β
β
β
β
β
ββ
β
β
βββββββββββββββββββ β
55.1 ΞΌs Histogram: log(frequency) by time 107 ΞΌs < Memory estimate: 0 bytes, allocs estimate: 0.
```

We have significantly improved our execution time and memory footprint. We can see that there are zero bytes allocated (this means that the compiler does not need to store the few intermediate variables in RAM), and that the execution time is close to that of the reference implementation.

So far we have reasoned about complex signals, which use two floats for storage. However in many situations we work with real value signals. Now in the case of a real signal, we know that \(\hat{f}\) verifies \(\hat{f}(-\nu) = \overline{\hat{f}(\nu)}\). This means that half of the values we calculate are redundant. Although we calculate the Fourier transform of a real signal, the result can be a complex number. In order to save storage space, we can think of using this half of the array to store complex numbers. For this, two properties will help us.

If we have two real signals \(f\) and \(g\), we can define the complex signal \(h=f+ig\). We then have:

\[ \hat{h}[k] = \sum_{n=0}^{N-1}(f[n]+ig[n])e^{-2i\pi kn/N} \]We can notice that

\[ \begin{aligned} \overline{\hat{h}[N-k]} &= \overline{\sum_{n=0}^{N-1}(f[n]+ig[n])e^{-2i\pi (N-k)n/N}}\\ &=\sum_{n=0}^{N-1}(f[n]-ig[n])e^{-2i\pi kn/N} \end{aligned} \]Combining the two we have

\[ \begin{aligned} \hat{f}[k] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N-k]})\\ \hat{g}[k] &= -\frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N-k]})\\ \end{aligned} \]The idea is to use the previous property by using the signal of the even and the odd elements. In other words for \(k=0...N/2-1\) we have \(h[k]=f[2k]+if[2k+1]\).

Then we have:

\[ \begin{aligned} \hat{f}^{\text{even}}[k] &= \sum_{n=0}^{N/2-1}f[2n]e^{-2i\pi kn/(N/2)}\\ \hat{f}^{\text{odd}}[k] &= \sum_{n=0}^{N/2-1}f[2n+1]e^{-2i\pi kn/(N/2)} \end{aligned} \]We can recombine the two partial transforms. For \(k=0...N/2-1\) :

\[ \begin{aligned} \hat{f}[k] &= \hat{f}^{\text{even}}[k] + e^{-2i\pi k/N}\hat{f}^{\text{odd}}[k]\\ \hat{f}[k+N/2] &= \hat{f}^{\text{even}}[k] - e^{-2i\pi k/N}\hat{f}^{\text{odd}}[k] \end{aligned} \]Using the first property, we then have:

\[ \begin{aligned} \hat{f}[k] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N/2-k]}) - \frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N/2-k]})e^{-2i\pi k/N} \\ \hat{f}[k+N/2] &= \frac{1}{2}(\hat{h}[k] + \overline{\hat{h}[N/2-k]}) + \frac{i}{2}(\hat{h}[k] - \overline{\hat{h}[N/2-k]})e^{-2i\pi k/N} \end{aligned} \]The array \(h\), which is presented previously, is complex-valued. However the input signal is real-valued and twice as long. The trick is to use two cells of the initial array to store a complex element of \(h\). It is useful to do the calculations with complex numbers before starting to write code. For the core of the FFT, if we note \(x_i\) the array at step \(i\) of the main loop, we have:

\[ \begin{aligned} \text{Re}(x_{i+1}[k]) &= \text{Re}(x_{i}[k]) + \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) - \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\ \text{Re}(x_{i+1}[k]) &= \text{Re}(x_{i}[k]) + \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) - \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\\\ \text{Re}(x_{i+1}[k+n_1]) &= \text{Re}(x_{i}[k]) - \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) + \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\ \text{Re}(x_{i+1}[k+n_1]) &= \text{Re}(x_{i}[k]) - \text{Re}(e^{-2i\pi j/n_2})\text{Re}(x_i[k+n_1]) + \text{Im}(e^{-2i\pi j/n_2})\text{Im}(x_i[k+n_1])\\ \end{aligned} \]With the organization we choose, we can replace \(\text{Re}(x[k])\) with \(x[2k]\) and \(\text{Im}(x[k])\) with \(x[2k+1]\). We also note that we can replace \(\text{Re}(x[k+n_1])\) with \(x[2(k+n_1)]\) or even better \(x[2k+n_2]\).

The last step is the recombination of \(h\) to find the final result. The formula in property 2 is rewritten after an unpleasant but uncomplicated calculation:

\[ \begin{aligned} \text{Re}(\hat{x}[k]) &= 1/2 \times (\text{Re}(h[k]) + \text{Re}(h[N/2-k]) + \text{Im}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Re}(h[k])\text{Im}(e^{-2i\pi k/N})... \\&...+ \text{Im}(h[N/2-k])\text{Re}(e^{-2i\pi k/N}) - \text{Re}(h[N/2-k])\text{Im}(e^{-2i\pi k/N})\\ \text{Im}(\hat{x}[k]) &= 1/2 \times (\text{Im}(h[k]) - \text{Im}(h[N/2-k]) - \text{Re}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[k])\text{Im}(e^{-2i\pi k/N})...\\&... + \text{Re}(h[N/2-k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[N/2-k])\text{Im}(e^{-2i\pi k/N}) \end{aligned} \]There is a particular case where this formula does not work: when \(k=0\) we leave the array \(h\) which contains only \(N/2\) elements. However we can use the symmetry of the Fourier Transform to see that \(h[N/2]=h[0]\). The case \(k=0\) then simplifies enormously:

\[ \begin{aligned} \text{Re}(\hat{x}[0]) &= \text{Re}(h[0]) + \text{Im}(h[0])\\ \text{Im}(\hat{x}[0]) &= 0 \end{aligned} \]To perform the calculation in place, it is useful to be able to calculate \(\hat{x}[N/2-k]\) at the same time that we calculate \(\hat{x}[k]\). Reusing the previous results and the fact that \(e^{-2i\pi(N/2-k)/N}=-e^{2i\pi k/N}\), we find:

\[ \begin{aligned} \text{Re}(\hat{x}[N/2-k]) &= 1/2 \times \Big(\text{Re}(h[N/2-k]) + \text{Re}(h[k]) - \text{Im}(h[N/2-k]]\text{Re}(e^{-2i\pi k/N})...\\&... + \text{Re}(h[N/2-k])\text{Im}(e^{-2i\pi k/N}) - \text{Im}(h[k])\text{Re}(e^{-2i\pi k/N}) - \text{Re}(h[k])\text{Im}(e^{-2i\pi k/N})\Big)\\\text{Im}(\hat{x}[N/2-k]) &= 1/2 \times \Big(\text{Im}(h[N/2-k]) - \text{Im}(h[k]) + \text{Re}(h[N/2-k])\text{Re}(e^{-2i\pi k/N})...\\&... + \text{Im}(h[N/2-k])\text{Im}(e^{-2i\pi k/N}) - \text{Re}(h[k])\text{Re}(e^{-2i\pi k/N}) + \text{Im}(h[k])\text{Im}(e^{-2i\pi k/N})\Big) \end{aligned} \]After this little unpleasant moment, we are ready to implement a new version of the FFT!

Since the actual computation of the FFT is done on an array that is half the size of the input array, we need a function to compute the inverted index on 9 bits to be able to continue testing on 1024 points.

```
bit_reverse(::Val{9}, num) = begin
num = ((num&0x1e0)>>5)|(num&0x010)|((num&0x00f)<<5)
num = ((num&0x18c)>>2)|(num&0x010)|((num&0x063)<<2)
((num&0x14a)>>1)|(num&0x010)|((num&0x0a5)<<1)
end
```

To take into account the specificities of the representation of the complexes we use, we implement a new version of `reverse_bit_order`

.

```
function reverse_bit_order_double!(x, order)
N = length(x)
for i in 0:(NΓ·2-1)
j = bit_reverse(order, i)
if i<j
# swap real part
x[2*i+1],x[2*j+1]=x[2*j+1],x[2*i+1]
# swap imaginary part
x[2*i+2],x[2*j+2]=x[2*j+2],x[2*i+2]
end
end
x
end
```

This leads to the new FFT implementation.

```
function my_fft_3(x)
N = length(x) Γ· 2
order = Int(log2(N))
@inbounds reverse_bit_order_double!(x, Val(order))
nβ = 0
nβ = 1
for i=1:order # i done the number of the column we are in.
nβ = nβ # nβ = 2β±-ΒΉ
nβ *= 2 # nβ = 2β±
step_angle = -2Ο/nβ
angle = 0
for j=1:nβ # j is the index in Xα΅ and Xα΅
re_factor = cos(angle)
im_factor = sin(angle)
angle += step_angle # a = -2Ο*j/nβ
# We combine element j from each group of subarrays
@inbounds for k=j:nβ:N
re_xβ = x[2*k-1]
im_xβ = x[2*k]
re_xβ = x[2*(k+nβ)-1]
im_xβ = x[2*(k+nβ)]
x[2*k-1] = re_xβ + re_factor*re_xβ - im_factor*im_xβ
x[2*k] = im_xβ + im_factor*re_xβ + re_factor*im_xβ
x[2*(k+nβ)-1] = re_xβ - re_factor*re_xβ + im_factor*im_xβ
x[2*(k+nβ)] = im_xβ - im_factor*re_xβ - re_factor*im_xβ
end
end
end
# We build the final version of the TF
# N half the size of x
# Special case n=0
x[1] = x[1] + x[2]
x[2] = 0
step_angle = -Ο/N
angle = step_angle
@inbounds for n=1:(NΓ·2)
re_factor = cos(angle)
im_factor = sin(angle)
re_h = x[2*n+1]
im_h = x[2*n+2]
re_h_sym = x[2*(N-n)+1]
im_h_sym = x[2*(N-n)+2]
x[2*n+1] = 1/2*(re_h + re_h_sym + im_h*re_factor + re_h*im_factor + im_h_sym*re_factor - re_h_sym*im_factor)
x[2*n+2] = 1/2*(im_h - im_h_sym - re_h*re_factor + im_h*im_factor + re_h_sym*re_factor + im_h_sym*im_factor)
x[2*(N-n)+1] = 1/2*(re_h_sym + re_h - im_h_sym*re_factor + re_h_sym*im_factor - im_h*re_factor - re_h*im_factor)
x[2*(N-n)+2] = 1/2*(im_h_sym - im_h + re_h_sym*re_factor + im_h_sym*im_factor - re_h*re_factor + im_h*im_factor)
angle += step_angle
end
x
end
```

We can now check the performance of the new implementation:

`@benchmark fft!(x) setup=(x = rand(1024) .|> complex)`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 17.630 ΞΌs β¦ 247.123 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 19.206 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 23.346 ΞΌs Β± 9.044 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00% β
βββ
β βββββββββββ
β β
ββββββββββ
β
β
β
β
β
βββ
ββ
βββ
βββββββββββββββββββββββββββββββββ
β
β
ββ
β
17.6 ΞΌs Histogram: log(frequency) by time 45.7 ΞΌs < Memory estimate: 1.72 KiB, allocs estimate: 25.
```

`@benchmark my_fft_3(x) setup=(x = rand(1024))`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 28.782 ΞΌs β¦ 100.122 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 29.276 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 33.700 ΞΌs Β± 9.410 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00% ββ β
β β
βββββ
β
βββ
β
β
ββ
βββ
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
βββ
β
ββββ
β
β
ββ
β
ββββββ
ββββ
ββββ
β
28.8 ΞΌs Histogram: log(frequency) by time 59.1 ΞΌs < Memory estimate: 0 bytes, allocs estimate: 0.
```

This is a very good result!

If we analyze the execution of `my_fft_3`

using Julia's *profiler*, we can see that most of the time is spent computing trigonometric functions and creating the `StepRange`

objects used in `for`

loops. The second problem can be easily circumvented by using `while`

loops. For the first one, in *Numerical Recipes* we can read (section 5.4 "*Recurrence Relations and Clenshaw's Recurrence Formula*", page 219 of the third edition):

\[\begin{aligned}\cos(\theta + \delta) &= \cos\theta - [\alpha \cos\theta + \beta\sin\theta]\\\sin(\theta + \delta) &= \sin\theta - [\alpha\sin\theta - \beta\cos\theta]\end{aligned}\]If your program's running time is dominated by evaluating trigonometric functions, you are probably doing something wrong. Trig functions whose arguments form a linear sequence \(\theta = \theta_0 + n\delta, n=0,1,2...\) , are efficiently calculated by the recurrence

Where \(\alpha\) and \(\beta\) are the precomputed coefficients \(\alpha = 2\sin^2\left(\frac{\delta}{2}\right),\;\beta=\sin\delta\)

This relation is also interesting in terms of numerical stability. We can directly implement a final version of our FFT using these relations.

```
function my_fft_4(x)
N = length(x) Γ· 2
order = Int(log2(N))
@inbounds reverse_bit_order_double!(x, Val(order))
nβ = 0
nβ = 1
i=1
while i<=order # i done the number of the column we are in.
nβ = nβ # nβ = 2β±-ΒΉ
nβ *= 2 # nβ = 2β±
step_angle = -2Ο/nβ
Ξ± = 2sin(step_angle/2)^2
Ξ² = sin(step_angle)
cj = 1
sj = 0
j = 1
while j<=nβ # j is the index in Xα΅ and Xα΅
# We combine the element j from each group of subarrays
k = j
@inbounds while k<=N
re_xβ = x[2*k-1]
im_xβ = x[2*k]
re_xβ = x[2*(k+nβ)-1]
im_xβ = x[2*(k+nβ)]
x[2*k-1] = re_xβ + cj*re_xβ - sj*im_xβ
x[2*k] = im_xβ + sj*re_xβ + cj*im_xβ
x[2*(k+nβ)-1] = re_xβ - cj*re_xβ + sj*im_xβ
x[2*(k+nβ)] = im_xβ - sj*re_xβ - cj*im_xβ
k += nβ
end
# We compute the next cosine and sine.
cj, sj = cj - (Ξ±*cj + Ξ²*sj), sj - (Ξ±*sj-Ξ²*cj)
j+=1
end
i += 1
end
# We build the final version of the TF
# N half the size of x
# Special case n=0
x[1] = x[1] + x[2]
x[2] = 0
step_angle = -Ο/N
Ξ± = 2sin(step_angle/2)^2
Ξ² = sin(step_angle)
cj = 1
sj = 0
j = 1
@inbounds while j<=(NΓ·2)
# We calculate the cosine and sine before the main calculation here to compensate for the first
# step of the loop that was skipped.
cj, sj = cj - (Ξ±*cj + Ξ²*sj), sj - (Ξ±*sj-Ξ²*cj)
re_h = x[2*j+1]
im_h = x[2*j+2]
re_h_sym = x[2*(N-j)+1]
im_h_sym = x[2*(N-j)+2]
x[2*j+1] = 1/2*(re_h + re_h_sym + im_h*cj + re_h*sj + im_h_sym*cj - re_h_sym*sj)
x[2*j+2] = 1/2*(im_h - im_h_sym - re_h*cj + im_h*sj + re_h_sym*cj + im_h_sym*sj)
x[2*(N-j)+1] = 1/2*(re_h_sym + re_h - im_h_sym*cj + re_h_sym*sj - im_h*cj - re_h*sj)
x[2*(N-j)+2] = 1/2*(im_h_sym - im_h + re_h_sym*cj + im_h_sym*sj - re_h*cj + im_h*sj)
j += 1
end
x
end
```

We can check that we always get the right result:

```
a = rand(1024)
b = fft(a)
c = my_fft_4(a)
real.(b[1:endΓ·2]) β c[1:2:end] && imag.(b[1:endΓ·2]) β c[2:2:end]
```

`true`

In terms of performance, we finally managed to outperform the reference implementation!

`@benchmark fft!(x) setup=(x = rand(1024) .|> complex)`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 17.393 ΞΌs β¦ 17.563 ms β GC (min β¦ max): 0.00% β¦ 31.36%
Time (median): 19.227 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 24.551 ΞΌs Β± 175.678 ΞΌs β GC (mean Β± Ο): 2.24% Β± 0.31% ββ
ββββββ ββ
β
β β
βββββββββββββ
β
β
βββββ
ββ
βββ
β
ββ
ββββββββββββββββββββββββββββββββ
β
17.4 ΞΌs Histogram: log(frequency) by time 40.9 ΞΌs < Memory estimate: 1.72 KiB, allocs estimate: 25.
```

`@benchmark my_fft_4(x) setup=(x = rand(1024))`

```
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 12.775 ΞΌs β¦ 52.808 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 12.984 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 13.386 ΞΌs Β± 1.605 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00% βββ ββ β β β
ββββββββββββ
ββ
ββ
βββ
ββ
β
ββ
β
β
β
β
βββ
β
β
βββ
β
β
β
β
β
βββ
ββ
ββ
β
β
βββββββββ
β
12.8 ΞΌs Histogram: log(frequency) by time 20.4 ΞΌs < Memory estimate: 0 bytes, allocs estimate: 0.
```

[4] | In practice we can always reduce to this case by stuffing zeros. |

[5] | MSB and LSB are the acronyms of Most Significant Bit and Least Significant Bit. In a number represented on \(n\) bits, the MSB is the bit that carries the information on the highest power of 2 (\(2^{n-1}\)) while the LSB carries the information on the lowest power of 2 (\(2^0\)). Concretely the MSB is the leftmost bit of the binary representation of a number, while the LSB is the rightmost. |

If we compare the different implementations proposed in this tutorial as well as the two reference implementations, and then plot the median values of execution time, memory footprint and number of allocations, we obtain the following plot:

I added the function `FFTW.rfft`

which is supposed to be optimized for real. We can see that in reality, unless you work on very large arrays, it does not bring much performance.

We can see that the last versions of the algorithm are very good in terms of number of allocations and memory footprint. In terms of execution time, the reference implementation ends up being faster on very large arrays.

How can we explain these differences, especially between our latest implementation and the implementation in FFTW? Some elements of answer:

FFTW solves a much larger problem. Indeed our implementation is "naive" for example in the sense that it can only work on input arrays whose size is a power of two. And even then, only those for which we have taken the trouble to implement a method of the

`bit_reverse`

function. The reverse bit permutation problem is a bit more complicated to solve in the general case. Moreover FFTW performs well on many types of architectures, offers discrete Fourier transforms in multiple dimensions etc... If you are interested in the subject, I recommend this article^{[6]}which presents the internal workings of FFTW.The representation of the complex numbers plays in our favor. Indeed, we avoid our implementation to do any conversion, this is seen in particular in the test codes where we take care of recovering the real part and the imaginary part of the transform:

`real.(b[1:endΓ·2]) β c[1:2:end] && imag.(b[1:endΓ·2]) β c[2:2:end]`

`true`

Our algorithm was not thought of with numerical stability in mind. This is an aspect that could still be improved. Also, we did not test it on anything other than noise. However, the following block presents some tests that suggest that it "behaves well" for some test functions.

These simplifications and special cases allow our implementation to gain a lot in speed. This makes the implementation of FFTW all the more remarkable, as it still performs very well!

[6] | Frigo, Matteo & Johnson, S.G.. (2005). The Design and implementation of FFTW3. Proceedings of the IEEE. 93. 216 - 231. 10.1109/JPROC.2004.840301. |

At the end of this tutorial I hope to have helped you to understand the mechanisms that make the FFT computation work, and to have shown how to implement it efficiently, modulo some simplifications. Personally, writing this tutorial has allowed me to realize the great qualities of FFTW, the reference implementation, that I use every day in my work!

This should allow you to understand that for some use cases, it can be interesting to implement and optimize your own FFT. An application that has been little discussed in this tutorial is the calculation of convolution products. An efficient method when convolving signals of comparable length is to do so by multiplying the two Fourier transforms and then taking the inverse Fourier transform. In this case, since the multiplication is done term by term, it is not necessary that the Fourier transform is ordered. One could therefore imagine a special implementation that would skip the reverse bit permutation part.

Another improvement that could be made concerns the calculation of the inverse Fourier transform. It is a very similar calculation (only the multiplicative coefficients change), and can be a good exercise to experiment with the codes given in this tutorial.

Finally, I want to thank @Gawaboumga, @NΓ¦, @zeqL and @luxera for their feedback on the beta of this tutorial, and @Gabbro for the validation on zestedesavoir.com!

]]>