
The basic power of two factorization that describes the FFT in the first place shows how to use two N/2 FFTs (+ pre or post processing) to do one N fft. By doing this recursively you can see how large FFTs can be broken into small ones

True, The butterfly is just fft of 2 samples (N = 2) the rest of the design of any N fft is built on this unit.
Any modifications such as using 64 N for 1024 N etc. is playing around with specific cases but any advantage is implementation dependent between given small fft and the larger fft built around it.

kaz, even doing a butterfly is implementation dependent, i think, because how you gotta handle the indexing of the two in-place bins and then doing the add/subtract math and the scaling with a complex number.
but, i think how the stitching is done is by "manually" implementing the butterflies that include 64-point FFT as a single super-butterfly.
considering the illustration that i stole from https://www.slideshare.net/op205/fast-fourier-transform-presentation , the rectangular 2-point DFT is the same as one of the circular nodes. if you pick the four x0, x4, x2, x6 and go to A0, A1, A2, A3 (two passes) that would be a 4-point DFT node. put it in a box with four in and four out. stitch four of those together and you have a 16-point DFT. at some point you are stitching together 64-point nodes (that would have 6 passes of 2-point DFTs which are otherwise known as "butterflies").

I was at some point excited about doing Altera (Intel) 2k fft from their 2x1k fft as described here(figure 2):
https://www.researchgate.net/publication/279753021_Implementing_super-efficient_FFTs_in_Altera_FPGAs
However relative to a 2k fft with serial stream input and serial stream output, what the authors have missed is that to parallise inputs as two substreams for 1k case (or alternatively get one output stream from two output streams) extra memory is needed. Fortunately in my design there were already memory blocks and so I could do parallel substreams at no extra cost and took advantage. Such advantage is purely because how the 2k core was designed in the first place.


Do you mean how to perfor a 1024 point FFT with a 64 points core?? This topic is shown in R. Lyons Book in the tips and tricks chapter.