EFCL

                   EFCL CHANNELIZER

                             v0.3, January 2017

KEYWORDS

Fast Channelizer, Filterbank, Speech Recognition Front End, Software
Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio Denoising,
High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE, NEON.

DOWNLOAD

10.01.2017: EFCL-v0.3    minor compatibility bug fix, documentation update
31.12.2016: EFCL-v0.2    2 bug fixes
30.12.2016: EFCL-v0.1    initial release

ACKNOWLEDGMENTS

This work has been supported by the LINDAT/CLARIN project of the Ministry
of Education, Youth and Sports of the Czech Republic (project LM2015071).

WHAT IT IS

  This is a
library and a stand-alone program that can split input signal

  stream into a number of equally spaced channels. The samples are
complex

  valued single precision floats. If Fs is a sampling frequency,
the

  channels' center frequencies will be 0, Fs/C, 2*Fs/C, ...
(C-1)*Fs/C,

  where C is the number of channels (i.e. the zeroth channel spans
(-Fs/2C,

  Fs/2C) interval, the first channel (Fs/2C,3Fs/2C) and the last
one

  ((C-1.5)*Fs/C,(C-0.5)*Fs/C) which --- due to spectrum
periodicity --- can

  be regarded as the channel "-1" spanning from -3Fs/2C to -Fs/2C).

The output of the channelizer can be optionally decimated by an integer
factor D (typically D=C).

The library is designed for high throughput, using on-the-fly compiled
assembly codelets that get compiled and tuned for a particular machine and
C and D parameters when the program starts. The compiled codelet
is then dynamically linked with the main program so that the channelizer
could use it. The assembly heavily relies on using SIMD instructions, the
flavor of which (SSE, AVX or NEON) is detected by the makefile when the
library gets compiled. So, if one wants to distribute platform independent
binaries, it is necessary to compile libefcl.so for different SIMD flavors
and distribute several different libefcl.so files from which the installer
(or the main program) would select the proper one (in fact this would be
more complicated as there are also tuning constants and options which are
CPU model specific).

For CPUs that lack SIMD instructions, there is still a generic C version
of the code, which is about 3 times slower.

The library can also take an advantage of multi core CPUs to further
increase its speed by running the computation in multiple threads. It is
runtime configurable how many threads can be used and the library
adaptively balances the load (even turning itself off on cores overloaded
by other tasks) so as to gain maximal throughput.

WHAT IT ISN'T

To select a channel out of a signal one needs a filter. The efcl
channelizer expects to be given its impulse response in a file containing
binary float numbers. The response itself may or may not be symmetric ---
it does not matter as the filtering is fully general. If you need to use
minimum phase filter for some reason, it is possible. It is enough to
specify this filter for channel 0, i.e. the low pass filter --- filters
for other channels are derived automatically from this one by efcl.

Although there is a utility "./simple_filter" which produces example filter
(sigma factor weighted truncated sin(x)/x), it was meant only for debugging
and performance testing, not for meaningful signal processing.

Designing a proper filter is much more involved, the crucial thing to
consider is that the stop band of such a filter must be falling with 1/f
or better with 1/f^2 so that the stop band would sum to a small constant
when folded due to aliasing (which happens due to decimation). Otherwise
the noise and spectral leakage would be much higher than expected.

Designing such a filter is however beyond the scope of this library. Use
GnuRadio or write your own design tool (weighted Remez algorithm might be
a method of choice). Keep in mind that the filter length is zero-padded
towards the nearest higher multiple of C (the number of channels). This
new length is displayed as L in efcl's verbose mode. Note that those
zeroes are processed as if they were general numbers, thereby taking a
processing time. It is therefore advised to use only lengths divisible by
C when designing the filter so as to take advantage of the additional taps.

HOW FAST IS IT

The speed generally depends on the number of channels C, filter length L and the
decimation factor D. To get a notion I did some testing using C=72, D=50 and L=1152.
The performance is measured in complex megasamples per second that are being emitted
by the program. Note that for C!=D the output speed differs from the input one.
Moreover the performance (in MS/s) is almost insensitive to the value of D, therefore
it is better to specify the speed in output MS/s than in input MS/s (which can be
computed from the output speed multiplying it by C/D).

The plots show how the performance depends on the buffer size (which also
affects the latency of the filterbank (the less the better)). Note that it is a fairly
flat function, so we don't lose much when the buffer size is not optimal. Anyway,
for a known machine the makefile sets it optimal and the user can easily measure it
for his machine and then issue recompilation with his optimal size. In any case,
the buffer size can always be overridden from the command line using -B= option.

The following plots show a single core performance of various CPUs in MS/s. On each CPU
the program was compiled with SIMD enabled, i.e. Intel and AMD used SSEx86 implementation,
whereas the ARM used NEON implementation.

Single core performance of Pentium M, Athlon XP, and ARM A53

Single core performance of Pentium M, Athlon XP, and ARM A53

Note that this is a dry performance only, no real I/O was performed and the program just
worked from memory to memory. In reality, I/O may become a non-negligible CPU load due to
increased cache pressure. Nevertheless, at least 50% of what was measured should be
attainable in a practial setup.

Note that there is certain minimal size of the buffer (namely L+D), below which the
algorithm cannot work and which defines the minimum latency. If less than this is
specified, it is automatically replaced by L+D. This explains the flat region at
the beginning of all the plots. The noise that can be seen there is caused by natural
variations in speed. In fact these are sligtly greater than what can be seen in the picture
because each point was measured 10 times and the best speed was plotted. This was to
hide sudden drops in performance which I attribute to interference with system processes
(as the testing time for each buffer size was as short as possible, even a small time
stolen by the system (or other task) exhibits as a large drop in performace).

Note that the noise increases to the right, because the data don't fit into the cache
anymore.

The next plots show the same thing in relative way, counting how many complex
samples (on average) leave the channelizer at every CPU clock. This way is best for
comparing CPU architectures of different clock speeds. It also shows that in the
filter setup used for the test it takes about 100 CPU cycles to produce a single pair
of output floats.

Single core performance expressed in average complex samples per CPU clock.

Single core performance expressed in average complex samples per CPU clock.

Especially note that the per-clock performance of ARM-A53 is not at all bad, considering
it is a in-order machine competing against out-of-order Intel and AMD. In fact, the
assembly part ran as fast as the respective one on Intel, only the FFTW library seemed
to be somewhat lazier on ARM (which might change by recompiling it with specific
optimization for A53 turned on, but I have not tried that).

Also note that A53 is a 64 bit machine but here it was run in Aarch32 mode, in which it
emulates older 32 bit ARM instruction set. If programmed in native instructions it would
surely be faster (its NEON unit has twice as much registers then, for example). Unfortunately,
NEON64 implementation is not ready yet.

The A53 CPU (as used in Raspbery-Pi 3 model B) is in fact a quad-core CPU, and the next
picture shows its performance when more than one thread is running. The number of worker
threads used is denoted by the t=<number>_ prefix there. Note that it does not have much sense
to run more than 3 computation threads as there must always be a manager thread that supplies
the data, which also consumes some power.

A53 multicore performance

Finally, the last plots show absolute and relative performance of modern multi-core
CPUs where 3 cores were utilized for computation:

EFCL speed on modern multicore CPUs

Note that using more cores for minimal buffer size does only a little improvement. But
when the buffer gets above a ceartain threshold the performance of multicore run rapidly
increases (this happens when the buffers are big enough so that all three threads can
be given their own chunk, taken from the current buffer).

The next plot shows the same thing pre CPU clock.

EFCL performance on modern CPUs expressed in MS/CPU_clock

EFCL performance on modern CPUs expressed in MS/CPU_clock

Notice that although Phenom running at 3 cores was more powerful than a Core i7
single core in MS/s, it would be actually slower if it ran at the same clock speed.
The IPC performace of the new CPUs increased dramatically. Now it only takes
10 CPU clocks to produce a single sample of the result, not 100 as in the case
of AMD Athlon XP and its competitors shown above.

DEPENDENCIES

Only the Linux operating system is currently supported. It was tested that the library
works in Cygwin too, but only its generic C/C++ implementation. To compile the efcl
library one needs the following software packages.

     PACKAGE                    VERSIONS KNOWN TO WORK WITH (*)

     gcc/g++                    3.3.6   4.7.3   4.9.2   5.4.0
     C library (getconf)        2.2.5   2.19    2.23
     pthread library            0.9     2.19
     FFTW library               3.3.3   3.3.4
     nasm                       2.11.06 2.10.07 2.11.05
     ld                         2.14    2.23.2 2.25

extra programs needed for NEON_IMPLEMENTATION (on ARM CPU):

     as                         2.25
     grep                       2.20
     dash                       ? (does anyone know how to determine it?)

(*) There might be other versions that work too, the listed ones were
actually tested. Note that nasm and ld are also required at the runtime
if the library has been configured to use the SIMD code. Likewise,
grep, as and dash are needed for ARM/NEON at runtime.

QUICK TEST

Run

    make bench

to benchmark performance of your machine and to test that compilation works.

HOW TO COMPILE AND INSTALL

To get help on what the makefile provides, run

    make

To install the library, just type

    make install

and it will compile and install:

       efcl and gosese binary        to $DESTDIR/usr/local/bin
       libefcl.a                        to $DESTDIR/usr/local/lib
       efcl.h header file               to $DESTDIR/usr/local/include

If you wanted a different configuration than the one which was discovered
by the SIMD_probe script, type:

    make clean
    SIMD=<implementation_name> make install

where the <implementation_name> is one of SSEx86, SSE, AVX, AVX512, NEON,
NEON64, C1, C2, C3, A1, A2, A4. Note that A1, A2, A4 implementations only
compile using older gcc like gcc-3.3.6, but this is OK as it was only a
prototype of SSEx86 implementation, which is now faster than any of Ax
implementations.

PROBLEMS

On latest Intel processors (Skylake, Kaby Lake), the TSC counters of
individual cores can become out of sync somehow. This causes the load
balancer to mispredict optimal workload for those cores which effectively
degrades the overall performance as some cores have to wait for the others
to finish their jobs. Consequently the suspend/resume mechanism detects
this and usually suspends all the threads but two. These two still perform
worse than they would on a system with synchronous TSC counters.

The problem is already known to Linux developers as can be seen there:

     http://lkml.org/lkml/2016/12/16/215

So there is some hope they will eventually fix it. Until then, the users
of the most recent Intel CPUs would have to get by with single threaded
performance which is unaffected by the bug.

FURTHER READING

See pgmdoc.txt, usrdoc.txt and math.pdf files distributed with the source code
for more detailed information. The easiest way to understand the internal
working of the code is to start with efcl0.cc, which is the algorithm in its
pure form, without complications stemming from multithreading and the use of
assembly language.

ORIGIN

This code was created at the Facutly of Math and Physics at Charles University,
in the Institute of Formal and Applied Linguistics. A student of mine
wanted to implement a channelizer in his bachelor thesis [1] as described in
the book [2]. He wanted to be faster than a similar channelizer in GnuRadio, in
which he succeded, outperforming it by a factor of 3. Nevertheless, he haven't
had a time to try all my advice in his implementation. To fulfil my tutor role
I decided to implement it from scratch to test if my ideas were right, hoping
that my student might appreciate some of them at last.

The resulting program is however quite far from a neat style of polished software
engineers. It is more aking to a wild style of "real programmers" [3], though it
is not a spegetti code. It is nevertheless quite complex and the complexity is not
hidden in functions/objects/whatever. Partly because of performance and partly
because the efcl can be configured in so many ways. The user can turn on or off
different ways by which it achieves high speed to study their impact. The price
for this was a heavy use of conditinal compilation that somewhat obfuscated
otherwise clean basic structure.

Anyway, when efcl is used as a library, all this mess is hidden inside, exposing
only the interface which is pretty straightforward.

AUTHOR

David Klusacek

HOW TO CITE

The permanent link to this work is http://hdl.handle.net/11234/1-2380.

REFERENCES

[1] Jan Hrach: Frequency Spectrum Monitoring System, Bachelor Thesis, MFF UK, 2016.
[2] Frederic J. Harris: Multirate Signal Processing for Communication Systems,
     Prentice Hall PTR, Upper Saddle River, NJ, USA, 2004, ISBN 0131465112.
[3] Ed Nather: The Story of Mel, USENET, May 21, 1983.