EFCL
CHANNELIZER
v0.3, January 2017
KEYWORDS
Fast Channelizer, Filterbank, Speech Recognition Front End,
Software
Defined Radio, Polyphase Filter, Frequency Multiplexing, Audio
Denoising,
High Performance Computing, HPC, SDR, FFT, FFTW, SIMD, AVX, SSE,
NEON.
DOWNLOAD
10.01.2017: EFCL-v0.3 minor compatibility
bug fix, documentation update
31.12.2016: EFCL-v0.2 2 bug fixes
30.12.2016: EFCL-v0.1 initial release
ACKNOWLEDGMENTS
This work has been
supported by the LINDAT/CLARIN project of the Ministry
of Education, Youth and
Sports of the Czech Republic (project LM2015071).
WHAT IT IS
This is a
library and a stand-alone program that can split input signal
stream into a number of equally spaced channels. The samples are
complex
valued single precision floats. If Fs is a sampling frequency,
the
channels' center frequencies will be 0, Fs/C, 2*Fs/C, ...
(C-1)*Fs/C,
where C is the number of channels (i.e. the zeroth channel spans
(-Fs/2C,
Fs/2C) interval, the first channel (Fs/2C,3Fs/2C) and the last
one
((C-1.5)*Fs/C,(C-0.5)*Fs/C) which --- due to spectrum
periodicity --- can
be regarded as the channel "-1" spanning from -3Fs/2C to -Fs/2C).
The output of the
channelizer can be optionally decimated by an integer
factor D (typically D=C).
The library is designed
for high throughput, using on-the-fly compiled
assembly codelets that get
compiled and tuned for a particular machine and
C and D parameters
when the program starts. The compiled codelet
is then dynamically linked
with the main program so that the channelizer
could use it. The assembly
heavily relies on using SIMD instructions, the
flavor of which (SSE, AVX
or NEON) is detected by the makefile when the
library gets compiled. So,
if one wants to distribute platform independent
binaries, it is necessary
to compile libefcl.so for different SIMD flavors
and distribute several
different libefcl.so files from which the installer
(or the main program)
would select the proper one (in fact this would be
more complicated as there are also tuning constants and options
which are
CPU model specific).
For CPUs that lack SIMD
instructions, there is still a generic C version
of the code, which is
about 3 times slower.
The library can also take
an advantage of multi core CPUs to further
increase its speed by
running the computation in multiple threads. It is
runtime configurable how
many threads can be used and the library
adaptively balances the
load (even turning itself off on cores overloaded
by other tasks) so as to
gain maximal throughput.
WHAT IT ISN'T
To select a channel out of
a signal one needs a filter. The efcl
channelizer expects to be
given its impulse response in a file containing
binary float numbers. The
response itself may or may not be symmetric ---
it does not matter as the
filtering is fully general. If you need to use
minimum phase filter for
some reason, it is possible. It is enough to
specify this filter for
channel 0, i.e. the low pass filter --- filters
for other channels are
derived automatically from this one by efcl.
Although there is a
utility "./simple_filter" which
produces example filter
(sigma factor
weighted truncated sin(x)/x), it was meant only for
debugging
and performance testing,
not for meaningful signal processing.
Designing a proper filter
is much more involved, the crucial thing to
consider is that the stop
band of such a filter must be falling with 1/f
or better with 1/f^2 so
that the stop band would sum to a small constant
when folded due to
aliasing (which happens due to decimation). Otherwise
the noise and spectral
leakage would be much higher than expected.
Designing such a filter is
however beyond the scope of this library. Use
GnuRadio or write your own
design tool (weighted Remez algorithm might be
a method of choice). Keep
in mind that the filter length is zero-padded
towards the nearest higher
multiple of C (the number of channels). This
new length is displayed as
L in efcl's verbose mode. Note that those
zeroes are processed as if
they were general numbers, thereby taking a
processing time. It is
therefore advised to use only lengths divisible by
C when designing the
filter so as to take advantage of the additional taps.
HOW FAST IS IT
The speed generally
depends on the number of channels C, filter length L and the
decimation factor D. To
get a notion I did some testing using
C=72, D=50 and L=1152.
The performance is
measured in complex
megasamples per second that
are being emitted
by the program. Note that
for C!=D the output speed differs from
the input one.
Moreover the performance
(in MS/s) is almost insensitive to the
value of D, therefore
it is better to specify
the speed in output MS/s than in input
MS/s (which can be
computed from the output
speed multiplying it by C/D).
The plots show how the performance depends on the buffer size
(which also
affects the latency of the filterbank (the less the better)).
Note that it is a fairly
flat function, so we don't lose much when the buffer size is not
optimal. Anyway,
for a known machine the makefile sets it optimal and the user
can easily measure it
for his machine and then issue recompilation with his optimal
size. In any case,
the buffer size can always be overridden from the command line
using -B= option.
The following plots show a single core performance of various
CPUs in MS/s. On each CPU
the program was compiled with SIMD enabled, i.e. Intel and AMD
used SSEx86 implementation,
whereas the ARM used NEON implementation.
Note that this is a dry performance only, no real I/O was
performed and the program just
worked from memory to memory. In reality, I/O may become a
non-negligible CPU load due to
increased cache pressure. Nevertheless, at least 50% of what was
measured should be
attainable in a practial setup.
Note that there is certain minimal size of the buffer (namely
L+D), below which the
algorithm cannot work and which defines the minimum latency. If
less than this is
specified, it is automatically replaced by L+D. This explains
the flat region at
the beginning of all the plots. The noise that can be seen there
is caused by natural
variations in speed. In fact these are sligtly greater than what
can be seen in the picture
because each point was measured 10 times and the best speed was
plotted. This was to
hide sudden drops in performance which I attribute to
interference with system processes
(as the testing time for each buffer size was as short as
possible, even a small time
stolen by the system (or other task) exhibits as a large drop in
performace).
Note that the noise increases to the right, because the data
don't fit into the cache
anymore.
The next plots show the same thing in relative way, counting
how many complex
samples (on average) leave the channelizer at every CPU clock.
This
way is best for
comparing CPU architectures of different clock speeds. It also
shows that in the
filter setup used for the test it takes about 100 CPU cycles to
produce a single pair
of output floats.
Especially note that the per-clock performance of ARM-A53 is not
at all bad, considering
it is a in-order machine competing against out-of-order Intel
and AMD. In fact, the
assembly part ran as fast as the respective one on Intel, only
the FFTW library seemed
to be somewhat lazier on ARM (which might change by recompiling
it with specific
optimization for A53 turned on, but I have not tried that).
Also note that A53 is a 64 bit machine but here it was run in
Aarch32 mode, in which it
emulates older 32 bit ARM instruction set. If programmed in
native instructions it would
surely be faster (its NEON unit has twice as much registers
then, for example). Unfortunately,
NEON64 implementation is not ready yet.
The A53 CPU (as used in Raspbery-Pi 3 model B) is in fact a
quad-core CPU, and the next
picture shows its performance when more than one thread is
running.
The number of worker
threads used is denoted by the t=<number>_ prefix there.
Note that
it does not have much sense
to run more than 3 computation threads as there must always be a
manager thread that supplies
the data, which also consumes some power.
Finally, the last plots show absolute and relative performance
of modern multi-core
CPUs where 3 cores were utilized for computation:
Note that using more cores for minimal buffer size does only a
little improvement. But
when the buffer gets above a ceartain threshold the performance
of multicore run rapidly
increases (this happens when the buffers are big enough so that
all three threads can
be given their own chunk, taken from the current buffer).
The next plot shows the same thing pre CPU clock.
Notice that although Phenom running at 3 cores was more powerful
than a Core i7
single core in MS/s, it would be actually slower if it ran at
the same clock speed.
The IPC performace of the new CPUs increased dramatically. Now
it only takes
10 CPU clocks to produce a single sample of the result, not 100
as in the case
of AMD Athlon XP and its competitors shown above.
DEPENDENCIES
Only the Linux
operating system is currently supported. It was tested that the library
works in Cygwin too, but only its generic C/C++ implementation.
To compile the efcl
library one
needs the following software packages.
PACKAGE
VERSIONS KNOWN TO WORK WITH (*)
gcc/g++
3.3.6 4.7.3 4.9.2 5.4.0
C
library (getconf)
2.2.5 2.19 2.23
pthread
library
0.9 2.19
FFTW
library
3.3.3 3.3.4
nasm
2.11.06 2.10.07 2.11.05
ld
2.14 2.23.2 2.25
extra programs needed for NEON_IMPLEMENTATION (on ARM CPU):
as
2.25
grep
2.20
dash
? (does anyone know how to determine it?)
(*)
There might be other versions that work too, the listed ones were
actually
tested. Note that nasm and ld are also required at the runtime
if the
library has been configured to use the SIMD code. Likewise,
grep, as and dash are needed for ARM/NEON at
runtime.
QUICK TEST
Run
make bench
to benchmark performance of your machine and to test that
compilation works.
HOW TO COMPILE
AND INSTALL
To get help on what the
makefile provides, run
make
To install the library,
just type
make install
and it will compile and
install:
efcl
and gosese binary
to $DESTDIR/usr/local/bin
libefcl.a
to
$DESTDIR/usr/local/lib
efcl.h header
file
to $DESTDIR/usr/local/include
If you wanted a different
configuration than the one which was discovered
by the SIMD_probe script,
type:
make clean
SIMD=<implementation_name> make
install
where the
<implementation_name> is one of SSEx86,
SSE, AVX, AVX512, NEON,
NEON64, C1, C2, C3, A1,
A2, A4. Note that A1,
A2, A4 implementations only
compile using older gcc like gcc-3.3.6, but this is OK as it was
only a
prototype of SSEx86
implementation, which is now faster than any of Ax
implementations.
PROBLEMS
On latest Intel processors (Skylake, Kaby Lake), the TSC
counters of
individual cores can become out of sync somehow. This causes the
load
balancer to mispredict optimal workload for those cores which
effectively
degrades the overall performance as some cores have to wait for
the others
to finish their jobs. Consequently the suspend/resume mechanism
detects
this and usually suspends all the threads but two. These two
still perform
worse than they would on a system with synchronous TSC counters.
The problem is already known to Linux developers as can be seen
there:
http://lkml.org/lkml/2016/12/16/215
So there is some hope they will eventually fix it. Until then,
the users
of the most recent Intel CPUs would have to get by with single
threaded
performance which is unaffected by the bug.
FURTHER READING
See pgmdoc.txt, usrdoc.txt
and math.pdf files distributed
with the source code
for more detailed
information. The easiest way to understand the internal
working of the code is to start with efcl0.cc, which is the algorithm in
its
pure form, without complications stemming from multithreading
and the use of
assembly language.
ORIGIN
This code was created at the Facutly of Math and Physics at
Charles University,
in the Institute of Formal and Applied Linguistics. A
student of mine
wanted to implement a channelizer in his bachelor thesis [1] as
described in
the book [2]. He wanted to be faster than a similar channelizer
in GnuRadio, in
which he succeded, outperforming it by a factor of 3.
Nevertheless, he haven't
had a time to try all my advice in his implementation. To fulfil
my tutor role
I decided to implement it from scratch to test if my ideas were
right, hoping
that my student might appreciate some of them at last.
The resulting program is however quite far from a neat style of
polished software
engineers. It is more aking to a wild style of "real
programmers" [3], though it
is not a spegetti code. It is nevertheless quite complex and the
complexity is not
hidden in functions/objects/whatever. Partly because of
performance and partly
because the efcl can be configured in so many ways. The user can
turn on or off
different ways by which it achieves high speed to study their
impact. The price
for this was a heavy use of conditinal compilation that somewhat
obfuscated
otherwise clean basic structure.
Anyway, when efcl is used as a library, all this mess is hidden
inside, exposing
only the interface which is pretty straightforward.
AUTHOR
David Klusacek
HOW
TO CITE
The permanent link to this work is
http://hdl.handle.net/11234/1-2380.
REFERENCES
[1] Jan
Hrach: Frequency
Spectrum Monitoring System, Bachelor Thesis, MFF UK, 2016.
[2]
Frederic J. Harris: Multirate Signal
Processing for
Communication Systems,
Prentice
Hall PTR, Upper Saddle River, NJ, USA, 2004, ISBN 0131465112.
[3] Ed Nather: The Story of Mel, USENET, May 21, 1983.