DSP algorithms

IM3000 Speed and Efficiency for Digital Signal Processing

Below are shown some results of benchmarks performed by and for a customer, as part of a prestudy for an ASIC development. The benchmark programs had been defined by the customer and (except for FFT) written in C language code and then optimized by the customer through rewriting in assembly code for three different processors. Imsys optimized these benchmark programs for the IM3000 by using microcode (which is not possible for the other processors).

Benchmark overview

1.  Array copy– One time copy of 1024 short (16bit) values from one buffer to another.

2.  Vector product – Five times dot product of two 1024-value short arrays.

3.  Product of conjugate – Five times cosine computation of two 512-value short arrays.

4.  Atan2 computation – One time product of conjugate over 512 complex (2x16bit) values.

5.  Cosin computation – Five times by 512 samples.

6.  Cosin-Sin computation – 512 complex numbers.

7.  FFT computation – 1024 points.

FFT implementations

No C code existed for this and Imsys developed its own reference model in C. It was implemented to perform “in-place”, i.e. the result replaces input data in memory. The assembly code implementations for STM32 and dsPIC use in-place and out-of-place respectively. The customer provided result only for dsPIC, presumably the faster of the two.

Microcode

Microcode optimization for IM3000 means that critical parts of the algorithm have been transformed into special opcodes, which are executed by microcode in the writable part of the control store of the Imsys processor. In the case of the Array copy benchmark, this had already been done, i.e. a suitable opcode already existed in the standard assembly instruction repertoire.

FFT computation on IM3000

Microcode was developed for three operations:

The first instruction takes x, y on the stack, and replaces them with x+y, x-y. The second is a variant that produces -i (x-y) instead of x-y.

The third instructions is complex multiplication, where one factor is viewed as a fixed-point number, with the range -0x8000 to 0x7FFF representing the interval -1.0 to +1.0. It takes two complex numbers x, y on the stack, and replaces them with the complex number:

  (Re(x) * Re(y) – Im(x) * Im(y)) >> 15

+ i (Re(x) * Im(y) + Im(x) * Re(y)) >> 15

Results of speed measurements

Function

Execution time (µs)

dsPIC

STM32

PIC32

IM3000

Array copy

26

33

Vector dot product

132

260

1336

431

Product of conjugate

921

651

Atan2 computation

422

293

Cosin computation

488

557

925

359

Cosin-Sin computation

206

161

FFT (1024 points)

2780

2040

Imsys IM3000 is considerably faster than the two 32-bit RISC processors on the two benchmarks for which results for those were available. Compared to the digital signal processor dsPIC, the Imsys processor was faster on five and slower on two benchmarks.

Energy consumption

The following results were measured by the customer:

Power consumption

mW

dsPIC

295

STM32 (ARM CortexM3)

118

PIC32 (MIPS)

128

Imsys IM3000

40

When these values are multiplied by the execution times for the respective benchmarks, the following results are obtained for the energy consumed by each benchmark execution:

Function

Energy per benchmark (uWs)

dsPIC

STM32

PIC32

IM3000

Array copy

8

1

Vector dot product

39

31

171

17

Product of conjugate

272

26

Atan2 computation

124

12

Cosin computation

144

66

118

14

Cosin-Sin computation

61

6

FFT (1024 points)

820

86

As can be seen here, the Imsys processor consumes much less energy when executing the benchmarks, in several cases an order of magnitude less.