Multi-Processor System-on-Chip 2. Liliana Andrade

Читать онлайн.
Название Multi-Processor System-on-Chip 2
Автор произведения Liliana Andrade
Жанр Зарубежная компьютерная литература
Серия
Издательство Зарубежная компьютерная литература
Год выпуска 0
isbn 9781119818403



Скачать книгу

in parallel. Such data dependencies exist also in iterative decoding algorithms between the various iterations. Hence, the bottleneck for high throughput is the part of an algorithm that cannot be parallelized. This part strongly limits the overall throughput that is also known as Amdahl’s law (Amdahl 2013). Other important features for efficient high-throughput implementations are locality and regularity to minimize interconnect. Interconnect can largely contribute to area, delay and energy consumption. Let us assume an FEC IP block of size 10 mm2 is a feasible size. A signal has to travel at least 7 mm if it has to be transmitted from one corner to the other in this IP block. It will take three clock cycles in a 14 nm technology, assuming a frequency of 1 GHz. This interconnect delay largely decreases the throughput and increases power. Thus, data transfers can be as important as calculations. Although channel decoding algorithms for advanced codes such as Turbo- and LDPC codes are mainly data-flow dominated, they imply irregularity and restricted locality, since efficient channel coding for these coding schemes is grounded on some randomness: that is the interleaver for Turbo codes and the Tanner graph for LDPC codes, respectively. Any regularity and locality in these structures improves the implementation efficiency but has negative impact on the communications performance. Hence, there is a fundamental discrepancy between information theory and efficient implementations for high throughput that demands regularity and locality. In Table 2.1, important implementation properties for the different code classes are summarized.

Code Dec. Algorithms Parallel vs. Serial Locality Compute Kernels Transfers vs. Compute
Turbo MAP Serial/iterative Low (interleaver) Add-Compare-Select Compute dominated
LDPC Belief propagation Parallel/iterative Low (Tanner graph) Min-Sum/add Transfer dominated
Polar Successive cancelation/list Serial High Min-Sum/add/sorting Balanced

      where ω is a normalized value between 0 and 1, that represents the timing overhead due to, for example, data distribution, interconnect, memory access conflicts, etc. The maximum clock frequency f is determined by the critical path in the compute kernels of the corresponding decoding algorithms and/or delay due to interconnect. Moreover, f is typically upper limited to 1 GHz due to power and design methodology constraints. The overhead ω increases with increasing N and P and is larger for decoding algorithms that have limited locality and are data-transfer dominated. The impact of ω on the throughput can be considered as an effective reduction of the clock frequency f or decrease in P, if additional clock cycles are mandatory, for example, due to memory conflicts, that cannot be hidden. If we are targeting 1 Tbit/s throughput with a frequency limit of 1 GHz, 1,000 bits have to be decoded in one single clock cycle that requires extreme parallelism. To achieve the highest throughput, P has to be maximized, and ω minimized. In the following sections, we will discuss the throughput maximization for the different coding schemes.

      2.3.1. Turbo decoder