Multi-Processor System-on-Chip 2. Liliana Andrade

Читать онлайн.
Название Multi-Processor System-on-Chip 2
Автор произведения Liliana Andrade
Жанр Зарубежная компьютерная литература
Серия
Издательство Зарубежная компьютерная литература
Год выпуска 0
isbn 9781119818403



Скачать книгу

from limited flexibility in terms of block sizes and varying number of iterations, a large area and high power consumption. Hence, further research is mandatory.

      Figure 2.2. Left: 306 Gbit/s turbo decoder. Middle: 288 Gbit/s LDPC decoder. Right: 764 Gbit/s polar decoder

      2.3.2. LDPC decoder

      LDPC decoding is based on an iterative message exchange between variable and check nodes on the Tanner graph that is represented by the parity check matrix H. Unlike the Turbo decoding, this BP has some inherent parallelism, since all check nodes (variable nodes) can be processed independently from each other. The decoder throughput is mainly limited by the iterative data exchange between the check and variable nodes. The result of each check node or variable node calculation has to be spread via the edges of the Tanner graph to all other connected nodes. The Tanner graph has very limited locality to provide good communications performance, which challenges an efficient implementation of the data transfers and largely impacts ω. Hence, in contrast to Turbo decoding, the BP algorithm is data transfer and not compute dominated. Let us consider an LDPC block code with a parity check matrix H that has #edges(H) 1-entries (number of 1’s in H equals the number of edges in the Tanner graph). Furthermore, let #proc_edges(A) denote the number of edges that can be processed in one clock cycle by a decoder architecture. The corresponding parallelism is image. We can classify different decoder architectures. Partially parallel architectures: Only a subset of edges and nodes are processed in parallel, i.e. P < 1. These architectures are very common for large block sizes that use quasi-cyclic (QC) block codes. An example is the DVB-S2 standard. Fully parallel architectures at iteration level: All edges are processed in parallel, i.e. P = 1. All check nodes and variable nodes are instantiated as hardware units and the corresponding edges are hardwired. Because of the low locality in the Tanner graph, routing congestion (minimizing ω) is a big challenge in these architectures. Unrolled fully parallel architectures: These architectures are similar to the fully parallel architectures, but, in addition, the iterations are unrolled and pipelined, i.e. P = I. In every clock cycle, a new block is processed. Only this architectural approach is feasible to achieve a throughput towards 1 Tbit/s (Schläfer et al. 2013; Ghanaatian et al. 2018). Figure 2.2 (middle) shows the layout of an (672,420) LDPC decoder using the unrolled fully parallel architecture approach. The decoder is implemented in the same technology and under same PVT assumptions as the Turbo decoder and runs with 400 MHz, achieving a (coded) throughput of 288 Gbit/s. The total area is 3.31 mm2. Different colors represent the individual iterations (in total 10).

      2.3.3. Polar decoder

      We have shown that throughputs towards 1 Tbit/s are feasible for all three code classes by appropriate “unrolling”, using heavy pipelining and spatial parallelism. However, this architectural approach is limited to small block sizes and small numbers of iterations (Turbo and LDPC codes), which negatively impacts the communications performance. Moreover, although pipelining largely increases the throughput and locality, it also increases the latency. All architectures suffer from limited flexibility in terms of block sizes (all three codes), varying number of iterations (Turbo and LDPC codes) and code rate flexibility (LDPC and Polar codes). In summary, the biggest challenge for very high-throughput decoder architectures lies in the improvement of the communications performance, under the aforementioned implementation constraints and providing block size, code rate and algorithmic flexibility. As discussed in the introduction, microelectronic progress will largely contribute to an improved area efficiency but not as much to an increased performance and a reduced power density. Thus, further research is mandatory to keep pace with the increasing requirements on communication systems in terms of throughput, latency, power/energy efficiency, flexibility, cost and communications performance.

      We gratefully acknowledge financial support by the EU (project-ID: 760150-EPIC).

      Abbas, S.M., Fan, Y., Chen, J., and Tsui, C.Y. (2017). High-throughput and