# EFFICIENT BIT-LEVEL DESIGN OF AN ON-BOARD DIGITAL TV DEMULTIPLEXER

S. Calvo, J.Sala, A. Pagès, G. Vàzquez

Dept. of Signal Theory and Communications, Universitat Politècnica de Catalunya,

c/ Jordi Girona 1-3, Mòdul D5-114b,

08034 Barcelona, SPAIN

Tel: +34-3-4015894; Fax: +34-3-4016447

e-mail: {sergio,alvarez}@gps.tsc.upc.es

# ABSTRACT

A bit-level description of the signal processing stage of an on-board integrated VLSI multi-carrier demodulator is presented in this paper, along with a description of the optimization procedure that has been developed for the signal processing functions<sup>1</sup>. The demultiplexer is capable of handling a varying number of carriers in a 36 MHz bandwidth on the satellite up-link. Its architecture has been optimized at bit-level in a way dependent on the known input signal statistics and carrier distributions allowed by the frequency plan.

## **1** INTRODUCTION

Space Digital Video Broadcasting Systems are evolving toward the DVB (Digital Video Broadcasting) Standard based on MPEG2. An increasingly larger amount of processing is being moved toward the space segment, so that complex regenerative payloads shall have to be carried by the forthcoming satellite generation. This paper describes the architecture that has been developed for an O/B Multi-carrier Demultiplexer ASIC prototype to provide services for digital television and multimedia in the frame of the HISPANET network project. HISPA-NET is aimed at providing broadcasting of digital multiprogramme television to Spanish-speaking communities in Europe and America. The basic concept is to provide access to individual broadcasters and service providers through specific transponders carried by the HISPASAT satellite. One carrier, conveying all programmes received on the individual up-links (Multi-frequency TDMA) is transmitted on the down-link. Therefore, demultiplexing and demodulation (not considered in this paper) must be carried out on-board.

The design of digital on-board systems, and specifically, of the filtering stages of the digital demultiplexer have to keep power consumption, gate count and implementation losses to a minimum while maintaining acceptable system performance. A special criterion that takes into account the structure of the (interfering) adjacent carriers has been developed [4] to derive suitable decimation filters for the demultiplexing function. The criterion optimizes jointly the filter response in the pass-, transition and stop-bands for a given number of coefficients as the complexity of filtering is exponential in the filter length. In this follow-up paper, we consider the bit-level design of the architecture therein described. A system overview is presented in section 2: System Description. Section 3, presents the approach followed in bit-level design for VLSI integration. Results and Conclusions are shown in Section 4.

## 2 SYSTEM DESCRIPTION

The architecture of the digital on-board demultiplexer shall have to deliver any carrier combination of those allowed (see Fig. 2.1 and 2.2) of the following signaling rates:  $R_s$ ,  $2R_s$ ,  $3R_s$  and  $4R_s$ , with  $R_s$  the lowest signaling rate. Each carrier is QPSK modulated with a square root raised cosine pulse (roll-off 0.35). Two possible frequency plans have been tailored to facilitate the demultiplexing scheme, where the separation with adjacent carriers is  $1.5 R_s$ . In the final architecture, both frequency plans depicted in Fig. 2.1 and Fig. 2.2 are processed by two independent demultiplexers that can be internally configured to deal with either of them. The overall bandwidth (36 MHz) can contain up to 18 small carriers at the  $R_s$  signaling rate. The sampling scheme is IF sampling at  $f_s = 36R_s$  (44.64 MHz). Both frequency plans have been devised to contain the four possible mentioned signaling rates with two constraints: (a) that very simple frequency shifting operations should be carried out and (b) that the output sampling rate of each carrier should be the same (in samples per symbol) for all rates. These two constraints have led to the construction of two frequency plans and the design of the demodulators at 3 samples per symbol. The inner architecture of the demultiplexer consists of intercommunicating polyphase processors in a tree scheme. In principle, it would have been possible to define a common architecture capable of processing both frequency plans. Note in figures 2 and 2 that it should only be

<sup>&</sup>lt;sup>1</sup>This research work has been partially supported by the National Research Plan of Spain (CYCIT): TIC95-1022-C05-01 and TIC96-0500-C10-01 and the Catalonian Regional Government (CIRIT): 1996SGR-00096.

neccessary to introduce the input signal at a decimationby-two or at a decimation-by-three block. Nevertheless, the impact this approach has at bit-level is considerable as each processing block must be dimensioned to handle complex signals at sufficient rate. In the end, it was opted for implementing each tree's hardware separately. Then, hardware optimization is more straightforward and can be handled more effectively using the techniques described in the section on Bit-Level Design.



Figure 1: Frequency Plan No. 1 and No. 2: (above) 2.1 Mbps and 6.3 Mbps carriers (below) 2.1 Mbps, 4.2 Mbps and 8.4 Mbps carriers.

Demultiplexing is carried out according to the scheme defined in figures 2 and 2:



Figure 2: Filtering and Decimation structure for Tree no. 1 or 2:2:3. Each block is a polyphase processor consisting of a decimated filter bank and a IDFT operation. Bit-level optimization is carried out separately at both blocks

The architecture for the 3:2:2 configuration is displayed in figure 2. Each block is implemented with a polyphase processor that performs demultiplexing of 6 and 4 carriers, with a decimation ratio of 3 and 2, respectively. Note that the working rate of each processor is twice as fast if compared to that of a conventional polyphase processor (the decimation ratio is only half the number of carriers). This only affects the IDFT part of the polyphase as the same outputs of the filter bank may be used twice (at different input positions to the IDFT) to evaluate odd output samples. Only some



Figure 3: Filtering and Decimation structure for Tree no. 2 or 3:2:2

of the outputs of each processor contain useful data (depending on the frequency plan), so that inhibited outputs will not be synthesized in the final hardware.



Figure 4: Architecture of the /2 (decimation-by-two) processor filter bank. The output vector  $\mathbf{y}$  is fed into a IDFT processor. The four different subfilters constitute the decimated versions of the carrier demultiplexing filter h(n): hi = h(4n + i). For this particular application, we have  $h(4n+1) = \delta(n-L)$  and h(4n+3) = 0. Upper/lower branches of the four logic multiplexers are selected according to even/odd IDFT output samples (vector  $\mathbf{z}$ ). The /3 processor is designed in a similar way.

## 3 BIT-LEVEL DESIGN

Bit-level designs of signal processing algorithms often require careful wordlength dimensioning in each node of the architecture. The overall performance of the system in terms of a distortion criterion depends on two factors: (a) the wordlength assigned to each internal variable and (b) the number of quantization levels at the input. The usual criterion is, thus, to optimize performance for a reasonable complexity. It is useful to consider several optimization levels in the design of a digital architecture: (a) top level where algorithmical optimization is carried out (choice of the algorithm), (b) bit level or register transfer level (RTL) where the optimization procedure described in this paper applies, and (c) gate level where the RTL specification is synthesized into an interconnection of gates/registers. Trade-offs must be met during the design phase. Therefore, it is a usual procedure in the provision of the system specifications to state the maximum tolerated level of implementation loss, evaluated in dB. We have chosen this therefore as the distortion criterion under which the architecture must be optimized. The implementation loss is defined according to:

$$L_I = 10 \log_{10} \frac{(E_b/N_o)_{\rm arch}}{(E_b/N_o)_{\rm true}}$$
 (dB) (1)

with  $(E_b/N_o)_{\rm arch}$  the bit-energy to noise spectral power density ratio obtained at the output of the architecture finally implemented and  $(E_b/N_o)_{\rm true}$  the true ratio of the ideal signal model.

Previously, we have described the high-level architecture for the digital TV demultiplexer in terms of a polyphase tree. The next step in the design process is to translate this algorithmics to RTL (VHDL) primitives.

Specifications of the present system in terms of the number of carriers to be processed, the four admissible symbol rates for each carrier and the varying input dynamics must be approached with a suitable strategy at the arithmetical and logic levels. The complexity of the overall system depends, on a large degree, on the number of bits assigned to each node in the architecture. It is therefore important to identify those points in the architecture that are more critical in terms of the implementation loss introduced.

This objective is usually achieved after recurrent simulations using probing sequences previously defined for a number of scenarii (we refer here to those critical cases defined in the system specifications). Tables are produced showing the implementation loss associated with each node in terms of the implicated complexity and a final decision is reached for the its proper dimensioning. A close understanding of the behaviour of bit dynamics in terms of the data statistics can provide short-cuts to this procedure. In our setting, data statistics can be intuitively related to the spectrum, or frequency plan, present at the input to each polyphase processor.

The working margin of each processor can be defined as that range, in terms of signal power, that is presented to it from the preceding stage in the architecture. Provided that this condition is met, that particular processor will work according to specifications. In this particular case, the whole demultiplexing tree is implemented as a cascade of polyphase processors, so that careful monitoring of signal dynamics is crucial to guarantee that each polyphase processor is near its optimum working point.

Bit-level dynamics depend heavily on the input data statistics. In particular, cascaded architectures are extremely sensitive to this effect. Bit-level primitives are implemented as fixed-point operations, so that recurrent processing on the input data vector results in signal attenuation along the cascade. This attenuation effect, being detrimental to system performance in terms of the quantization SNR (or implementation loss  $L_I$ ), is heavily dependent on the input data spectrum.

The approach that has been taken for this design is to monitor the data histogram along the cascade as probing sequences are fed through. This attenuation can be eliminated if fixed-point amplification is implemented at key points in the architecture. The choice of the amplification factors is critical in the sense that only a precise understading of the signal statistics can provide a suitable value for the whole range of carrier distributions contemplated in the frequency plan. A suitable value must be chosen to prevent either saturation of arithmetics, or an excessively low signal to quantization noise ratio for the specified carrier powers.

Therefore, in order to determine the behaviour of one target architecture, it is only necessary to determine those input data distributions (or spectra) that deliver at the output the maximally and minimally attenuated signal power for constant input signal power. It can be shown that the attenuation induced by the architecture depends on the randomness of the input signal spectrum. All operations involved in demultiplexing the carrier set are linear operations. It is straightforward to provide an intuitive justification: let  $\{x_i, i \in \mathcal{I}\}$  be a set of input (correlated and bounded) random variables and let us perform a linear operation  $L(\cdot)$  on these variables:  $y = L(x_1; \cdots; x_N)$ . Then, the probability density function of the random variable Y':

$$p_{Y'}(y')$$
 ,  $y' \stackrel{\text{def}}{=} y/\max(y)$  (2)

is flatter the more those input random variables are correlated (this can be justified from the Central Limit Theorem). In other words, absence of correlation at the input can be interpreted as the output taking its maximum values with vanishing probability.

Therefore, wordlength dimensioning is critical in terms of the data statistics to guarantee minimum implementation loss. The most significant bits at several points in the architecture can be dropped as they will only activate with negligible probability (depending on the data statistics). Thus, the necessary logic to evaluate those bits can be obviated in the synthesis process, leading to area and gate delay reductions in the final implementation. That is true, a trade-off shall have to be established between the clipping probability (those MSB bits that would activate) and the logic complexity. In the final hardware, simple scaling operations with factors > 1 shift bits to the upper positions. Rounding is performed afterwards to limit the wordlength passed on to the next processor.

In conclusion, the use of histograms and characteristic probing sequences has provided the necessary means

| Gates      | stage $1$ | stage $2$ | stage $3$ |
|------------|-----------|-----------|-----------|
| Tree 2:2:3 | 5229      | 12477     | 13698     |
| Tree 3:2:2 | 31185     | 25252     | 27235     |

Table 1: Number of gates used by the integrated circuit for each tree. Trees 2:2:3 and 3:2:2 total 31404 and 83672 gates, respectively. Tree 3:2:2 displays a heavier computational load.

to reduce the digital architecture complexity of the demultiplexer to an acceptable level.

## 4 RESULTS

Histograms and spectra are shown at key points in the architecture. We have considered two different scenarii: (a) one carrier configuration containing all nine 2.1 Mbps carriers and (b) one carrier configuration containing two 8.4 Mbps carriers and one 2.1 Mbps carrier. In this way, we can show the effect of data statistics and the spectrum at those points of interest in the architecture. Particularly, we have chosen the input to each of the polyphase processors. The dynamic range is always specified as (-1/2, 1/2).



Figure 5: Histograms (a) and (b) at the output of the first (1) and second (2) stages. Note that histogram (b) is more spread than (a) as the number of independent carriers therein contained is four times lower than (a). It is advisable to monitor the dynamics associated with (b) to keep a sufficiently low clipping probability while (a) will be proner to granular noise. Note also than in (2) histogram (b) is already departing from the Gaussian shape.



Figure 6: (above) spectrum of the (b) configuration and (below) spectrum of the (a) configuration. The frequency axis is normalized to the sampling frequency of 44.64 MHz.

Extensive simulations have been run to obtain the wordlength assignment for each polyphase processor capable of meeting the specifications. A 8-bit analog to digital converter (ADC) is used to IF-sample the input signal. Wordlengths passed between polyphase processors are limited to 8 bits. The complexity of each stage is presented in table 4 as evaluated from the VHDL synthesis process. Note that Tree 3:2:2 is more complex due to more demanding computational requirements. For comparison, see figures 2 and 2

### 5 Conclusions

It has been shown that linear signal processing operations can be efficiently synthesized onto an integrated circuit when the statistical dependence between data is taken to advantage in the design process. The reduction in complexity must be traded off against clipping probability. Therefore, saturating arithmetics is necessary to avoid excessive distortion.

### References

- [1] J.Sala, A. Pagès, J.Riba, S.Calvo, G. Vázquez, M.A.Rey. "Algorithms Study and Simulation Results". Report AEO-001887, 30/06/97 submitted to the European Space Agency under contract ESA/ESTEC-12092/96/NL/US.
- [2] J.Prat, A. Rodrguez, F.Ortega, M.A.Rey. "Digital Architectural Design". Report AEO-0018487, 30/06/97 submitted to the European Space Agency under contract ESA/ESTEC-12092/96/NL/US.
- [3] F. Ortega, A. Rodríguez et al. "An advanced Multi-Carrier Demodulator for the ESA OBP System". Proceedings of the Fifth ESA International Workshop on Digital Signal Processing Techniques Applied to Space Communications.
- [4] J. Sala, A. Pagès, S. Calvo, J. Prat. "Design and Implementation of a DVB On-Board Multi-Carrier Demodulator". Proceedings of ICASSP'98 (COMM10.8).