Application Specific Instruction Set Processor Design using Processor Designer
Zhu Ziyuan, Tang Shan, Liu Zhiguozhuziyuan@ict.ac.cn
Wireless Communication Technology Research Center
Institute of Computing Technology, Chinese Academy of Sciences
Abstract
A well-known challenge during processor design is to obtain best possible results for a typical target application domain by combining flexibility and computational performance. Application Specific Instruction Set Processor (ASIP) provides a tradeoff between generality of processor flexibility and its physical characteristics such as computational performance and silicon area. In this paper, we introduce a LTE (3GPP Long Term Evolution) targeted ASIP design flow using Synopsys Processor Designer, including the instruction set and micro architecture design, design space exploration and optimization and performance profiling. Further, the synthesizable HDL generation is also discussed, including the synthesis and FPGA verification. Finally, by proposing an ASIP target for Fast Fourier Transform, we exploring the design and optimization details about the instruction set and micro architecture with Processor Designer. The experimental results show that the processor computes 2048 points FFT within 2000 cycles.
Key Word: ASIP, LTE, FFT, Processor Designer, the tradeoff between area and efficiency
1. Introduction
Long Term Evolution (LTE) Advanced which is introduced by 3GPP is one of the candidates for next generation mobile communication. LTE-Advanced is the next major step of LTE which is based on OFDMA/SC-FDMA to fulfill the requirement of IMT-Advanced. As for mobile terminal, it requires more high efficiency and speed, which is a great challenge for the design of LTE-Advanced baseband design. Furthermore, Next generation mobile communication system should support various communication standards to seamless and smooth interface across heterogeneous network. Therefore, the flexibility of baseband is another key challenge. A software- defined radio system (SDR) is a radio communication system where components implemented by means of software on a personal computer or embedded computing devices. SDR based communication system improve the baseband processing flexibility. Especially in the multi-mode platform, SDR decrease the system overall consumption dramatically.
However, SDR system require large computational throughput which is higher than the capabilities of modern digital signal processors (DSP). An alternative implementation style to DSP is the Application Specific Instruction Set Processors (ASIP). The instruction set of an ASIP is tailored to benefit a specific application. ASIP offers the availability of custom sections for time critical tasks and offer flexibility through an instruction-set. They can be finely tuned to run a small range of applications very efficiently, while keeping the ability to run other tasks through a micro-code program.
Facing the challenge of next generation wireless communicating technology, Wireless Communication Research Center, a sub-department of Institute of Computing Technology (ICT) in Chinese Academic of Science, are developing a LTE baseband system. The system includes digital signal processing (DSP) subsystem, protocol stack processor subsystem, interconnection network, memory and peripheral interfaces. One of important parts of baseband system is the baseband signal processing subsystem. The subsystem is based on SDR technology and ASIP implementation.
The rest of the paper is organized as follows: In Section 2, we present an ASIP architecture developed for the LTE (4G) UE (User Equipment) baseband chip, including the general features and functions. Especially, we will present the design mythology using Synopsys Processor Designer. Next, Section 3 demonstrates a FFT processor design for simplicity. It details the instruction set and micro architecture design, design space exploration and optimization and performance profiling, the HDL generation and the synthesis and FPGA verification. The experimental results show that the processor computes 2048 points FFT within 2000 cycles. Section 4 outlines the conclusion.
2. ASIP Design and Modeling Methodology
Current 3G wireless protocols already have large computation and throughputs requirements, which are almost higher than modern processors. For 4G, these requirements will undoubtedly have a greater increase and will bring great challenges for digital processor deign. These challenges including: heavy duty data path plane, the mixing of wide SIMD execution and VLIW, the balance of pipeline control and data throughput, and the efficient multi-ported register file. All of these introduce very heavy time-consuming and labor- intensive workload. In other word, this is almost a mission- impossible in the traditional Register Transfer Level design method. However, employing ESL (electronic system level) design tool will free the designer from the tedious workload, and bring more effort to improve the system performance.
Synopsys Processor Designer is an automated, application- specific embedded processor design and optimization ESL environment. Processor Designer dramatically accelerates the design of both custom processors and programmable accelerators, especially for the application-specific instruction set processors (ASIPs). Furthermore, Processor Designer also has the ability for modeling a wide range of processor architectures, including architectures with DSP-specific and RISC-specific features as well as SIMD and VLIW architectures [1].
In the following, we present ASIP architecture for the LTE UE baseband chip, including the general features and functions. In this paper, we name this architecture ‘Maotu’,a traditional Chinese Zodiac character. Further, we will present the modeling mythology using Synopsys Processor Designer for Maotu.
2.1 Maotu architecture
Maotu is an ASIP architecture specially designed for baseband processing in 4G wireless communication UE. It is a variable length VLIW machine with a SIMD data path supporting a rich set of vector data processing instructions, which is suitable for high volume parallel data processing in MIMO/OFDM system. Meanwhile, Maotu has a relatively simple control path with regard to simple program flow of the target applications. As part of the baseband processor SoC, Maotu can be used to realize a wide range of signal processing functions including cell search, OFDM demodulation, Channel estimation, MIMO detection, QAM de-mapping and uplink data generation and modulation under the control of a host CPU which controls the physical layer procedures.
The main features of Maotu are:
• Control enhanced scalar instructions
○ Integer arithmetic instructions
○ Data transfer instructions
○ Logical instructions
○ Bitwise shift instructions
○ Conditional branch instructions
○ Unconditional jump instructions
○ Zero overhead hardware loop
• VLIW/SIMD mixed architecture
○ load/store architecture
○ 4-Slot variable length VLIW supported
○ 1024-bit SIMD cluster supported
○ Application specified SIMD instructions
• Supported datatypes
○ 16, 32 bit integers
○ 16, 32 bit fixed point
○ (16, 16) bit complex fixed point
○ 48 bit multiply and accumulation
• Scalar datapath
○ 32 general purpose registers which are 32-bit wide
○ 32-bit multiply and accumulate unit
• Vector datapath
○ Concurrent operations
■ 32-way 16-bit integer/fixed point multiplication
■ 32-way 16-bit complex fixed point multiplication
■ 32-way 16-bit/32-bit ALU
○ Register files
■ 15 general purpose vector registers, 512-bit wide
The architecture of Maotu is variable length VLIW and SIMD mixed. SIMD instructions are used for known applications while VLIW instructions are used for new applications. The performance can be enhanced by both instruction-level parallelization and data-level parallelization. Both the SIMD instructions and the variable length VLIW feature guarantee the code efficiency. Furthermore, the high flexibility is achieved by the VLIW instructions. The VLIW/SIMD mixed architecture is shown in Figure 2-1.

Figure 2-1 The architectre of Maotu processor
The operations in Maotu are partitioned to three groups: scalar operations, SIMD operations and VLIW operations, as shown in Figure 2 2. The scalar operations are mainly used for
control intensive application and program flow control. The SIMD operations, which are special accelerating instructions designed for known application, e.g. FFT and MIMO specified instructions, are used mainly in certain applications. The VLIW operations, which can be composed from different instruction combinations, will provide high computing energy even for unknown application.

Figure 2-2 Scalar, SIMD and VLIW operations in Maotu
The micro architecture of Maotu is shown in Figure 2-3, mainly consists of four units: instruction handling, scalar data path, vector data path and register file.

Figure 2-3 Maotu micro architecture
Instruction Handling Unit:
There are three functional components in Instruction Handling Unit: instruction fetch component, instruction decode component and instruction dispatch component. These components used to access to the tightly coupled program memory, decode the VLIW instruction packet and the individual instruction, and finally dispatch them to the corresponding computation unit in the scalar data path or vector data path.
Scalar data path:
The scalar data path is designed for scalar computation, which includes the following units:
• Scalar register file: general purpose register file
• Scalar load/store unit (SLSU): load/store scalar data
• Scalar ALU (SALU): perform scalar arithmetic and logic operation
• Scalar ALU and scalar multiply and accumulate unit (SMAC): perform integer scalar multiplication, multiply and accumulate and division
Vector data path:
The vector data path is designed for vector computation, which includes the following units:
• Vector shuffle unit (VSF): shuffle a vector with a certain pattern
• Intra vector unit (IVU): provide intra-vector operations, e.g. intra addition/subtraction within a vector
• Vector multiply and accumulate unit (VMAC): perform integer/fixed point vector multiplication, multiply and accumulate, division and complex multiplication
• Vector load/store unit (VLSU): load/store a vector
• Vector ALU (VALU): perform vector arithmetic and logic operation
• AGU register file: provide different base address and offset address
• Vector address arithmetic unit (VAAU): compute effective address base AGU registers
Register File Unit:
Maotu’s register file consist 31 general purpose scalar registers and 15 general purpose vector registers. For both the intra and inter data exchange in the vector register file and the scalar register file, a data routing network is also implemented in the register file unit.
Generally, Maotu is a processor architecture supporting wide SIMD execution variable length VLIW. The functional unit in Maotu could be re-configured for different application and different design target. In Section 3, a Fast Fourier Transform ASIP with Maotu architecture is demonstrated as an example. In the next, we will discuss the benefit of Synopsys Processor Designer modeling for Maotu.
2.2 Design in Processor Designer
The traditional Hardware Description Languages like VHDL and Verilog has availability of efficient synthesis methods and tools that enable the translation of RTL designs into optimized gate-level implementations. However, with the rapid growth of SoC scales, the traditional design methodology is indeed no more suitable. On the contrast, ESL design environment supports better management of the design complexity and reduction of the design cycle all together. Designing at higher levels of abstraction is an obvious way as it allows a better coping with the system design complexity, to verify earlier in the design process and to increase code reuse. As shown in Figure 2-4.

Figure 2-4 Traditional HDL design vs. ESL design method
Processor Designer provides highly-effective approach for creating complex chips and systems that focuses on the higher abstraction level concerns first and foremost. The key to Processor Designer’s automation is its Language for Instruction Set Architectures, LISA. LISA is a processor description language that incorporates all necessary processor-specific components such as register files, pipelines, pins, memory and caches, and instructions. It enables the efficient creation of a single golden processor specification as the source for the automatic generation of the instruction set simulator (ISS) and the complete suite of software development tools, like Assembler, Linker, Archiver and C-Compiler, and synthesizable RTL code. The development tools, together with the extensive profiling capabilities of the debugger, enable rapid analysis and exploration of the application-specific processor’s instruction set architecture to determine the optimal instruction set for the target application domain. Processor Designer enables the designer to optimize instruction set design, processor micro-architecture and memory sub-systems, including caches. Processor Designer’s use of a single high-level processor specification ensures the consistency of the ISS, software development tools and RTL implementation, eliminating the verification and debug effort necessitated by multiple, independently-created models [1].
A typical design flow in Processor Designer is shown in Figure 2-5. The only labor-intensive work is the LISA description. After that, the Processor designer could generate the cycle accurate simulator for application workload profiling and the synthesizable HDL code for area, timing and power consumption evaluation.

Figure 2-5 Design flow in Processor Designer
Generally, Processor Designer dramatically accelerates the design and benefits the following aspects:
Rapid Architectural Exploration
Design space exploration of baseband processing ASIP is difficult due to the complex wireless algorithms and intrinsic complexities of processor design. Changing different processor instruction set and micro architecture description is a very tedious and error prone task. Especially for the LTE ASIP, the processor design is a joint optimization process between software algorithm and hardware architecture. Both the algorithm and the processor architecture would be varying time by time. As shown in Figure 2-6, Processor Designer accelerates the ASIP architecture exploration by bring the wireless algorithms, the processor ISA design and processor architecture into one platform. An architectural exploration cycle can be finished in very short time.

Figure 2-6 Maotu Architectural Exploration Flow in Processor Designer
Easily Parameterized
LISA, Language for Instruction Set Architectures, internally provide a methodology to modeling the processor instructions. Every instruction is defined as a function in language C. All the instructions could be easily cut or extended for different configuration of different application. Furthermore, Processor Designer provides convenient method for the modeling of register file and their read/write port. Take advantage of these features, the register file also can be parameterized for different application.
Accelerate the design process
As an ESL design environment, Processor Designer will undoubtedly accelerate the processor design and verification process. Firstly, Processor Designer provides efficient mechanism to modeling the instruction set architecture, register files, computational unit and memory interface. In addition, there are sets of function in LISA to resolve the pipeline hazard and access memory efficiently. Especially, Processor Designer could generate the cycle accurate simulator and HDL modules automatically. These features accelerate the HW/SW co-design process and save the time consumption of HDL coding.
3. A FFT ASIP implementation with Maotu Architecture
The performance of 2048-point FFT processing is a typical characteristic for the baseband processor evaluation. In this section, to demonstrate the performance of Maotu architecture, we implement a FFT ASIP using the architecture of Maotu.
For radix-4 and radix-2 mixed 2048 point FFT algorithm, the dataflow is shown in Figure 3-1. The algorithm has 5 stages: 4 Radix-4 butterfly stages and the last stage is Radix-2 butterfly. The radix-4 butterfly is a length-4 DFT and is a basic computational block for FFT while ratio-2 butterfly is actually a length-2 DFT and is also a basic computational block for FFT. The pseudo code in Figure 3-2 illustrates the 2048 point FFT implementation in the processor. Where, the algorithm mainly consist two loop blocks, one for radix-4 butterfly and another for radix-2 butterfly.

Figure 3-1 Radix-4 and Radix-2 mixed 2048 point FFT algorithm

Figure 3-2 2048 point FFT implementation pseudo code
In order to run the program efficiently, we should try our best to optimize the cycle consumption within the loop. We can easily implement Maotu architecture to a FFT ASIP with Processor Designer.
Instruction Set Implementation
The instruction set mainly contains the following: Vector load/store instructions:
For load/store data and twiddle factors, Processor Designer provides full featured tools to modeling memory access behaviors.
Shuffle instructions:
For data reordering in the FFT algorithms.
Butterfly instructions:
For computing a butterfly unit, e.g. radix-2 and radix-4 butterfly.
Address generation instructions: For updating the address for data store and load. Loop control instructions:
For the program flow control. Processor Designer also provides a convenient method to modeling the zero leading hardware loops.
512-bit SIMD/VLIW Implementation
The performance of processors is increased by improving both the data-level parallelism and the instruction-level parallelization. Maotu architecture employs 512-bit wide SIMD execution to improve the data-level parallelization. In FFT processor, 512-bit is composed with 16 points 16- bit complex where 4 sets radix-4 butterfly or 8 sets radix-2 butterfly can be computed in parallel. In the other hand, Maotu architecture support 4-slot variable length VLIW mechanism, where 4 instructions could be executed in parallel.
Therefore, FFT ASIP is a 512-bit SIMD and VLIW vector processor, where the instruction set supports butterfly computing, address generation and data reordering. Figure 3-3 demonstrate the pseudo code of FFT ASIP. The memory access, address generation and data manipulation could be executed in parallel. All the operations are pipelined; therefore all the instruction latency is 1 cycles. Take the advantage of zero leading hardware loops; the cycle consumption of 2048 point FFT is 1920 cycles.

Figure 3-3 Pseudo code of FFT ASIP
The gate-count of FFT ASIP is about 0.5 million. For the 130 nm CMOS technology, the max operate frequency is about 150 MHz.
4. Conclusions and Recommendations
In this paper we present an application-specific processor architecture tailored to the needs of digital baseband processing. In addition we implement a FFT ASIP for demonstration; experimental result shows that data throughput is quite satisfying. Especially, the ESL design methodology is efficient and powerful. Figure 4 1 shows a LTE baseband test chip layout which developed by ICT for the purpose of physical layer processing. The test chip is a typical SDR baseband processor where two cores implemented with Maotu architecture. It is worth mentioning that all the related RTL codes are generated automatically by Processor Designer, the synthesis result is quite acceptable.

Figure 3-3 Pseudo code of FFT ASIP
5. Acknowledgements
The author would like to thank Xiaowei Pan of Synopsys China, Falco Munsche of Synopsys German for their help in the process of project development.
6. References
[1] http://www.synopsys.com/Systems/BlockDesign/processorDev/Pages/default.aspx



