Integrate Custom Layout with IC Compiler Flow Based on 90nm Process
GU Yanke & FU Min
C2 MicroSystems Inc.
WANG Wei
Beijing University of Technology, VLSI & System Laboratory
ABSTRACT
Along with the increase of design complexity in DSM and demanding of high performance, traditional normal P&R flow is not enough to meet the requirement of physical design, critical path need to be dealt with specially. How to efficiently integrate the third party tools into P&R tools is becoming a tough issue. In this paper, integration of custom layout with IC Compiler is introduced based on Vector Unit (VU) in a low cost media processor chip by C2, which has about 1.6 million logic gates after synthesis. The huge data path logics and register files in VU are the bottleneck of the speed of the DSP & CPU in processor, which requires us to pay special attention to the physical implementation for our data path of VU module to achieve higher density and better performance.
To generate the custom layout of the data path logics and register files in VU, another third party tool is used in our design and a special design flow is applied to integrate VU’s custom layout into the Synopsys IC Compiler P&R flow based on 90nm process. Some special methods are adopted in this paper to improve timing and congestion because of the design complexity. It achieves a very high utilization and ensures timing continuous convergence. In addition, the timing analysis and the correlation between ICC and PrimeTime-SI are also discussed in this paper.
1.0 Introduction
The Project JAZZ2 is the next generation of System-On-Chip for C2 Microsystems. The Media Processor (MP) is the core components of it. The Media Processor Unit (MPU) is a 32-RISC superscalar unit equipped with a 256-bit Vector Unit and a 64KB combined cache/buffer (CCB) unit. The Scalar unit can issue up to 4 instructions per cycle. This paper will focus on the Vector Unit’s (VU) physical design.
VU is the component which is dedicated in the vector calculation. It mainly consists four parts: vector register file, vector bypass and byte select crossbar, vector multiplier and vector ALU. It has 2×32 entries of 256-bits (divided into 64 entries of 128-bits) register files and it has a total 128 8×8 multipliers. It can perform 8-bit, 16-bit and 32-bit integer, 32-bit Floating point multiplies and 16-bit complex multiplies. In order to improve the performance of the VU, reduce the latency of the data path between each pipeline become very important. Therefore custom layout of the data path is applied in our design. Since the custom layout design is done by the third party tool, How to effectively integrate the custom placement with ICC (IC Compiler) and how to optimize the congestion and timing in ICC becomes the big issue for us.
In this paper we will introduce our special design flow for VU by using Synopsys tools. It will also share some experience of using IC compiler (Version ICC_vZ_2007.03-SP1) to optimize the congestion and timing for the VU physical design.
Also, IC Compiler RP (Relative Placement) can do the same task with the third party tool we have used in VU’s layout design for the custom data path placement. It will be introduced in this paper.
Finally, the timing analysis and the timing correlation between ICC and Primetime SI will be mentioned in this paper.
2.0 Design Flow
The figure 1 shows the special design flow for VU physical design. It mainly consists the netlist generation, custom placement of data path, Scan stitch, Physical design and timing analysis. The steps which are in the bold test box are done by the Synopsys tools and the others are done by the third party tools.
Figure 1 – Design Flow for VU
2.1 Netlist Generation
The VU module is consist of data path and control logic. Since the data path need to pay special attention to the placement, the data path netlist and the placement will be done by the third party tool. At the same time the other logic will be synthesized by the Design compiler. Finally, the third party tool will read all the netlists and hook them up and generate data path Placement information. The details of custom placement by third party tool will be introduced by chapter 3.
2.2 Scan Chain Stitch
Generally, the DFT Compiler will stitch the scan register by the hierarchy and the name of the register. Although IC Compiler can do the scan reorder in the internal of one scan chain or between different scan chains, it can not do what we expect exactly because of the random scan stitch in DFT. For VU physical design, since the data path is placed by custom design and the data bus are group together, it can be done by specific indication in normal DFT design flow to simply the design flow and do exactly what we expect. In this VU DFT design, we group the bus registers (which are placed by custom placement) and stitch the neighbor bus registers in one scan chain. For the registers which are synthesized by Design Compiler and placed by ICC automatically, it will be put together in one scan chain.
Figure 2 – Scan Chain without Group
Figure 2 shows the scan chain which is showed by IC Compiler (each color shows a single scan chain), where the scan chain is stitched automatically by DFT Compiler. We can see the scan cells are chained randomly and the distance between two scan cells may be very long. It will dramatically influence the congestion and timing in 90nm process. To solve this problem, each scan chain’s components are manually specified by “set_scan_group” and “set_scan_path” in DFT Compiler. Scan chains are completely done in what we expect. The sample script for grouping the scan cells and set scan path are showed below:
set_scan_group d0vby -include_elements [list mp_vu_dp/mp_vu_vr/vu_vbyp/vbyp_sd \ mp_vu_dp/mp_vu_vr/vu_d0pipe]
set_scan_group vfa_va -include_elements [list mp_vu_dp/mp_vu_vfa/vfax32x4 \ mp_vu_dp/mp_vu_va]
set_scan_group w01 -include_elements [list mp_vu_dp/mp_vu_vm/mul_array0/vm_w0 \ mp_vu_dp/mp_vu_vm/mul_array0/w0pipe]
……
set_scan_path chain1 -include [list d0vby] -complete true
set_scan_path chain2 -include [list vfa_va] -complete true
set_scan_path chain3 -include [list w01] -complete true
……
Figure 3 shows the results of the VU final Scan Chains by IC Compiler. Each scan chain is showed by different colors. It shows that the custom placed bus registers are chained together to reduce the distance between the scan cells. By doing this, the total wire lengths are reduced and the timing and congestion are dramatically improved.
And now, We can use a more efficient flow that IC compiler can read in the scan def dumped by DFTC, then reorder the scan chain based-on the placement result.
Figure 3 – Scan Chain with Group
2.3 SDC Generation
Since the top netlist is generated by third party tool, the SDC file is needed to generate by Prime Time (PT). PT will read all the constraint and the top level netlist and write out the SDC file to offer the design constraint for IC Compiler.
2.4 Physical Design by IC Compiler
Figure 4 shows the IC Compiler physical design flow. It mainly consists of five parts: IC Compiler Floorplan, place_opt, clock_opt, route_opt and chip finishing. Chapter 4 will introduce the details of how to optimize the congestion and the timing by IC Compiler.
Figure 4 – ICC Physical Design Flow
2.5 Extraction by Star-RCXT & STA by Primetime-SI
After the physical design, it is necessary to extract the parasitic like resistors, capacitors, and inductors from a fully routed design block to analysis the timing, noise and power etc. Star-RCXT from Synopsys is applied by VU physical design. The output of Star-RCXT such as SPEF (Standard Parasitic Exchange Format) will be used by PrimeTime SI to process the timing analysis including SI analysis. The details of the timing analysis will be introduced by chapter 5.
3.0 Custom Layout Design
To generate the custom layout of the data path logics and register files in VU, another third party tool is used in our design. “dpGen” is used by logic designers to design datapath elements, using several build-in functions. Datapath elements can be as simple as 32bit 2to1 Mux or as complex as a Multiplier Array that supports 8, 16 and 32 bits multiplications. It uses generic standard cell library to generate the desired circuit and placement file with verilog gate level netlist.
3.1 Datapath Elements
Figure 5 shows a simple diagram of datapath flow in VU and it shows the different types of logic in the data path and other control logic surrounding it. The register file “RF” and the pipeline bus registers are very structured, it should be put together and placed according to the data path direction. Also, there are some combinational logics are extremely related to these registers, manually place these logic will greatly optimize the congestion and timing. Therefore all these logics are generated and placed by third party tool and the remaining logics showed by the cloudy parts in figure 5 are synthesized by Design Compiler. As shown in the figure 5, the data path goes from left to right and control line is from bottom to top.

Figure 5 – VU Datapath Elements
3.2 Register File Example
The following is an example of a multi-port register file build using standard cells. The core of this register file is already very dense, but gaps were kept intentionally every 16 bits of the RF. The last stage of the address decoders was pre-placed. The decoder and other control logic were kept as RTL blocks and were synthesized. For the physical implement in ICC, after initializing Floorplan, the placement information DEF file was read into ICC by using “read_def” command. Every cell placed by third party tool had “FIXED” property. The majority of cells in this Register file were already pre-placed. ICC used the gaps between the cells to place the synthesized control logic (red). In addition, all CTS (yellow), IPO (green) cells were placed in the available gaps.

(I) pre-placement (II) post ICC placement, CTS and IPO
Figure 6 – VU Register File Example
3.3 Combination Datapath and Control Logic
It will leave some space in advance for the synthesized logic. As the figure 7 shows, ICC will place the synthesis logic in the space where we expect it will be put. Another similar example is showed by figure 8. The synthesized logics are placed in a central location of pre-placed cells. The space between the pre-placed logic can be adjusted slightly – increased or decreased – to achieve the most optimal space utilization according to the cells density which is shown by ICC.
Figure 7 – Synthesize logic is Placed in Empty Space
Figure 8 – Random Logic Filling in the Gaps
3.4 Custom Design with ICC Relative Placement
In addition, IC compiler has a physical datapath engine that allows the user to specify relative positioning of groups of cells as well, just like the third party tool we used. The initial relative positioning of cells occurs during coarse placement. As shown in figure 10, the flow for using ICC relative placement follows these major steps:
◆ Read the gate-level netlist into IC Compiler by using the “import_design” command.
◆ Define the relative placement constraints:
– Create the relative placement groups by using the “create_rp_group” command.
– Add relative placement objects to the groups by using the “add_to_rp_group” command.
P.S. IC Compiler annotates the netlist with the relative placement constraints.
◆ Prevent relative placement cells from being removed during optimization by using the “set_size_only” command.
◆ Read floorplan information by “read_def”.
◆ Perform placement for the design by using the “place_opt” command.
◆ Analyze the relative placement results.
◆ Perform clock tree synthesis with the relative placement structures fixed in place.
◆ Perform routing with the relative placement structures fixed in place.
Figure 9 – Relative Placement Flow
Figure 10 – Relative Placement Column and Row Positions
A relative placement group is an association of cells, other groups, and keep outs. A group is defined by the number of rows and columns it uses. Figure 10 shows the positions of columns and rows in a relative placement group. Alternatively, you can modify these options later using the “set_rp_group_options” command.
Once an RP group is created, it can be used within another RP group. This is done via the – hierarchy switch on the “add_to_rp_group” command. The following code example illustrates use and creation of hierarchical RP, as shown in figure 11. Figure 12 shows relative placement in a design containing obstructions which is common in designs.
Figure 11 – Including Groups in a Hierarchical Group
Figure 12 – Relative Placement in a Design Containing Obstructions
4.0 Physical Design by IC Compiler
ICC is a powerful tool for auto layout. There are three typical commands: place_opt, clock_opt, route_opt which simplify the whole layout flow and have high quality of optimization. These three commands, just as their names imply, are used in the stage: placement, CTS, routing. Here, we will introduce the whole layout flow in detail.
4.1 Floorplan
As we know, floorplan is a critical step during the whole layout design. ICC provides the command for the floorplan of none rectangle shape and makes this type of floorplan easy. The following is the command:
The shape size is as the following:
Figure 13 – Initialize FloorPlan
The ICC command: initialize_rectilinear_block define four types of shape style which are often used in layout design: L type, T type, U type and X type. You can also generate random type by yourself. If you just use normal rectangle shape, you can use the command: initialize_floorplan. In our design, a custom layout for data path is designed alone by other tool and we need to put it into ICC. So after initializing floorplan, the def file output by custom layout tool is read into ICC:
The DEF file: mp_vu_cutom.def only include the following information: 1) The position and direction of standard cells in custom layout 2) The ports information: position, direction, metal, size and so on
Base on the following consideration, we use the DEF format to get ports’ information but not the TDF format.
1) The TDF format can not support the none-rectangle shape Floorplan best. 2) We need to use DEF to get the custom layout information, so it is convenient to put the ports’ information in DEF file.
The next step, the standard cells which are synthesized are placed and routing the power nets:
Now, the floorplan is ok and we can use the following commands to get the quality of the design:
Please read the reports by the above three commands carefully. If any doubt, confirm with the circuit designer. The following is our floorplan:
Figure 14 – VU’s Floorplan in ICC
4.2 Placement
The main command during placement in ICC is place_opt. We use the following commands to execute the placement:
Because of the timing and congestion issues, we choose the options: “-effort high”, “-congestion” and “-area_recovery” when using the command: place_opt.
The option: “-effort high” can improve the timing quality but it will take a longer runtime. “-congestion” and “-area_recovery” options can get a better congestion result. Scan reorder is executed by choose the option: “-optimize_dft”.
After the above commands, if the design still has some congestion issue, you can use the following command:
The “refine_placement” command can further improve the congestion quality but it will make the timing result worst. Also, you can use the following command to improve the timing:
Pay attentions here, during timing optimization in ICC, we should extract RC before the optimization to avoid not updating the timing information. To update the RC information the command “extract_rc –estimate” is used before routing and “extract_rc” is directly used after routing. During placement, group and placement blockage are the normal methods to be used according to the request of design. ICC also provides these commands.
1) Group: create_bounds
The command: create_bounds can generate two types of group: hard and soft. The hard type request the elements which are grouped to put into the group region only and the soft type imply that some elements can be put outside of the group region. By default, other cells can’t be put into the group region, if you want, you can choose the option: “-cycle_color” to do so.
2) Placement blockage: create_placement_blockage
It also has two types: hard and soft. The hard placement blockage tough restricts cells not to be placed into its region while the soft type implies that some cells can be placed into the region. After we take the above measures, the timing was improved about 250ps during placement and also got a better congestion result.
4.3 CTS
The quality of CTS directly affects the final timing results. ICC is powerful in building clock tree.
We will descript it in detail combined with our design at the following.
The above commands choose the buffer/inverter cells to build clock tree. You can refer to the standard cell manual provided by foundry to choose the suitable buffers/inverters.
During CTS, some design rules such as max_transition can be pointed out by the following command:
In our design, we set the max_transition for clock signals to 200 ps. For the clock signals, we usually use a special routing rule to get a better quality. The following commands define the special routing rule used for clock routing and make the clock tree to use it.
The top metal of our design is METAL7 (METG1). We don’t use the METAL under MET3 during CTS and we make the pace between clock signals double of the default spacing. To control the CTS result, we often give a skew value to ICC as the target skew during CTS:
The target skew of our design is 300 ps. To solve the timing issues in our design, we create a “useful skew” during CTS to borrow the timing on none critical paths. By this way, we improve the timing quality very much. To create the “useful skew”, we need to choose suitable pins. None critical paths exist before or after them. Then we first compile sub clock trees from these pins:
In the above examples, a none-critical path exists before the pin: “mp_vu_dp/mp_vu_vfmt /perm_cc/I0_Ctg/CLK” and 500 ps can be borrowed from it. Another none critical path exists after the pin: “mp_vu_dp/mp_vu_vfa/vfax32×4/W0_ctg/CLK” and 460 ps can be borrowed from it. If timing can be borrowed from front, the “useful skew” is negative, otherwise, it is positive. So the “useful skew” of the two pins are -500ps and 460ps. After building the clock sub trees, we report the clock latency from these pins.
From “mp_vu_dp/mp_vu_vfmt/perm_cc/I0_Ctg/CLK”, the latency is 336.72.
From “mp_vu_dp/mp_vu_vfa/vfax32×4/W0_ctg/CLK”, the latency is 353.46.
Before the whole clock tree building, we should set these pins as float pin:
The value of “float_pin_max_delay_rise” is equal to “the latency of pin” subtract “the useful skew of pin”
After the above settings complete, we will build the whole clock tree:
There are three clocks in our design, but in the top level, these three clocks are the same. So we use the command: set_inter_clock_delay_options -balance_group “clk gvrclk pclk” to balance these three clocks.
The following is the CTS summary:
Figure 15 – CTS Summary Report
After the building of whole clock tree, we need to optimize timing and scan reorder:
We do the optimization two times to get better quality of results. Clock signals routing:
4.4 Routing
During routing, the main command used is route_opt which include global routing, track assign, detail routing and optimization.
Global route options:
First, we do an initial routing only:
After this, we extract RC information based on the initial routing to do the actual routing and optimization:
We choose the option “-area_recovery” to solve congestion issues.
If there are some routing DRC violations or timing violations still existing after the above step, you can use the following command to do further optimization:
The frequency of our design is 330 MHz, after the above steps, the final timing violations is 50ps and the results in PT is OK.
The following commands insert the filler cells before dumping gds data out:
5.0 Timing Analysis
We use the Verilog Netlist & sdc file which generated by ICC to do STA in PT-SI. In ICC’s Arnoldi mode, it has good correlation between PT-SI & ICC, as shown below:
Figure 16 – Timing Analysis in PT-SI (slack=-37.33ps)
PrimeTime SI (signal integrity) is an optional tool that adds crosstalk analysis capabilities to the PrimeTime static timing analyzer. We use PrimeTime SI calculates the timing effects of cross-coupled capacitors between nets and includes the resulting delay changes in the PrimeTime analysis reports. It also calculates the logic effects of crosstalk noise and reports conditions that could lead to functional failure. The main setting in PT-SI while “report_timing” is shown as follows:
Figure 17 – Timing Analysis in ICC (slack=-121.92ps)
The main setting in IC Compiler while “report_timing” is shown as follows:
6.0 Conclusions and Recommendations
From the VU physical design we got the conclusion that the IC Compiler provides us the benefits in effective integration of custom layout design. It also provides us the possibility to got higher density and better performance. Owe to Synopsys’ synthesis and DFT tool, we got highly flexibility and high performance on optimization of congestion and timing in IC Compiler design flow. The consistency of the commands between all the Synopsys tools also provide us easiest way to learn tool and save much time on physical design. For better consistency of the VU physical design flow, IC Compiler RP (Relative Placement) will be considered for custom layout design in C2 Microsystems’ next generation project.
7.0 Acknowledgements
This paper was put together with help from Synopsys Application Consultant and the backend team of C2 Microsystems. The authors would like to thank the following people for their supports on their kindly advice on VU physical design and the ICC tool’s issue.
Alfred Jiang, Rachel Xie and Wendy Gao from C2 MicroSystems and ZhiZhong Wang, Jianjin Hu and Tao Wang from Synopsys AE.
8.0 References
[1] IC Compiler User Guide, Version Z-2007.03, March 2007, Synopsys
[2] DFT Compiler User Guide: Scan, Version Z-2007.06, June 2007, Synopsys
[3] Star-RCXT User Guide, Version Z-2007.06, June 2007, Synopsys
[4] Prime Time User Guide, Version Z-2007.06, June 2007, Synopsys
[5] Media Processor Specification, C2 Microsystems
[6] Integration of a Datapath Generation with An ASTRO Flow, Kossay Omary & Sunny Shin, C2 Microsystems, SUNG2006

















