Overcome Physical Implementation Challenges with Combined Multi-Techniques in IC Compiler


Zhen YangZhen.Yang@lsi.com

LSI SHANGHAI

Abstract

This article aims to present the physical design challenges and the approaches to solve the issues in IC Compiler. Firstly we will introduce our basic physical implementation flow and secondly share some practical experiences about how to combine multi-techniques to effectively remove implementation bottlenecks: congestion and critical timing. Because of design complexity(especially complex dataflow), it causes terribly serious congestion spot.We solve this issue with the usage of “placement blockage array” and “keepout margin”.Besides, there are a number of large dimension memories in the design which require high speed performance and their interface timing are critical path which become timing bottleneck. Combining multi-techniques:“soft bound”, “path group” and “useful skew”, we successfully solve the critical timing violation on those memory interface.

1. Introduction

Generally, we usually encounter the design challenge that achieving high speed and low power considering keep area minimal to reduce cost under the pressure of tight schedule. Our design is core of storage product worked at 3.0GHz. It has 1.5 million standard cell instances, 89 SRAMs and 1 analog macro. Its main system clock frequency is 750 MHZ, it has above 45 clock domains. The design is implemented with TSMC 45nm technology and adopt density library in order to get minimum area and low power target. It adopts such low power method: Power Island, Multi Vth Library and Clock Gate. In this paper, we will focus on its physical implementation challenge and solution: congestion and high speed bottleneck on memory and also present how to overcome them in IC Compiler.
In first section we briefly explain our basic physical implementation flow and the main structure of floorplan. The understanding of the design is important to comprehend the difficulties encountered and solution. The second section describes the congestion issue and approach with keepout margin and blockage array. The third chapter will maintain the high speed challenge on large memory and how to combine multi-technique to overcome it.

2. Flow and Design Overview

The following section highlights the main steps of our implementation flow.

2.1 Basic Flow Overview

1. Synthesis with DC Topographical.
The topographical feature of Design Compiler allows layout aware RTL synthesis. It does a coarse placement of the cells and extracts the interconnect resistance and capacitance from the virtual layout information. This could get better correlation with ICC. After floorplan exploration, ICC dump floorplan file to DC-T. The floorplan file contains such information: die size, macro placement, port/pad location, placement blockage and bounds information. The ICC’s command is:
icc_shell-> write_physical_constrain -output floorplan.tcl
Generally, this process needs some iterations to get the golden netlist and floorplan.
2. place_opt
place_opt is ICC “mega” command. Because of our chip’s complexity and timing challenge, it need run 24~28 hour to complete. The ICC’s command is:
icc_shell-> place_opt -cong -optimize_dft -effort high -num_ cpus 4
3. CTS
Due to the complexity of design clock network, it is essential to conduct a proper analysis of clock structure. As build clock tree, we firstly synthesis high speed clocks and later synthesis the generated clock tree.
4. Post-CTS fix
This step is to fix hold and setup timing violation after CTS.
5. route_opt
We use the Z-route engine to do the detail route and it could get faster runtime and get higher double via rate.
6. Leakage Power Optimization(LIPO)
The LSI design flow uses an internal leakage power recovery utility run from within PrimeTime. This utility was discussed previously at Boston SNUG 2008 [1] and has been used in the LSI design flow on many production designs. The technique of leakage power recovery run in the signoff environment has achieved significant leakage reduction.
LIPO is efficient to carry out the leakage swap without bringing timing violations. After implement the scripts, the leakage power get obviously decreasing.

Figure 1 – ICC Basic Flow

2.2 Design Overview

The design has complex internal timing path and contains a number of large dimension memories which leads trouble for floorplan, bring serious congestion and timing closure. The floorplan picture is alike figure 2.

Figure 2 – Floorplan Overview

3. Congestion Challenge and Approach

This design’s congestion challenge mainly comes from two kinds:
1. Near Memory Channel 2. High Pin Density Cells –“AOI” and “OAI”. The figure 3 is the congestion map before fixing.

Figure 3 – Original Congestion Map

The hot spot A in figure 3 is caused by AOI/OAI type cell. Because this type of cells has high pin density and complex connection relationship of dataflow in design, it causes route overflow in local region.(Figure 4).Though ICC tries to fix the congestion by lower local cell density, the issue still couldn’t be solved completely. The local congestion map showed in Figure5.We overcome it with “keep out margin” during place_opt.
The command is: icc_shell-> set_keepout_margin $object_list -outer {lx by rx ty}

Figure 4 – AOI/OAI Type Cell Connection

Figure 5 – Original Congestion Map around AOI/OAI Cell

After set the keep out margin on those cells, we redo place optimization. The congestions around them become better as figure 6.

Figure 6 – Optimized Congestion Map around AOI/OAI Cell

The hot spot B region near memory channel in figure 3 is because of complex interconnect related to memories. We overcome it with blockage array. After place_opt, we review the congestion map and create the blockage array. The blockage array guidance is: each array is created along with degrade. The width and space are determined by congestion situation. After blockage array done, then run command:
icc_shell-> refine_placement -num_cpus 4 The congestion gets obvious improvement alike Figure8.

Figure 7 – Original Congestion Map around Memory

Figure 8 – Optimized Congestion Map around Memory

Combined with keepout margin and blockage array, the congestion issue could be solved easily.

4. Timing Challenge and Approaches

It’s found that in TSMC’ 45nm library, standard cell delay decreases much. However, the memory’s delay and setup timing aren’t improved much. In our design, we meet timing bottleneck on this memory timing path. (Figure 9)

Figure 9 – Critical Timing Path on Memory

In figure 9, timing path from “MRF_mem” to “mrf_buf_reg” and “mrf_buf_reg” to LE_MEM are critical timing paths.On the other side, the timing path to “MRF_mem” and timing path from “LE_MEM” also are critical timing, so useful-skew technique is not available to solve above timing bottle neck. We mainly adopt soft bound and path group to solve the timing issue.

4.1 Soft bound

Because above two critical paths are related to “mrf_dbuf_regs”. Those registers’ locations are very sensitive to the timing. By iteration experiment, we get the seemly location for those 480 flip-flops shown in red. (Figure 10)

Figure 10 – Whole Chip Soft Bound Distribution

Actually, we define some extra soft bounds for some critical module and logic, which are very critical and helpful to meet timing for whole chip. The command to create bound is:

icc_shell-> create_bounds -name mrf_dbuf_reg -type soft \
[get_cells */mrf2_dbuf_reg_*] -coordinate {{540 2045}{700 2352}}

4.2 Path Group

By default, ICC define every clock domain as a timing group with critical rang 0. Within every group ICC stop optimization when all paths in the group meet timing or optimization gets stuck on the critical path. If optimization gives up on the critical path, no additional optimization is performed on the less critical paths. This actually couldn’t exert ICC’s optimization ability which induces more violated paths and total negative slack. For critical timing of data bus related to memory, we define timing groups with critical rang alike,

group_path -name group_mrf2dbuf -critical_range 0.1 -weight 10 \
-from [get_pins */MRF_MEM_*/QB[*]] \
56
-to [get_pins */mrf2_dbuf_reg_*/D ]
group_path –name group_dbuf2le_mem -critical_range 0.1 -weight 10 \
–from [get_pins */mrf2_dbuf_reg_*/Q] \
–to [get_pins */LE_MEM_*/D[*]]

According to our experience, fixing sub-critical paths may help the critical path. Since the critical rang is with respect to the critical path delay, if the critical path delay is improved, the critical rang band moves lower along with the improved critical path.Critical range optimization will not improve a sub-critical path if the improvements make the critical path worse. With a critical rang, optimization will obviously reduce TNS. The weight is a way to control the relative priority of optimization.

5. Conclusion

This paper present the basic flow that can achieve timing closure from DC-T to route. Especially, it demonstrates method to solve serious congestion issue with keepout margin and blockage array. This paper also shares some experience to achieve timing target with soft_bound and path_grpup to solve bottleneck on high-speed memories.

6. Acknowledgements

The author would like to thank the colleagues in LSI (ShangHai) Storage Physical Design Team for their kind supports. And also would like to give the sincere thanks to Jin Xiao from Synopsys.

7. References

[1] Bruce Zahn. A Utility for Leakage Power Recovery within PrimeTime. Boston SNUG 2008
[2] Synopsys “IC Compiler User Guide”, June 2010
[3] Synopsys “IC Compiler Workshop Student Guide”, June 2009