Improve ARM1176JZ-S Core Performance in a DCT+ICC flow


–Optimize timing and leakage power simultaneously at 65nm

LI Ying, HOU Hua Min
Thomson Broadband R&D (Beijing) Co., Ltd

Abstract

With more complex features integrated, our digital video SoCs need embedded ARM core performance as higher as possible. This paper introduces a Synopsys DCT+ICC flow we developed to improve ARM1176JZ-S CPU core performance. We use useful skew to raise frequency, meanwhile develop a flow to accurately putting LVT (low-Vth) cells on critical path for the purpose to limit leakage power consumption. Based on TSMC 65LP library, the performance of ARM1176JZ-S improved greatly.

1.0 Introduction

THOMSON is a world leader and standard developer in digital video technologies. With new modes and video formats updated frequently, our new SoC function become more complexity and performance is required more strictly.

Applying new process in ICs can help to achieve higher performance and to lower power consumption, but chip cost increases at the same time. We develop a high-accuracy, fast, re-usable physical design flow for project setup to evaluate what extent the performance of core module can reach. ARM cores play important roles in our SoC. ARM1176JZ-S processor incorporates an integer core that implements the ARM architecture v6. It supports the ARM and ThumbTM instruction sets and SIMD DSP instructions which enhance the multi-media processing capability. ARM1176JZ-S implements an eight-stage pipeline and can reach high performance, which helps customers can develop many kinds of high-end applications. As a common module, we take ARM1176JZ-S as the benchmark to evaluate our flow quality.

Our flow is developed based on Synopsys new generation design platform “Design Compiler Topographic + IC Compiler”.

Synopsys Design Compiler is the industry’s most comprehensive and production-proven suite of RTL synthesis and test solutions. DC Topographic provides physical aware RTL synthesis, shares topographical technology with the IC Compiler physical implementation solution to enable designers to accurately predict post-layout timing, power and area during RTL synthesis.

IC Compiler is a single, convergent, chip-level design tool that enables designers to implement high-performance, complex and challenging designs. ICC 2007.12 version also delivers an integrated solution to apply useful-skew to optimize design timing.

Synopsys DCT and ICC help us not only implement design with high performance but also keep the result accurate with sign-off tools. Under the Pilot3.0 control and management, our flow can be completed in a short period.

At 90nm and below, with shrinking geometries, transistors go to lower drain-source voltage level, lower threshold voltage, thinner oxide and shorter effective channel length. These caused transistors to be susceptible to many modes of leakage currents. As the figure 1 showing, low threshold voltage device have fast speed, but leakage power consumption also growing fast with threshold voltage come down. For leakage power consideration, we must limit LVT cells utilization in our design carefully. We analysis the timing critical path, and change the bottle neck cells to the LVT cells with same logic, so as to improve timing performance. This method can minimize LVT cells usage under a small percentage.

Figure 1 Leakage current and delay relationship of Multi Threshold cells l

2.0 Implementation Strategy and Detail Methodology

2.1 Overview

ARM has a reference implementation methodology for physical design on Synopsys flow. In Figure-2, ARM reference flow is showing at left side, which using physical compiler to do physical aware synthesis and placement. The right is the flow developed by us. We apply DCT and ICC to improve ARM performance. At synthesis step, these two flows are both physical driven by reading floorplan information. But at physical implement steps, ICC is a more integrated and enhanced platform, on which placement, CTS, Route and chip finish can be implemented. In the ICC flow, we apply useful skew into design and increase critical path timing weight to make tool pay more effort on timing optimization. To improve timing frequency, we begin to put LVT cells into critical path from placement stage. Since accurate net delay information not acquired, we only put LVT registers into critical path. Clock tree synthesis with useful skew will implement clock tree by applying the adjustment on clock latency we defined. After route and detail parasitic extracted, we put LVT cells accurately on critical path bottle neck cells, then ARM1176JZ-S frequency increased and LVT cell usage minimized.

Figure 2 Comparison between ARM reference flow and our flow

2.2 Floorplan

The ARM1176 processor is a tightly tuned design with a large density of paths in the critical range. A good floorplan of the ARM1176 processor is crucial to ensure that performance not degraded.
The rule from ARM1176 documents recommend is:
1. Partition the memories so that all related instruction memories are in one corner of the floorplan and all related data memories are in the adjacent corner.
2. Make the standard cell placeable area as close to a 1:1 ratio as possible.
We create a floorplan according ARM suggested as figure 3 shown.

Figure 3 ARM1176JZ-S floorplan map

The floorplan in figure 3 is also used by DCT to do physical aware RTL synthesis. To minimize the IR-drops inside the core, a dense M4, M5, M6 power grid mesh was designed.

2.3 DCT Synthesis and Placement Optimization

Netlist quality is critical to design implementation, especially the timing closure. Physical aware synthesis avoids timing approximation of wire-load model. We supply our floorplan by using a DEF file to DCT. With the command “extract_physical_constraints”, memories, pins locations and chip size information are read into DCT. Since DCT shares topographical technology with the ICC, we can accurately predict post-layout timing.
The DCT synthesis script is:
read_ddc $GEV(src_dir)/ARM1176.unmapped.ddc
current_design $GEV(block)
link
extract_physical_constraints ../inputs/ARM1176JZS.def -output dct_constaints.tcl
tproc_source -file $TEV_CONSTRAINT_FILE
tproc_set_dont_use
compile_ultra
⋯⋯
Post DFT netlist is supplied to ICC to implement placement. Our placement script is showing below.
tproc_openMDB -lib $GEV(mw_lib_dst) -design $GEV(block)
tproc_source -file “$GEV(gscript_dir)/pnr/icc_common_settings.tcl”
# set don’t use cells, all LVT cells are included.
tproc_set_dont_use
⋯⋯
remove_ideal_network -all
set_ideal_network [ all_fanout -flat -clock_tree ]
set_max_leakage_power 0
place_opt -area_recovery -congestion -power
⋯⋯
for {set i 1} {$i <= 3} {incr i} {
# group critical path to a new group and increase weight, let tools pay more effort on it.
⋯⋯
psynopt -area_recovery -congestion -power
}
⋯⋯
for {set i 1} {$i <= 3} {incr i} {
#change critical path start point from HVT (high-Vth)/NVT (normal-Vth) to LVT
⋯⋯
psynopt -area_recovery -congestion -power
}
Incremental optimization is an often used method to optimize timing. Although tools always pay more effort on critical path optimization automatically, we still want to highlight the most important paths for ICC. We use “group_path” command to collect the critical paths and the option “-weight” to raise the weight value, then ICC will pay more attention on them to do optimize. Following scripts show this process.
## Group critical path to a new group and increase weight, let tools pay more effort on it.

set path_coll [get_timing_paths -slack_lesser_than ${post_opt_slack} \
-max_paths ${total_neg_path} -group ${post_opt_group}]

foreach_in_collection each_path $path_coll {
lappend critical_path [get_object_name [get_attribute $each_path endpoint]]
}

group_path -name post_compile -to $critical_path -weight 20 -critical_range 0.2
psynopt -area_recovery -congestion
group_path -default -to $critical_path -weight 1.0

2.4 Synopsys Useful Skew Flow and CTS

2.4.1 Useful Skew Rationale

There is no doubt that the clock is the single most critical in SoC design. Generally, timing violations are fixed by data path optimization. With useful-skew, you can fix timing violations by adjusting clock arrival times at registers or latches. The benefit of use-full skew is not only frequency increase, but also the noise decrease. In the “Zero Skew” design, the large amount current drawn from the power supply when clock switch will introduce simultaneous switching noise, which may inadvertently cause the chip to fail (especially if it contains sensitive analog and RF circuits). On the other hand, we can use clock skew to achieve wider safety margins, increase clock frequency, speed up design closure, and reduce peak current and simultaneous switching noise.

Figure 4 and Figure 5 give an example on eliminating timing violation by useful skew. In Figure 4, the slacks of data path C1 and C3 are negative, while C2 has positive slack. We can increase clock arrival time at R1, decrease clock arrival time at R2,then borrow positive slack from data path C2 to C1 and C3, all the 3 timing paths are met timing, as showing in Figure-5`

Figure 4 Useful-skew technology: before useful-skew used

Figure 5 Useful-skew technology: after useful-skew used
2.4.2 “Two Pass” skew_opt flow and CTS

We have two considerations on applying use-full skew into design.

One consideration is to apply useful skew before clock tree synthesis. Clock tree synthesis can achieve larger latency adjustment targets, design can have more useful skew, but the clock nets parasitic are estimated based on virtual routing with more scope for miscorrelation.

Another consideration is to apply useful skew incrementally to fix timing violations in post clock tree synthesis or post route stage. At this time, timing is required to be the most accurate after detail routing, therefore applying useful skew should be more effective. But, clock tree optimization can only make small latency adjustments after clock route, such as sizing. Some other methods on adjustment like relocation and delay insertion are limited at post route clock tree.

Synopsys supply an automatic useful-skew optimize flow with the command skew_opt. Applying skew_opt, ICC will analyzes the design timing firstly, determines the pins to be optimized, and then provide the optimal solution.

Three main Tcl commands are used by skew_opt to adjust clock skew:
1> set_clock_latency
This command let ICC timer understand the latency before CTS and measure skew_opt QoR at placement stage. It is not honored by CTS.
2> set_clock_tree_exceptions
This command not understood by the timer, while honored by CTS. It gives a guide on adjusting the latency at CTS step.
3> set_inter_clock_delay_options
This command control interclock delay constraints based on skew_opt analysis.

Clock tree synthesis will balances clock arrival times at each endpoint, include ICG latches and registers. But clock arrival at the clock gate is earlier than that at the launching register. The ICG E pins setup violations can be fixed by speeding up clock arrival times at launching registers. Synopsys recommend a “Two-Pass” flow to optimize such path timings, please see the figure 7.

Figure 6 The recommended “Two-Pass Flow” for applying Useful skew

Pass 1: Run the useful skew analysis on a post CTS design and generate a solution file. Skew_opt will analysis the propagated delay information when determining the latency adjustments, including accurate analysis of timing at clock gating cells, and also, analysis considers the relationship between every nonstop pin and the endpoint that are in its fan-out. Skew_opt ensures that the latency at the endpoint is always longer than that at the nonstop pin. Skew_opt can also consider timing derating setup on clock since clocks are propagated.

Pass 2: Source the generated useful skew solution into the design before running CTS, then Clock tree synthesis can achieve the desired latency adjustments at each pin. Such a flow can achieve good correlation between required and actual latencies.

To reduce wire delay and crosstalk impact on other nets, we set non-default-rule on clock nets with double width, double vias, and shielding. Below is skew_opt “two pass” flow script we used.

⋯⋯
#PASS 1
psynopt -area_recovery -congestion
save_mw_cel
#Run clock_opt to get updated latencies
clock_opt -update_clock_latency -inter_clock_balance

set write_sdc_output_lumped_net_capacitance false
set write_sdc_output_net_resistance false
set sdc_write_unambiguous_names false
write_sdc -nosplit tmp_sdc_for_skew_top.sdc
sh grep set_clock_latency tmp_sdc_for_skew_top.sdc > scl
sh grep get_clock scl > ucl

#Run skew_opt
skew_opt
close_mw_cel

# PASS 2
#Open post place_opt cell
open_mw_cel $GEV(block)
current_design
tproc_set_dont_use
source -v -e ucl
source -v -e skew_opt.tcl
psynopt -area_recovery -congestion
# run clock_opt
clock_opt inter_clock_balance
save_mw_cel

By default, skew_opt optimizes for setup time, but hold time may be degraded meanwhile. Skew_opt can also minimize the latency adjustment to lower impact on hold time. When both the -setup and -hold options are specified, skew_opt tracks WNS for both setup and hold times for each start point and endpoint.

Not all clock pins can be optimized by useful skew. Before running Useful Skew, clock structures and constrains need to be checked. IO ports related path, some non-stop pins defined as generated clock can not be optimized. Unconstrained pins, ILM clock pins used in hierarchical flow, and some level-sensitive latches in MVDD design can not be optimized either. skew_opt don’t consider clock exceptions during analysis and optimization, if a pin with preexisting exception setup is an optimized endpoint, skew_opt overrides the ignored, floating, or stop pin exception with the new floating pin exception. Preexisting nonstop exceptions would not be overridden. To improve accuracy, remove any pin_load constraints set on clock ports are recommended.

2.5 Routing and LVT cell change method

After clock tree synthesis by applying useful skew, we move forward to route step. For routing and post route optimization, we follow such methodologies:
1> Route the design in a timing driven mode with crosstalk prevention
2> Incremental route optimization with turn on crosstalk delta delay.
3> Use different steps of route_opt to achieve the best results.

Our route optimization scripts are below:
⋯⋯
set route_opt_xtalk_reduction_setup_threshold 0.05
set route_opt_xtalk_reduction_setup_slack_threshold 0.040
route_opt -effort medium -area_recovery -xtalk_reduction

set route_opt_xtalk_reduction_setup_slack_threshold 0.020
route_opt -incremental -effort medium -only_xtalk_reduction
route_opt -incremental -area_recovery
⋯⋯

After route step, signal nets become real metal nets that we can acquire accurate parasitic information on them, and timing analysis is more accurate. Putting LVT cells into critical path based on accurate timing analysis is the best way we find to limit LVT usage so far.

When first round placement finished, we analysis critical path and change start point registers from HVT/NVT to LVT, which will benefit timing by shortening register CP to Q timing delay. Because we hope to avoid our clock latency adjustment bothered by change registers, we change registers at placement before clock tree synthesis. Change critical path registers to LVT can be run several rounds with incremental place optimization.

After detail route, we use StarRCXT to extract real RC in the design, and check timing in PTSI. Based on PTSI timing report, we use command “chang_link” to change bottle-neck cell in critical path from NVT/HVT to LVT in ICC, which can boost the performance greatly.
A timing report below can help understand this process. The cell “NR3D4” with instance name “uNoRAM/uMain/uDCache/U7419” occupy 0.24ns delay, after change link to “NR3D4LVT”, the delay become 0.19ns, 0.05ns saved on this cell. The whole path slack from original “-0.49ns” go down to “-0.11ns” after change the cells to LVT cells.

Startpoint: uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_
(rising edge-triggered flip-flop clocked by arm_clk)
Endpoint: uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg
(rising edge-triggered flip-flop clocked by arm_clk)
Path Group: arm_clk

Path Type: max

Point Incr Path
————————————————————————–
clock arm_clk (rise edge) 0.00 0.00
clock network delay (propagated) 1.70 1.70
uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_/CP (SDFQND2LVT) 0.00 1.70 r
uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_/QN (SDFQND2LVT) 0.20 1.89 f
uNoRAM/uMain/uLsu/U1036/ZN (INVD6) 0.07 * 1.96 r
uNoRAM/uMain/uLsu/DAssocVA[24] (ARM1176JZS_A1176Lsu_1) 0.00 1.96 r
uNoRAM/uMain/uDCache/DAssocVA[24] (ARM1176JZS_A1176DCache_1) 0.00 1.96 r
uNoRAM/uMain/uDCache/U5429/Z (BUFFD6) 0.11 * 2.08 r
uNoRAM/uMain/uDCache/U4771/ZN (XNR2D4) 0.17 * 2.25 r
uNoRAM/uMain/uDCache/U4770/ZN (ND4D4) 0.15 * 2.40 f
uNoRAM/uMain/uDCache/U7419/ZN (NR3D4) 0.24 * 2.64 r
uNoRAM/uMain/uDCache/U6178/ZN (ND2D8) 0.14 * 2.78 f
uNoRAM/uMain/uDCache/ICC_PO_1258/ZN (NR2XD8) 0.09 * 2.87 r
uNoRAM/uMain/uDCache/U971/ZN (INVD8) 0.05 * 2.91 f
uNoRAM/uMain/uDCache/ICC_PCO_272/ZN (INVD4) 0.04 * 2.95 r
uNoRAM/uMain/uDCache/ICC_PCO_273/ZN (INVD6) 0.04 * 2.99 f
uNoRAM/uMain/uDCache/ICC_PO_994/ZN (ND2D4) 0.04 * 3.03 r
uNoRAM/uMain/uDCache/ICC_PCO_349/ZN (NR3D4) 0.05 * 3.08 f
uNoRAM/uMain/uDCache/U22/Z (BUFFD8) 0.09 * 3.17 f
uNoRAM/uMain/uDCache/U19/ZN (CKND2D4) 0.05 * 3.22 r
uNoRAM/uMain/uDCache/U106/ZN (INVD2) 0.05 * 3.27 f
uNoRAM/uMain/uDCache/U5988/ZN (NR2XD3) 0.07 * 3.34 r
uNoRAM/uMain/uDCache/U961/ZN (ND2D1) 0.09 * 3.43 f
uNoRAM/uMain/uDCache/U962/ZN (OAI21D4) 0.09 * 3.52 r
uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg/D (SDFCNQD4LVT) 0.00 * 3.52 r

data arrival time 3.52

clock arm_clk (rise edge) 1.88 1.88
clock network delay (propagated) 1.63 3.50
clock uncertainty -0.40 3.10
uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg/CP (SDFCNQD4LVT) 0.00 3.10 r
library setup time -0.08 3.03
data required time 3.03
————————————————————————–
data required time 3.03
data arrival time -3.52
————————————————————————–
slack (VIOLATED) -0.49

Table 7 Timing path report before use LVT cells

Startpoint: uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_
(rising edge-triggered flip-flop clocked by arm_clk)
Endpoint: uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg
(rising edge-triggered flip-flop clocked by arm_clk)
Path Group: arm_clk
Path Type: max

Point Incr Path
————————————————————————–
clock arm_clk (rise edge) 0.00 0.00
clock network delay (propagated) 1.70 1.70
uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_/CP (SDFQND2LVT) 0.00 1.70 r
uNoRAM/uMain/uLsu/uA1176LsuAssocDriver_assocVA_reg_24_/QN (SDFQND2LVT)<- 0.22 1.91 r
uNoRAM/uMain/uLsu/U1036/ZN (INVD6LVT) 0.05 * 1.96 f
uNoRAM/uMain/uLsu/DAssocVA[24] (ARM1176JZS_A1176Lsu_1) 0.00 1.96 f
uNoRAM/uMain/uDCache/DAssocVA[24] (ARM1176JZS_A1176DCache_1) 0.00 1.96 f
uNoRAM/uMain/uDCache/U5429/Z (BUFFD6LVT) 0.09 * 2.05 f
uNoRAM/uMain/uDCache/U4771/ZN (XNR2D4LVT) 0.12 * 2.17 r

uNoRAM/uMain/uDCache/U4770/ZN (ND4D4LVT)<- 0.11 * 2.28 f
uNoRAM/uMain/uDCache/U7419/ZN (NR3D4LVT)<- 0.19 * 2.47 r
uNoRAM/uMain/uDCache/U6178/ZN (ND2D8LVT)<- 0.10 * 2.57 f
uNoRAM/uMain/uDCache/ICC_PO_1258/ZN (NR2XD8LVT) 0.07 * 2.64 r
uNoRAM/uMain/uDCache/U971/ZN (INVD8LVT)<- 0.04 * 2.67 f
uNoRAM/uMain/uDCache/ICC_PCO_272/ZN (INVD4LVT)<- 0.03 * 2.70 r
uNoRAM/uMain/uDCache/ICC_PCO_273/ZN (INVD6LVT)<- 0.03 * 2.73 f
uNoRAM/uMain/uDCache/ICC_PO_994/ZN (ND2D4LVT)<- 0.03 * 2.77 r
uNoRAM/uMain/uDCache/ICC_PCO_349/ZN (NR3D4LVT) 0.04 * 2.80 f
uNoRAM/uMain/uDCache/U22/Z (BUFFD8LVT)<- 0.07 * 2.87 f
uNoRAM/uMain/uDCache/U19/ZN (CKND2D4LVT)<- 0.04 * 2.91 r
uNoRAM/uMain/uDCache/U106/ZN (INVD2LVT)<- 0.04 * 2.95 f
uNoRAM/uMain/uDCache/U5988/ZN (NR2XD3LVT)<- 0.05 * 3.00 r
uNoRAM/uMain/uDCache/U6132/ZN (CKND2D4) 0.05 * 3.05 f
uNoRAM/uMain/uDCache/U960/Z (AN2D4LVT) 0.06 * 3.11 f
uNoRAM/uMain/uDCache/U962/ZN (OAI21D4LVT)<- 0.03 * 3.14 r
uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg/D (SDFCNQD4LVT) 0.00 * 3.14 r
data arrival time 3.14

clock arm_clk (rise edge) 1.88 1.88
clock network delay (propagated) 1.63 3.50
clock uncertainty -0.40 3.10
uNoRAM/uMain/uDCache/uA1176DCacheTLB_intTlbAbort_reg/CP (SDFCNQD4LVT) 0.00 3.10 r
library setup time -0.07 3.03
data required time 3.03
————————————————————————–
data required time 3.03
data arrival time -3.14
————————————————————————–
slack (VIOLATED) -0.11

Table 8 Timing path report after change LVT cells.

3 Summary

As introduced before, we develop a high-accuracy, fast, re-usable physical design flow on Synopsys DCT and ICC platform and improve our embedded ARM1176JZ-S core performance in TSMC 65LP library.

Based on the physical aware synthesized netlist from DCT, we instruct ICC to optimize the critical path highlighted by us. It helps improve timing over 3%. Useful skew technology applied by synopsys skew_opt flow helps us save about 7% on timing. The advantage of LVT cells is fast timing even though it has large leakage power consumption. We put only 0.4% of LVT registers at placement stage and result in improving timing by 11%. After route step, based on accurate timing analysis in PTSI, LVT cells are put into bottle neck cells with its occupancy around 3% to raise frequency by 7%.

Through the flow, the performance of ARM1176JZ-S is enhanced meanwhile leakage power is minimized by lessening LVT cells usage. Below is a table described different stage timing result

Table 9 Timing and LVT usage summary after every step in our flow.

4 References

[1] IC Compiler User Guide: Using Design Planning in the IC Compiler Flow, A-2007.12-SP2, March 2008.
[2] Useful skew application note presentation, IC Compiler Version A-2007.12-SP1
[3] ARM1176JZF-STM and ARM1176JZ-STM Revision: r0p4 Implementation Guide