Effective HFN Handling: Key to Better Physical SoC Designs


Steven S. Leung
ALi (Shanghai) Corp.
SNUG China · September 2003

Abstract

Increased complexity in SoC design has led to a greater number of high fanout nets (HFNs) that are beyond the designers’ direct knowledge like what they traditionally have on clocks and reset signals. Effective handling of this class of HFNs is key to achieving a better physical design in the SoC era. This paper presents the results of a study on HFN handling with various flow and tool options provided by Astro. It is found that given the same floorplan, removing the buffer trees prior to the pre-placement coupled with specific optimization options can reduce the total area and power consumption of the standard cells significantly. Improved routability alsogreately facilitates the one-pass timing closure flow.

1. Introduction

Although there are no agreements among designers on how high a net’s fanout is really high enough to be classified as a high fanout net (HFN), it has long been recognized that certain classes of nets such as clocks and reset signals with a large number of sinks are most effective handled in the physical design phase. Consequently, the concept of “ideal” nets and don’t care attributes are implemented in synthesis and STA to enable the designer to bypass the normal analysis on these nets while specialized CTS (Clock Tree Synthesis) tool is developed in the backend to realize the buffering needs.

This handling of HFNs like clocks and reset signals so far is successful; and one prerequisite of this success is the designer’s intimate knowledge on these signals, which is not a problem. As design complexity continues to grow, however, the number of non-traditional HFNs also increases, and the designers may not have direct knowledge on this class of HFNs like what they have on clocks and resets.

While HFN handling is recognized as an important issue in ASIC design, previous efforts have been focused mainly in synthesis [Furtner, SNUG Boston, 2001]. The connectivity of the buffer trees inserted during the synthesis stage, however, represents additional constraints to the physical design. Furthermore, the partitioning and groupings of the sinks and buffers during logic synthesis without any physical information can prevent physical design to achieve the optimal placement.

In the new versions of Astro 2003.03 and 2003.06, the previously available function of “pdsHFNCollapse” has been improved and integrated into the PrePlacement routine. The preplacement now can remove the buffer trees inserted by synthesis before placement, thus promises a better physical design result. Accordingly, this study is launched to investigate the effectiveness of this new feature in physical design.

2. Objectives

The objective of this work is to find out if the new feature of removing buffer trees in Pre-Placement provides any benefits in the physical design of the chip. Furthermore, since buffer trees can be re-inserted at different places in the subsequent design stages, it is also the objective of this study to find out when the buffer trees for the HFNs are best synthesized.

3. Approach

To achieve our set objectives, two designs with dramatically different characteristics are selected for various experiments. As the table on the next page shows, the first design is IO limited and is only one tenth of the second design. However, because only four metal layers are used, the first design has considerable congestions and is tough to route. Both designs have been taped out before using the existing version of Astro (5.0.2), thus providing a ready reference point for comparison.

With its relatively small size, the first design is used to explore various flow and tool options. The rapid turnaround time allows more experiments done in the same amount of time so as to enable us to get familiar with the new version faster. Based on experience and insight gained from the experiments on the first design, further experiments are run on the second design to verify what has been learned as well as to probe further.

3.1 Measuring the effectiveness of buffer tree synthesis

Since the key objective of this work is to find out the most effective way to handle HFNs, how to measure the effectiveness of the buffer tree created in various ways becomes essential. Consider a generic buffer tree depicted in the figure below:

In general, using cells of bigger drive strengths can reduce the number of cells and can improve routability. But on the other hand, bigger drive cells implies bigger area and power consumptions. Moreover, the bigger a buffer cell’s drive is, the bigger the cap loading of its input pin will be. If the root net of the buffer tree is part of a critical path, cells in previous stages may need to scale up their drive strengths as well in order to maintain their target speeds, thus creating a rippling effect up along the critical timing path, which may offset the benefits of using bigger drive cells.

Looking at the figure, it is intuitive to measure the end result of the buffer tree as the total drivers needed to drive that number of sinks. In other words, we can compare one buffer tree implementation with another by comparing their fanout per unit drive, which is calculated from

Fanout/Unit_Drive = Total_Sink_Count / S Normalized-Drive#,

where the Normalized-Drive# is the ratio of the area of that drive over that of X1 buffer.

Note that when a physical design reaches the point of implementable (meeting both routability and timing requirements), this number measures not merely the buffer tree itself, but the combined effect of library cell design, tree configuration, placement, as well as optimization such as selection of buffer/inverters and reducing the sink pin’s cap loading, etc.

A set of scripts has been developed to trace the buffer tree connectivity and calculate this fanout per unit drive number for each tree. Table 2 below shows the buffer tree information of the two test cases in the pre-layout netlists. Note that while the gate count of the second case is about 10 times of the first case, the number of nets having more than 100 sinks is more than 10x of case1, which underlines the importance of effective handling of HFNs. The scripts also calculate the total leakage power of the buffer cells as a reference point for predicting the power consumptions.

To facilitate the analysis of the results, design statistics of each stage are compiled into an easy-to-use webpage and links are set up such that clues of problems can be traced easily. The figure on next page shows the result of the scripts.

4. Design Flow Alternatives

In Astro 2003.03 (and 2003.06), the previously available standalone function pdsHFNCollapse has been integrated into the Pre-Placement routine as a default-on option. Figure 3 below shows the GUI of the Pre-Placement with this option.

The fanout value associated with this option is a threshold value related to the total number of sinks. That is, the Collapse function will remove the buffer tree only if the total number of sinks of the root net is greater than [or equal to?] this threshold value. All buffer trees will be removed when this value is set to 1, according to the manual. Our experiments show, however, that this is not the case and before the buffer tree removal operation, there appear actions to derive some sort of timing budget for each net and this timing consideration seems to have prevented the complete

removal of the buffer trees, especially when the inverters are involved.

In the pre-placement, the user can also select if buffer trees are synthesized for the HFNs after the buffer trees are removed and the design is placed. The HFN buffer tree synthesis is carried out within PDS and can be invoked in two other places: IPO (InPlace Optimi-zation) or PPO1 (Post-Placement Optimization-1). In IPO or PPO1, the buffer tree will be removed first (if exists) before it is resynthesized. Figure 4 illustrates the relationship between the Collapse and synthesis options in the context of PrePlace – InPlace – Post-Place flow.

In PrePlace and PPO1, HFN synthesis is an option that can be set via the GUI. In InPlace, however, HFN synthesis is embedded as part of the optimization suite but can be disabled via the command define astUseFanoutSetup 0. In our experiments, various combinations of option values in each phase as well as different flow paths are tested in order to derive the optimal flow and options.

5. Experiment Results and Analysis

5.1 Case1 Experiments

Table 3 on next page records a partial list of PrePlacement experiments performed on Astro 2003.03 with design case1. Also listed are the design’s cell count and area after the PrePlacement. The experiments on pre-placement shows that the best results measured by cell count, congestion, and timing at this stage is obtained by the option values shown below. The ideal optimization and logic remapping options in this case do not result in significant improvements in either timing or area but they cause mismatches in existing formal verification setup and thus are turned off.

To investigate the effect of synthesizing the HFN buffers at different phases, experiment #13 is rerun with the HFN synthesis option turned off and the two preplacement results are run through the placement and PPO1 phases. One parameter value affecting the HFN buffer synthesis behavior is the maximum fanout setting. The set_max_fanout command in the original .sdc file has been commented out and the tool will take up the max fanout setting from Timing → Timing Setup → Optimization. The default value of this parameter is 40 but an internal maximum value of 100 is hard-coded in the tool, thus changing this value to above 100 has no effect as the tool will reset it back to100. The key metrics of these two experiment results are captured in Table 4.

As can be seen, both #13 and #14 have obtained similar results, with #13 having slightly bigger area but better congestion numbers. Compared with the current flow, the cell count and area of both #13 and #14 have been reduced by more than half. This reduction of buffer tree area has led to a dramatic reduction in congestion, as shown in Figure 5 of the congestion maps of the current flow and PPO1(>100) of #13.

To investigate the effectiveness of the HFN synthesis results, the statistics of various measures of the biggest HFN in this design is tabulated in Table 5.

For this net, there are only 14 sinks really need inverter outputs, but the initial netlist has more than 93 inverters in this net’s buffer tree instantiated at 8 levels. The collapse operation is apparently not able to remove the majority of these inverters. As a result, the structure of this part of the buffer tree remains pretty much intact. Also notice that the max fanout as well as FO/Dr number for the preplacement in #13 are rather low, indicating some problems with the HFN synthesis process in the pre-placement phase. (Experiments on the 2003.06 version later show that these two numbers have improved to 31 and 7.56, respectively.) Despite of this, the fanout per unit drive values in both #13 and #14 are nearly double or more than double that of the current flow, confirming the effectiveness of the HFN handling approach of removing all buffer trees first as well as the value of using this number as an indicator of effectiveness in buffer tree synthesis.

5.2 Case2 Experiments

In case2 experiments, the option values are fixed as shown before, but the focus is more on when to have the HFN synthesized (or resynthesized). Table 6 summarizes various experiments. Basically, the difference between #1* and #3* is that HFN synthesis in PrePlace is turned off with #1 but on with #3. In addition, an experiment (#2) is conducted to remove the remaining buffers/inverters via ECOs after the tool’s removal of the buffer trees to see if the final result is of any better. Key statistics of the experiment results are compiled in the Table 7 below.

From the table, we can draw a number of observations (on the 2003.06 version):

● While the ECO approach removes much more buf/inv, the final result after placement/optimization is not better.

● HFN synthesis is best done in the PrePlacement stage and subsequent re-synthesis does not seem to produce better results.

● Although it can improve the congestion, the InPlacement step can be obmitted; and in fact, the best result (3-ppo) is obtained from running PPO1 directly from Pre-Placement results.

● When the timing is met, the FO/Dr value is indicative to the overall results. The higher this value, the smaller the total area consumed.

Comparing the best result (3-ppo) with the current flow, however, the improvement appears to be only marginal. While the cell count and areas are smaller, the congestion is about the same, so, the overall improvement is far less dramatic than case1. But is it possible that the die area is not challenging enough so that the better tool/flow can’t distinguish itself from the mediocre one? To answer this question, we raised the core utilization to the point that the overall die size is reduced by nearly 10% and adjusted the floorplan accordingly. Only experiment #3 and the reference flow are run on the new floorplan. In addition, the way to turn off HFN synthesis in InPlace via the variable astUseFanoutSetup is discovered. This is tried as experiment 3a. The results are shown in Table 8 below.

Table 9 gives more detailed comparison of design statistics of the 3a-ppo vs. the reference flow while Figure 6 on next page shows the congestion maps of the reference flow and the 3a-ppo. As can be seen, the current flow is indeed having a hard time to route the design now while the #3a flow of timing driven placement and optimization with HFN synthesis turned off achieves a very low overall congestion. This result clearly demonstrates the key advantage of removing the buffer trees before placement: Without the artificial connectivity constraints due to the buffers inserted during logic synthesis, the placement tool can produce a better placement. The subsequent buffer insertion, when taking the placement into accounts, can significantly reduce the routing resource requirement, which can translate to a smaller die in the case of core-limited designs.

6. Conclusions

Experiment results of Astro 2003.03/06 with two design cases have been presented. It is found that the new feature of buffer tree removal in the PrePlace stage provides significant benefits in reducing cell count and buffer-induced congestions. This enables a core-limited design to achieve higher utilization, leading to a smaller die size.

Conceptually, the ideal flow should look like this:

The difficulty here is that once the buffers are removed, the conventional timing driven placement engine may be skewed by the timing problems caused by the unbuffered nets. Our experiments appear to support this conjecture as the non-timing driven placement in the PrePlace currently handles the buffer insertion better than the timing driven counterpart in the InPlace step. At this point, the best flow for Astro 2003.06 derived from this work is summarized in Figure 7 below.

This work has also proposed a model of fanout per unit drive for measuring the effectiveness of buffer tree synthesis. The experiment results have validated the value of this model, as the bigger this value is, the better the design results. Since this value is simple to calculate, it can be used for both monitoring the results (such as incorporating it into the QoR report) as well as used as an optimization objective. Table 10 below shows

the FO/Dr number and other relevant statistics for the top 20 HFNs after PrePlace. As can be seen, the FO/Dr values for those nets with large mixed polarity sinks are rather low, indicating further improvement is still possible.

As shown previously, the buffer removal process does not handle the inverters very well. The table here also shows that currently, the HFN synthesis embedded in PDS uses only buffers but no inverters. Because of the better area and transition characteristics of the inverters in the library we used, it is very desirable that the tool can also use the inverter cells in doing the buffer insertion.

It is noticed that the overall FO/Dr number in the Case2 design (19+) is twice that of Case1 (~9.6). At this point, this number is proven useful for comparing different flow or implementations of the same design, but it is not clear if the difference between different designs is due to the design characteristics or it is originated from the difference in FE synthesis methodology. Further study on this may shed new lights on how the differences of HFN handling in FE may have impacted the physical design, in lieu of this new feature of integrated automatic buffer tree removal.

Reference

Rick Furtner, “High Fanout Without High Stress: Synthesis and Optimization of High-fanout Nets Using Design Compiler 2000.11,” SNUG Boston, 2001.