Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase delay buffer estimate for the ECO flow #1019

Merged

Conversation

antonblanchard
Copy link
Collaborator

When working out how many delay buffers to insert, the ECO flow uses a
very small hard coded value (0.06ns). At the tt corner, a better estimate
is around 0.5ns. This means we heavily over fix hold violations, and
quite possibly cause setup violations.

Setting a more realistic estimate for the buffer delay should have no
downsides. Since the ECO flow is iterative, any hold path that is not
completely fixed will be fixed in subsequent iterations.

This value (and the name of the buffer to use) should really end up
in the open_pdks config.tcl.

When working out how many delay buffers to insert, the ECO flow uses a
very small hard coded value (0.06ns). At the tt corner, a better estimate
is around 0.5ns. This means we heavily over fix hold violations, and
quite possibly cause setup violations.

Setting a more realistic estimate for the buffer delay should have no
downsides. Since the ECO flow is iterative, any hold path that is not
completely fixed will be fixed in subsequent iterations.

This value (and the name of the buffer to use) should really end up
in the open_pdks config.tcl.
@donn donn requested a review from msaligane March 28, 2022 10:15
@donn donn changed the title ECO: Increase delay buffer estimate Increase delay buffer estimate for the ECO flow Mar 28, 2022
@donn donn merged commit 3026092 into The-OpenROAD-Project:master Mar 28, 2022
@msaligane
Copy link
Contributor

@antonblanchard FO1 delay at TT is ~40ps? adding 500ps per buffer might be too high?
@donghl17 thoughts?

@msaligane
Copy link
Contributor

@donn @antonblanchard The best way to do this I think is to set a variable in a config file that overwrites the default one.

@mithro
Copy link

mithro commented Mar 28, 2022

Why not use the actual delay of the cell being inserted?

@msaligane
Copy link
Contributor

Why not use the actual delay of the cell being inserted?

This would be best, especially since the timing reports are multi-corner. Then the delay could be calculated based on the .lib being used for signoff.

@antonblanchard
Copy link
Collaborator Author

I agree with @mithro and @msaligane, getting the actual delay would be best.

@antonblanchard FO1 delay at TT is ~40ps? adding 500ps per buffer might be too high? @donghl17 thoughts?

This is what I see in one of my taped out designs (I'm just pulling out delays for a few of the ECO inserted hold buffers here):

ss corner:

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                  0.18    1.26    3.29 v U_HOLD_FIX_BUF_0_108/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.20    1.29    3.31 v U_HOLD_FIX_BUF_0_113/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.14    1.21    3.23 v U_HOLD_FIX_BUF_0_132/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.14    1.21    3.23 v U_HOLD_FIX_BUF_0_103/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.15    1.23    3.24 v U_HOLD_FIX_BUF_0_107/X (sky130_fd_sc_hd__dlygate4sd3_1)

tt corner:

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                  0.06    0.57    2.57 v U_HOLD_FIX_BUF_0_98/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.08    0.58    2.60 ^ U_HOLD_FIX_BUF_0_24/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.11    0.61    2.64 ^ U_HOLD_FIX_BUF_0_42/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.07    0.59    2.59 v U_HOLD_FIX_BUF_0_86/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.06    0.57    2.58 v U_HOLD_FIX_BUF_0_96/X (sky130_fd_sc_hd__dlygate4sd3_1)

ff corner:

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                  0.04    0.38    2.38 v U_HOLD_FIX_BUF_0_4/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.05    0.39    2.39 v U_HOLD_FIX_BUF_0_1/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.05    0.40    2.40 ^ U_HOLD_FIX_BUF_0_7/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.04    0.38    2.38 v U_HOLD_FIX_BUF_0_5/X (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.05    0.40    2.41 ^ U_HOLD_FIX_BUF_0_2/X (sky130_fd_sc_hd__dlygate4sd3_1)

@antonblanchard
Copy link
Collaborator Author

I added some code to dump the delay the resizer calculated for sky130_fd_sc_hd__dlygate4sd3_1 and its ~550ps

@msaligane
Copy link
Contributor

@antonblanchard Do you know how those delays values are calculated? the slew values seems to be matching the FO1 delay. Can you paste the timing of path where eco buffers are inserted?

@antonblanchard
Copy link
Collaborator Author

Here's an example from the final STA run:

Startpoint: a[10] (input port clocked by clk)
Endpoint: _626_ (rising edge-triggered flip-flop clocked by clk)
Path Group: clk
Path Type: min
Corner: tt

Fanout     Cap    Slew   Delay    Time   Description
-----------------------------------------------------------------------------
                          0.00    0.00   clock clk (rise edge)
                          0.00    0.00   clock network delay (propagated)
                          2.00    2.00 v input external delay
                  0.01    0.00    2.00 v a[10] (in)
     1    0.00                           a[10] (net)
                  0.01    0.00    2.00 v U_HOLD_FIX_BUF_0_3/A (sky130_fd_sc_hd__dlygate4sd3_1)
                  0.06    0.52    2.52 v U_HOLD_FIX_BUF_0_3/X (sky130_fd_sc_hd__dlygate4sd3_1)
     1    0.00                           net_HOLD_NET_0_3 (net)
                  0.06    0.00    2.52 v input2/A (sky130_fd_sc_hd__buf_2)
                  0.09    0.18    2.70 v input2/X (sky130_fd_sc_hd__buf_2)
     1    0.03                           net2 (net)
                  0.09    0.00    2.70 v _626_/D (sky130_fd_sc_hd__dfxtp_4)
                                  2.70   data arrival time

                          0.00    0.00   clock clk (rise edge)
                          0.00    0.00   clock source latency
                  0.43    0.32    0.32 ^ clk (in)
     1    0.10                           clk (net)
                  0.43    0.00    0.32 ^ clkbuf_0_clk/A (sky130_fd_sc_hd__clkbuf_16)
                  0.07    0.26    0.58 ^ clkbuf_0_clk/X (sky130_fd_sc_hd__clkbuf_16)
     2    0.05                           clknet_0_clk (net)
                  0.07    0.00    0.59 ^ clkbuf_1_1_0_clk/A (sky130_fd_sc_hd__clkbuf_8)
                  0.03    0.13    0.72 ^ clkbuf_1_1_0_clk/X (sky130_fd_sc_hd__clkbuf_8)
     1    0.01                           clknet_1_1_0_clk (net)
                  0.03    0.00    0.72 ^ clkbuf_1_1_1_clk/A (sky130_fd_sc_hd__clkbuf_8)
                  0.09    0.16    0.88 ^ clkbuf_1_1_1_clk/X (sky130_fd_sc_hd__clkbuf_8)
     2    0.04                           clknet_1_1_1_clk (net)
                  0.09    0.00    0.88 ^ clkbuf_2_2_0_clk/A (sky130_fd_sc_hd__clkbuf_8)
                  0.08    0.18    1.06 ^ clkbuf_2_2_0_clk/X (sky130_fd_sc_hd__clkbuf_8)
     2    0.04                           clknet_2_2_0_clk (net)
                  0.08    0.00    1.06 ^ clkbuf_3_4_0_clk/A (sky130_fd_sc_hd__clkbuf_8)
                  0.37    0.39    1.45 ^ clkbuf_3_4_0_clk/X (sky130_fd_sc_hd__clkbuf_8)
    11    0.22                           clknet_3_4_0_clk (net)
                  0.37    0.01    1.46 ^ clkbuf_leaf_83_clk/A (sky130_fd_sc_hd__clkbuf_16)
                  0.05    0.23    1.69 ^ clkbuf_leaf_83_clk/X (sky130_fd_sc_hd__clkbuf_16)
     6    0.02                           clknet_leaf_83_clk (net)
                  0.05    0.00    1.70 ^ _626_/CLK (sky130_fd_sc_hd__dfxtp_4)
                          0.25    1.95   clock uncertainty
                          0.00    1.95   clock reconvergence pessimism
                         -0.06    1.88   library hold time
                                  1.88   data required time
-----------------------------------------------------------------------------
                                  1.88   data required time
                                 -2.70   data arrival time
-----------------------------------------------------------------------------
                                  0.82   slack (MET)

@donghl17
Copy link
Contributor

donghl17 commented Mar 28, 2022

The reason we use 0.06ns for FO1 is that

  1. FO1 delay may be different among corners so it's hard to be given a fixed value.
  2. even if an initial delay is smaller than the actual value, the eco loop may fix it.

I don't quite understand why the resizer calculates the delay so large.

@antonblanchard
Copy link
Collaborator Author

I'm not sure where 60ps is coming from. Based on some Googling (and finding an old post from @RTimothyEdwards), it looks we can use the cell_rise timing table to estimate delay:

                cell_rise ("del_1_7_7") {
                    index_1("0.0100000000, 0.0230506000, 0.0531329000, 0.1224740000, 0.2823110000, 0.6507430000, 1.5000000000");
                    index_2("0.0005000000, 0.0013003900, 0.0033820100, 0.0087958500, 0.0228760000, 0.0594954000, 0.1547340000");
                    values("0.4998114000, 0.5089495000, 0.5285765000, 0.5701394000, 0.6668673000, 0.9095528000, 1.5405252000", \
                        "0.5045179000, 0.5136551000, 0.5331524000, 0.5748917000, 0.6716904000, 0.9148460000, 1.5447300000", \
                        "0.5140869000, 0.5233344000, 0.5430380000, 0.5846870000, 0.6816028000, 0.9245619000, 1.5526628000", \
                        "0.5343100000, 0.5434748000, 0.5629217000, 0.6046893000, 0.7011746000, 0.9444749000, 1.5746626000", \
                        "0.5680223000, 0.5771998000, 0.5968143000, 0.6385586000, 0.7351774000, 0.9786373000, 1.6067756000", \
                        "0.6128227000, 0.6221012000, 0.6416894000, 0.6833976000, 0.7799209000, 1.0234245000, 1.6514372000", \
                        "0.6625758000, 0.6718174000, 0.6914005000, 0.7330232000, 0.8295155000, 1.0724920000, 1.7018963000");
                }

They are all in the 500ps+ range.

@msaligane
Copy link
Contributor

@antonblanchard ~40ps is based on spice simulation

@maliberty
Copy link
Member

Something is wrong if spice gives 40ps and liberty 500ps.

@msaligane
Copy link
Contributor

This is what the values should look like:

cell_rise ("del_1_7_7") {
 index_1("0.01, 0.0230506, 0.0531329, 0.122474, 0.282311, 0.650743, 1.5");
 index_2("0.0005, 0.00126321, 0.00319137, 0.00806272, 0.0203697, 0.0514623, 0.130015");
 values("0.0490569, 0.0558265, 0.0717435, 0.1104504, 0.2075838, 0.4514763, 1.0684844", \
 "0.0536074, 0.0603471, 0.0762774, 0.1149897, 0.2116108, 0.4563259, 1.0785807", \
 "0.0642037, 0.0708758, 0.0866203, 0.1255204, 0.2235220, 0.4693128, 1.0892596", \
 "0.0819207, 0.0888360, 0.1049964, 0.1440040, 0.2421219, 0.4876812, 1.1038942", \
 "0.1041913, 0.1115879, 0.1278155, 0.1672534, 0.2646072, 0.5103125, 1.1274786", \
 "0.1259206, 0.1349041, 0.1524062, 0.1916016, 0.2898473, 0.5344468, 1.1529084", \
 "0.1292959, 0.1413072, 0.1638067, 0.2061120, 0.3024876, 0.5484357, 1.1667728");

@maliberty
Copy link
Member

Where did that table come from?

@antonblanchard
Copy link
Collaborator Author

@msaligane can answer for the spice derived data, but I'm guessing the liberty version comes from https://github.com/google/skywater-pdk-libs-sky130_fd_sc_hd/blob/ac7fb61f06e6470b94e8afdf7c25268f62fbd7b1/cells/dlygate4sd3/sky130_fd_sc_hd__dlygate4sd3_1__tt_025C_1v80.lib.json#L314

@msaligane
Copy link
Contributor

@msaligane can answer for the spice derived data, but I'm guessing the liberty version comes from https://github.com/google/skywater-pdk-libs-sky130_fd_sc_hd/blob/ac7fb61f06e6470b94e8afdf7c25268f62fbd7b1/cells/dlygate4sd3/sky130_fd_sc_hd__dlygate4sd3_1__tt_025C_1v80.lib.json#L314

Correct. @antonblanchard which corner is that table you shared from?

Regarding the FO4 / FO1 delays. Here is some simulation ran by @tdene :
https://docs.google.com/spreadsheets/d/1gMwM1-UAFYXjcYZNyzoIgDlpseJFfcP5M_l5y53X7Js/edit#gid=0

@tdene
Copy link
Contributor

tdene commented Mar 29, 2022

That was inverters.

It's minimal effort to re-format my spice sim and check sky130_fd_sc_hd__dlygate4sd3_1. I'll get out of meetings in about 3.5h and post some results then.

Regarding liberty vs spice

Typically liberty values have to be taken with half a grain of salt, because they depend on the table indices. In this case though, just looking in the middle of the tables and comparing the delay cell at 0.122474 X 0.022876 vs the inv_1 at 0.122474 X 0.0254232 vs the inv_12 at 0.122474 X 0.02662, we see 0.6385586 vs 0.1863159 vs 0.0927495.

I think OSU re-characterized all the SkyWater-provided standard cells. Let me see if anyone can find me that data, and it can be used as another verification point.

@tdene
Copy link
Contributor

tdene commented Mar 29, 2022

I updated the spreadsheet Mehdi linked with a new sheet, representing the delay cells:
https://docs.google.com/spreadsheets/d/1gMwM1-UAFYXjcYZNyzoIgDlpseJFfcP5M_l5y53X7Js

FO4 delay for sky130_fd_sc_hd__dlygate4sd3_1 seems to be around 280ps.

I'm confused about the difference between this reply and this reply.

@maliberty
Copy link
Member

The amounts of the differences here seem much larger than process variation would account for. There seems to be some fundamental mismatch and alarm bells are going off for me.

@tspyrou
Copy link
Collaborator

tspyrou commented Mar 29, 2022

@msaligane The ECO flow should not estimate what is going to happen but instead make the change in the netlist with incremental timing and then measure if the slack is better or worse via opensta. Making this change will require ensuring that incremental global routing happens as well so the parasitics are reasonable. This aligns with @mithro's comment.

@tspyrou
Copy link
Collaborator

tspyrou commented Mar 29, 2022

I think we should have a meeting to discuss this. @tdene @msaligane

@tdene
Copy link
Contributor

tdene commented Mar 29, 2022

@tspyrou I'm free almost any time tomorrow, on the 30th. My 31st is completely full though.

@tspyrou
Copy link
Collaborator

tspyrou commented Mar 29, 2022

@msaligane do you want to propose a time via email?

@msaligane
Copy link
Contributor

FO4 delay for sky130_fd_sc_hd__dlygate4sd3_1 seems to be around 280ps.

@tdene in our case we insert back to back buffers. So the FO1 delay makes more sense. The delay cell here should be 2 inverters in series which might explain the higher delay.

@msaligane The ECO flow should not estimate what is going to happen but instead make the change in the netlist with incremental timing and then measure if the slack is better or worse via opensta. Making this change will require ensuring that incremental global routing happens as well so the parasitics are reasonable. This aligns with @mithro's comment.

Yes, I agree that we should pick the delay value of the inserted buffer from .lib directly. Re: incremental timing, we actually run a timing check at signoff step and calculate the number of buffers to insert based on violating slack. We picked a value based on spice simulation but it is certainly better to use the .lib file instead.
The incremental routing would help a lot to reduce the run time during ECO iterations.

@msaligane do you want to propose a time via email?

Sure, I should be able to meet on 31st PM.

@donghl17
Copy link
Contributor

The ECO flow currently uses the spice FO delay to calculate an estimation of the buffer number. The change is incremental: every time we add buffers to the netlist, we'll do the opensta timing analysis again and then decide how many buffers should be inserted in the next iteration by checking the reports. For each iteration, the ECO flow may adjust the buffer number according to the timing reports so that all the violations can be fixed in the end.

Here is the document about the ECO flow details: https://docs.google.com/document/d/1RYWNeVjQpBkMhN1LGLOyMwwtLzFAENI-veqsfWg-5v8/edit?usp=sharing

@tspyrou
Copy link
Collaborator

tspyrou commented Mar 30, 2022

@donghl17 is there a reason to use the spice FO delay instead of OpenSTA API's? Either should work but then the spice has to correlate to the spice used to build the .lib

@donghl17
Copy link
Contributor

Yes, it could be better to use OpenSTA APIs, but I wasn't aware of the APIs before.

@tdene
Copy link
Contributor

tdene commented Mar 31, 2022

There appears to be a discrepancy between the sky130_fd_sc_hd__dlygate4sd3_1 delay listed by the timing files and the sky130_fd_sc_hd__dlygate4sd3_1 delay obtained via spice simulation, as agreed upon by @msaligane @donghl17 @tspyrou @tdene

This merits further investigation. Raising issue #363 inside skywater-pdk.

@tdene
Copy link
Contributor

tdene commented Mar 31, 2022

Another conclusion that was reached is the following:

  • OpenLane appears to have its own hold-violation fix pass, as seen here
  • Said hold-violation fix pass is hard-coded to use the sky130_fd_sc_hd__dlygate4sd3_1 cell.

During the meeting, a consensus was reached that a change was advisable due to the following reason:

  • The sky130_fd_sc_hd__dlygate4sd3_1 claims it has ~500ps of delay on it.
  • Defaulting to a high-delay cell to fix hold violations is disadvantageous for numerous reasons.

Personally, I see another possible point-of-failure. Though I do not understand the code-base sufficiently to judge whether this may occur, I fear that such hardcoded cells may lead to failure if a cell library other than sky130_fd_sc_hd is used.

The script in question should be changed. One option is to modify it. Another option is to replace it with OpenROAD's hold-violation-fix functionality, if doing so is possible and makes sense here.

Unsure of who to ping: @donn @donghl17 @antonblanchard, @jjcherry56 [most recent committer to OpenROAD repair_timing -hold-related code]

@antonblanchard
Copy link
Collaborator Author

@tdene I notice three liberty cells (sky130_fd_sc_hd__buf_8, sky130_fd_sc_hd__buf_12, sky130_fd_sc_hd__buf_16) are characterised out to 5ns, and the cell_rise table for each looks pretty suspicious:

sky130_fd_sc_hd__buf_16 in particular seems a bit magical, where it has negative cell_rise delays:

                cell_rise ("del_1_7_7") {
                    index_1("0.0100000000, 0.0281727000, 0.0793701000, 0.2236070000, 0.6299610000, 1.7747700000, 5.0000000000");
                    index_2("0.0005000000, 0.0023207900, 0.0107722000, 0.0500000000, 0.2320790000, 1.0772200000, 5.0000000000");
                    values("0.0699465000, 0.0720265000, 0.0806863000, 0.1123300000, 0.2371367000, 0.8045733000, 3.4296477000", \
                        "0.0761547000, 0.0782439000, 0.0868256000, 0.1185513000, 0.2433967000, 0.8112872000, 3.4485761000", \
                        "0.0941300000, 0.0961656000, 0.1046878000, 0.1361711000, 0.2615364000, 0.8345354000, 3.4633343000", \
                        "0.1264587000, 0.1286038000, 0.1373975000, 0.1696380000, 0.2955609000, 0.8647546000, 3.4934911000", \
                        "0.1638667000, 0.1664669000, 0.1767662000, 0.2109544000, 0.3371712000, 0.9042485000, 3.5293068000", \
                        "0.1643597000, 0.1679857000, 0.1822371000, 0.2262780000, 0.3553059000, 0.9229272000, 3.5579750000", \
                        "-0.0061497000, -0.0015439000, 0.0176675000, 0.0804246000, 0.2339132000, 0.7999581000, 3.4244187000");
                }

We currently disable that cell, but the 8 and 12 sized cells are used, and have unrealistic values in the table for transitions > 1.5ns.

@jjcherry56
Copy link

Negative delays are an artifact of how delay is measured and completely normal with super slow input transitions. The gate is an amplifier so the output can cross the logic threshold before the input does.

@antonblanchard
Copy link
Collaborator Author

@jjcherry56 thanks, but does it make sense that all the values of delay are monotonically increasing until the final row (eg 0.0699 - 0.164, -0.006)?

@antonblanchard
Copy link
Collaborator Author

If it is real, I wonder if we should treat max transition errors as errors during openlane sign off (it's a warning right now), because if we are far outside the limit of the lookup table the delay might be wildly different.

@jjcherry56
Copy link

Yes, it makes sense (ie, when looking at delay versus input transition time).
Yes, that is one of the main reason max_transition attributes exist (besides the physical effect of short circuit power).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants