Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Produce $bmux cells in shiftmul transformation #3880

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

povik
Copy link
Member

@povik povik commented Aug 7, 2023

Prompted by the discussion in #3873 I played around with changing the right shiftmul transformation to either (1) produce $bmux cells; (2) produce a number of independent shift cells. For now these are special modes conditioned on a -bmux or -wshift option, and in contrast to the original shiftmul transformation in those modes we allow for what we call the W2>W case over at #3873.

The code here is on top of the other PR, but the last commit is where the focus is.

Cc: @phsauter

@povik
Copy link
Member Author

povik commented Aug 7, 2023

I took on a little exercise in Tcl writing and produced the graph below, comparing the gate cost on one testcase, switching between doing nothing, peepopt -bmux, and peepopt -wshift, then running techmap; opt_expr -mux_undef; abc; for different widths of the part-select. Surprisingly to me, with the ABC magic in loop, the bmux mode ended up having a slightly lower cost at wide part-selects! EDIT: bug in transformation, corrected graph in a comment below. The shift stride (W) was 34, W2 is on the x-axis.

costs

testcase.v

module reg_mux (
	sel,
	out,
	in
);
	parameter [31:0] NoPorts = 32'd13;
	parameter [31:0] SelWidth = 32'd14;

	input wire [3:0] sel;
	output reg [SelWidth - 1:0] out;
	input wire [(NoPorts * 34) - 1:0] in;

	always @(*) begin
		out = 1'sb0;
		out = in[sel * 34+:SelWidth];
	end
endmodule

sweep.tcl

yosys -import
read_verilog testcase.v
design -save ast

set from 5
set to 100

for {set width $from} {$width <= $to} {incr width 1} {
	chparam -set SelWidth $width
	procs
	opt
	wreduce
	design -save evaluated

	#equiv_opt -assert peepopt
	#equiv_opt -assert peepopt -bmux
	#equiv_opt -assert peepopt -wshift

	design -load evaluated
	techmap
	opt_expr -mux_undef
	abc
	set stat_do_nothing [yosys tee -q -s result.json stat -json]

	design -load evaluated
	peepopt -bmux
	techmap
	opt_expr -mux_undef
	abc
	set stat_bmux [yosys tee -q -s result.json stat -json]

	design -load evaluated
	peepopt -wshift
	techmap
	opt_expr -mux_undef
	abc
	set stat_wshift [yosys tee -q -s result.json stat -json]

	proc cell_counts {stat} {
		set mux_gates [dict get $stat modules \\reg_mux num_cells_by_type \$_MUX_]
		set other_gates [dict get $stat modules \\reg_mux num_cells]

		return "$mux_gates\t[expr $other_gates - $mux_gates]\t"
	}

	puts stdout "$width\t[cell_counts $stat_do_nothing][cell_counts $stat_bmux][cell_counts $stat_wshift]"
}

Commands:

./yosys -c sweep.tcl -q > costs.txt
gnuplot -e "set terminal png; set key left top; plot 'costs.txt' using 1:2 title 'untransformed (MUX gates)', 'costs.txt' using 1:3 title 'untransformed (other gates)', 'costs.txt' using 1:4 title 'bmux (MUX gates)', 'costs.txt' using 1:6 title 'wshift (MUX gates)'" > costs.png

@povik
Copy link
Member Author

povik commented Aug 7, 2023

So, I propose, let's just produce bmuxes up to some high W2/W, otherwise do nothing and be done with it.

@povik
Copy link
Member Author

povik commented Aug 7, 2023

Haha, look what happens to the graph just a bit past where it was cropped earlier :D

costs

@povik
Copy link
Member Author

povik commented Aug 7, 2023

Let's make sure that jump isn't a bug in the transformation code...

@povik
Copy link
Member Author

povik commented Aug 7, 2023

It is, it fails equivalence checking around that edge

@phsauter
Copy link
Contributor

phsauter commented Aug 7, 2023

Which edge are you talking about, the one around 105-ish or the one around 170-ish?

@povik
Copy link
Member Author

povik commented Aug 7, 2023

105-ish (FWIW the drop in muxes around 170 is compensated by other gates which is the new yellow line)

@phsauter
Copy link
Contributor

phsauter commented Aug 7, 2023

I am going to assume the jump is at exactly 102, correct?
(3*34)

@povik
Copy link
Member Author

povik commented Aug 7, 2023

Yes, though it starts failing at 69 already, one above 34*2

@povik
Copy link
Member Author

povik commented Aug 7, 2023

Ah I see the bug

@povik
Copy link
Member Author

povik commented Aug 7, 2023

OK, generating a new graph

@povik
Copy link
Member Author

povik commented Aug 7, 2023

This is more like it.

costs

In the bmux case ABC sees it can share the mux gates and makes it equivalent in cost to the parallel shifts.

@phsauter
Copy link
Contributor

phsauter commented Aug 8, 2023

I hogged one of the ETH servers overnight and created a larger sweep, note that the color-scale of the reference (untransformed) is different:
contour_plots
results.csv
This shows combined gate count only, its already complicated enough as is.

Both bmux and wshift look well behaved and pretty much identical.
In the untransformed case you can see pretty well the block widths widths with a compact bit-representation such as 128, 160 and 192, confirming the initial assumption that padding them to 128 can produce superior results.

From a Quality-of-Results perspective I think this PR is ready, the two things left are:

  1. figure out what the default behavior should be (I suggest bmux)
  2. documentation of this transformation

@Ravenslofty
Copy link
Collaborator

I'm going to give this a little bit of a bump; is anything blocking this?

@povik
Copy link
Member Author

povik commented Oct 8, 2023

There are still some changes to be made. In one of the dev JFs we have decided on the following:

  • The default behavior should be the bmux transformation for W2<=W, parallel shifts otherwise.

  • There should be flags to force one or the other.

Personally I also want to add a way to specify a numeric W2/W threshold for deciding between the two transformation options.

@Ravenslofty
Copy link
Collaborator

There should be flags to force one or the other.

This sounds like an excellent opportunity to use the scratchpad.

@povik
Copy link
Member Author

povik commented Oct 8, 2023

There should be flags to force one or the other.

This sounds like an excellent opportunity to use the scratchpad.

Genuinely curious: How is that better than having a command option? With scratchpad it's more cumbersome to invoke. Is the argument along the lines that scratchpad configuration is exposed from synth scripts without any extra work?

@Ravenslofty
Copy link
Collaborator

Genuinely curious: How is that better than having a command option? With scratchpad it's more cumbersome to invoke. Is the argument along the lines that scratchpad configuration is exposed from synth scripts without any extra work?

I am pretty strongly against flag proliferation in synthesis scripts; I've written cookbooks to explain the purpose of the variety of flags in synth passes, most of which the end user will not need (synth_ecp5 -noccu2 for example). This is not nearly general enough to warrant adding extra confusion.

Yes, doing scratchpad -set pass.config N is clumsy if somebody is writing it out all the time, but in practice if somebody needs this, this is likely going to go into a .ys file which is ran with yosys -s.

That way you have the same command for every synthesis flow, instead of needing to file a PR that adds your flag to every synthesis flow.

@phsauter
Copy link
Contributor

@povik could you rebase this PRs branch ontop of your peepopt-work branch?
This PR needs the fixup! peepopt: Add left-shift 'shiftmul' variant commit as well to work properly.

The default of bmux for W2<=W seems intuitive to me.

As for the options:
I think peepopt in general should have options to turn off each pass. These should be flags in my opinion since when people have problems with it, the options will likely be the first place they check. Providing the option to turn-off passes should aid in debugging efforts there.

The additional options like forcing only bmux, only wshift or changing the W2/W threshold are more involved and can be behind another thing, I do not mind that.
However, this still has to be clearly documented in the help message of peepopt itself (including at least a reference to how to use the scratchpad for this).
Otherwise the cookbooks are needed again anyway, just for undocumented options instead.

Drop code that was once used by the 'dffmux' pattern but now is unused
after that pattern has been obsoleted by the 'opt_dff' pass.
No functional change intended.
The `opt_expr` pass running before `peepopt` can interfere with the
detection of a shiftmul pattern due to some of the bottom bits of the
shift amount being replaced with constant zero. Extend the detection to
cover those situations as well.
Add a separate shiftmul pattern to match on left shifts which implement
demuxing. This mirrors the right shift pattern matcher but is probably
best kept separate instead of merging the two into a single matcher.
In any case the diff of the two matchers should be easily readable.
@povik
Copy link
Member Author

povik commented Oct 16, 2023

@phsauter Done! Rebased on the other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants