Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory_libmap support for K6N10f #1

Open
rakeshm75 opened this issue Apr 4, 2023 · 54 comments · May be fixed by #2
Open

memory_libmap support for K6N10f #1

rakeshm75 opened this issue Apr 4, 2023 · 54 comments · May be fixed by #2

Comments

@rakeshm75
Copy link

  1. SPRAM inferencing
  2. DPRAM inferencing
  3. Split support for SPRAM (placing 2 18K RAMs in a 36K block)
  4. Asymmetric RAM support for SPRAM
  5. Testcases to verify RAM inferencing, Split support & Asymmetric RAM inferencing
@tpagarani
Copy link

@nakengelhardt let's use this issue for further discussion on memory_libmap support

@nakengelhardt
Copy link
Collaborator

I had a look at the examples in https://github.com/QuickLogic-Corp/yosys-f4pga-plugins/tree/main/ql-qlf-plugin/tests/qlf_k6n10f, especially bram_tdp and bram_tdp_split, and I am confused by your definition of TDP mode. It seems that the input in bram_tdp.v describes a memory with two independent read ports and two independent write ports, so four ports total. But the output of the current synth_quicklogic implementation is mapped to a memory with two read-write ports, and it looks like that will write to the read address if rce and wce are enabled simultaneously. This seems incompatible with the input code to me, am I misunderstanding something?

@rakeshm75
Copy link
Author

The basic TDP36K BRAM IP has only one addr port (no separate read address and write address):
.CLK_A1_i
.ADDR_A1_i
.WEN_A1_i
.BE_A1_i
.REN_A1_i
.WDATA_A1_i
.RDATA_A1_o

BRAM_TDP36K

In the mapping the read address & write address are multiplexed to port address based on read or write:
(in brams_map.v file)
assign PORT_A_ADDR = A1EN ? A1ADDR_TOTAL : (B1EN ? B1ADDR_TOTAL : 15'd0);
assign PORT_B_ADDR = C1EN ? C1ADDR_TOTAL : (D1EN ? D1ADDR_TOTAL : 15'd0);

Yes, you are correct it would write to the read address when rce and wce are enabled simultaneously of the same port (A or B).

@nakengelhardt
Copy link
Collaborator

In that case, we won't be able to reuse the existing ql_bram_split pass, I'll need to write a replacement. Can you confirm the exact limitations on the allowed combinations of setting the 8 mode parameters:

RMODE_A1_i
WMODE_A1_i
RMODE_A2_i
WMODE_A2_i
RMODE_B1_i
WMODE_B1_i
RMODE_B2_i
WMODE_B2_i

Which ones need to always be equal and which are allowed to be set to different values?

@rakeshm75
Copy link
Author

With 36K,
RMODE_A1_i = WMODE_A1_i = RMODE_A2_i = WMODE_A2_i
RMODE_B1_i = WMODE_B1_i = RMODE_B2_i = WMODE_B2_i
With18K ,
RMODE_A1_i = WMODE_A1_i
RMODE_B1_i = WMODE_B1_i
RMODE_A2_i = WMODE_A2_i
RMODE_B2_i = WMODE_B2_i

@nakengelhardt
Copy link
Collaborator

Thanks! In that case, should I also allow asymmetric ports in split mode, or will that be a problem for the downstream flow?

How should I deal with the attributes "rd_data_width" and "wr_data_width"? The names don't really make sense since the widths are per-port and each port can do read and write simultaneously. Are the timing differences only influenced by the port modes? Should it be something like port_a_width/port_b_width for 36k mode and port_a1_width/port_a2_width/port_b1_width/port_b2_width for split mode?

@rakeshm75
Copy link
Author

Yes, we should allow asymmetric ports in split mode as well (18K bram).

For SPRAM, one of the port was tied as write port and other port as read port, that the reason the attributes were named "rd_data_width" & "wr_data_width". But as you said each port can do a read and write, it would be appropriate have the attributes named based on ports, as you mentioned above port_a_width/port_b_width for 36k mode and port_a1_width/ port_a2_width/port_b1_width/port_b2_width for split mode.

Only for the FIFO's there is a restriction that the write port (Push) will always be tied to Port A and the read port (pop) will always be tied to Port B, as the FIFO flags are tied to the Port A read data port. Anyways, inferencing is not supported for FIFO's only FIFO instantiation is supported.

Timing difference is influenced by port widths, split/nonsplit and fifo modes.

@rakeshm75
Copy link
Author

One question, earlier you mentioned that we won't be able to reuse the existing ql_bram_split pass and you would write a replacement, so if a designer writes a dual port ram like the one in bram_tdp and bram_tdp_split (with separate read and write address each port) would it be inferred ?

@nakengelhardt
Copy link
Collaborator

No, I think those descriptions would require a primitive with at least 3 ports (2 write + 1 read) to map the semantics. An additional read port could be created with duplication, but there's no way to create more write ports than the primitive has, and if both are used to write then there is no port left over to read...

@rakeshm75
Copy link
Author

Ok, understood.

@nakengelhardt nakengelhardt linked a pull request May 5, 2023 that will close this issue
@nakengelhardt
Copy link
Collaborator

I've gone through the existing tests and rewritten the modules that infer memory with the patterns that memory_libmap expects. However the manually instantiated modules still use the old code path (bram_map.v + bram_final_map.v) which now has some issues as a) they are not considered as candidates for merging two 18k blocks in split mode, and b) they still use the rd_data_width/wr_data_width attributes vs the port_a_width/port_b_width. How would you like to handle this? Are all of the modules like BRAM2x18_TDP considered user-facing primitives that should continue to be supported?

(FYI I've also had some simulation issues with the included sim tests where sdffsre cells would always output X, so I've had to rewrite the bram_tdp examples in a way that changes the behavior of the module to not infer FF cells, that's why it has those intermediate wires for ce now.)

@rakeshm75
Copy link
Author

With regards to the manually instantiated modules, I will make the changes to use the port_a_width/port_b_width instead of rd_data_width/wr_data_width attributes. Also, I would look into using the libmap_brams_map.v or integrate manually instantiated modules into this file.

The user facing primitives would be the following:

  1. RAM_36K_BLK
  2. RAM_18K_BLK
  3. DPRAM_36K_BLK
  4. DPRAM_18K_BLK
  5. AFIFO_36K_BLK
  6. AFIFO_18K_BLK
  7. SFIFO_36K_BLK
  8. SFIFO_18K_BLK

@rakeshm75
Copy link
Author

I have been running tests with latest PR (#2), few observations:

I see that the Address and the Write data inputs to the RAMs are registered. Why is it registered?

I have attached the design here with the synthesis output files.

spram_32x1024.zip

@nakengelhardt
Copy link
Collaborator

nakengelhardt commented May 10, 2023

If you run memory_libmap with debug it will tell you: - emulate transparency with write port 0
The pattern in this memory is for read-first behavior but the TDP36K block has undefined collision behavior, so it will insert some extra logic to handle this. The pattern for undefined behavior that can map directly would be:

always @(posedge clk) 
begin
	if (wce)
		memory[wa] <= wd;
	if (rce)
		rq <= memory[ra];
	if (wce && wa == ra)
		rq <= 'x;
end

Alternatively you can add the no_rw_check attribute to suppress collision handling:
(* no_rw_check = 1 *) reg [DWIDTH-1:0] memory[0:(2**AWIDTH)-1];

@nakengelhardt
Copy link
Collaborator

Were you able to test the PR sufficiently? Is there something else you need me to do on it, or should I get started on reviewing the other parts of synth_quicklogic with a view to getting it merged upstream?

@rakeshm75
Copy link
Author

Yes, we have tested the PR, you can proceed on reviewing the other parts of synth_quicklogic.

@rakeshm75
Copy link
Author

@nakengelhardt
When I run synthesis on the PICOSOC design, I get the following error:

7.36. Executing MEMORY_LIBMAP pass (mapping memories to cells).
mapping memory top.soc.cpu.cpuregs via $__QLF_TDP36K
Memory top.soc.memory.mem mapping candidates (post-geometry):
Memory top.soc.memory.mem mapping candidates (after post-geometry prune):
ERROR: no valid mapping found for memory top.soc.memory.mem

I have attached the design and the synthesis report.

Can you please help us understand this issue?

Earlier with MEMORY_BRAM pass, the ram inferring was happening without any issue on this design.

picosoc.zip

@nakengelhardt
Copy link
Collaborator

That memory is explicitly designated with (* ram_style = "distributed" *) in the code at picosoc_noflash.v:225. Since k6n10f doesn't have any distributed memory (LUTRAM), this ram style is impossible to implement. If you remove the attribute or change it to a supported ram style, it will happily infer it.

@nakengelhardt
Copy link
Collaborator

I've updated memory_libmap to print a message about forced/disabled RAM styles when such an attribute is encountered, this should make it at least a little easier for the user to understand what is happening: YosysHQ/yosys#3826

@rakeshm75
Copy link
Author

Thank you!

@rakeshm75
Copy link
Author

@nakengelhardt
What information is required to support RAM initialization?
What format should the initialization data be provided by the user?

@nakengelhardt
Copy link
Collaborator

On the user side it would be assignments in initial blocks, either directly with a for loop or from a file with $readmemh functions.

For implementing the support, the information required is how the ram primitive expects the data to be provided. In $mem_v2 the initialization data is saved as one constant the size of the entire array, and this has to be translated to the format that the primitive wants during the techmap step.

@rakeshm75
Copy link
Author

The RAM primitive expects the data in the following format:

Split=1 (18K RAM)

x1:
RAM_Data[17:0] = {2'b00,data15,....data2,data1,data0};
x2:
RAM_Data[17:0] = {2'b00,data7[1:0],....data1[1:0],data0[1:0]};
x4:
RAM_Data[17:0] = {2'b00,data3[3:0],....data1[3:0],data0[3:0]};
x8:
RAM_Data[17:0] = {2'b00,data1[7:0],data0[7:0]};
x9:
RAM_Data[17:0] = {data1[8],data0[8],data1[7:0],data0[7:0]};
x16:
RAM_Data[17:0] = {2'b00,data0[15:0]};
x18:
RAM_Data[17:0] = {data0[17],data0[16],data0[15:0]};

NonSplit=1 (36K RAM)

x1:
RAM_Data1[17:0] = {2'b00,data15,....data2,data1,data0};
RAM_Data2[17:0] = {2'b00,data31,....data18,data17,data16};
x2:
RAM_Data1[17:0] = {2'b00,data7[1:0],....data1[1:0],data0[1:0]};
RAM_Data2[17:0] = {2'b00,data15[1:0],....data9[1:0],data8[1:0]};
x4:
RAM_Data1[17:0] = {2'b00,data3[3:0],data2[3:0],data1[3:0],data0[3:0]};
RAM_Data2[17:0] = {2'b00,data7[3:0],data6[3:0],data5[3:0],data4[3:0]};
x8:
RAM_Data1[17:0] = {2'b00,data1[7:0],data0[7:0]};
RAM_Data2[17:0] = {2'b00,data3[7:0],data2[7:0]};
x9:
RAM_Data1[17:0] = {data1[8],data0[8],data1[7:0],data0[7:0]};
RAM_Data2[17:0] = {data3[8],data2[8],data3[7:0],data2[7:0]};
x16:
RAM_Data1[17:0] = {2'b00,data0[15:0]};
RAM_Data2[17:0] = {2'b00,data1[15:0]};
x18:
RAM_Data1[17:0] = {data0[17],data0[16],data0[15:0]};
RAM_Data2[17:0] = {data1[17],data1[16],data1[15:0]};
x32:
RAM_Data1[17:0] = {2'b00,data0[15:0]};
RAM_Data2[17:0] = {2'b00,data0[31:16]};
x36:
RAM_Data1[17:0] = {data0[17],data0[16],data0[15:0]};
RAM_Data2[17:0] = {data0[35],data0[34],data0[33:18]};

@rakeshm75
Copy link
Author

@nakengelhardt
I have shared the above information regarding the ram primitive data format. Is the above information sufficient or Do you require more information to support ram initialization?

@nakengelhardt
Copy link
Collaborator

@rakeshm75 Thanks for the information! I just got back to the office yesterday, this is my next priority to work on.

In the module definition of TDP36K in brams_sim.v there isn't any parameter RAM_Data. I believe you mentioned that the INIT_XX parameters that are currently there are not the actual parameters expected by the P&R tool. Could you provide an updated simulation model for the TDP36K primitive with the parameters that should be assigned when mapping the memory?

Also, in addition to the order of the bits for one line in the RAM, could you specify the order of the memory lines in the parameter value?

@rakeshm75
Copy link
Author

@nakengelhardt
I have attached the RAM initialization document here. Please review and we could discuss the details.

RAM Initialization V2.0.docx

@povik
Copy link

povik commented Sep 4, 2023

Hi @rakeshm75, I am following up on @nakengelhardt's work on this.

I opened a PR to add the RAM_INIT property per your description to the simulation model. Feel free to check the model for correctness.

I will be working on adding synthesis support for this next.

In the document, you ask

Apart from the parameter in the synthesis output Verilog netlist and the blif output file, can synthesis file output the initialization data as hex file in the above-mentioned data format for each RAM block?

In principle it can. If implemented, it should probably be enabled by passing a separate option to synth_quicklogic. Could you please tell more about the intended use-case of this? Is it perhaps for user inspection, or to be consumed by a loader later?

@rakeshm75
Copy link
Author

Hi @povik

Yes, it would be consumed by the loader later. The purpose of the initialization data as hex file is for the loading/programming the initialization data through the pre-load bus.

Regards,
Rakesh

@povik
Copy link

povik commented Sep 5, 2023

Thanks Rakesh. I opened a new PR #9 which adds synthesis support for initial values on top of the extension of the sim models from PR #8, which I now closed. I will get back to you on the feature of emitting separate RAM data files from synthesis.

@rakeshm75
Copy link
Author

rakeshm75 commented Sep 10, 2023

@nakengelhardt
I have attached a VHDL DPRAM (4096x12) design which is run through Yosys (using Verific) synthesis. The MEMORY_LIBMAP pass does not infer a BRAM. The same design on the MEMORY_BRAM pass infers the BRAM properly.

I have attached the design and synthesis report for both the runs (MEMORY_LIBMAP & MEMORY_BRAM).

@povik
Copy link

povik commented Sep 11, 2023

@rakeshm75
I looked into what happens there. The Verific frontend parses out the design to have read-first behavior in case of an address collision between a read on one of the A/B ports and a write on the other. That collision behavior is specified as undefined for TDP36K in the libmap definition, and since it can't be emulated when there are shared R/W ports, the block memory doesn't get inferred.

You can define TDP36K to have read-first behavior across the dual ports by adding wtrans all old; below the port definition in the library, then it does get inferred, but I don't know if the property is true of the underlying hardware. @nakengelhardt's earlier comments suggest that it's not.

@povik
Copy link

povik commented Sep 11, 2023

Looking into the frontend code, Verific always takes memories to be read-first, no matter the input code.

@nakengelhardt
Copy link
Collaborator

nakengelhardt commented Sep 11, 2023

I'm not well versed in VHDL, but I think this is a collision handling problem. By analogy with the verilog pattern, I've found these options for having it infer the block ram:

  • disable the check via the no_rw_check attribute:
architecture rtl of xip_dpmem is

    type mem_type is array(0 to 2**AWIDTH_G-1) of std_logic_vector(DWIDTH_G-1 downto 0);

    shared variable mem : mem_type;
    attribute no_rw_check : string;
    attribute no_rw_check of mem : variable is "true";
begin
  • explicitly handle the collision:
    -- Port A
    process(clk_i)
    begin
        if rising_edge(clk_i) then
            if wren_a_i = '1' then
                mem(to_integer(unsigned(addr_a_i))) := din_a_i;
            end if;
            
            dout_a_o <= mem(to_integer(unsigned(addr_a_i)));
            if ((wren_b_i = '1') and (addr_a_i = addr_b_i)) then
                dout_a_o <= (others => 'X');
            end if;
            
        end if;
    end process;
     
    -- Port B
    process(clk_i)
    begin
        if rising_edge(clk_i) then
            if wren_b_i = '1' then
                mem(to_integer(unsigned(addr_b_i))) := din_b_i;
            end if;

            dout_b_o <= mem(to_integer(unsigned(addr_b_i)));
            if ((wren_a_i = '1') and (addr_a_i = addr_b_i)) then
                dout_b_o <= (others => 'X');
            end if;
        end if;
    end process;

There's a separate issue with distinguishing between read first and write first patterns with the verific frontend that I need to look into...

@rakeshm75
Copy link
Author

@nakengelhardt
One question here, with Verilog for read-first behavior, Yosys inserts extra logic to handle the undefined collision behavior and infers the BRAM. But here in VHDL, Yosys is not inferring the BRAM at all as default (without attribute), What is the reason?

Yes, with the above attribute (no_rw_check ) or with the code to explicitly handle the collision, Yosys infers the BRAM correctly with VHDL as well.

@nakengelhardt
Copy link
Collaborator

nakengelhardt commented Sep 12, 2023

@rakeshm75 I checked and it doesn't infer the equivalent verilog code either. IIRC the previous example had two different clocks, whereas in this case both ports are using the same clock, so it also needs to make sure that address collisions between addr_a_i and addr_b_i are handled correctly, and this is not possible to emulate if more than one write port exists. For the case where the two ports are in different clock domains, this part of the collision checks is omitted as there's no presumption of consistent order between the operations on the different ports anyway, so there is only the behavior of the read data signal on the writing port to consider.

Regarding the issue of verific only inferring read first ports even for other patterns, we have merged a fix today: YosysHQ/yosys#3927
With this fix, it now also infers correctly the two patterns listed as "not yet supported" in our docs when using verific.

@povik
Copy link

povik commented Sep 12, 2023

To try to give more detail on when emulation is possible: Both the emulation of read-first and write-first collision behavior requires to delay the writes on the underlying hardware primitive by one cycle. In general that's not possible to implement on a shared R/W port since that can't be fitted into the limitation to operate on a single address per cycle.

With two synchronous R/W ports that share a clock domain and require emulation of read-first colllision behavior with respect to each other, the inference code just can't find a way to map it onto TDP36K. The inference explores many options, so if there's e.g. one R/W port and one R port with the read-first requirement, then it finds it can implement that by duplicating the memory, performing writes simultanously on both instances in parallel, and using one instance for one read port and the other for the other port.

@rakeshm75
Copy link
Author

@nakengelhardt , @povik
Got it... thanks for the explanation!

@povik
Copy link

povik commented Sep 14, 2023

I will get back to you on the feature of emitting separate RAM data files from synthesis.

@rakeshm75 I sketched something in PR #10, let me know how well that would work for you.

@rakeshm75
Copy link
Author

@povik @nakengelhardt

I have attached ROM design (VHDL) here. How do we force Yosys to infer BRAM?
I have used the "ram_style" attribute but it does not infer the BRAM.

Uploading rom_des.zip…

@nakengelhardt
Copy link
Collaborator

@rakeshm75 I think you might have submitted the comment before the file upload completed? I am only seeing a link that goes back to this issue...

@rakeshm75
Copy link
Author

@nakengelhardt
Oh sorry, I am uploading the design here.

rom_des.zip

@nakengelhardt
Copy link
Collaborator

Thanks, I have investigated the example. The constant is not its own element in the netlist so we can't see attributes from it, but we don't currently look at attributes from the output register, which we should. I will look into modifying memory_libmap so that this coding style also works.

@rakeshm75
Copy link
Author

rakeshm75 commented Jan 12, 2024

@nakengelhardt @povik
We want to disable True Dual Port RAM inferring for our new device architecture. How do we disable Yosys from inferring tdp ram inferring? Memory libmap should infer only the SDP ram not the TDP ram.

@nakengelhardt
Copy link
Collaborator

The description in the libmap_brams.txt file represents only the TDP mode. Since SDP is just a subset of TDP where some signals are unused, it's not considered a separate mode. If you want to infer SDP only, that requires writing new libmap_brams.txt, and changing ql_bram_merge and libmap_brams_map.v to work on the new port layout.

@tpagarani
Copy link

@nakengelhardt, we are still going to support use of 36k BRAM as two 18K BRAM in split mode. Nothing changes on that front. In fact, the external pin interface of TDP36K remains unchanged. We'll still have 4 ports on 36K BRAM. The only change is that we don't support True Dual Port mode.

@nakengelhardt
Copy link
Collaborator

To confirm, before I start working on modifying ql_bram_merge and libmap_brams_map.v, in SDP mode the restrictions are that one port can be used for reading only and the other for writing only, but otherwise all the configuration options are still the same (each port can pick a different width, and the two ports can be using different clocks)?
So the corresponding libmap description would be:

ram block $__QLF_SDP36K {
	init any;
	byte 9;
	option "SPLIT" 0 {
		abits 15;
		widths 1 2 4 9 18 36 per_port;
	}
	option "SPLIT" 1 {
		abits 14;
		widths 1 2 4 9 18 per_port;
	}
	cost 65;
	port sr "RD" {
		clock posedge;
		clken;
	}
	port sw "WR" {
		clock posedge;
		wrbe_separate;
	}
}

@rakeshm75
Copy link
Author

rakeshm75 commented Jan 17, 2024

@nakengelhardt
Yes, you are correct. Port A is used for write and PORT B is used for read. Yes, each port can pick a different width, and the two ports can be using different clocks.

Here when we want to define ports separately for write and read, the names should be "W" or "WR" and "R" or "RD" correct? we cannot use "A" and "B" ??

I made the changes as follows and also modified the port names in libmap_brams_map.v accordingly (ex... PORT_A_WR_DATA -> PORT_W_WR_DATA, PORT_B_RD_DATA -> PORT_R_RD_DATA ...)

ram block $__QLF_SDP36K {
	init any;
	byte 9;
	option "SPLIT" 0 {
		abits 15;
		widths 1 2 4 9 18 36 per_port;
	}
	option "SPLIT" 1 {
		abits 14;
		widths 1 2 4 9 18 per_port;
	}
	cost 65;
	port sr "R" {
		clock posedge;
		clken;
	}
	port sw "W" {
		clock posedge;
		clken;
		wrbe_separate;
	}
}

@rakeshm75
Copy link
Author

@nakengelhardt
What would be the best way to support both the architectures, one device that supports TDP (TDP & SDP) and other that supports only SDP?

@nakengelhardt
Copy link
Collaborator

@rakeshm75 You can choose the names freely, if you prefer "A" and "B" you can name them that (I was just thinking it would be easier to read the code in the techmap file if the name also implies the capability).

If you want to support two devices with different memory blocks, I would recommend having two libmap files (e.g. bram_tdp.txt and bram_sdp.txt), and then passing one or the other when calling memory_libmap in synth_quicklogic. The techmap file and the ql_bram_merge pass probably don't need to be separated, as they should always transform the input cells into output cells of the same kind.

@nakengelhardt
Copy link
Collaborator

@rakeshm75 I see that you still have clken; on port W. Is there actually a clock enable port separate from the write enable port on the hardware primitive?

@rakeshm75
Copy link
Author

@nakengelhardt
No, there is no clock enable port separate from the write enable port. I will modify it.

@rakeshm75
Copy link
Author

rakeshm75 commented Jan 17, 2024

@nakengelhardt
But I had kept the clken; on port W as well because the libmap_brams_map.v has PORT_W_CLK_EN & PORT_R_CLK_EN (even on the original TDP model, PORT_A_CLK_EN & PORT_B_CLK_EN ) & in the __QLF_SDP36K_MERGED module has PORT_A1_CLK_EN, PORT_B1_CLK_EN, PORT_A2_CLK_EN & PORT_B2_CLK_EN).

Actually on the BRAM primitive there is no write clock enable and no read clock enable. On read side there is Read enable signal (REN).

@nakengelhardt
Copy link
Collaborator

nakengelhardt commented Jan 17, 2024

Yes, I needed to make it a clock enable for the R/W ports because you told me that if WEN is high the primitive will also update the read value, and the only way to represent that in the libmap logic is as clken. But if the ports aren't used simultaneously for read and write then there isn't any need to have a second signal because there is only one behavior (either the port is active or not). You can have rden; instead of clken; on the read port for the SDP, it shouldn't make a difference for a read-only port.

@nakengelhardt
Copy link
Collaborator

Would you want me to prepare to add support for the sdp-only architecture to the upstream synth_quicklogic, or is this still something you are only experimenting with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants