Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using an FPGA to check if proposed sorting networks are compliant #14

Open
jeras opened this issue Oct 26, 2024 · 2 comments
Open

Using an FPGA to check if proposed sorting networks are compliant #14

jeras opened this issue Oct 26, 2024 · 2 comments

Comments

@jeras
Copy link

jeras commented Oct 26, 2024

Hi,

I was thinking whether it would make sense to check the sorting network proposals generated by the evolutionary algorithm using an FPGA. I will try to explain what can be done on an FPGA and how much time it would take to perform the steps, so you could estimate if the process would provide a speedup compared to running it in SW, and what network size would be the tipping point where FPGA is faster than SW. I could do the estimation myself, but I would have to know, some estimates on how long it takes for the SW approach to check a proposed network.

  1. The evolutionary algorithm would create a set of proposals. As a rough estimate about 100 n=64 might fit into an FPGA, depending on the FPGA size. After thinking about the network being very simple for ones and zeros input, I would update the estimate to 1000 networks fitting into an FPGA.
  2. Use a code generator to write pipelined HDL (Verilog) code for the proposals. Also generate a known good network for reference. The code could contain swap counters, without much overhead.
  3. Synthesize, place and route the generated code using the FPGA vendor tool, the result would be a bitstream. For an 80% full FPGA this step might take a few hours.
  4. With a script load the proposals into the FPGA. This step would take a few minutes.
  5. Run the proposals with all input combinations (ones and zeros) generated using a maximal LFSR this step could check between 200M and 500M combinations per second. At 500MHz it would check a n=32 network in 8.6 seconds, a n=48 network in 6.5 days, ...
  6. The FPGA would return the state over UART after all networks fail, or there is a pass. The report would also contain swap counts. This would take a few seconds.

The largest overhead is the slow conversion from generated HDL code to FPGA bitstream.

The FPGA can process all input combinations for small networks fast compared to SW, so using prefixes might not be needed. This would enable the evolutionary algorithm to cover a larger optimization space.

It might be possible to write efficient generic HDL code for a fixed n (and some max depth), so that new network proposals could be loaded into the FPGA over UART. This approach would allow skipping the slow FPGA synthesis step. I did not think about such generic code yet, so I am unsure how efficient it could be in therms of FPGA area and clock speed. It would probably make sense for n<32.

Going even further, for small n, the FPGA could just go through all possible routing networks, I am not going to calculate right now, how many would this be. As a drawback, the algorithm for generating network proposals would have to be rather simple, as implementing something like an evolutionary algorithm on FPGA would take time.

@bertdobbelaere
Copy link
Owner

Thanks for sharing your thoughts about this. I think that indeed it makes a lot of sense to try to use FPGA(s) as accelerator.
What would be needed to make them effective is that (1) they can handle parts of the process more efficiently than the CPU can do them and (2) that the overhead of synthesis and communication is far less than the avoided CPU time.
I like the idea of using a LFSR as a way to efficiently apply the input vectors in a "random" order without needing to store them. In its simplest form I see however a few issues that would need to be tackled:

  • The vast majority of the networks tried by SorterHunter (typically a few 100K networks/second) are rejected in an early phase. Using the SW algorithm, we apply the input vectors in batches of 64 (bit-parallel) and verify them in maybe a few 100 ns per batch (time linear with network size after prefix - if any). Typically, less than 10% of the networks survive the first round of 64 vectors, successively less are rejected in the 2nd, 3rd, ... batches, and few survive the entire set of vectors, but we need to test them all to be sure. As soon as one vector fails to sort, the network is rejected.
    I think that in order to make if worthwhile to offload this task to the FPGA, we should only pass it the more "serious" candidates that survived at least a few rounds, otherwise more time would be spent in transferring than in checking.

  • It is true that more search space could be covered without prefixes, however I don't think they can be avoided for the larger networks as 2^n quickly becomes too large, even for an FPGA. A limited (1 layer) prefix may be an option and does not have to be a restriction because it can be shown that a valid network can always be transformed to place its first layer elements at arbitrary positions.
    E.g. when we have n=36 and assuming 15 elements on the 1st layer, we would only need to test 2^(36-2*15) * 3^15 inputs instead of 2^36, still a lot but certainly worth the reduction. Maybe it is possible to have a variant on the LFSR that
    generates ternary outputs in pseudorandom order.

The mode of operation would need to be chosen as some tradeoff between versatility and performance.
On the one extreme an FPGA could be synthesized with the candidate networks already built in.
Personally I'm not convinced that would be practical, loadable networks seem more convenient and also scale better if you have more than one device.
Having the FPGA propose new variants is the other extreme, but that would probably be complex, although not at all impossible. We just need to avoid large memory arrays to store the vectors to be tested.
The design may work on a large number of networks in parallel, in which case the input interface could accept a network (as a list of pairs) and give a ticket number back, the output could be a queue with the ticket numbers of passed and failed networks after processing.
Alternatively the set of input vectors could be split in a large amount of subsets and the whole FPGA could work on different subsets in parallel for the same network. There are also quite a number of FPGA development boards that have faster interfaces than UART, PCIe add-on cards seem the most convenient for desktop PCs.
The biggest bummer may be the pricing model of the development tools. Typically free to try out or free until a certain design size, but suddenly costly beyond that.

@jeras
Copy link
Author

jeras commented Oct 29, 2024

Hi,

Based on your description I agree, synthesizing generated networks would not make too much sense, except maybe for very large networks which already passed a few rounds of SW checks. So reconfigurable networks would probably be the way to go. I mentioned UART as the communication protocol, since it is the easiest to implement. PCIe, Ethernet, ... require a combination of hardware and software, in general they take more time to implement, maybe too much time for a hobby project.

My answer is a bit short, since I lost most interest. My research was focused on using sorting networks as interconnect inside chips, but it seems they are not a good compromise for modern devices. I was only able to find one mention of a sorting network used in an old ATM network device (probably about 30 years old now). Even other non-blocking rearrangeable networks for unicast traffic like the Clos network seem to be in the past. A book mentioned a SONET chip using them internally, I could not find the documentation for the chip, only for the next device in the family and it did not seem to use Clos networks internally, but would still use them to connect multiple devices. But again SONET is also a thing of the past. Modern global networks use Ethernet and some server farms use Infiniband, those seem to use fat tree and supercomputer specific network topologies. In general it seems simple crossbars have replaced all "non-blocking rearrangeable networks for unicast traffic" inside chips, what remains are either simple crossbar switches or some other blocking structures which are easy to control (interconnected rings, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants