Performance & coding style


#1

Hi,

How does the ReconfigureIO tools treat unrolled loops?

I’m trying to understand if it’s better from a throughput point of view to write a loop like this:

		for i := 0; i < 16; i++ {
			roundOut = round(x[i], y[i], roundOut)
		}

…or manually unroll it like this:

roundOut = round(x[0], y[0], roundOut)
roundOut = round(x[1], y[1], roundOut)
 .
 . 
roundOut = round(x[15], y[15], roundOut)

…will the reco tools treat the unrolled code as a 16 stage pipeline?


#2

Hi,

We do not unroll loops automatically, so from a throughput standpoint, manually unrolling is the way to go. The unrolled version you suggest would make a 16 stage pipeline, though you would expect it to take roughly 16x the amount of area.


#3

It looks like this only generates a computational blob with a 16 stage latency, not really a pipeline with a throughput of 1 result per clock; there isn’t any information how to stage the x and y vector elements as to create a real pipeline.

What would you need to express to make this a real pipeline with throughput of 1 result per clock?


#4

I think the simplest answer to your question is by partitioning the x and y arrays into smaller arrays or individual elements (variables). This is yet to be automated by us.


#5

The new control flow optimisations in the v0.17.3 release can help here. In most cases goroutines which process data from an input channel and write the result to an output channel within an infinite loop should transform to a pipeline. The main limitations are that the goroutine must not have any internal state and can’t currently use any control flow structures within the loop. The input and output channels must also have a length of at least one in order to avoid rendezvous synchronisation with the producers and consumers.


#6

If you don’t support control flow, how would you implement packet processing pipelines?


#7

It’s a different level of abstraction from something like packet processing - this particular optimisation is for data processing pipelines, so you can write code like this and get a pipeline you can stream data through at up to 1 data item per clock:

func foo(a <-chan int, b <-chan int, sum chan<- int, product chan<- int) {
  for {
    operandA := <-a
    operandB := <-b
    product <- operandA * operandB
    sum <- operandA + operandB
  }
}

The only control flow that could work in that context would be the equivalent of using the ternary operator, which Go doesn’t support directly. Inferring ternary operators from if/then/else is something we’re looking at but it isn’t currently supported.


#8

Hi Chris,

I think your reply highlights the kind of information that we are missing…how to write the code such that it performs to expected levels (in terms of latency, throughput and clock speed) and how do we know what the achieved latency & throughput actually are.

best regards,
Mark