Binary Neural Networks on FPGAs


Binary Neural Networks are gaining attention in the community as they’re using compact data types for processing. This removes redundant computation (and, of course, storage and communication) away. FPGAs are extremely useful for this purpose and are the best for implementing custom operations. FINN is recently proposed as a framework to support BNNs in collaboration with Xilinx Research Labs:

Read more:

and please feel free to share your thoughts with us.


This is a pretty solid paper! They summarize the most important work that has been done on BNNs. One thing to note with BNNs is that they are typically trained on a server to get the binary weights and then those weights are loaded into whatever platform (e.g. FPGA) that the network is being run on. So, on-the-fly training isn’t really possible.


Agree, but again in terms of online training of NNs FPGAs are widely used to model sparsity (non fully connected NNs) and leverage compact data types, including log domain computation (5 bits for activations/weights against 32bit FP!):


Oh yeah, I completely agree on their usefulness and importance (I’m doing research on an application of BNNs), I just wanted to point out one of their drawbacks! :slight_smile:


Interesting, so how do you train your NNs, locally or you use cloud based infra? (if you don’t mind sharing)


Cloud-based at the moment! Previous research in the area tried to put backpropagation onto hardware for this application, but then the number of clock-cycles and amount of hardware required explodes.

However, just implementing enough to train the final weights on a network would probably increase accuracy over time and at least enable some transfer learning without too much extra hardware.

It’s an interesting problem!


Hi @peterseo,

I am not familiar with NN anyway but I am curious what is the major source of hardware demand in training implementation?

This work is based on binarynet ([Ref. 5]) which is proposed to use bit-wise operations than arithmetic ones. Is it feasible to implement if we can decrease the size?

[Ref. 5]

I understand the training will be slow.


From my understanding:

If only forward propagation is going to be implemented in hardware, then the weights don’t need to be stored in memory but if training will happen then the weights need to be stored. That immediately adds at least a clock cycle, and that paper, and the others that reference BNNs, still compute floating-point gradients and then use some method to translate that to a binary update.

For the applications I’m looking at, the neural network is running at the same time as other instructions so an extra ALU would be needed in order to keep everything parallel.


That’s right, in back propagation weights are updated iteratively and that calls for extra hardware (e.g. FIFOs, additional control flow like state machines, etc).

So if you need an extra parallel ALU to run along your instructions you can have it as an accelerator on the FPGA instead of getting the entire network synthesised onto the FPGA?


Thank you guys for the explanations (cheer). So it will be cool if the training modules can fit FPGA devices.
I have another minor ( and probably dumb) question if you don’t mind. I looked up several implementations/papers which use existing datasets. Because the training process needs to know how accurate the output is, it might demand manual feedbacks for unclassified inputs. Is it necessary to train on-the-fly? If yes, are user’s feedbacks required?


Khanh, I think you’ll find Quora very useful for your training specific questions,

what we discuss in this thread is about FPGA-based acceleration of NNs and how our framework could be useful in this respect.


@Mahdi.Jelodari The neural network is part of a microprocessor architecture, but an accelerator could work.

Could you point me in a direction to learn more about accelerators? Thanks!


@peterseo Sure, have a look at our documents:
particularly, the histogram example in Tutorial 1 – Learning the workflow,

it describes how you could develop your own accelerator(s) in Go and run it on a FPGA(s) in parallel with your main code running on a CPU(s).


you might be in interested in this Gihub repo
which appears to have the source code, still TBC though.


Thanks Ed! We’re going to try doing a BNN library/example to see how it stretches our compiler, maybe based on (apologies for paywall…)

Anyone who would like to be involved, drop me a line! @peterseo?


@here you can also follow our #BNN related conversation at


I’ll take a look at that paper, but looking at the abstract it’s pretty apparent that the author’s implementation is on a discreet FPGA part with possibly with networking rather than one embedded into a multicore CPU. The latter being the target architecture for the technology on offer from unless I have a serious misunderstanding of the intent of the objective here.


@edward.hartley I think the author’s implementation is customised to exploit maximum memory bandwidth for this particular application; An embedded accelerator into a multicore CPU may suffer from the latency associated with the memory organisation. offers an acceleration framework that can benefits from both architectures with respect to the user’s application.