The purpose of this project is designing downgrade TPU processor for DNN(Deep-Neural-Network), which is similar to Goggle TPU(Tensor Proecssing Unit)
In SW stack, training target DNN model for Image Processing, getting hyperparameter, inferencing process on HW stack whick is designed by Verilog HDL. For inferencing, designed Conv(convolution layer), FC(Fully Connected layer), Pool(Max Pooling layer). Conv layer use im2col method. PE(Processing Unit) and SA(Systolic Array) is designed for ifmap stationary structure. For downgrading this project, we use just 8bit weight and 8bit ifmap data and for this, quatization is applied.
- FPGA part : XC7Z020-1CLG400C
- 1 MSPS On-chip ADC : Yes
- Look-up Tables (LUTs) : 53,200
- Flip-flops : 106,400
- Block RAM : 630 KB
- Clock Management : 4
- Available Shield I/O : 40
- Total Pmod Ports : 6
- Fan Connector : Yes
- Zynq Heat Sink : Yes
- HDMI CEC Support : TX and RX ports
- RGB LEDs : 2
- data type -> unsigned int8
- Using FPGA DSP, 10 < latency < 20(ns) (1clock = 10ns)
- Port : True_dual_port_ram
- Latency : R/W = 1clk
- Size : 640kB
- Bandwidth : 512 bit per cycle
- For matrix multiplication on conv layer, before layer's output ifmap tensor need to be im2col transformed to matrix.
- For im2col transformation, use AXI interface for transfering BRAM to FPGA processor. We applied im2col transformation by Xilinx VITIS interface program.(2021.2 ver)
- In BRAM, ifmap and weight for conv operation are called through address value, ifmap is preloaded into SA (Systolic Array), and weight is transmitted to GLB.
- Buffers the weight values received from Conv Data Mover in order to transfer them to SA in the form for ifmap stationary operation.
- Multiple PEs (Processing Elements) are connected to each other to deliver ifmap, weight, and partial sum.
- When the weight and partial sum values are forwarded from PE to another PE, the valid signal is also transmitted at the same time.
- Multiplication of two square matrices is performed through PE operation and forwarding of data&valid(en) signals of primitives.
- As a combination of FIFO and adder, the psum value delivered from SA is written to fifo using the psum_valid signal.
- Accumulation is performed by using the rdata of the FIFO as feedback and writing it back to the FIFO after adding.
- It stores the ofmap values received from the accumulator in the buffer and writes it to the BRAM.
- Module that reads or writes data from 2 BRAMs (BRAM0 & BRAM1)
- State: Receive run_i and run_count_i from the controller in IDLE/RUN/DONE, IDLE state and operate
- R/W operand to BRAM0 and pass the operation core -> Save the resulting value to BRAM1 (transfer to Max Pooling)
- Received 2 operands, multiplying the two and accumulate the result on the result value that has been accumulated MAC operation causes a timing violation
- Multiplication retiming problem (clk latency < MAC) is solved by pipelining using FF
- Operates every time the data mover finishes 1 IDLE-RUN-DONE
- 8-bit operation result must be written to BRAM2 whenever the result is a multiple of 4 (for FC2)
BRAM0
- conv layer output featuremap volume 7 x 7 x 64
- 7 x 7 feature maps are stored one by one in each BRAM row
- Read one row and insert 7 numbers into Core 1 ~ 7 (π_π: ith input neuron)
BRAM1
- Stores 7 weights per row
- Read one row and insert 7 numbers into Core 1 ~ 7
- π_(π,π): weight from ith input neuron to jth output neuron
- Assume that weight storage can be transmitted using AXI4 Protocol
- BRAM capacity requirement: Approx. 3.212MB ((7x7x64)x1024 x8bit)
- Assume that Addr control can also be performed using AXI4 Protocol
BRAM2
- Save 1024 values of FC1's output and FC2's input neurons
- Get the feature map from BRAM0 and perform max_pooling
- Since BRAM0 is a dual port, 2 rows are fetched.
- Perform MP for 2 x 14 operands imported, write result value 2 x 7 to BRAM1
- When reading data from BRAM0, two rows must be fetched.
- In MEM IF with BRAM0, the port giving addr and the port receiving bram output are divided into two each.
- One Addr increases to 1,3,5 and the other increases to 2,4,6, and the data of the corresponding row is read.
- Write (run_count_i0, run_count_i1) a row corresponding to half the number of rows read from BRAM0 by MP operation
- Simulating with verilog testbench and golden ref below
- Path: /SIM/
- Vivado 2021.2 simulator
- Vitis 2021.2
- Path: /SW/
- How to use
- Reset the path of the ifmap and weight txt file created with rand of golden_ref.c
- Reset the txt file open path of Verilog tb_GEMM
- Run Vivado simulation and change the ofmap folder created with c and ofmap created with verilog testbench
- For block diagram of conv layer, refer to /DOC/BLOCK_DIAGRAM/IFS_SA.drawio
- The scale of the MMU for Conv operation and the operation core for FC operation is implemented in a size suitable for the target DNN model of the project.
- You can adjust the latency and resources by changing the module size by changing the parameter values inside the verilog code.
- In the case of FC and MP, operation latency and bandwidth can be changed by changing the number of cores, but only FC7 and MP7 are listed above.
- instruction design
- scalable code
- timing variation check
- 2022/07/18 : 1st