CPLD Complex Programmable Logic Device FPGA Field Programmable Gate Array GAL Generic Array Logic HDL Hardware Description Language IEEE Institute of Electrical & Electronic Engineers IP Intellectual Property ILA Integrated Logic Analyzer ISE Integrated Software Environment ISP In-System Programming JKFF Jack-Kilby Flip Flop JTAG Joint Test Action Group LEC Logic Equivalence Checker LMG Logic Modeling Group LUT Look-Up Table NGC Native Generic Compiler OTP One-Time Programmable PACE Pin-out and Area Constraints Editor PAL Programmable Array Logic PCI Peripheral Component Interconnect PLA Programmable Logic Array TBW Test-Bench Waveform UCF User Constraints File VHDL VHSIC Hardware Description Language VHSIC Very High Speed Integrated Circuit XST Xilinx Synthesis Technology ABBREVIATIONS
SYSTEM DESIGN USING HDL (ECE43) # Digital system design using Verilog, Charles Roth, Lizy Kurian John, Byeong Kil Lee, 1st Edition, 2016, Cengage Learning 1 2.1, 2.2, 2.3 - 2.8, 2.11, 2.13 - 2.15 2 2.9, 2.10, 2.12, 2.16 - 2.19, 8.1, 8.2 3 3.1 - 3.4, 5.1, 5.2.1, 5.3 4 4.1 - 4.5, 4.8, 4.6, 4.7, 4.9, 4.11 5 6.1 - 6.5, 6.7 - 6.12 DESIGNING WITH FPGA
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 4 Example-1: Design of a 4:1 multiplexer using FPGA Configurable Logic Block in FPGA Each CLB in the FPGA contains two 4-variable function generators. It also contains two flip-flops which can be used for latching the function.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 5 As 4:1 mux contains 6 inputs, it is not possible to implement it using 1 CLB in the given FPGA. Therefore, the 4:1 mux can be decomposed into 2:1 mux blocks. Flip-flops are of no use here. M = S1'S0'I0 + S1'S0I1 + S1S0'I2 + S1S0I3 M1 = S0'I0 + S0I1 M2 = S0'I2 + S0I3 M = S1'M1 + S1M2
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 6 Instead of logic equations, modern FPGAs use LUT as a basic building block. For this particular design of 4:1 mux, the contents of LUT4 are as shown: Each LUT4 can implement 1-bit function of 4-input variables. Hence, 16 cells of SRAM are required for the input columns (“don’t care” terms need to be included as logic states in the LUT). 3 LUT4s require 48 SRAM cells. Therefore, this is an expensive implementation.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 7 If the CLB in the FPGA has a provision to combine the outputs of the function generators, then the 4:1 mux can be implemented using a single CLB. (e.g.: XC4000) This method requires 2 LUT4s and 1 LUT3. Hence, the number of SRAM cells required is, 16 + 16 + 8 = 40.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 8 Example-2: Circular Shift Register (Ring Counter) Even though FGs are of no use, they have to be used.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 9 Q1: How many CLBs are required to implement a 3-to-8 decoder? A1: If the decoder is implemented using logic gates, 3 NOT gates and 8 AND gates are required. When CLB is used, as there are 8 outputs, one for each input combination, 8 FGs are required. As each CLB contains 2 sets of (FG4+FF), the number of CLBs required is 4.  If LUT based FPGA is used, for each output, 8 SRAM cells and one 8:1 mux is required. Thus, 8 LUTs are required. But in the CLB, as each FG4 contains 16 SRAM cells (including the “don’t care” term), for the 8 FG4s, 16x8 = 128 SRAM cells are required. 1 0 0 0 0 0 0 0 0 0 0
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 10 Implementing functions using Shannon’s decomposition  Shannon’s expansion theorem helps to decompose a function containing larger number of variables, into a function containing lesser number of variables.  In the example shown, instead of a single 6-variable FG, two 5-variable FGs along with ½ of a 3rd FG are used to realize the function. Thus, Shannon’s expansion theorem helps in the reduction of hardware.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 11 Example-3: Consider Z = abcd’ef’ + a’b’c’def’ + b’cde’f By setting a = 0, Z = b’c’def’ + b’cde’f => Z0 By setting a = 1, Z = bcd’ef’ + b’cde’f => Z1  Therefore, two LUT5s along with either a 2:1 mux or another LUT5, can be utilized for implementing the function.  The number of terms in Z0 or Z1 does not matter, as this is going to be implemented by LUT.  If only LUT4 is available in the CLB, then the function needs to be decomposed further, by using “a” and “b” together.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 12 This method requires seven LUT4s, in general.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 13 Example-4: Consider Z = abcd’ef’+a’b’c’def’+b’cde’f Substituting a = 0 & b = 0, Y0 = c’def’ + cde’f Substituting a = 0 & b = 1, Y1 = 0 Substituting a = 1 & b = 0, Y2 = cde’f Substituting a = 1 & b = 1, Y3 = cd’ef’ As there is a null function, this requires only 5 LUT4s.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 14 Q2: What is the max. no. of LUT5s for realizing a 7-variable function? A2: Four LUT5s are required for implementing Y0, Y1, Y2 and Y3. Another LUT5 is required to implement the three initial terms. The last LUT5 is required to club the last term with the previous output. Thus, the total no. of LUT5s required is 6. However, in the last LUT5, one input remains unused, and it has to be considered as “don’t care”. Example-5: Implement a 7-variable function using 4-input LUTs and 2:1 multiplexers. (7-variable function = Two LUT6 + One 2:1 mux). (6-variable function = Two LUT5 + One 2:1 mux). (5-variable function = Two LUT4 + One 2:1 mux). Substituting accordingly, we obtain, (7-variable function = Eight LUT4 + Seven 2:1 mux).
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 15  If muxes are unavailable in CLB, then more LUTs are needed. Xilinx Spartan FPGA provides mux in addition to LUT4. A logic unit is these FPGAs is called as “slice”. S L I C E
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 16 REALIZATION OF 7-VARIABLE FUNCTION USING 4 SLICES Example-6: Implement the parity function A⊕B⊕C⊕D⊕E using 4- variable Function Generators.  For direct implementation, this 5-variable function requires only one LUT5.  Using Shannon’s expansion, this function can be decomposed into two 4-variable functions, and can be realized using two LUT4s and one 2:1 multiplexer.  If multiplexer is not present in the CLB, then it requires three LUT4s.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 17 CARRY CHAINS IN FPGA  Addition is a very common operation in digital circuits.  As LUT4 is a standard building block in FPGA, two LUT4s are required for sum and carry bits.  Thus, for an n-bit adder, ‘2n’ number of LUT4s are required.  But, if the FPGA can provide dedicated circuitry for generating and propagating carry bit to the next stage, then only ‘n’ number of LUT4s are required for sum bits.  The dedicated carry chain generates the carry bit in parallel.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 18 CASCADE CHAINS IN FPGA  For a function with large number of variables, the FPGAs provide cascade AND and cascade OR chains (for PoS and SoP terms).  Thus, instead of using separate FGs to perform AND or OR functions, the cascade circuitry can be used to create such functions.  Hence, for a 32 variable SoP function, only 8 LUT4s are required. But without the cascade chain, 11 LUT4s are required (8 + 2 + 1).  FPGAs such as Altera Stratix IV provide register chains as well.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 19 Examples of Logic blocks in commercial FPGAs  Kintex uses LUT6 in each slice. CLB contains 4 copies of the slice.  Xilinx Virtex and Spartan FPGAs use LUT4. Each slice contains two FGs, two muxes, two flip-flops, and additional logic. 1. Xilinx Kintex CLB
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 20  Each Stratix IV LM contains two LUT6 and two flip-flops. Each LUT6 has two independent inputs and four shared inputs. In addition, two 1-bit adders are built in, with carry chaining.  Flip-flops with register chaining allows to create shift registers. 2. Altera Stratix IV Logic Module
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 21  Fusion VersaTile consists of muxes and logic gates. Each block has 4 inputs – X1, X2, X3 and XC. The VersaTile block is of significantly finer grain than the LUT4 present in other FPGAs.  Each VersaTile can be configured as: 3-input logic function, or latch with (clear or set), or D flip-flop with (enable, clear or set). 3. Microsemi Fusion VersaTile
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 22 DEDICATED MULTIPLIERS IN FPGA  Multiplication is also a common operation, and for implementing it, several programmable logic blocks are required. In addition, such multiplier will be slower, because of the interconnecting switches.  Hence, some Xilinx and Altera FPGAs contain dedicated 18X18 multipliers. When multiplication of larger numbers are required, several of the built-in multipliers can be put together.  e.g., if A and B are of 32 bits, then they can be represented as: A=(C X 216)+D, B=(E X 216)+F & A X B = (CE X 232)+(DE+CF) X 216 + DF. Thus, 4 multipliers are required to generate the partial products CE, DE, CF & DF, which are later added by means of several adders.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 23 Cost of programmability  The logic block shown, for its configuration, requires totally 46 SRAM cells (276 transistors). There will be additional configuration bits required, for programmable interconnect and for programmable I/O. Thus, the flexibility of programmable points comes with a much higher additional cost of associated memory cells (SRAM/Flash).  e.g.: Xilinx Virtex-II XC2V40 (with 512 LUT4s & 88 I/O pins), needs 3,38,976 configuration bits. Virtex-II XC2V8000 (with 93,184 LUT4s & 1108 I/O pins), needs more than 26 million configuration bits.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 24 FPGAs and One-Hot state assignment  While implementing a state machine, in general, state encoding is performed with ‘n’ bits for 2n states. e.g.: for a machine with 4 states, 2-bit encoding has to be used. Increase in ‘n’ will be requiring more no. of logic blocks.  For faster implementation of the design, it is desirable to reduce the no. of logic blocks and interconnections. Hence, instead of the encoding method, one-hot method can be used, which will reduce the no. of logic blocks.  This method, in turn, will result in the increased no. of flip-flops; but this does not affect the implementation much, as each FPGA logic block contains two flip-flops.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 25  For the state graph shown, the encoding can be 00, 01, 10 and 11.  But with the usage of one-hot method, the state encoding will be 1000, 0100, 0010 and 0001. The states will use one flip-flop each.  The next state equation for the flip-flop Q3 can be written as, Q3 + = X1Q0Q1 ’Q2 ’Q3 ’ + X2Q0 ’Q1Q2 ’Q3 ’ + X3Q0 ’Q1 ’Q2Q3 ’ + X4Q0 ’Q1 ’Q2 ’Q3.  In the one-hot method, this equation will get reduced to, Q3 + = X1Q0 + X2Q1 + X3Q2 + X4Q3. Here, each term in the equation contains exactly one state variable. The output equations are: Z1 = X1Q0 + X3Q2, Z2 = X2Q1 + X4Q3. As terms contain one state variable each, this leads to fewer logic cells.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 26  In electronic designs, a “cell” is defined as the predesigned and precharacterized circuit element.  Thus, a cell contains pretested and prestored instances of circuit diagram, its circuit symbol, and its physical description (layout).
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 27  ASIC contains an exact number of gates that are required for the design. But FPGA contains arrays of gates, or arrays of LUTs. Thus, if a larger design needs to be implemented in FPGA, the ASIC designer needs to have an idea about the design being fit into a given FPGA.  For the designer, the number of gates inside FPGA is not a useful metric, as FPGA is programmable. Hence, a term called “equivalent gate count” is defined, as a count of the circuitry that can fit into a particular FPGA. This type of gate count is extremely difficult to compute, as it depends on the type of circuitry, the type of interconnections, and the available routing resources available in the FPGA. FPGA CAPACITY (Maximum gates versus usable gates)
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 28  One method for computing the equivalent gate count for a CLB is as follows: 2:1 mux = 4 gates, 3-input XOR gate = 6 gates, 4-input XOR gate = 9 gates, Flip-flop = 7 gates, and so on. Thus, the equivalent gate count for a CLB can be obtained. The total gate count can be estimated, by multiplying the equivalent gate count with the number of CLBs in the FPGA. In general, this type of gate count is likely to be higher than the gate count of the practical circuitry that is being realized.  Another method is to use the Benchmark circuits (e.g.: Benchmark suite prepared by PREP [Programmable Electronics Performance company]). For example, if an ASIC contains 2000 gates, and if an FPGA can fit 20 copies of the ASIC, with no routing between the copies, then the maximum gate count of the FPGA can be considered as 40,000.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 29  Synthesis is the process of translation of an abstract high-level design to a detailed circuit description.  The synthesis tool implements the digital system as an interconnection of gates, flip-flops, registers, counters, muxes, adders, and other basic building blocks.  The representation of the design as a logic schematic, together with an associated wirelist, is called as netlist. DESIGN TRANSLATION (SYNTHESIS) results in AND gate. results in AND gate followed by flip-flop.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 30  The synthesis tool performs a line-by- line translation of HDL into hardware.  The synthesis tool selects components that are available in the library.  In general, ‘case’ statement results in muxes, comparison results in adders, shift results in registers, and so on.  For implementation with different technologies, different component libraries can be provided.  The resulting hardware is optimized later on.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 31 Synthesis of a ‘case’ statement module case_eg (a,b); input [1:0] a; output reg [1:0] b; always @(a) begin case (a) 0: b<=1; 1: b<=3; 2: b<=0; 3: b<=1; endcase end endmodule Synthesized circuit before optimization Logic optimization Synthesized circuit after optimization
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 32 Unintentional latch creation Module latch_eg (a,b); input [1:0] a; output reg b; always @(a) begin case (a) 0: b<=1; 1: b<=0; 2: b<=1; endcase end endmodule Initial output of naïve synthesizer Optimized output of naïve synthesizer
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 33 Output of optimizing synthesizer Output of naïve synthesizer Solution to eliminate latch Module latch_eg (a,b); input [1:0] a; output reg b; always @(a) begin case (a) 0: b<=1; 1: b<=0; 2: b<=1; 3: b<=0; endcase end endmodule
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 34 Synthesis of ‘if’ statements if (A == 1’b1) begin nextstate <= 3; Z <= 1; end if (A == 1’b1) begin nextstate <= 3; Z <= 1; end else begin nextstate <= 2; Z <= 0; end Ambiguous code, that results in latch Unambiguous code module if_eg (A,B,C,D,E,Z); input A,B; input [2:0] C,D,E; output reg [2:0] Z; always @(A or B) begin if (A == 1’b1) Z <= C; else if (B == 1’b0) Z <= D; else Z <= E; end endmodule Synthesized output
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 35 Synthesis of arithmetic components module ar_eg (clk,A,B,ge,acc,count); input clk; input [3:0] A,B; inout [3:0] acc,count; output ge; reg [3:0] acc_t, count_t; assign acc = acc_t; assign count = count_t; assign ge = (A >= B); always @(posedge clk) begin acc_t <= acc +B; count_t <= count + 1; end endmodule Synthesized output
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 36 Example-7: What hardware gets resulted for the statement, assign LE = (A <= B); where A and B are 4-bit vectors?  The symbol “<=” is a relational operator over here.  The following statement inside the ‘always’ block, LE <= (A <= B); results in the same hardware. Example-8: What is the optimized hardware for, assign EQ3 = (A == 3); where A is 4-bit vector?  A naïve synthesizer may produce a 4-bit comparator, with ‘A’ and ‘3’ as inputs.  For optimization, the statement can be altered as: assign EQ3 = ~A[3]&~A[2]&A[1]&A[0];
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 37 Area of the silicon chip: Minimum Power consumed: Minimum Speed of operation: Maximum Size of the product: Optimum Weight of the product: Optimum Memory capacity: Maximum Cost of the product: Minimum Delay of operation: Minimum Ideal requirements (Practical tradeoffs) Area, power and delay optimizations Area & delay of a circuit are inversely related (e.g.: serial v/s parallel).  Energy & delay of a circuit are also inversely related (more switching implies increased dynamic power).  Thus, Area-Time (AT) product and Energy-Delay (ED) product are the metrics used, to qualify the circuit. The path with the longest delay in the circuit is called as the “critical path”.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 38 MAPPING, PLACEMENT AND ROUTING  These are the 3 major steps that happen, to transform the design that is in the netlist form, to the appropriate target technology (MPGA, CPLD, FPGA, ASIC).  Mapping is the process of translating the design into the available building blocks in the target technology. [e.g.: LUT with mux (Xilinx), Mux with gates (Microsemi)].  In other words, it is the process of binding the technology- dependent circuits of the target technology to the technology- independent circuits that are in the design.  In case of FPGA, the design has to be mapped into muxes, LUTs etc. In case of ASIC, the design has to be mapped into the standard cells that are available in the library (e.g.: logic gates, muxes, decoders, encoders, comparators, counters etc.)
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 39  Placement is the process of taking the defined logic & I/O blocks from the technology mapper, and assigning them to the physical locations of the target implementation. Routing is the process of interconnecting those blocks and sub-blocks on the target implementation.  “Place & route” are often done along with each other. Two of the popular algorithms used for the same purpose are, ‘Simulated annealing’ and ‘Iterative improvement’.
07/03/2019 Aravinda K., Dept. of E&C, NHCE, Bengaluru 40  In metallurgy, annealing is the process utilized to toughen the metal, by heating it, and then cooling it slowly, in a series of steps. The temperature is kept high in the beginning, and it is reduced gradually in the next steps.  In a similar fashion, for placing & routing, the simulated annealing algorithm takes bigger risks in the beginning, by making random modifications for a feasible solution, and gradually arrives at an optimal solution. In the beginning, just like high temperature, risky moves are performed. In the next steps, as the temperature is reduced, there will be decrease in the probability of occurrence of bad moves.  In contrast, the iterative improvement algorithm accepts only better solutions in each step. Such algorithms are called as ‘greedy’. At the end of simulated annealing, the algorithm has to be greedy, so as to accept only positive moves.
07/03/2019 41Aravinda K., Dept. of E&C, NHCE, Bengaluru A S I C D E S I G N F L O W
07/03/2019 42Aravinda K., Dept. of E&C, NHCE, Bengaluru

System design using HDL - Module 5

  • 2.
    CPLD Complex ProgrammableLogic Device FPGA Field Programmable Gate Array GAL Generic Array Logic HDL Hardware Description Language IEEE Institute of Electrical & Electronic Engineers IP Intellectual Property ILA Integrated Logic Analyzer ISE Integrated Software Environment ISP In-System Programming JKFF Jack-Kilby Flip Flop JTAG Joint Test Action Group LEC Logic Equivalence Checker LMG Logic Modeling Group LUT Look-Up Table NGC Native Generic Compiler OTP One-Time Programmable PACE Pin-out and Area Constraints Editor PAL Programmable Array Logic PCI Peripheral Component Interconnect PLA Programmable Logic Array TBW Test-Bench Waveform UCF User Constraints File VHDL VHSIC Hardware Description Language VHSIC Very High Speed Integrated Circuit XST Xilinx Synthesis Technology ABBREVIATIONS
  • 3.
    SYSTEM DESIGN USINGHDL (ECE43) # Digital system design using Verilog, Charles Roth, Lizy Kurian John, Byeong Kil Lee, 1st Edition, 2016, Cengage Learning 1 2.1, 2.2, 2.3 - 2.8, 2.11, 2.13 - 2.15 2 2.9, 2.10, 2.12, 2.16 - 2.19, 8.1, 8.2 3 3.1 - 3.4, 5.1, 5.2.1, 5.3 4 4.1 - 4.5, 4.8, 4.6, 4.7, 4.9, 4.11 5 6.1 - 6.5, 6.7 - 6.12 DESIGNING WITH FPGA
  • 4.
    07/03/2019 Aravinda K., Dept.of E&C, NHCE, Bengaluru 4 Example-1: Design of a 4:1 multiplexer using FPGA Configurable Logic Block in FPGA Each CLB in the FPGA contains two 4-variable function generators. It also contains two flip-flops which can be used for latching the function.
  • 5.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 5 As 4:1 mux contains 6 inputs, it is not possible to implement it using 1 CLB in the given FPGA. Therefore, the 4:1 mux can be decomposed into 2:1 mux blocks. Flip-flops are of no use here. M = S1'S0'I0 + S1'S0I1 + S1S0'I2 + S1S0I3 M1 = S0'I0 + S0I1 M2 = S0'I2 + S0I3 M = S1'M1 + S1M2
  • 6.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 6 Instead of logic equations, modern FPGAs use LUT as a basic building block. For this particular design of 4:1 mux, the contents of LUT4 are as shown: Each LUT4 can implement 1-bit function of 4-input variables. Hence, 16 cells of SRAM are required for the input columns (“don’t care” terms need to be included as logic states in the LUT). 3 LUT4s require 48 SRAM cells. Therefore, this is an expensive implementation.
  • 7.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 7 If the CLB in the FPGA has a provision to combine the outputs of the function generators, then the 4:1 mux can be implemented using a single CLB. (e.g.: XC4000) This method requires 2 LUT4s and 1 LUT3. Hence, the number of SRAM cells required is, 16 + 16 + 8 = 40.
  • 8.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 8 Example-2: Circular Shift Register (Ring Counter) Even though FGs are of no use, they have to be used.
  • 9.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 9 Q1: How many CLBs are required to implement a 3-to-8 decoder? A1: If the decoder is implemented using logic gates, 3 NOT gates and 8 AND gates are required. When CLB is used, as there are 8 outputs, one for each input combination, 8 FGs are required. As each CLB contains 2 sets of (FG4+FF), the number of CLBs required is 4.  If LUT based FPGA is used, for each output, 8 SRAM cells and one 8:1 mux is required. Thus, 8 LUTs are required. But in the CLB, as each FG4 contains 16 SRAM cells (including the “don’t care” term), for the 8 FG4s, 16x8 = 128 SRAM cells are required. 1 0 0 0 0 0 0 0 0 0 0
  • 10.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 10 Implementing functions using Shannon’s decomposition  Shannon’s expansion theorem helps to decompose a function containing larger number of variables, into a function containing lesser number of variables.  In the example shown, instead of a single 6-variable FG, two 5-variable FGs along with ½ of a 3rd FG are used to realize the function. Thus, Shannon’s expansion theorem helps in the reduction of hardware.
  • 11.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 11 Example-3: Consider Z = abcd’ef’ + a’b’c’def’ + b’cde’f By setting a = 0, Z = b’c’def’ + b’cde’f => Z0 By setting a = 1, Z = bcd’ef’ + b’cde’f => Z1  Therefore, two LUT5s along with either a 2:1 mux or another LUT5, can be utilized for implementing the function.  The number of terms in Z0 or Z1 does not matter, as this is going to be implemented by LUT.  If only LUT4 is available in the CLB, then the function needs to be decomposed further, by using “a” and “b” together.
  • 12.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 12 This method requires seven LUT4s, in general.
  • 13.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 13 Example-4: Consider Z = abcd’ef’+a’b’c’def’+b’cde’f Substituting a = 0 & b = 0, Y0 = c’def’ + cde’f Substituting a = 0 & b = 1, Y1 = 0 Substituting a = 1 & b = 0, Y2 = cde’f Substituting a = 1 & b = 1, Y3 = cd’ef’ As there is a null function, this requires only 5 LUT4s.
  • 14.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 14 Q2: What is the max. no. of LUT5s for realizing a 7-variable function? A2: Four LUT5s are required for implementing Y0, Y1, Y2 and Y3. Another LUT5 is required to implement the three initial terms. The last LUT5 is required to club the last term with the previous output. Thus, the total no. of LUT5s required is 6. However, in the last LUT5, one input remains unused, and it has to be considered as “don’t care”. Example-5: Implement a 7-variable function using 4-input LUTs and 2:1 multiplexers. (7-variable function = Two LUT6 + One 2:1 mux). (6-variable function = Two LUT5 + One 2:1 mux). (5-variable function = Two LUT4 + One 2:1 mux). Substituting accordingly, we obtain, (7-variable function = Eight LUT4 + Seven 2:1 mux).
  • 15.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 15  If muxes are unavailable in CLB, then more LUTs are needed. Xilinx Spartan FPGA provides mux in addition to LUT4. A logic unit is these FPGAs is called as “slice”. S L I C E
  • 16.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 16 REALIZATION OF 7-VARIABLE FUNCTION USING 4 SLICES Example-6: Implement the parity function A⊕B⊕C⊕D⊕E using 4- variable Function Generators.  For direct implementation, this 5-variable function requires only one LUT5.  Using Shannon’s expansion, this function can be decomposed into two 4-variable functions, and can be realized using two LUT4s and one 2:1 multiplexer.  If multiplexer is not present in the CLB, then it requires three LUT4s.
  • 17.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 17 CARRY CHAINS IN FPGA  Addition is a very common operation in digital circuits.  As LUT4 is a standard building block in FPGA, two LUT4s are required for sum and carry bits.  Thus, for an n-bit adder, ‘2n’ number of LUT4s are required.  But, if the FPGA can provide dedicated circuitry for generating and propagating carry bit to the next stage, then only ‘n’ number of LUT4s are required for sum bits.  The dedicated carry chain generates the carry bit in parallel.
  • 18.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 18 CASCADE CHAINS IN FPGA  For a function with large number of variables, the FPGAs provide cascade AND and cascade OR chains (for PoS and SoP terms).  Thus, instead of using separate FGs to perform AND or OR functions, the cascade circuitry can be used to create such functions.  Hence, for a 32 variable SoP function, only 8 LUT4s are required. But without the cascade chain, 11 LUT4s are required (8 + 2 + 1).  FPGAs such as Altera Stratix IV provide register chains as well.
  • 19.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 19 Examples of Logic blocks in commercial FPGAs  Kintex uses LUT6 in each slice. CLB contains 4 copies of the slice.  Xilinx Virtex and Spartan FPGAs use LUT4. Each slice contains two FGs, two muxes, two flip-flops, and additional logic. 1. Xilinx Kintex CLB
  • 20.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 20  Each Stratix IV LM contains two LUT6 and two flip-flops. Each LUT6 has two independent inputs and four shared inputs. In addition, two 1-bit adders are built in, with carry chaining.  Flip-flops with register chaining allows to create shift registers. 2. Altera Stratix IV Logic Module
  • 21.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 21  Fusion VersaTile consists of muxes and logic gates. Each block has 4 inputs – X1, X2, X3 and XC. The VersaTile block is of significantly finer grain than the LUT4 present in other FPGAs.  Each VersaTile can be configured as: 3-input logic function, or latch with (clear or set), or D flip-flop with (enable, clear or set). 3. Microsemi Fusion VersaTile
  • 22.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 22 DEDICATED MULTIPLIERS IN FPGA  Multiplication is also a common operation, and for implementing it, several programmable logic blocks are required. In addition, such multiplier will be slower, because of the interconnecting switches.  Hence, some Xilinx and Altera FPGAs contain dedicated 18X18 multipliers. When multiplication of larger numbers are required, several of the built-in multipliers can be put together.  e.g., if A and B are of 32 bits, then they can be represented as: A=(C X 216)+D, B=(E X 216)+F & A X B = (CE X 232)+(DE+CF) X 216 + DF. Thus, 4 multipliers are required to generate the partial products CE, DE, CF & DF, which are later added by means of several adders.
  • 23.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 23 Cost of programmability  The logic block shown, for its configuration, requires totally 46 SRAM cells (276 transistors). There will be additional configuration bits required, for programmable interconnect and for programmable I/O. Thus, the flexibility of programmable points comes with a much higher additional cost of associated memory cells (SRAM/Flash).  e.g.: Xilinx Virtex-II XC2V40 (with 512 LUT4s & 88 I/O pins), needs 3,38,976 configuration bits. Virtex-II XC2V8000 (with 93,184 LUT4s & 1108 I/O pins), needs more than 26 million configuration bits.
  • 24.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 24 FPGAs and One-Hot state assignment  While implementing a state machine, in general, state encoding is performed with ‘n’ bits for 2n states. e.g.: for a machine with 4 states, 2-bit encoding has to be used. Increase in ‘n’ will be requiring more no. of logic blocks.  For faster implementation of the design, it is desirable to reduce the no. of logic blocks and interconnections. Hence, instead of the encoding method, one-hot method can be used, which will reduce the no. of logic blocks.  This method, in turn, will result in the increased no. of flip-flops; but this does not affect the implementation much, as each FPGA logic block contains two flip-flops.
  • 25.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 25  For the state graph shown, the encoding can be 00, 01, 10 and 11.  But with the usage of one-hot method, the state encoding will be 1000, 0100, 0010 and 0001. The states will use one flip-flop each.  The next state equation for the flip-flop Q3 can be written as, Q3 + = X1Q0Q1 ’Q2 ’Q3 ’ + X2Q0 ’Q1Q2 ’Q3 ’ + X3Q0 ’Q1 ’Q2Q3 ’ + X4Q0 ’Q1 ’Q2 ’Q3.  In the one-hot method, this equation will get reduced to, Q3 + = X1Q0 + X2Q1 + X3Q2 + X4Q3. Here, each term in the equation contains exactly one state variable. The output equations are: Z1 = X1Q0 + X3Q2, Z2 = X2Q1 + X4Q3. As terms contain one state variable each, this leads to fewer logic cells.
  • 26.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 26  In electronic designs, a “cell” is defined as the predesigned and precharacterized circuit element.  Thus, a cell contains pretested and prestored instances of circuit diagram, its circuit symbol, and its physical description (layout).
  • 27.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 27  ASIC contains an exact number of gates that are required for the design. But FPGA contains arrays of gates, or arrays of LUTs. Thus, if a larger design needs to be implemented in FPGA, the ASIC designer needs to have an idea about the design being fit into a given FPGA.  For the designer, the number of gates inside FPGA is not a useful metric, as FPGA is programmable. Hence, a term called “equivalent gate count” is defined, as a count of the circuitry that can fit into a particular FPGA. This type of gate count is extremely difficult to compute, as it depends on the type of circuitry, the type of interconnections, and the available routing resources available in the FPGA. FPGA CAPACITY (Maximum gates versus usable gates)
  • 28.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 28  One method for computing the equivalent gate count for a CLB is as follows: 2:1 mux = 4 gates, 3-input XOR gate = 6 gates, 4-input XOR gate = 9 gates, Flip-flop = 7 gates, and so on. Thus, the equivalent gate count for a CLB can be obtained. The total gate count can be estimated, by multiplying the equivalent gate count with the number of CLBs in the FPGA. In general, this type of gate count is likely to be higher than the gate count of the practical circuitry that is being realized.  Another method is to use the Benchmark circuits (e.g.: Benchmark suite prepared by PREP [Programmable Electronics Performance company]). For example, if an ASIC contains 2000 gates, and if an FPGA can fit 20 copies of the ASIC, with no routing between the copies, then the maximum gate count of the FPGA can be considered as 40,000.
  • 29.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 29  Synthesis is the process of translation of an abstract high-level design to a detailed circuit description.  The synthesis tool implements the digital system as an interconnection of gates, flip-flops, registers, counters, muxes, adders, and other basic building blocks.  The representation of the design as a logic schematic, together with an associated wirelist, is called as netlist. DESIGN TRANSLATION (SYNTHESIS) results in AND gate. results in AND gate followed by flip-flop.
  • 30.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 30  The synthesis tool performs a line-by- line translation of HDL into hardware.  The synthesis tool selects components that are available in the library.  In general, ‘case’ statement results in muxes, comparison results in adders, shift results in registers, and so on.  For implementation with different technologies, different component libraries can be provided.  The resulting hardware is optimized later on.
  • 31.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 31 Synthesis of a ‘case’ statement module case_eg (a,b); input [1:0] a; output reg [1:0] b; always @(a) begin case (a) 0: b<=1; 1: b<=3; 2: b<=0; 3: b<=1; endcase end endmodule Synthesized circuit before optimization Logic optimization Synthesized circuit after optimization
  • 32.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 32 Unintentional latch creation Module latch_eg (a,b); input [1:0] a; output reg b; always @(a) begin case (a) 0: b<=1; 1: b<=0; 2: b<=1; endcase end endmodule Initial output of naïve synthesizer Optimized output of naïve synthesizer
  • 33.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 33 Output of optimizing synthesizer Output of naïve synthesizer Solution to eliminate latch Module latch_eg (a,b); input [1:0] a; output reg b; always @(a) begin case (a) 0: b<=1; 1: b<=0; 2: b<=1; 3: b<=0; endcase end endmodule
  • 34.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 34 Synthesis of ‘if’ statements if (A == 1’b1) begin nextstate <= 3; Z <= 1; end if (A == 1’b1) begin nextstate <= 3; Z <= 1; end else begin nextstate <= 2; Z <= 0; end Ambiguous code, that results in latch Unambiguous code module if_eg (A,B,C,D,E,Z); input A,B; input [2:0] C,D,E; output reg [2:0] Z; always @(A or B) begin if (A == 1’b1) Z <= C; else if (B == 1’b0) Z <= D; else Z <= E; end endmodule Synthesized output
  • 35.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 35 Synthesis of arithmetic components module ar_eg (clk,A,B,ge,acc,count); input clk; input [3:0] A,B; inout [3:0] acc,count; output ge; reg [3:0] acc_t, count_t; assign acc = acc_t; assign count = count_t; assign ge = (A >= B); always @(posedge clk) begin acc_t <= acc +B; count_t <= count + 1; end endmodule Synthesized output
  • 36.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 36 Example-7: What hardware gets resulted for the statement, assign LE = (A <= B); where A and B are 4-bit vectors?  The symbol “<=” is a relational operator over here.  The following statement inside the ‘always’ block, LE <= (A <= B); results in the same hardware. Example-8: What is the optimized hardware for, assign EQ3 = (A == 3); where A is 4-bit vector?  A naïve synthesizer may produce a 4-bit comparator, with ‘A’ and ‘3’ as inputs.  For optimization, the statement can be altered as: assign EQ3 = ~A[3]&~A[2]&A[1]&A[0];
  • 37.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 37 Area of the silicon chip: Minimum Power consumed: Minimum Speed of operation: Maximum Size of the product: Optimum Weight of the product: Optimum Memory capacity: Maximum Cost of the product: Minimum Delay of operation: Minimum Ideal requirements (Practical tradeoffs) Area, power and delay optimizations Area & delay of a circuit are inversely related (e.g.: serial v/s parallel).  Energy & delay of a circuit are also inversely related (more switching implies increased dynamic power).  Thus, Area-Time (AT) product and Energy-Delay (ED) product are the metrics used, to qualify the circuit. The path with the longest delay in the circuit is called as the “critical path”.
  • 38.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 38 MAPPING, PLACEMENT AND ROUTING  These are the 3 major steps that happen, to transform the design that is in the netlist form, to the appropriate target technology (MPGA, CPLD, FPGA, ASIC).  Mapping is the process of translating the design into the available building blocks in the target technology. [e.g.: LUT with mux (Xilinx), Mux with gates (Microsemi)].  In other words, it is the process of binding the technology- dependent circuits of the target technology to the technology- independent circuits that are in the design.  In case of FPGA, the design has to be mapped into muxes, LUTs etc. In case of ASIC, the design has to be mapped into the standard cells that are available in the library (e.g.: logic gates, muxes, decoders, encoders, comparators, counters etc.)
  • 39.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 39  Placement is the process of taking the defined logic & I/O blocks from the technology mapper, and assigning them to the physical locations of the target implementation. Routing is the process of interconnecting those blocks and sub-blocks on the target implementation.  “Place & route” are often done along with each other. Two of the popular algorithms used for the same purpose are, ‘Simulated annealing’ and ‘Iterative improvement’.
  • 40.
    07/03/2019 Aravinda K.,Dept. of E&C, NHCE, Bengaluru 40  In metallurgy, annealing is the process utilized to toughen the metal, by heating it, and then cooling it slowly, in a series of steps. The temperature is kept high in the beginning, and it is reduced gradually in the next steps.  In a similar fashion, for placing & routing, the simulated annealing algorithm takes bigger risks in the beginning, by making random modifications for a feasible solution, and gradually arrives at an optimal solution. In the beginning, just like high temperature, risky moves are performed. In the next steps, as the temperature is reduced, there will be decrease in the probability of occurrence of bad moves.  In contrast, the iterative improvement algorithm accepts only better solutions in each step. Such algorithms are called as ‘greedy’. At the end of simulated annealing, the algorithm has to be greedy, so as to accept only positive moves.
  • 41.
    07/03/2019 41Aravinda K.,Dept. of E&C, NHCE, Bengaluru A S I C D E S I G N F L O W
  • 42.
    07/03/2019 42Aravinda K.,Dept. of E&C, NHCE, Bengaluru