NCL Decoder Design

Theory

The decoder is like a backwards multiplexer, it uses the selector bits to output a DATA 1 on a single output, and DATA 0 on the others. This can be used to enable one of many modules, or just to change encoding from binary to one-hot.

Design

If you remember how we made the MUX module, we had a loop that generated a set of selector lines (DATA0 or DATA1 from each selector input) for each case. We will re-use this, but we will generate a TRUE and FALSE signal for each case (DATA0 and DATA1 of the corresponding output). The TRUE gate for each case will be a THNN, and the FALSE gate will be a TH1N. Here is what a Decoder2 module would look like:

DMUX2

The inputs to the TH1N gates (FALSE) are the opposing rails to the THNN gate of the same case.

  • CaseTrue='All bits for this case set'
  • CaseFalse='Any other bit set'

Any NULL input produces some NULL output: The THNN gates (TRUE) can’t set because they will always be missing an input, and for any particular input (missing a bit) there are two possible outputs: one with the bit set and one with it clear; these FALSE outputs will remain off as they need the DATA0 and DATA1 respectively from the missing input.

The decoder is actually a fairly simple gate. It is possible to split the DMUX into two parts: DMUX1, and DMUX0. These components would output the DATA1 and DATA0 lines respectively. They are not valid as complete NCL components, but they are useful: You can make a MUX by using a DMUX1 and 2*NumOptions TH22 gates. This might be especially useful when making a MUX that takes in multi-bit options. A single DMUX1 would be used to generate the control signals, and each signal of each bit of each option would be gated with the TH22 gates.

Input Completeness

So, I have been doing more reading, and I found a concept that I think I glossed over up to this point. Input Completeness is the condition that the output should not change until all inputs are available. This must hold for both NULL->DATA and DATA->NULL wavefronts. I don’t actually understand why this is necessary yet.

I had vaguely considered the concept as weather or not internal lines would ever toggle more than once during a single data cycle, but I thought that since all data was expressed by asserted lines, a system couldn’t toggle as long as there were no inverters. Even if there was feedback, none of the gates use compliments of inputs, so adding more inputs either sets the gate, or leaves it alone. There is no way to clear a set line, without clearing an input. I will look into the reasons this condition is necessary at some point.

Quick thing: I will be using the term CSOP a bunch. It means Canonical Sum-of-Product. This is the version of the equation that has all of the truth table rows brought out separately. Even if the function can be optimized to eliminate a variable from a term or two, that would violate the rules of CSOP.

The NULL->DATA Wavefront

If the circuit is initially NULL (inputs, outputs, internals) then the outputs cannot change until all inputs are DATA. The simplest way to do this is to use the CSOP implementation. With CSOP, every input is used in one of the AND-Plane gates (either as DATA0 or DATA1). As such, none of the AND-Plane gates can trigger until all of the inputs have values.

The AND-Plane is the column of THNN gates that all the inputs tie into (all possible combinations of input DATA values).

The DATA->NULL Wavefront

The DATA->NULL transition for any individual gate is held until all its inputs go to NULL. As such, once an output is set, it won’t clear until its inputs clear. Unfortunately that only applies to the inputs involved in setting the output; in CSOP, again, this is all of them. If the output is not constructed with CSOP, then in some cases, some inputs won’t affect the outputs (think the unselected inputs of a MUX).

Solutions

It is not necessary to implement the function with CSOP, you can take the logic function and add (A.0+A.1) to the product terms that are missing A, for example. The function can then be simplified/expanded from there. This is described some here on page 17 (section 3.1):

 Smith, Scott C., and Jia Di. Designing Asynchronous Circuits Using NULL Convention Logic (NCL) Scott C. Smith and Jia Di. San Rafael, Calif.]: Morgan & Claypool, 2009. Print. Synthesis Lectures on Digital Circuits and Systems #23.

I haven’t found a openly available source for this, if you are a student, check your university’s library website. If you do find a source, comment it.

NCL Multiplexer Implementation

Design RecapMUX4

4-Option Example

for Case in 0 to N-1
  [build CaseBits with DATA0's and DATA1's]
  -- CaseBits is a concatenated signal from the iSelector input
  Selectors(Case) <= THNN(CaseBits)

  GatedCase0 <= TH22(Selectors(Case), iOptions(Case).DATA0)
  GatedCase1 <= TH22(Selectors(Case), iOptions(Case).DATA1)
next Case

output.DATA0 <= TH1N(Gated00, Gated10, Gated20, Gated30, ...)
output.DATA1 <= TH1N(Gated01, Gated11, Gated21, Gated31, ...)

Generic pseudo-VHDL

Implementation

Remember the Full Adder‘s un-optimized version? If you look at the implementation, you’ll see a chunk of code at the top that generates one-hot encoding of all cases. We are going to use that for our internal Selectors signal:

cases: for case in 0 to NumOptions generate
  bits: for ibit in 0 to NumSelectors-1 generate

    Input0Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) = 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA0;
    end generate;

    Input1Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) > 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA1;
    end generate;
  end generate;

  CaseSelectorGate: THmn
    generic map(M => NumSelectors, N => NumSelectors)
    port map(inputs => selectorInputs(case),
             output => Selectors(case));

 end generate;

Next, we need to gate the two lines (DATA0 and DATA1) for each option, which will NULL them if they are not the selected signal:

cases: for case in 0 to NumOptions generate
  bits: for ibit in 0 to NumSelectors-1 generate

    Input0Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) = 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA0;
    end generate;

    Input1Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) > 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA1;
    end generate;
  end generate;

  CaseSelectorGate: THmn
    generic map(M => NumSelectors, N => NumSelectors)
    port map(inputs => selectorInputs(case),
             output => Selectors(case));

  Gated0: THmn
    generic map(M => 2, N => 2)
    port map(inputs(0) => Selectors(case),
             inputs(1) => iOptions(case).DATA0,
             output => GatedOptions0(case));

  Gated1: THmn
    generic map(M => 2, N => 2)
    port map(inputs(0) => Selectors(case),
             inputs(1) => iOptions(case).DATA1,
             output => GatedOptions1(case));

 end generate;

Finally, take all those gated options and or the signals together, so whichever one is selected will drive the line to a 1 if it is set:

cases: for case in 0 to NumOptions generate
  bits: for ibit in 0 to NumSelectors-1 generate

    Input0Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) = 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA0;
    end generate;

    Input1Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) > 0 generate
      selectorInputs(case)(iBit) <= iOptions(case).DATA1;
    end generate;
  end generate;

  CaseSelectorGate: THmn
    generic map(M => NumSelectors, N => NumSelectors)
    port map(inputs => selectorInputs(case),
             output => Selectors(case));

  Gated0: THmn
    generic map(M => 2, N => 2)
    port map(inputs(0) => Selectors(case),
             inputs(1) => iOptions(case).DATA0,
             output => GatedOptions0(case));

  Gated1: THmn
    generic map(M => 2, N => 2)
    port map(inputs(0) => Selectors(case),
             inputs(1) => iOptions(case).DATA1,
             output => GatedOptions1(case));

 end generate;

o0: THmn
  generic map(M => 1, N => NumOptions)
  port map(inputs(0) => GatedOptions0(case),
           output => output.DATA0);

o1: THmn
  generic map(M => 1, N => NumOptions)
  port map(inputs => GatedOptions1(case),
           output => output.DATA1);

That’s all the logic then, but we need to add the wrapping structures (entity declaration, architecture declaration, and internal signal declarations). This module will have one generic parameter (NumOptions), and a constant based on it (NumSelectors). The width of the iSelector input will be the log of the number of options:

entity MUX is
  generic(NumOptions : integer := 2);
  port (iSelector : in ncl_pair_vector(0 to clog2(NumOptions)-1);
        iOptions  : in ncl_pair_vector(0 to NumOptions1-);
        output   : out ncl_pair);
end MUX;

architecture structural of MUX is
  constant NumSelectors : integer := clog2(NumOptions);
  signal Selectors : std_logic_vector(0 to NumOptions-1);
  signal GatedOptions0 : std_logic_vector(0 to NumOptions-1);
  signal GatedOptions1 : std_logic_vector(0 to NumOptions-1);

  type SelectorData is array (integer range ) of std_logic_vector(0 to NumSelectors-1);
  signal selectorInputs : SelectorData(0 to NumOptions-1);
begin
  -- [This part is the same as before]
  
  cases: for case in 0 to NumOptions generate
    bits: for ibit in 0 to NumSelectors-1 generate

      Input0Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) = 0 generate
        selectorInputs(case)(iBit) <= iOptions(case).DATA0;
      end generate;

      Input1Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(case, 3)) > 0 generate
        selectorInputs(case)(iBit) <= iOptions(case).DATA1;
      end generate;

      CaseSelectorGate: THmn
        generic map(M => NumSelectors, N => NumSelectors)
        port map(inputs => selectorInputs(case),
                 output => Selectors(case));
    Gated0: THmn
      generic map(M => 2, N => 2)
      port map(inputs(0) => Selectors(case),
               inputs(1) => iOptions(case).DATA0,
               output => GatedOptions0(case));

    Gated1: THmn
      generic map(M => 2, N => 2)
      port map(inputs(0) => Selectors(case),
               inputs(1) => iOptions(case).DATA1,
               output => GatedOptions1(case));

    end generate;

  o0: THmn
    generic map(M => 1, N => NumOptions)
    port map(inputs(0) => GatedOptions0(case),
             output => output.DATA0);

  o1: THmn
    generic map(M => 1, N => NumOptions)
    port map(inputs => GatedOptions1(case),
             output => output.DATA1);

end structural;

Testing

I am testing this module with 2 inputs for now; in theory it scales, but at some point I should add a 4-option test, and maybe a 5 to see how it does with non-power of 2 values. The test script goes through the inputs options and tests that they output correctly.

When I first ran this, I had an error where the outputs indexing was in the wrong order. I had the part of the code near the top messed up to use iSelectors(case).DATA0 instead of DATA1 and vice versa.

Capture

Commit: b35b729

NCL Multiplexer Design

Theory

Multiplexers are components that let you switch between different options for a signal. They take in some number of option values (usually a power of 2) and a selector. Each data value of the selector corresponds to a particular input, which is fed to the output.

output = iOptions(iSelector)

If there are 2 options (the most basic MUX) then the selector is 1 bit. If there are 3 or 4 options 2 bits are needed, and so on.

Design

This time we’re going to go about this in a more intuitive, less rigorous, manner. Let’s consider each ‘row’ separately, each row will correspond to one input option (both *.0 and *.1). For each of these rows, we’ll generate a gating signal from the iSelector bits. This gating signal will be used by two TH22 gates to clear all but the selected signals.

This is very much like having 2 MUXes, one for the *.1s and one for the *.0‘s.

We’ll be reusing some code from the FullAdder implementation to get the selectors (one wire per input case); each of these will gate the DATA0 and DATA1 lines of the respective input option. The gated values will then be combined with a TH1n gate. An example 4-option case:

Case0 <= TH22(iSel(0).DATA0, iSel(1).DATA0)
Case1 <= TH22(iSel(0).DATA1, iSel(1).DATA0)
Case2 <= TH22(iSel(0).DATA0, iSel(1).DATA1)
Case3 <= TH22(iSel(0).DATA1, iSel(1).DATA1)

GatedA0 <= TH22(iOptions(0).DATA0, Case0)
GatedA1 <= TH22(iOptions(0).DATA1, Case0)

GatedB0 <= TH22(iOptions(1).DATA0, Case1)
GatedB1 <= TH22(iOptions(1).DATA1, Case1)

GatedC0 <= TH22(iOptions(2).DATA0, Case2)
GatedC1 <= TH22(iOptions(2).DATA1, Case2)

GatedD0 <= TH22(iOptions(3).DATA0, Case3)
GatedD1 <= TH22(iOptions(3).DATA1, Case3)

output.DATA0 <= TH14(GatedA0, GatedB0, GatedC0, GatedD0)
output.DATA1 <= TH14(GatedA1, GatedB1, GatedC1, GatedD1)

mux42.png

That’s a bit repetitive, let’s make it a little more general. The design involves some ‘magic’ parts because they are more of an implementation detail really.

for Case in 0 to N-1
  [build CaseBits with DATA0's and DATA1's]
      -- CaseBits is a concatenated signal from the iSelector input
  Selectors(Case) <= THNN(CaseBits)

  GatedCase0 <= TH22(Selectors(Case), iOptions(Case).DATA0)
  GatedCase1 <= TH22(Selectors(Case), iOptions(Case).DATA1)
next Case

output.DATA0 <= TH1N(Gated00, Gated10, Gated20, Gated30, ...)
output.DATA1 <= TH1N(Gated01, Gated11, Gated21, Gated31, ...)

Each row generates a selector, gates the option values, and passes them to the output. Any un-selected inputs are NULLed out (Gated#0 and GATED#1 both go to 0) leaving only the selected input to pass through the TH1N gates.

Using an NCL Register

In this post, I described what a NCL register is. I wanted to get a more practical understanding of what the register does and how different pipeline stages interact. To facilitate this, I put the Full Adder between two registers, with their control signals linked:

pipelinedadder.png

In this setup, both registers start with NULL, requesting DATA.

  1. When DATA is fed to the first register, it immediately passes it on to the adder and requests NULL
  2. Once the Adder completes, the second register saves the DATA to the outputs and requests NULL.

The same sequence repeats with the NULL wavefront, then back to DATA, and so on…

We’ve already tested the Adder, but we want to make sure the system works, so we make a separate test for this unit (VHDL source, TCL test script). This test doesn’t actually verify the results of the computation as we already checked the adder. Essentially, if it runs, the pipelining  worked. If it hangs, then something is wrong and wavefronts are not propagating through the circuit.

Pipelined Adder tests

In theory, a loop with 3 registers can be made, but in this case, if the outputs feed back, the result will degrade to 1 eventually, or stay at 0. I may make a 2-bit counter or something in a while.

Commit: 40d96b8

NCL Full Adder Implementation

Design Recap

The circuit from my last post:

FullAdder

oS.1:
TH12(
  TH22(
    TH13(iA.1, iB.1, iC.1), -- 1 <= NumBits
    TH23(iA.0, iB.0, iC.0)), -- NumBits < 2
  TH33(iA.1, iB.1, iC.1))); -- 3 <= NumBits

oS.0:
TH12(
  TH33(iA.0, iB.0, iC.0), -- NumBits < 1
  TH22(
    TH23(iA.1, iB.1, iC.1), -- 2 <= NumBits
    TH13(iA.0, iA.0, iA.0)); -- NumBits < 3

oC.1: TH23(iA.1, iB.1, iC.1) -- 2 <= NumBits

oC.0: TH23(iA.0, iB.0, iC.0) -- NumBits < 2

Optimized

The structural VHDL implementation:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use work.ncl.all;

entity FullAdder is
  port(iC : in ncl_pair;
       a : in ncl_pair;
       b : in ncl_pair;
       oS : out ncl_pair;
       oC : out ncl_pair);
end FullAdder;

architecture structural of FullAdder is
  type first_layer is array (integer range <>) of std_logic_vector(0 to 2);
  signal first_layer_inputs : first_layer(0 to 7);
  signal intermediate : std_logic_vector(0 to 7);
  signal inputs : ncl_pair_vector(0 to 2);

begin
  inputs(2) <= a;
  inputs(1) <= b;
  inputs(0) <= iC;
  input_layer: for i in 0 to 7 generate
    bits: for ibit in 0 to 2 generate
      Input0Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(i, 3)) = 0 generate
        first_layer_inputs(i)(iBit) <= inputs(iBit).Data0;
      end generate;
      Input1Selection: if (to_unsigned(2**iBit, 3) and to_unsigned(i, 3)) > 0 generate
        first_layer_inputs(i)(iBit) <= inputs(iBit).Data1;
      end generate;
    end generate;
    gate: THmn
            generic map(M => 3, N => 3)
            port map(inputs => first_layer_inputs(i),
                     output => intermediate(i));
  end generate;

  oS0: THmn
         generic map(M => 1, N => 4)
         port map(inputs(0) => intermediate(0),
                  inputs(1) => intermediate(3),
                  inputs(2) => intermediate(5),
                  inputs(3) => intermediate(6),
                  output => oS.DATA0);

  oS1: THmn
         generic map(M => 1, N => 4)
         port map(inputs(0) => intermediate(1),
                  inputs(1) => intermediate(2),
                  inputs(2) => intermediate(4),
                  inputs(3) => intermediate(7),
                  output => oS.DATA1);

  oC0: THmn
         generic map(M => 1, N => 4)
         port map(inputs(0) => intermediate(0),
                  inputs(1) => intermediate(1),
                  inputs(2) => intermediate(2),
                  inputs(3) => intermediate(4),
                  output => oC.DATA0);

  oC1: THmn
         generic map(M => 1, N => 4)
         port map(inputs(0) => intermediate(3),
                  inputs(1) => intermediate(5),
                  inputs(2) => intermediate(6),
                  inputs(3) => intermediate(7),
                  output => oC.DATA1);
end structural;

architecture optimized of FullAdder is
  signal sLT2 : std_logic;
  signal sLT3 : std_logic;
  signal sGE2 : std_logic;
  signal sGE1 : std_logic;
  signal sEQ3 : std_logic;
  signal sEQ2 : std_logic;
  signal sEQ1 : std_logic;
  signal sEQ0 : std_logic;
begin
  LT2: THmn
         generic map(M => 2, N => 3)
         port map(inputs(0) => a.DATA0,
                  inputs(1) => b.DATA0,
                  inputs(2) => iC.DATA0,
                  output => sLT2);

  GE2: THmn
         generic map(M => 2, N => 3)
         port map(inputs(0) => a.DATA1,
                  inputs(1) => b.DATA1,
                  inputs(2) => iC.DATA1,
                  output => sGE2);

  GE1: THmn
         generic map(M => 1, N => 3)
         port map(inputs(0) => a.DATA1,
                  inputs(1) => b.DATA1,
                  inputs(2) => iC.DATA1,
                  output => sGE1);

  EQ1: THmn
         generic map(M => 2, N => 2)
         port map(inputs(0) => sGE1,
                  inputs(1) => sLT2,
                  output => sEQ1);

  EQ3: THmn
         generic map(M => 3, N => 3)
         port map(inputs(0) => a.DATA1,
                  inputs(1) => b.DATA1,
                  inputs(2) => iC.DATA1,
                  output => sEQ3);

  S1: THmn
        generic map(M => 1, N => 2)
        port map(inputs(0) => sEQ1,
                 inputs(1) => sEQ3,
                 output => oS.DATA1);

  EQ0: THmn
         generic map(M => 3, N => 3)
         port map(inputs(0) => a.DATA0,
                  inputs(1) => b.DATA0,
                  inputs(2) => iC.DATA0,
                  output => sEQ0);

  LT3: THmn
         generic map(M => 1, N => 3)
         port map(inputs(0) => a.DATA0,
                  inputs(1) => b.DATA0,
                  inputs(2) => iC.DATA0,
                  output => sLT3);

  EQ2: THmn
         generic map(M => 2, N => 2)
         port map(inputs(0) => sGE2,
                  inputs(1) => sLT3,
                  output => sEQ2);

  S0: THmn
        generic map(M => 1, N => 2)
        port map(inputs(0) => sEQ2,
                 inputs(1) => sEQ0,
                 output => oS.DATA0);

  oC.DATA0 <= sLT2;
  oC.DATA1 <= sGE2;

end optimized;

This VHDL implementation of the the un-optimized design  uses generic loops to setup the first layer (the first layer uses all combinations of group values). The second layer is set up manually.

The optimized design’s signals are named in terms of relations Greater or Equal to #, Less Than #, and EQual to #. So sEQ1 is asserted when 1 input group is set to 1, and sGE2 is asserted when at least 2 input groups are set to 1.

Testing

The test script runs through all combinations of inputs. I ran the tests with both versions. Here’s the result, no surprises really.

Full Adder Test

Commit: a7a9dba, d645811

NCL Full Adder Design

Theory

Like the Half Adder, a Full Adder counts it’s inputs. The full Adder counts three of them though. This to account for the carry in of the previous bit.

Truth Table

iA iB iC oS oC
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1

Design

Once more, we’ll start with the truth table, derive sum-of-product equations, and circuit-ize.

oS = (iA'*iB'*iC) + (iA'*iB*iC') + (iA*iB'*iC') + (iA*iB*iC)
oC = (iA'*iB*iC) + (iA*iB'*iC) + (iA*iB*iC') + (iA*iB*iC)

Again, we need to convert these into NCL logic (DATA0 and DATA1).

oS.0 = (iA.0*iB.0*iC.0)+(iA.0*iB.1*iC.1)+(iA.1*iB.0*iC.1)+(iA.1*iB.1*iC.0)
oS.1 = (iA.0*iB.0*iC.1)+(iA.0*iB.1*iC.0)+(iA.1*iB.0*iC.0)+(iA.1*iB.1*iC.1)
oC.0 = (iA.0*iB.0*iC.0)+(iA.0*iB.0*iC.1)+(iA.0*iB.1*iC.0)+(iA.1*iB.0*iC.0)
oC.1 = (iA.0*iB.1*iC.1)+(iA.1*iB.0*iC.1)+(iA.1*iB.1*iC.0)+(iA.1*iB.1*iC.1)

Note that each row of the truth table is used exactly twice, once for each variable. Since 0’s and 1’s are both represented by a high signal, each output variable {oS, oC} has an assignment for each case. Build the AND-Plane with TH33 gates, and the OR-Plane with TH13 gates:

FullAdder

This design takes up 168 transistors in total. Lets see if we can make it with fewer.

Optimization

This time, instead of SOP form, I’m going to look at it more intuitively. Since the bits are symmetric (the values of iA can be swapped with iB without any change in the  expected output) let’s look at counting them with threshold gates instead of checking individual cases. To check a ‘less than’ relation ship for number of inputs set, just count the number of 0’s.

oS.1: (1 <= NumBits < 2) + (3 <= NumBits) --5 gates
oS.0: (NumBits < 1) + (2 <= NumBits < 3)  --5 gates

oC.1: (2 <= NumBits)                      -- 1 gate (shared)
oC.0: (NumBits < 2)                       -- 1 gate (shared)

Gate version:

oS.1: 
TH12(
     TH22(
          TH13(iA.1, iB.1, iC.1),  -- 1 <= NumBits
          TH23(iA.0, iB.0, iC.0)), -- NumBits < 2
     TH33(iA.1, iB.1, iC.1))); -- 3 <= NumBits

oS.0:
TH12(
     TH33(iA.0, iB.0, iC.0), -- NumBits < 1
     TH22(
          TH23(iA.1, iB.1, iC.1),  -- 2 <= NumBits
          TH13(iA.0, iA.0, iA.0)); -- NumBits < 3

oC.1: TH23(iA.1, iB.1, iC.1) -- 2 <= NumBits
oC.0: TH23(iA.0, iB.0, iC.0) -- NumBits < 2

I’m going to call this ‘functional notation’. It treats each gate as a function with other gates as inputs; common expressions are evaluated only once (duplicate gates are only built once, then shared). This uses the following gates (with their transistor counts):

  • 3 TH12 = 3*6  = 18 transistors
  • 1 TH22 = 1*12 = 12 transistors
  • 2 TH13 = 2*8  = 16 transistors
  • 2 TH23 = 2*18 = 36 transistors
  • 2 TH33 = 2*16 = 32 transistors

Total: 114 transistors (-32% from SOP). The downside is that this is three-layer logic, which generally has a little higher delay for that third layer. On the upside, most of the gates have fewer inputs (2 and 3 inputs instead of 3 and 4 inputs). This reduces the complexity of each gate and may actually reduce the critical path. I won’t do that analysis here.

You might be tempted to obtain the *.0 or *.1 signal by inverting the other. You cannot do this in NCL. You must be able to pass on NULL wavefronts which require both to be 0. This is a downside of NCL, all groups require two (or more) complimentary circuits to obtain. This limitation results in increased die area.

See my next post for the implementation.

NCL Register Implementation

See this post for the design of the NCL Register.

Implementation

I implemented this module structurally, with a for...generate (you’ll find that I’m a big fan of generics if you keep up with the blog).

This module assumes that all groups (DAT0, DATA1, … DATAn) are dual rail (capped at DATA1); if I need other encodings, I’ll make them as separate modules later. I added a generic RegisterDelay input so that I can better observe pipelines of components (if it is stable for time, then I can read values off the waves easier)

library ieee;
use ieee.std_logic_1164.all;
use work.ncl.all;

entity RegisterN is
  generic(N : integer := 1;
  RegisterDelay : time := 20 ns);
  port(inputs : in ncl_pair_vector(0 to N-1);
  from_next : in std_logic;
  output : out ncl_pair_vector(0 to N-1);
  to_prev : out std_logic);
end RegisterN;

architecture structural of RegisterN is
  signal outs : std_logic_vector(0 to (2*N)-1);
  signal watcher_out : std_logic := '0';
begin

register_gates: for i in 0 to N-1 generate
  T22_i0 : THmn
             generic map(N => 2, M => 2, Delay => RegisterDelay)
             port map(inputs(0) => inputs(i).DATA0,
                      inputs(1) => from_next,
                      output => outs(2*i));
  output(i).DATA0 <= outs(2*i);

  T22_i1 : THmn
             generic map(N => 2, M => 2, Delay => RegisterDelay)
             port map(inputs(0) => inputs(i).DATA1,
                      inputs(1) => from_next,
                      output => outs(2*i+1));
  output(i).DATA1 <= outs(2*i+1);

end generate register_gates;

  watcher: THmn
             generic map (N => N*2, M => N)
             port map (inputs => outs,
                       output => watcher_out);
  WatcherOutput: to_prev <= NOT watcher_out;
end structural;

Testing

For Testing I again used a test script, available on GitHub. It goes through the values, and makes sure that data is sent through correctly. It also checks that DATA/NULL wavefronts are delayed correctly by the control signal (handshaking). Note that the to_prev output signal is always in the ‘opposite’ state of the outputs (outputs NULL => to_prev 1 and otuputs DATA => to_prev 0) there is actually a 1-ns delay (default) in the watcher gate (a TH36 in this case).

Capture

Commits: 5d349e7, 4ecfe25, bf16da3

NCL Register Design

We’ve covered some basics on NCL (signals and gates), next I’m looking into registers and structuring a system with multiple components.

Theory

In synchronous logic, designers use flip-flops to store data, they store the current value on every clock edge, moving it to the next stage. In asynchronous logic, there is no clock edge, so saving data requires something else. NCL uses threshold gates as registers, which works because of their hysteresis property. The requirements for the register:

  • Hold on to the value for as long as the next module needs it
  • Send a reset (NULL) signal to the next module on all inputs when it needs it
  • Let the previous register know what it needs (DATA/NULL)

So, there’s handshaking going on here, each register tells the one before it what it needs, and tries to send the next one what it asks for.

Design

How do we send the request to the previous module then? Lets assume 1 control line, and see if we need something else later. Since we are representing NULL with 0, lets set a request for null to be 0, and a request for data to be 1. We want to receive data as soon as the module has reset to NULL, and we want NULL as soon as the module is outputting data on all groups. Here’s the initial design:

NCL Register

If both A and B have a line set (either 0 or 1) then the ‘watcher’ gate is set. The little circle on the tip is an inverter, it turns the 1 (indicating we have DATA) to a 0 (indicating we want NULL) and vice versa.

There is one more requirement: If the module after us is requesting DATA, we can’t store the NULL wavefront (which would overwrite the DATA values) and vice versa and so need to hold the previous module until we can. This means that the request has to be based on the register’s outputs, not its inputs.

Refresher: A group of NCL lines are the set of lines representing a single entity, only one can be active at a time, but it is allowable to have none active (NULL).

NCL Register

Here we have a gate saving each bit: If the control input is low, then the gates will reset when the previous module’s outputs clear (next module requesting null). If the control input is high, then the gates will save DATA inputs (next module requesting DATA).

When both groups (A and B) have data, the watcher sees 2 data lines, sets its output, which goes through the inverter and requests NULL (which won’t be saved until the next module requests NULL).

Eventually, the previous module NULLs out and waits for a DATA request. When the next module requests NULL, the register gates flip to NULLs and the watcher outputs a 0, which is inverted to a 1 (request for DATA). The NULL wavefront passes through the module to the next register.

This cycle continues.

Notes

Components can be directly linked without registers, but only one operation can occur between registers at a time. Adding the registers splits up the operation into smaller parts, which can occur in parallel (for different inputs). At the start, the first set of inputs is loaded, and when they move to the second stage, the first is NULLed, after that, the first stage receives the second set of inputs, while the first set is still running through the third stage. this continues, with all data wavefronts separated by NULL wavefronts.

NCL Half Adder Implementation

See this post for the theory and design of the NCL Half Adder.

Design Recap

Here’s the circuit design:

halfadder1-e1499625273754.png

(simple version)

HalfAdder Optimized

(optimized version)

The top two gates are THxor0 gates, the next is a TH34w22, and the last one is a TH22 gate. I will implement this structurally, a fairly straightforward process in this case.

Implementation

library ieee;
use ieee.std_logic_1164.all;
use work.ncl.all;

entity HalfAdder is
 port(a : in ncl_pair;
      b : in ncl_pair;
      s : out ncl_pair;
      c : out ncl_pair);
end HalfAdder;

architecture structural of HalfAdder is
  signal a0b0_ins : std_logic_vector(0 to 1);
  signal a0b0_out : std_logic;
  signal a0b1_ins : std_logic_vector(0 to 1);
  signal a0b1_out : std_logic;
  signal a1b0_ins : std_logic_vector(0 to 1);
  signal a1b0_out : std_logic;
  signal a1b1_ins : std_logic_vector(0 to 1);
  signal a1b1_out : std_logic;

signal s0_ins : std_logic_vector(0 to 1);
  signal s0_out : std_logic;
  signal s1_ins : std_logic_vector(0 to 1);
  signal s1_out : std_logic;

  signal c0_ins : std_logic_vector(0 to 2);
  signal c0_out : std_logic;
begin
  a0b0_ins(0) <= a.DATA0;
  a0b0_ins(1) <= b.DATA0;   T21_A0B0 : THmn
               generic map(N => 2, M => 2)
               port map(inputs => a0b0_ins,
                        output => a0b0_out);

  a0b1_ins(0) <= a.DATA0;
  a0b1_ins(1) <= b.DATA1;   T21_A0B1 : THmn
               generic map(N => 2, M => 2)
               port map(inputs => a0b1_ins,
                        output => a0b1_out);

  a1b0_ins(0) <= a.DATA1;
  a1b0_ins(1) <= b.DATA0;   T21_A1B0 : THmn
               generic map(N => 2, M => 2)
               port map(inputs => a1b0_ins,
                        output => a1b0_out);

  a1b1_ins(0) <=  a.DATA1;
  a1b1_ins(1) <= b.DATA1;   T21_A1B1 : THmn
               generic map(N => 2, M => 2)
               port map(inputs => a1b1_ins,
                        output => a1b1_out);

  s1_ins(0) <= a0b1_out;
  s1_ins(1) <= a1b0_out;   T21_S1: THmn
            generic map(N => 2, M => 1)
            port map(inputs => s1_ins,
                     output => s1_out);
  s.DATA1 <= s1_out;

  s0_ins(0) <= a0b0_out;
  s0_ins(1) <= a1b1_out;   T21_S0: THmn
            generic map(N => 2, M => 1)
            port map(inputs => s0_ins,
                     output => s0_out);
  s.DATA0 <= s0_out;
  c.DATA1 <= a1b1_out;

  c0_ins(0) <= a1b0_out;
  c0_ins(1) <= a0b1_out;
  c0_ins(2) <= a0b0_out;   T31_C0: THmn
            generic map(N => 3, M => 1)
            port map(inputs => c0_ins,
                     output => c0_out);
  c.DATA0 <= c0_out;
end structural;

architecture optimized of HalfAdder is
begin   
  Sum0: THxor0
          port map(A => A.DATA0,
                   B => B.DATA0,
                   C => A.DATA1,
                   D => B.DATA1,
                   output => s.DATA0);

  Sum1: THxor0
          port map(A => A.DATA1,
                   B => B.DATA0,
                   C => A.DATA0,
                   D => B.DATA1,
                   output => s.DATA1);

  Carry0: THmn
            generic map(N => 6, M => 3)
            port map(inputs(0) => A.DATA0,
                     inputs(1) => A.DATA0,
                     inputs(2) => B.DATA0,
                     inputs(3) => B.DATA0,
                     inputs(4) => A.DATA1,
                     inputs(5) => B.DATA1,
                     output => c.DATA0);

  Carry1: THmn
            generic map(N => 2, M => 2)
            port map(inputs(0) => A.DATA1,
                     inputs(1) => B.DATA1,
                     output => c.DATA1);
end optimized;

I have two implementations here. The basic version, and the optimized version. There’s not a lot to explain beyond VHDL syntax. The gates are built as in the diagram, though some orderings might have changed.

Testing

The test script runs through all input values, clearing to NULL in between. The test simulation run:

Capture

If you have any questions, leave a comment below.

Commit: a57125a