Null Convention Logic Signals II

Background

My naming conventions have probably been misleading. I have mostly treated NCL bits as three-state logic (NULL, DATA0, and DATA1) this is a valid way to work with the data, but it is limiting. The better way to think of it is as individual 2-state logic lines (NULL and DATA).

Internally to a component, using single-rail signals is very helpful. Not only are single-rail signals the inputs to all NCL gates, but they can be used as links between components as well. Single-rail signals cannot really be passed through registers because they have to be asserted for the DATA wave to read as complete. This would prevent the signal from conveying meaning, as it would always have the same value during computations.

My Theory

This section is unverified theory. Please share your thoughts below.

I will b using _ to refer to a non-asserted rail (null) and ^ to refer to an asserted rail (data).

Another benefit of this view is that you get more flexibility in state encodings. Standard encodings (that I’ve come across) use mutually exclusive data lines. So a 3-rail signal would have 3 possible states and a 4-rail signal would have 4 states. By thinking of the data more in terms of DATA and NULL, you get to design your own state encodings. For example, with a 4-rail encoding, you could use the following 6 states: {__^^, _^_^, _^^_, ^__^, ^_^_, ^^__}.

State Space Restrictions

A state space is the set of valid states. We will use it here a little more specifically to refer to the valid DATA states. Not all state spaces will actually work for NCL, The completion logic has to be able to correctly identify completed states. For example, if we try to use the following for a 3-rail signal, we have a problem: {__^, _^^, ^^^}. If the result is the last state (^^^) and the data on line 0 arrives first, the completion logic will see it as a complete __^ and pass on the wrong result.

For any state, it has to be able to partially arrive, without looking like another state. This is required for delay insensitivity. A simple way to make this happen is to use M-hot states. By that I mean that each state will have the same number of rails asserted as data (^), and all such states can be used. The number of states in such an assignment for a N-rail signal is M Choose N or C(M, N). This has its maximum value when M=N/2. It also has the benefit that any incomplete state has at most M-1 rails set, which will never match another state.

The logic proof-y version: ∀x,y∈S:(x&y!=x) and (x&y!=y) must hold (& is the bitwise AND operator) where S is the state space.

Using this N/2-hot encoding, the states/rail is

Rails States State Increase Factor
1 1
2 2 2
3 3 1.5
4 6 2
5 10 1.6667
6 20 2

Using this encoding, you get nearly twice as many states for each rail you add. Using mutually-exclusive rails (1-hot) you get 1 extra state per rail. Even if you break it up into 2-rail pairs and use combinations of those values, you only double your state space for every 2 rails you add. In the table above, s you increase the number of rails, the factor approaches 2 for the odd numbers of rails as well.

In special cases, you might want a more complex encoding, some examples of valid state spaces:

  • {___^, ^_^_, _^__}
  • {__^^, ^___, _^__}
  • {^^^_, ^^_^, ^_^^, _^^^}
  • {^^^_, ___^}
  • {__^^, _^_^, _^^_, ^__^, ^_^_, ^^__}

Takeaway

While you can increase the compatibility of your modules by using a standard state space, sometimes you can reduce the number of rails by using more dense encodings. Additionally, signals that exist only between registers (are not passed on) can actually be 1-rail signals, which is something I have a hard time remembering when integrating components. These 1-rail signals can act as enable signals: They can be used to prevent data from propagating, or to allow it to propagate (remember steering).

Register Ring with Multiple Wavefronts

In this post, we built a ring of registers and set a data wavefront loose in it. We added many stages, and watched the single wavefront run through them. In this post, I’ll show what happens with multiple wavefronts.

Theory

First, we should establish how many registers we need for some number of wavefronts. From the Register Ring post, we know that in order for a state (DATA or NULL) to pass on, the next two stages must both be in the opposite state. This prevents the passage from overwriting the next stage’s state completely, as it has already passed on to the next stage.

Slide7

Here, the DATA state (red) can move on to the next stage because the NULL state has been passed up to the 3rd stage. By overwriting the 2nd stage, no state is lost, because stage 2 has already finished with the NULL state.

Slide2

Here, stage 2 cannot pass DATA on to stage 3 because that would remove the NULL state entirely. If there were 4 stages, and the 4th was NULL, then the 3rd stage could accept data, since the NULL state would be preserved.

As you can see, the number of stages in a state is not important, but rather the sequence must be preserved. Let’s assume a really long line of registers for a moment, with stage states: (DATA, NULL, NULL, DATA, ...) All of these NULLs can be assumed to be the same logical state, which we’ll refer to as NULL1, we’ll also call the first DATA state DATA1: ​(DATA1, NULL1, NULL1, DATA0, ...). Take the case where the last DATA wave can’t move because of some really slow stage after it. If we ‘run’ the system for a register delay, the DATA1 state progresses, and overwrites one of the NULL1 states, but there are still some left, so nothing is really lost. When DATA1 moves forward, a NULL state fills its stage; this is a new state, so let’s call it NULL2(NULL2, DATA1, NULL1, DATA0, ...). If we run again, one might expect the DATA1 state to advance again, but taht would overwrite the NULL1 state completely, putting two distinct DATA states right next to each other. Having two adjacent DATA states prevents the components from resetting between them, which only produces correct results if the DATA states are identical. Since there is no guarantee that DATA1 and DATA0 are identical, we have to preserve the NULL between them.

There are states which would be deadlocked, however, as long as the initial conditions are valid (not deadlocked), the handshaking ensures that these states never occur as a propagation of the circuit.

Some examples:

  • Ring: (NULL, DATA, NULL, DATA) (4 states) – Nothing can pass, the system is deadlocked (we can’t actually get here naturally)
  • Ring: (NULL, DATA, NULL, NULL) (2 states) – Only the DATA state can pass, the NULLs are contiguous, so they can be considered to be identical states
  • Ring: (NULL, DATA, DATA, NULL) (2 states) – Both states can pass, the DATA can overwrite the NULL in position 4, and the NULL can overwrite the DATA in position 2
  • Ring: (NULL, DATA, DATA, DATA) (2 states) – Only NULL can pass, if DATA was able to pass, the NULL could be deleted from the system
  • Ring: (NULL, NULL, DATA, NULL) (2 states) – Only DATA can pass, if NULL was able to pass, the DATA could be deleted from the system
  • Rotations of these work out to the same thing since it’s a ring

Note that DATA and NULL are symmetric and the values in the DATA wave don’t matter. You can swap instances of ‘NULL’ and ‘DATA’ and get a valid result.

In short, with an even number of stages, you can only have N/2 states. What about odd numbers?

  • Ring: (NULL, DATA, NULL, NULL, DATA) (4 states) – Only the first DATA can move, as it is the only stage followed by 2 of the opposite state
  • Ring: (NULL, DATA, DATA, NULL, DATA) (4 states) – Only the first NULL can move, as it is the only stage followed by 2 of the opposite state

Remember that rotations still match

  • Ring: (NULL, DATA, NULL, NULL, NULL) (2 states) – Only the first DATA can move, as it is the only stage followed by 2 of the opposite state
  • Ring: (NULL, DATA, DATA, NULL, NULL) (2 states) – Either state can move
    • Ring: (NULL, DATA, DATA, DATA, NULL) (2 states) – Either state can still move
    • Ring: (NULL, NULL, DATA, NULL, NULL) (2 states) – Only DATA can move

In the odd case, we can get up to (N+1)/2 states to fit. One thing to note is that when the pipeline is full (max number of states) only one state can move at a time in the odd case. In the even case, there’s that extra space that doesn’t get us another state, but it does give the states some wiggle room: more than one state can be advancing at a time.

From the above, we get: NumStages=NumStates+Advancement where Advancement is the maximum number of states that can advance at a time. Advancement must be greater than 0, if it equals 0, the pipeline will be locked. If Advancement>=NumStates then there’s no real advantage to adding more stages, unless the stages are not delay-matched.

Note: NumStates is always even in a ring.

Experiment

We’ll start with a full pipeline (Advancement=1) and then try a throughput-optimized design (Advancement=NumStates) where every state can advance simultaneously. By using the same number of states, we can benchmark the throughput and latency of the two.

Lest use NumStates:=4, which corresponds to 2 NULL and 2 DATA states.

Minimized Pipeline

By our formula above, we need 4+1=5 stages. The initial state:

5-stage-pipeline.png

By putting the value of 0 in the first DATA state, and 1 in the second state, we can track when the ring completes a cycle. Throughput is 2/t_complete (two DATA states per trip around the ring) and latency is t_complete.

5-stage-4-state
5-stage ring with 2 DATA states and 2 NULL states
  • Throughput: 1/(400 ns) = 2.5 MHz
  • Latency: 800 ns

This experiment used the static_loop VHDL file, and a specific test script.

Throughput-Optimized Pipeline

This version has 4+4=8 stages to go through, but all the states can move at the same time. Throughput should increase, but latency may increase as well. I’m not going to show the diagram because of size problems, but it’s a lot like the above, just with more stages.

Throughput and latency are calculated from the simulation, just as before. For optimal throughput and startup latency, arrange the states in pairs – 2 DATA, then 2 NULL, then 2 DATA, … – this increases the number of Advancement options at the start to the maximum: NumStates.

8-stage-4-state
An 8-stage ring with 2 DATA states and 2 NULL states
  • Throughput: 1/(160 ns) = 6.25 MHz
  • Latency: 320 ns

This experiment used the static_loop VHDL file, and a specific test script.

Results

It looks like the throughput-optimized pipeline indeed has a higher throughput: By adding 3 states, we were able to more than double the efficiency of the ring, if you look at the simulation waveforms, in the first one only one stage is transitioning at a time. In the second, 4 stages are transitioning at a time. The second (faster) version uses 8/5 the resources though, so the decision on how many stages to use depends on available die space as well. In simulation, we don’t have to worry about this.

Commits: 997b4bb, be37b7f, 20ddd65

Register Ring

I want to start in on some sequential logic. We have a few combinational modules that we can build on already, so the est thing would be to make a sequential-only circuit. Once we’ve clarified how the concept works, we can add in combinational logic in.

By ‘sequential-only’, I am referring to a setup with only registers, it just passes it’s initial input in a loop forever, not changing it. I’m hoping it’ll help me with the concept a bit more, and flush out any issues with the registers.

Here’s a diagram of a three stage loop, just imagine the outputs loop back (drawing it would be messy):

ring.png

Runtime Behavior

Let the initial state of the first stage’s outputs be DATA, with the first stage requesting NULL. The second and third stages are outputting NULL, and requesting DATA. Now let red be DATA, and blue be NULL.Slide1

After the gate delay for the register, the DATA wave is passed, and a request for NULL is sent back.

Slide2Slide3Slide4Slide5Slide6

And, we’re back where we started

Slide7

The ring will continue indefinitely. The VHDL source is available here, though without the test script, all stages remain at NULL, requesting DATA. I simulated this, and t turns out that it’s harder to see the pattern in graph form. To make things easier, I raised the number of stages.

Capture

At 8 stages, I start to be able to see it clearly as distinct wavefronts going through the pipeline. To make it really obvious, ramp it up to 12.

Capture

Not all stages shown.

And finally, if set at three, the pattern is harder to see, but it’s there. In the 3-stage case, the time spent requesting NULL and requesting DATA is the same for each stage.

Capture

If you use 2 stages, the system locks. A slideshow version of the ring pictures from above. As you can see, the transition to NULL only occurs while there are 2 DATA stages, and the transition to DATA only occurs while there are 2 NULL stages. This is so that no wavefront is ever ‘overwritten’. The second instance of that state saves the value.

Handshaking & Pipelining

Generally speaking (in this context), handshaking is the process of electronic systems agreeing on what to do next. In NCL, this takes the form of modules requesting a NULL or DATA signal.

  • The request for NULL indicates that the operation has been performed, and the results saved for the next module to use.
  • The request for DATA indicates that the module has reset, passed the reset signal to the next module, and is ready for more work.

If you’re having trouble, think of it like two people doing laundry, one is operating the washing machine, the other is operating the dryer. Additionally, assume that the operators can’t be sure of how long a load will take in either stage. It’s not a perfect analogy, but it might help. Initially, both are empty:

  1. In the empty state, the washing machine operator knows he or she needs clothes, so he or she finds some and puts them in the wash. The drier is also empty, but it needs to get its clothes from the washer, so it has to wait.
  2. When the washer is finished, the operator takes out the clothes and pus them between the washer and dryer, for the dryer’s operator to use when ready. Since the washer is now ‘reset’, it is ready for clothes again, its operator loads it.
  3. The dryer operator sees the clothes are ready to be dried and puts them in the dryer. Eventually they finish and leave the system.

Now, imagine that after a few loads, the dryer can’t keep up, and the washing machine finishes a load while the dryer is still running. The load goes into the space between the washer and dryer. Since the washer can’t know how close the dryer is to completion, it has to wait for the dryer to finish and take the waiting load before starting another.