## Mapping Square Root to the PSoC Datapath

Last post, I went through a couple of iterations of a square root algorithm before arriving at an efficient method using only shifts and adds. Now, let's figure out how to configure the datapath to execute this algorithm.

The C-style pseudo-code from the previous post is nice, but it will be easier to map if we try to represent it in a state diagram:

Notice that I've reordered a few of the operations. Instead of adding 2*D to q and then shifting the result to the right, I am now shifting q first and then adding D. This saves a shift on one of the two branches.

Now we can start thinking about how the datapath can do this.

Normally the datapath can only track two variables as a part of an iterative algorithm (one in A0, one in A1). With the latest version of the datapath which is what is in PSoC 3 and PSoC 5LP (not available in PSoC 5), there is a new mode called FIFO Dynamic Mode. In the standard mode, the datapath FIFOs are configured statically as either input or output. An input FIFO is written by the CPU or DMA and read by the datapath. An output FIFO is the opposite. With FIFO Dynamic Control, the FIFOs can be individually and dynamically selected to be in either internal or external mode. Internal mode means the datapath has full read and write control over the FIFO. External mode gives the CPU/DMA full control. This means that the CPU can load values into a FIFO in external mode, switch it into internal mode to allow the datapath to perform some computation, and then read the values back out of the same FIFO once the datapath has completed. This also means that the datapath has up to 8 new places to store computed values (two 4-byte FIFOs). Medium-complexity computations that used to eat up CPU cycles can now be offloaded and/or parallelized, speeding up applications and letting the CPU focus on more complex problems. In the square root algorithm we'll just need storage for 3 variables.

It usually makes sense to clearly partition variables between the two sides of the datapath (A0/D0/F0 vs. A1/D1/F1). It is possible to configure F0 to receive writes from A1, and to use one of our dynamic configuration operations to swap A0 and A1, but this adds complexity and cycles. For this algorithm, I'll track D in A0/D0/F0, and I'll track q and M in A1/D1/F1.

Now let's start looking at the states we'll need for our datapath state machine. The first thing we need to set up are the initial conditions. M is initially set to N, the radicand of our square root. Let's assume that value got put into a FIFO (specifically F1) before FIFO control got switched to internal. The other two initial values are static, so we can use D0 to hold 2n - 2, and D1 to hold 0. Then, the first thing we need to do is load these values out of the data register into the accumulators. Luckily, we can independently select the write-source for A0 and A1, so we can load from the data registers in parallel. We can also use this state as our idle state. If we have some "start" signal, we are just fine spinning here until start goes high. That gives us our initial state:

Next, we need to compute D + q. Because we will be comparing this to M (partitioned to the "1" side) it will be easiest to store the result of the addition in A0. In parallel, we can also load M into A1 to prepare for the comparison. We need to keep around the current values of D and q, so load them into their respective FIFOs as well:

To check the condition "D + q <= M" we will use the cl1 condition provided by the datapath (cl means Compare Less Than). The cl0 condition always compares A0 < D0, but the cl1 condition can be configured to compare A1 < A0, which is exactly what we need. If D + r is greater than M (A1 < A0), we take the left branch in our state machine above. If not, we ll go down the right branch. We could perform a NOP in our current state while we wait for the result of the condition. Instead, let's precompute a value that we need for the right branch, the longer of the two execution branches: M = M (D + q):

Let's first consider the right branch. We guessed correctly with our precompute! We are done with M for this iteration, so we can shove it back in the FIFO. The next thing we need to do is compute the new values of D and q, so bring them back into the accumulators:

We need to shift q one bit to the right, and there isn't much else we can do until then, so:

Now we can compute our new value for q by adding D to what we just shifted:

Now start to calculate our new value of D:

We need to shift D one more time. At the same time, we can take a shortcut from the state machine above. D starts at 2n 2 and gets shifted twice every iteration, producing the series 0x40, 0x10, 0x04, 0x01. That means that on the final iteration of our algorithm, when D is about to shift off, the first shift will have already taken care of shifting D to 0. We will shift D one more time for all of the other iterations, but we can check our condition in this state to see if we are ready to stop looping. The z0 condition is hardwired to compare (A0 == 0). If z0 is false, we need to jump back to state #1. We have our registers set up exactly how state #1 expects, so there is no more work to be done there. If z0 is true, we can bail out of our loop and prepare for the result to be read out.

Now we will examine the left branch of the state machine. We guessed wrong on our precompute, so the first thing we need to do is undo that. At the same time, we can bring D back into the accumulator so that it's ready to shift:

We aren't going to change M this iteration, so go ahead and put it back in the FIFO. This will allow us to load q into the accumulator. We don't need to add D to q in this branch, so we can go ahead and shift D before q is loaded:

We have two operations left to perform on this branch for this iteration. Both q and D need to be shifted once more, and then we need to check our condition to see if we stay in the loop. Luckily, we've already defined state #7 to shift D and check the condition. So to finish out the left branch, all we need to do is shift q and jump to state #7:

Now that we've defined all of the states in the computation loop, we have to think about the endgame. If we determine that we are done looping in state #7, we will get to state #11 with our final square root value sitting in A1. Nothing else matters at that point. All we need to do is make sure that F1 contains our final answer and nothing else.

For completeness, I've added an effective NOP state #12 so that we can pulse an eoc signal before going back into the idle state:

We have successfully mapped the algorithm to states and operations that can each be handled by the datapath. In the next post, I'll detail how to get these 13 states all configured in a single datapath.

ALL CONTENT AND MATERIALS ON THIS SITE ARE PROVIDED "AS IS". CYPRESS SEMICONDUCTOR AND ITS RESPECTIVE SUPPLIERS MAKE NO REPRESENTATIONS ABOUT THE SUITABILITY OF THESE MATERIALS FOR ANY PURPOSE AND DISCLAIM ALL WARRANTIES AND CONDITIONS WITH REGARD TO THESE MATERIALS, INCLUDING BUT NOT LIMITED TO, ALL IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT OF ANY THIRD PARTY INTELLECTUAL PROPERTY RIGHT. NO LICENSE, EITHER EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, IS GRANTED BY CYPRESS SEMICONDUCTOR. USE OF THE INFORMATION ON THIS SITE MAY REQUIRE A LICENSE FROM A THIRD PARTY, OR A LICENSE FROM CYPRESS SEMICONDUCTOR.

Content on this site may contain or be subject to specific guidelines or limitations on use. All postings and use of the content on this site are subject to the Terms and Conditions of the site; third parties using this content agree to abide by any limitations or guidelines and to comply with the Terms and Conditions of this site. Cypress Semiconductor and its suppliers reserve the right to make corrections, deletions, modifications, enhancements, improvements and other changes to the content and materials, its products, programs and services at any time or to move or discontinue any content, products, programs, or services without notice.