I have designed three residue division algorithms. Two are exact integer division algorithms, and one is an approximate algorithm. The first integer algorithm is fairly simple. One translates a residue dividend and a residue divisor into floating point, divides the floating point, and achieves a first approximation to the answer. Since the floating point system we are working with has a small mantissa, this floating point division can be done very rapidly. Then the floating point approximation is translated into residue, multiplied by the divisor, and the result subtracted from the dividend to compute an error term. This error term is translated into floating point and divided in floating point by the floating point divisor to produce a floating point correction factor. This procedure is continued until it converges, which it does fairly rapidly since we have a system that can translate small residue numbers more rapidly than large ones, and with every iteration of the process, the numbers to be translated get smaller.
I know that this sounds complicated and it is an iterative process, but simulations that I have written indicate that it can be carried out in about the same time as polynomial division. We should remember that polynomial division is also an iterative process, although people who just use division algorithms and don't design them can come to view them as a monolithic entity provided by the processor chip or resident software.
The second division algorithm is an extension of the first, but is a little bit faster. It operates only on a small subset of the digits of the divisor at a time, and a complete description of the process is beyond the scope of this paper.
It is the third division algorithm that seems to me to hold the most promise for practical computations, because it is substantially faster than polynomial algorithms. Simply put, it is only really a form of scaling. In binary, it is possible to perform scaling using a simple shift register. If we scale the number down, dividing it by two, then we lose information, but we can extend the range of the arithmetic process by keeping track in another register of the amount of scaling we have done. This, of course is the essence of the floating point idea. We can also think of it as simplified form of rational arithmetic.
It is well known in the residue arithmetic community that it is also possible to scale residue numbers. The process is not as simple as the binary process, but then it is only a little more complicated than residue subtraction in terms of the time that it takes to scale one residue digit. I may be the first author to suggest scaling both the numerator and the denominator of a residue fraction. In residue arithmetic as in binary, scaling is a form of division. In residue arithmetic, if it is known that a divisor is evenly divisible by its dividend, then the division can be carried out very rapidly. This is also the basis of a well know algorithm for translating residue numbers into polynomial, and thus at the same time that the residue computer is scaling a rational residue number, it can be "refreshing" the floating point estimates of both the numerator and the denominator to aid in the detection of future overflows and so fourth.
The beauty of this scheme is that the above is not the division algorithm. It is only what happens when there is an arithmetic overflow, and though it is quite a bit simpler than a polynomial division algorithm, it does not have to be done every time the residue rational computer does a divide. The division algorithm itself consist of only two simple operations, invert and multiply, just like in grade school arithmetic. Both of these residue operations can be performed in less time than it takes a polynomial computer to do an add.
It is important to understand here that we have stood the conventional wisdom on its head. The conventional wisdom is that residue divide is far and away more time consuming than polynomial divide and yet here we have an algorithm that does a residue divide faster than a polynomial computer can do a simple add. It seems to me that this has not been realized by the residue arithmetic community, much less the general computer science community. If there is something wrong with my reasoning, I need some more intensive and frank peer review than I have been getting. If my reasoning is correct, then we should be building the residue rational supercomputer.