I've got more or less 37 years of experience in x86 assembly and C language. The way I see, C is nothing more than "high level" assembly and, to me, it makes perfect sense to develop assembly routines using C.
Nowadays, good C compilers (GCC, Clang, Intel C++, but not MSVC++!), using optimization options, create a very good code indeed, taking advantage of lots of characteristics about your processor, avoiding branch mispredictions, caches mismatches and other exoteric things.
Here's an example of the famous (and not ISO 9899 standard function) itoa() function (modified just to use base 10). I'll show you 4 ways to do it.
First, itoa1.c, writes each algarism to the pointed buffer and, in the end, reverse the string:
// itoa1.c
char *itoa10( char *p, int x )
{
char buffer[12];
char *q, *r;
_Bool negative;
r = p; // We'll need this copy later.
q = buffer; // Pointer to write individual algarisms in ASCII.
negative = x < 0; // Flag: x is negative?
// Calc absulute value of x.
// FIXME: There's a problem here.
if ( negative )
x = -x;
// Convert each decimal algarism, forward.
do
*q++ = '0' + x % 10;
while ( x /= 10 );
// Put an extra '-' if x is negative.
if ( negative )
*q++ = '-';
*q-- = '\0';
// copy string in reverse to buffer pointed by p.
while ( q >= buffer )
*p++ = *q--;
*p = '\0';
return r;
}
Why 12 chars in the local buffer? Because INT_MIN is -2147483648 (11 chars), plus the '\0' at the end of the string.
Here we got 2 problems. The first I did it on purpose to show we have to be careful when creating any routine. If x is INT_MIN, there is no way to negate it, using 2's complement. This can be fixed making a copy of x to a more precise (bigger type, as in long long int) and, then, negate it, if necessary. The second problem is the string reversal, copying the local buffer to the buffer pointed by the argument. Here we have, still, a third problem: The need for local buffer! It is unecessary, as described below, in a better implementation:
// itoa2.c
char *itoa10( char *p, int x )
{
// Since -INT_MIN cannot be represented on an 'int', we
// use better precision to hold the value (long long int is 64 bits long).
long long int n;
char *q, *r;
_Bool negative;
negative = x < 0;
n = llabs( x );
// We'll need r later to calculate the string length.
// 12 is used here because INT_MIN is "-2147483648"
// (11 chars), plus the extra '\0'.
// Here the pointers point 1 char after the end of the
// buffer.
q = r = p + 12;
*--q = '\0';
// Convert each decimal algarism, backwards.
do
*--q = '0' + n % 10;
while ( n /= 10 );
// Put an extra '-' if x is less then zero.
if ( negative )
*--q = '-';
// q points to the first char we have on buffer.
// Copy converted string to the beginning of the target buffer.
// This works because there are, at least, 2 bytes to move.
memmove( p, q, r - q );
return p;
}
Here we got rid of the local buffer, using the argument p as the target buffer and we got rid of the string reversal as well. But, yet, we have one loop and one "movement" of bytes in the target buffer. The final movement is needed because your buffer pointed by p must begin with the string converted. The clock cycles wasted by memmove() depends on r - q bytes moved.
This seems to be better then the previous, but we can "improve" this by calculating how many chars will be in the final buffer. We can do this using 10's base logarithm of absolute value of x (if x != 0). This should improve the routine as we get rid of the final copy, but the number of tests to make a modified version of ilog10() to work will waste, more of less, the same number of clock cycles of the final movement. This third routine can be like this:
// itoa3.c
// Modified log10 for integers.
// Used to get the number of chars in the buffer.
static int ilog10_( unsigned int x )
{
static const unsigned int v[] =
{ 1000000000U, 100000000U, 10000000U, 1000000U, 100000U, 10000U, 1000U, 100U, 10U };
/* ilog10_(0) doesn't exist!
this is checking in itoa10() routine. */
//if ( ! x )
// return -1;
for ( int i = 0; i < sizeof v / sizeof v[0]; i++ )
if ( x >= v[i] )
return 9 - i;
return 0;
}
char *itoa10( char *p, int x )
{
// Since -INT_MIN cannot be represented on an 'int', we
// use better precision to hold the value (long long int is 64 bits long).
long long int n;
char *q;
_Bool negative;
negative = x < 0;
n = llabs( x );
// We have, at least, 2 chars ocuppied in the buffer:
// '0' and '\0'.
q = p + 2;
// ilog10() isn't defined for 0, so the test is necessary.
if ( x )
q += ilog10_( n ) + negative;
*--q = '\0';
// Convert each decimal algarism backwards.
do
*--q = '0' + n % 10;
while ( n /= 10 );
// Put an extra '-' if x is less then zero.
if ( negative )
*--q = '-';
return p;
}
We can tweak ilog10_() to improve the timing if, most of the time, we'll convert small values.
But I think the version of itoa2.c is better then this, in terms of performance (must measure!).