Tutorial: Performance and Time.

Before continuing with the performance tutorial (which I promise will start back up very soon), there is one important topic to discuss; specifically, timing routines. In order to have an understanding of how changes in code affect performance you need to have a metric of sort (you all know this, you're MacResearchers!). There are a number of methods that people use for timing routines, some more portable than others. In this tutorial I'm going to talk about Mach Absolute Time.

Why Mach?

Timing routines vary depending on the time domain you need to measure. In some cases using a function such as clock() or getrusage() and doing some simple math is sufficient (and it's portable). If the timing routine is going to be used outside of the actual development scheme then portability is important. I don't use these. Why?

For me, typical issues when tuning code are:

1) I only need the timing routine for immediate testing.
2) I don't like relying on multiple functions to cover different time domains. Their behavior may not be consistent.
3) Sometimes I need a very high resolution timer.

Welcome mach_absolute_time!

mach_absolute_time is a CPU/Bus dependent function that returns a value based on the number of "ticks" since the system started up. It's not well documented. But, the lack of concrete documentation shouldn't be a deterrent toward using it, because the behavior is guaranteed on Mac OS X AND it covers any time range that the system clock covers (nanoseconds up until the system halts (by differencing)). Would I use it in production code, probably not. But for testing, it's a sure thing.

There are a couple of items we need to consider in using mach_absolute_time.

1) How do we take get the current Mach Absolute Time?
2) How do we convert it to a meaningful number (since I said it's CPU/Bus dependent)?

You can download all of the code for this project here.

Getting the mach_absolute_time

This one is easy:


#include <stdint.h>

uint64_t start = mach_absolute_time();
uint64_t stop = mach_absolute_time();

Done. We typically take two values since we want to see the time elapsed.

Converting mach_absolute_time differences to seconds

This one is a bit more complex because we need to get the system timebase that mach_absolute_time is...well based on. The code below is derived from some code that Ian Ollmann posted sometime back on one of the Apple Mailing lists:



#include <stdint.h>
#include <mach/mach_time.h>

//Raw mach_absolute_times going in, difference in seconds out
double subtractTimes( uint64_t endTime, uint64_t startTime )
{
    uint64_t difference = endTime - startTime;
    static double conversion = 0.0;
    
    if( conversion == 0.0 )
    {
        mach_timebase_info_data_t info;
        kern_return_t err = mach_timebase_info( &info );
        
	//Convert the timebase into seconds
        if( err == 0  )
		conversion = 1e-9 * (double) info.numer / (double) info.denom;
    }
    
    return conversion * (double) difference;
}

The most important call here is the call to mach_timebase_info. We pass in a struct that will return time base values. Finally once we have the system heartbeat, we can generate a conversion factor. Normally the conversion is imply the numerator (info.numer) divided by the denominator (info.denom). Here I've multiplied it by 1e-9 to get it into seconds (something my brain understands a little better). Finally we take the difference of the two Mach Absolute Times and multiply by our conversion factor to get the true time difference.

Wow. Now you're probably thinking, how is this better than using clock? Seems like a lot of work. Well it is, sort of. That's why it's in a function. We just pass in our values to the function and get the answers back.

Example

Let's do a quick example. Below is the entire code listing (including the mach function). You can compile it using: gcc mach.c -o mach



#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
#include <mach/mach_time.h>

//Raw mach_absolute_times going in, difference in seconds out
double subtractTimes( uint64_t endTime, uint64_t startTime )
{
    uint64_t difference = endTime - startTime;
    static double conversion = 0.0;
    
    if( conversion == 0.0 )
    {
        mach_timebase_info_data_t info;
        kern_return_t err = mach_timebase_info( &info );
        
		//Convert the timebase into seconds
        if( err == 0  )
			conversion = 1e-9 * (double) info.numer / (double) info.denom;
    }
    
    return conversion * (double) difference;
}

int main() {
    
    int i, j, count;
    
    uint64_t start,stop;
    
	double current = 0.0;
	double answer = 0.0;
	double elapsed = 0.0;
	
	int dim1 = 256;
	int dim2 = 256;
	int size = 4*dim1*dim2;
	
	//Allocate some memory and warm it up
    	double *array = (double*)malloc(size*sizeof(double));
	for(i=0;i<size;i++) array[i] = (double)i;
	
	count = 5;
	for(i=0;i<count;i++){	
		start = mach_absolute_time();
		
		//do some work
		for(j=0;j<size;j++){
			answer += sqrt(array[j]);
		}

		stop = mach_absolute_time();
		current = subtractTimes(stop,start);
		printf("Time for iteration: %1.12lf for answer: %1.12lf\n",current, answer);
		
		elapsed += current;
	}    
    
    	printf("\nTotal time in seconds = %1.12lf for answer: %1.12lf\n",elapsed/count,answer);
    
	free(array);
	
    return 0;
}

What are we doing here? For this example we have a reasonable size array of doubles that we will populate with some numbers and get the sum of the sqrt of those numbers. For testing purposes we will iterate over that calculation 5 times. After each iteration we will print out the time taken and then summarize the total runtime when the calculation is done. On my PowerMac G5 (2.5 GHz) this is what I get.

[bigmac:~/misc] macresearch% gcc mach.c -o mach
[bigmac:~/misc] macresearch% ./mach 
Time for iteration: 0.006717496412 for answer: 89478229.125529855490
Time for iteration: 0.007274204955 for answer: 178956458.251062750816
Time for iteration: 0.006669191332 for answer: 268434687.376589745283
Time for iteration: 0.006953711252 for answer: 357912916.502135872841
Time for iteration: 0.007582157340 for answer: 447391145.627681851387

Average time in seconds = 0.007039352258 for answer: 447391145.627681851387

Note that I turned optimizations off in this case, because compilers have ways of getting around brain-dead loops like this. Again, this is an example. If it were real code, reasonable optimizations would be on.

Ok... So a couple of points about the example.

First. The dimensions I used were set larger than my caches. I did this on purpose since typically we need to be aware of behavior in cases where data falls out of cache (as it will here, on my system at least. If you are on a MacPro it won't, necessarily). We'll talk more about cache issues in the future. Of course, this is a contrived example, but it's something to always think about.

Second, you'll notice I put in a comment after the memory allocation about warming up the memory. What do I mean?

This is something that typically isn't a concern in the real world, as memory is often already touched by the time you need it. But when putting together smaller test cases to measure function performance it can be a real issue that affects your measurements.

When dynamic memory is allocated, the first time it is accessed the memory manager needs to zero it out (on OS X regardless of how you allocate dynamically: malloc, calloc etc... all memory is zeroed before being handed to the user***. The memory is zeroed as a security precaution (you don't want to hand someone memory that contains secure information, such a decryption keys, for example).

***I stand corrected on this statement see Ian Ollmann's note below regarding zeroing behavior using malloc. As he states, if you always need zeroed memory use calloc instead of malloc***

That zeroing process incurs a penalty (and is flagged by the system as a zero-fill page fault). So to make our timings a bit more accurate, we touch that memory once before we start using it to ensure we don't get hit during the timing process with the zero-fill page fault hit.

Let's test that. Change the line:

for(i=0;i<size;i++) array[i] = (double)i;

to:

//for(i=0;i<size;i++) array[i] = (double)i;

And rerun the test:

[bigmac:~/misc] macresearch% ./mach
Time for iteration: 0.009478866798 for answer: 0.000000000000
Time for iteration: 0.004756880234 for answer: 0.000000000000
Time for iteration: 0.004927868215 for answer: 0.000000000000
Time for iteration: 0.005227029674 for answer: 0.000000000000
Time for iteration: 0.004891864428 for answer: 0.000000000000

Average time in seconds = 0.005856501870 for answer: 0.000000000000

Note that the first iteration through the loop is about twice as long as the subsequent ones. Also note that conveniently our answers are all zero! Again, the memory has been zeroed out. If we were handed memory from the heap as is, we'd have nonsensical numbers for the answers. The fact that the timing of the other functions went down is due to how the sqrt function is implemented and for the moment beyond the scope of this tutorial (we'll get to that kind of stuff later).

One final, but important point. Don't rely on the zeroing behavior for all memory. Indeed, it's very much possible that memory handed to you on from a static allocation will be polluted as in a case such as:

double array[3][3];

Produces on my system:

-1.99844 -1.29321e-231 -1.99844
-3.30953e-232 -5.31401e+303 0
1.79209e-313 3.3146e-314 0

So be careful!

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

malloc does not always zero fill

> on OS X regardless of how you allocate dynamically: malloc, calloc etc... all memory is zeroed before being handed to the user

Absolutely not true.

If you need zero filled blocks, you should use calloc. The default small allocator for malloc does not zero fill blocks. The large allocator (from which the small allocator originally gets its memory) does, but the small allocator is free to hand back freed small blocks without zeroing them. Note that the position of the cutoff between small and large allocator on MacOS X can and will change. It should not be relied on.

In addition, users can setenv MallocPreScribble to fill blocks with non-zero bytes. man malloc for more info.

For example:

[ian-ollmanns-computer:/tmp] iano% cat main.c
#include stdio.h
#include string.h
#include stdlib.h

int main( void )
{
char *str = malloc(32);
strcpy( str, "I am not zero!" );
free( str );

str = malloc( 32 );

printf( "%s\n", str );

return 0;
}
[ian-ollmanns-computer:/tmp] iano% gcc main.c
[ian-ollmanns-computer:/tmp] iano% ./a.out
I am not zero!

timing methodology

It is my opinion that the most accurate timing routines do the following:

o Test segments of code that take 1-100 microseconds to execute.

o Take the *best* time out of a large number of iterations (>1000)

o Really small functions might be called a large fixed number of times between calls to mach_absolute_time(), such that the overall time for N iterations over the small function is 1-100 microseconds.

o subtract out timer latency, which should be measured the same way as function time (write the same timing loop, just don't call the function.)

This should be sufficient to gauge performance of core routines to within a fraction of a cycle under optimal conditions. (In my opinion, in the code David shows above, the variable "size" is a couple of orders of magnitude too large.)

Why 1-100 microseconds? mach_absolute_time() does not take into account time lost due to the scheduler stopping your process so another process can run. (Getrusage will do that, but its timing granularity is too coarse for many purposes.) You will end up adding time taken up by completely unrelated things like Mail.app to check your e-mail onto your HPC workload timings at random, which is clearly not what you want. To avoid that, you want to time a complete end to end calculation over a small enough time period that you are unlikely to be interrupted by other things like Mail.app, video updates, or the audio threads, or some hardware device. One usually tries to keep the measured interval above 1 microsecond so that the precision of the fundamental timer is not an issue.

Why the best time? Sh-t happens. A lot. If you try to look at average time or worst time you're just going to be measuring noise due to scheduler interruption. A lot of people will spout lots of religious this and that about how you should measure total system performance. It's complete bunk. Ignore them. That stuff is only important if you are writing high level code that is calling through 12 levels of frameworks. If you are writing your own bare metal arithmetic for scientific computation, likely the only things you need to worry about are hard CPU performance (which this measures) and VM/memory effects (which this purposely does not measure).

Why? If you actually take the time to look at the complete distribution of times from your runs (over a lot of samples, suitably short so that scheduler interruption is unlikely) what you will see is that the vast majority of times are all very, very similar, near the minimum time. Then there will be a very small fraction of times that are way larger, of the sort that if you were doing empirical work, you'd just toss out with a Q-test. News flash: You *are* doing empirical work, which is why those bogusly large data points should be tossed out. It is very important that you do so, because usually they take so much more time than the other samples that even one or two of the corrupted runs will skew an average to significantly off what the true time should be for that routine.

You can use the minimum time as a high precision estimate of the speed of the core of your routine. Tune that to maximize that number.

Next, we go to fix real world performance, to satisfy our real world critics out there. This is when we run the cold cache, data set oversized, etc. case. Likely very little that you do here to fix these problems will be obviously successful because mostly what these measurements measure is noise. Did it help? Did it not help? Run it three times. Maybe it did? :-} On older machines (G4) you may find that cache prefetch hints work, giving up to 30% speed boost. if the disk is a problem, then some sort of asynchronous preloading of data is an excellent idea. As always, high level algorithmic improvements will help here too.

When you are done with those you are done. Go on to the next routine that Shark flags. Don't waste time worrying about what a cache miss is going to do to your average time. The correct solution is to do your best to organize your code such that you avoid the cache miss, not agonize over the religious purity of average times vs. best times. Shark can sample on cache misses. Those are pretty easy to spot. You only need to worry about the ones that correlate with routines that also use a lot of CPU time.