Valid HTML 4.01 Strict!
Valid CSS!

Rates, normalizing and consolidating

Index

Intro
Transform to a rate
About rate and time
Normalize intervals
Consolidate intervals
Wrapup

Intro

RRDtool stores rates during time intervals. These time intervals are on well defined boundaries in time. However, your input is not always a rate and will most likely not be on such boundaries. This means your input needs to be modified. This page explains how it works.

A couple of different stages should be recognized:

This does not get in your way, it is not doing bad things to your data. It is how RRDtool works, by design. If you do not want to do this, you should be looking at (for instance) mysql and a drawing program.

All stages apply for all input, no exceptions. There is no short circuit. After transforming your input into a rate, normalization occurs. After normalization, consolidation occurs. All three stages can be a no-op if you carefully setup your database, but making one stage a no-op does not mean the other stages are skipped.

If you use GAUGE, the input is already a rate, but it is still subject to normalization. If you enter the data exactly at the boundaries normalization is looking for, your input is still subject to consolidation.

Transform to a rate

Everything is processed as a rate. This doesn't mean you cannot work with temperatures, just remember it is processed as if it was a rate as well.

There are several ways for RRDtool to get a rate from its input:

In each of these four cases, the result is a rate. This rate is valid between the previous call to RRDtool and the current one. RRDtool does not need to know anything about the input anymore, it has start, end and rate.

This concludes step 1. The data is now a rate, no matter what data source type you use. From this moment on, RRDtool doesn't know nor care what kind of data source type you use.

About rate and time

If you transfer something at 60 bytes per second during 1 second, you could transfer the same amount of data at 30 bytes per second during 2 seconds, or at 20 bytes per second during 3 seconds, or at 15 bytes per second during 4 seconds, et cetera.

These numbers are all different yet they have one thing in common: rate multiplied by time is a constant. In this picture, it is the surface that is important, not its width nor its height. This is because we look at the amount of data, not at its rate nor its time. Why this is important follows later.

Normalize interval

The input is now a rate but it is not yet on well defined boundaries in time. This is where normalization kicks in. Suppose you are looking at a counter every minute. What do you know? You know counter values at MM:SS. You don't know if the counter incremented at a high rate during a small amount of time (1 second at 60 bytes per second) or during a long time at a small rate (60 seconds at 1 byte per second). Look at the picture above again, each MM:SS will be somewhere in the white areas.

This means that the rate you think you know isn't the real rate at all! In this example, you only know that you transfered 60 bytes in 60 seconds, somewhere between MM:SS and the next MM:SS. Its computed rate will be 1 byte per second during each interval of 60 seconds. Let me emphasize that: YOU DO NOT KNOW THE REAL RATE, only an approximation.

Now look at the next image , showing some measured intervals and rates. The samples are taken at 30 seconds past the minute, each colored area represents another measurement. There are four measured intervals, the last one has a rate of zero but it is known (i.e. the last update occurred at 04:30). One expected update, the one at 02:30, did not happen. RRDtool can cope with this perfectly well if you let it. The update just happens at 03:30 and is valid from 01:30. This is governed by the heartbeat setting.

The bottom part of the image is the result after normalization. It shows that each interval uses a bit of each input interval. The first interval is built from the blue interval (which started before 00:00) and the red interval (measured between 00:30 and 01:30). Only the blue part that falls inside the 00:00 to 01:00 interval is used, only the red part that falls inside the 00:00 to 01:00 interval is used. Similar for the other intervals. Notice that it are areas that are important here. A well defined part of the blue area is used (in this example exactly half) and a well defined part of the red area is used (dito). Both represent bytes transfered during an interval. In this example we use half of each interval thus we get half of the amount of bytes transfered. The new interval, the one created in the normalization process, has a surface that is exactly the sum of those two amounts. Its time is known, this is a fixed amount of time, the step size you specified for your database. Its rate is its area divided by this amount of time.

If you think it isn't right to shift data around like this, think again. Look at the red interval. You know something has happened between 00:30 and 01:30. You know the amount of data transfered but you do NOT know WHEN. It could be that all of it was transfered in the first half of that interval. It could also be that all of it was transfered in the last half of that interval. (in both cases the real rate would be twice as high as you measured!) It is perfectly reasonable to divide the transfer like we did. You still don't know if it is true or not. On the long term it doesn't make a difference, the data is transfered and we know about it.

The rates are now normalized. It are these rates that RRDtool works with. Notice that the second and fourth normalized rate (mixture of red and green, mixture of green and white) are lower than the green rate. This is important when you look at maximum rates seen. But as both the red and green rates are averages themselves, the mixture is as valid as its sources.

Each normalized rate is valid during a fixed amount of time. Together these are called Primary Data Points (PDPs). Each PDP is valid during the step size. RRDtool doesn't know nor care about the input you gave it. This concludes step 2. From now on, RRDtool forgets all original input.

Even if normalization is a no-op (if you made sure your timestamps are already on well defined boundaries) consolidation still applies.

Consolidate intervals

Suppose you are going to present your data as an image. You want to see ten days of data in one graph. If each PDP is one minute wide, you need 10*24*60 PDPs (10 days of 24 hours of 60 minutes). 14400 PDPs is a lot, especially if your image is only going to be 360 pixels wide. There's one way to show your data and that is to take several PDPs together and display them as one pixel-column. In this case you need 40 PDPs at a time, for each of the 360 columns, to get a total of ten days. How to combine those 40 PDPs into one is called consolidation and it can be done in several ways:

Which function you are going to use depends on your goal. Sometimes you would like to see averages, so you can use it to look at the amount of data transfered. Sometimes you want to see maxima, to spot periods of congestion, et cetera.

Whichever function you use, it is going to take time to compute the results. 40 times 360 is not a lot but consider what's going to happen if you look at larger amounts of time (such as several years). It would mean you have to wait for the image to be generated.

This is also covered in RRDtool but it requires some planning ahead. In this example, you are going to use 40 PDPs at a time. Other examples would use other amounts of PDPs each time but you can know up front what those amounts are going to be. In stead of doing the calculations at graph-time, RRDtool can do the computations at monitoring time. Each time a series of 40 PDPs is known, RRDtool consolidates them and stores them as a Consolidated Data Point (CDP). It are these CPDs that are stored in the database. Even if no consolidation is required, you are going to "consolidate" one PDP into one CDP.

The created CDPs are stored in a Round Robin Archive (RRA). It is perfectly legal, even quite wise, to have multiple RRAs per database. You could create an RRA containing CDPs of one PDP per CDP, an RRA of 4 PDPs per CDP (360 of these would cover one day), an RRA of 40 PDPs per CDP (as discussed above) and so on. This is the steps parameter as discussed in the rrdtool documentation (steps, not step). For each of the RRAs you can determine how much CDPs are stored. This is the rows parameter.

RRDtool doesn't take some random PDPs to generate a CDP. Each interval in RRDtool starts and ends at a whole multiple of a certain amount of time. For CDPs, this amount of time is "step times steps". Times are in Seconds since the unix epoch. If you need boundaries at midnight your local time then don't make the mistake of specifying 86400 seconds per CDP (1440 PDPs in our example). This will most likely not work, unless you live on Iceland or another country that has no time difference with GMT, not even in the summer. If you don't understand why this is, don't bother arguing that RRDtool should change.

Wrapup

This concludes this explanation of how rrdtool stores data in its RRDs. There is of course much more to it. Have a look at the documentation for rrdtool create and notice how it can deal with unknown input (both in the PDP and CDP generating stages). How the stored data is used is also not explained here, look at rrdtool graph and/or rrdtool fetch for that. Important to remember is that data representation works with CDPs inside RRAs. I hope this document aided in your understanding of how these are generated.

Do you like this information? Tell others! Don't you? Tell me!

This page was created by Alex van den Bogaerdt, an independent IT consultant. If you want to provide feedback or if you want to hire me please see contact alex.
Back to the top of this page. You can also go to the index