This Technical White Paper explores a number of business and technical issues arising in the Data Warehousing environment when the aggregate data is missing or is incomplete. The paper looks at some real-world system implementations that illustrate the complexities of dealing with the aggregate data when building data warehouses and decision support systems based on the operational data. The need for differentiation between null and zero is shown to be critical to the successful interpretation of data. Finally, the paper offers a design solution for differentiating between null and zero values of the measures in the multi-dimensional records.
The concept of zero (a symbol and a number) has been known to people for quite some time. The zero symbol is used in positional notation to denote an “empty spot”, and zero number is used to reflect the “nothing” in mathematical calculations (as in “1–1=0”). The term “null” has also been used for some time in physics and mathematics in a way equivalent to a zero number. The idea of a “null value”, however, is a fairly recent invention, which is used in computer programming and in SQL database management systems (DBMS).
The NULL constant assignment as a value of a pointer (in programming languages such as C and C++) is used to indicate that the pointer does not point to any value. Similarly, the NULL value in a field of a DBMS table indicates that the value is “not specified”. This can be a result of the value being “unknown”, “missing”, or “not applicable”.
This paper will concentrate on the use of zero and NULL values in SQL databases, and the issues arising from not treating them properly in data warehousing and decision support applications.
On a very basic level, the difference between the “operational” and the “data warehouse” database is in the way the database is intended to be used. The purpose of the operational database is to provide the ability to quickly store, retrieve, edit and delete the data as required in the daily operation of the business. The purpose of the data warehouse (or, its smaller version, called a “data mart”) is to store large amounts of the historical information and support analytical processing of this data. The data warehouses are populated either with the raw operational data, or some form of the aggregate extraction of the operational data.
Both the operational and data warehouse databases can be implemented using relational database management systems (RDBMS) such as Oracle, SQL Server, DB2, MySQL, etc. These RDBMS provide the database designer with the ability to allow for certain columns in the tables to be “nullable”, i.e., to be left empty when populating the data in the records of the table. This provides an easy way of making certain information in the database “optional”.
Nullable fields are widely used in the operational databases. For example, any system collecting personal information, such as address, would allow for the apartment number to be a nullable field to accommodate people living in houses. In general, nullable fields offer the flexibility required by the operational system to easily capture, update, and delete optional information in a consistent manner, without affecting the integrity of the rest of the data.
The use of nullable fields in the data warehousing environment is less straightforward. Although, when the data warehouse is used to simply archive operational data the same nullability principles apply as in the operational database, this is usually not the case when the data warehouse is built using aggregate data. The reason is when the data in the data warehouse is used for analysis, it is important to assign some meaning to the NULL values in the fields. For example, when a retail store collects customers’ postal code information which is later used for analysis of customer base, all the transactions where the postal code field is NULL will get recorded in the “Declined to answer” category. It is important to note, that, as in this example, in the data warehouse it is usually important to know why the filed was populated with a NULL value. In general, interpretation of the NULL values will be different for each nullable field, depending on the nature of the data being captured in the operational system along with the nature of analytical processing that the data warehouse is designed to support.
Since there are many different data warehousing data models, it is not practical to consider every variation. We will, therefore, concentrate on a generic concept of a “dimensional” data warehouse (which can be implemented in a relational database using a star or a snowflake schema). In a dimensional data model the data is represented in a form of “cubes”, each cube containing a number of “dimensions” and “measures”. For example, an airline company data warehouse might contain some cubes representing ticket sales statistics, employee salary statistics, corporate budget and spending statistics, demographic information about the customers, etc. The ticket sales cube may have the dimensions reflecting fare class, airplane type, departure airport, destination airport, and so on. The measures in this cube will reflect number of seats booked, number of cancellations, total sales dollars, refund dollars, etc. In general, cube dimensions represent “properties” of the data, whereas measures represent the actual numeric data, such as number of people, dollars, etc. The use of NULL values in cube dimensions differs from the use of NULL values in measures.
Normally, there will never be a NULL value in the data warehouse cube dimension. This is because the dimensions, representing properties (or “attributes”) of the data, would usually have a specific meaning assigned to the NULL value in the process of extraction from the operational system. In most cases the set of allowed dimension values will be extended with an additional entry to represent “Unknown’ or “Not Applicable” value. This translation provides the ability to properly represent the breakdown of the cube measure values across all the possible values of the specific dimension (or a set of dimensions) in the reports. Having no NULL values in the cube dimensions also provides the ability to use the aggregate functions (such as sum and average) on the measures without worrying about the issues arising in SQL queries when dealing with NULL values (see “Nulls: Nothing to Worry About”, http://www.oracle.com/technology/oramag/oracle/05-jul/o45sql.html, for one example of the issues surrounding NULL values in SQL queries).
When dealing with measures in the cube, the data warehouse loading process will usually automatically convert the NULL values to zeroes. Alternatively, the data warehouse reporting platform will treat NULL values as zeroes at reporting time. As a matter of fact, it is only recently, in SQL Server 2005, Microsoft Analysis Services allowed for the measures to be nullable (see http://www.microsoft.com/technet/prodtechnol/sql/2005/dwsqlsy.mspx ). The automatic assumption that the NULL measure represents zero is due to the fact that measures in the data warehouse cubes represent numeric data, and for numeric data, it is logical to assume that if the count (or a sum) of something produced “nothing”, then the amount is equal to zero. The problem with the automatic replacement of NULL values in cube measures with a zero, however, is in that it is not always known why the measure has a NULL value. Without having this knowledge, treating NULL values in the measures as zeroes is dangerous in that, in many cases, it will cause the data warehouse to generate completely inaccurate reporting of the data.
In this paper we will demonstrate the types of issues arising from not exercising caution when dealing with NULL values in the measures. We will especially concentrate on a very common situation when all the measures in a cube have NULL value for a certain set of dimensions, which will usually result in the operational system simply not producing a record for such a combination in the data extract.
The dimension most often present in data warehousing cubes reflects time. In the database, time dimension can be physically defined as a single date field, or a combination of all or some of the date components (year, month, day, even hour and minute), stored in separate fields. It is also possible for a cube to have a pair of dates, representing an “effective start” and “effective end” of the data. Regardless of implementation, we can consider a portion of the cube reflecting effective lifespan of the data to be a “time dimension”.
The importance of time dimension in the discussion of NULL values is because of the fact that, when the data is loaded into the data warehouse, there will often be gaps in the time dimension. This is due to the following reasons:
As we have discussed earlier the NULL value in the measure reflects an instance of “missing data”. Depending on the reason for the absence of the data, the NULL value should or should not be treated as, or replaced with, a zero value. To demonstrate the issue, let us consider an example of a data warehouse being loaded with data from a sales tracking system. Let us imagine that the data is collected in collected monthly and includes Date (year and month), Sales Person ID, Product ID, Customer Type (e.g., individual, business, government, etc.), Number of Unit Sold, and Sales Dollar Amount. Let us now consider a number of scenarios reflecting the data collection process.
When the sales-tracking system is a centralized application, it can be expected that all the sales are recorded in one place. In this case, when the extract is performed, we can expect that all the existing data will be included in the extract for a specific month. The typical extract is done by applying an aggregating query (in this case, a sum) on all the measures (unit counts and dollar amounts) from individual invoices, grouped by the all the dimensions (year/month, Sales Person ID, Product ID, and Customer Type). This result of this query will not contain any records for the combinations of dimensions that had no invoices associated with them. Since we have assumes that the system is centralized, we have to conclude that for any of the missing combinations of the dimensions, there were no sales, meaning that zero units was sold for zero dollars. As a result, for this scenario, as far as business is concerned, the data set in our data warehouse should contain all the combinations of dimensions filled with zero or more units sold for zero or more dollars.
The second scenario we would like to consider is when the data warehouse is being populated with the data obtained from different “product divisions”. For example, a telecommunications company may have a cable TV, high-speed internet, and wireless divisions, reporting their sales into a single corporate data warehouse. In this case, the overall sales data can be expected to arrive in batches, as each division submits its information. Looking at our set of dimensions, we see that for a specific year/month we may have no data at all associated with a certain set of product IDs, if the division, responsible for those, has not yet reported its data. In this case, for those product IDs the data is truly missing, and should not be replaced with zeroes. However, if any data for a specific Product ID is present, the rest of dimensions can be assumed to be complete, and, if there is a missing record for a specific Sale Person ID, or customer type, the sales for those combinations can be assumed to be zero.
For the last scenario we can pick either a “distributed sales force” organization (with each sales person submitting their results to the data warehouse independently), or a “market vertical aligned” organization (with the sales organization structure aligned with the customer type). In both of these models, the absence of the data in a certain dimension (Sales Person ID in the first case, and Customer Type in the second) would mean that the data has not yet been submitted, or is not at all available. However, if there was any data present for a specific value in those dimensions, the data for the rest of dimensions, if it is missing, can be assumed to be zero.
It can be easily seen that, in all of the above cases, replacing all the missing data with zeroes would cause the improper reporting, skewing numbers such as average calculations, etc. On the other hand, not fully populating the data sets and not replacing NULLs with zeroes, where appropriate, would also provide the incorrect view of that data by reporting the data as “not available”, instead of being zero. This would also lead to incorrect calculations of the aggregate functions.
As we can see even from a very basic example above, the interpretation of NULL values, or missing rows in the data set, must be driven by the business knowledge about the nature of the data, as well as the process of data source and the process of data acquisition and load into the data warehouse. In order to address this need we propose the following model.
For each source of the data in the data warehouse we would identify two sets of dimensions. A “primary” set would include all the dimensions that reflect the availability of the data. This set would almost always include time, as well as one or more of the “business” dimensions, reflecting the segmentation of the data batches being loaded into the data warehouse. The absence of a specific combination of primary dimensions would indicate that that data is not available for that combination. A “secondary” set of dimensions would include all the dimensions that are not in the primary set. The absence of a specific combination of the secondary dimensions would mean that the values for all the measures in that set are all zeroes.
It is also important to note that, depending on the way the data warehouse cubes were defined and populated with the data, the measures themselves can be a part of a primary or secondary dimension set. This means that if in a record some measures have values and some others are NULL, then, depending on the specific measure, the NULL should be converted to a zero or not. The difference in interpretation of the NULL values for the measures in primary and secondary sets is that in the former the data is “not available” (the value should remain NULL), and in the latter, “there is nothing to report” (the value should be replaced with the zero.
In summary, we can say that the only way that the data warehouse can provide accurate reporting, in the situations of the incomplete source data sets, is by defining the proper interpretation for the missing data. Our proposed method for achieving this is through defining primary and secondary dimension sets, with the defined interpretation for each. However, a number of other methods can be used, as long as the end result is achieved, which is to provide accurate reporting on the data warehouse data.
Send mail to webmaster@inmentis.com
with questions or comments about this web site.
Copyright © 2007 Inmentis Ltd.