Archive for the ‘Deep Compression’ Category
Last week I described Why DB2 Data Compression is Better than Oracle Database. In my blog post, I focused on the reasons why the DB2 compression techniques get better compression rates than the Oracle Database compression techniques. In this post, I’d like to focus on a couple of recent additions to DB2 that even further enhance its ability to minimize your storage costs:
- DB2 9.7 also has compression for user temporary tables and system temporary tables. We have seen that temporary tables can occupy as much as one third of the data storage in certain environments, so compressing temporary tables can have a very significant impact on your data storage costs.
- When it comes to compressing database indexes, DB2 9.7 offers both RID list compression and dynamic-length prefix compression that includes substrings of columns. Oracle Database, on the other hand, offers only fixed-length prefix compression.
When you combine all of these DB2 advantages, it should be apparent that DB2 will typically do better than Oracle Database at lowering database-related storage costs. In fact, International Technology Group (ITG) came to the following conclusions in its report at VALUE PROPOSITION FOR IBM DB2 9.7: Cost Savings Potential Compared to Oracle Database 11g:
Oracle 11g compression rates are typically in the 20 to 30 percent range – many users find performance degradation at higher levels to be unacceptable – while early DB2 9.7 users have routinely employed 60 to 80 percent compression rates with negligible effects on processor performance.
However, you should be aware that data compression rates and performance are highly dependent on the nature of your data and the nature of your environment. Your best course of action is to get IBM and Oracle to prove the kinds of storage savings they can achieve for your data in your environment. That’s the only way to cut past the marketing and determine the real impact.
Yesterday, in a blog post titled Data Compression in IBM DB2 and Oracle Database, we saw that IBM DB2 gets better data compression rates than Oracle Database for TPC-H data. You might wonder why DB2 is so much better. The reasons boil down to two basic differences:
- DB2 uses one compression dictionary for the entire database table, whereas Oracle Database uses a separate compression dictionary for each block in the database. A block is a unit of storage ranging in size from 4k to 32k. Because Oracle Database uses a much smaller scope for its data compression dictionaries, its compression rates are naturally going to be smaller.
- Both DB2 and Oracle can compress repeating patterns that span multiple columns. However, DB2 can compress substrings that span multiple columns, whereas Oracle Database cannot compress substrings. For instance, if we have rows with
EATON, CHRISTOPHER, DB2 can compress both of these by putting
EATON, CHRISin the compression dictionary, whereas Oracle Database cannot. Once again, the smaller scope for Oracle Database limits the compression rates it can achieve.
If you are in any doubt as to which vendor’s approach gets the best compression rates, consider the following quote from Oracle themselves. Two Oracle researchers in a paper at the Conference on Very Large Data Bases (VLDB) titled Data Compression in Oracle endorse the approach taken by DB2 as achieving better compression rates:
Due to its global optimality of compression a table-wide dictionary approach can result in high compression factors… Furthermore, a global dictionary may use less disk space compared to a block level approach, which potentially repeats dictionary entries across blocks.
Last week, in my blog post titled Truth in Advertising – Advanced Data Compression in Oracle 11g, I discussed the fact that data compression rates are highly dependent upon 1) the nature of the data and 2) the database environment. I’d like to follow up on that post with a data point that directly compares IBM DB2 and Oracle Database.
Last week, I mentioned that TPC-C data contains random strings. As such, it is not an ideal data set for measuring data compression rates. TPC-H is actually a better data set, as it contains data that is closer in nature to “real world” data. There are issues with TPC-H data as well. Since the data is programmatically generated, most columns have a uniform distribution, which limits the clustering of data that tends to happen in “real world” data sets. Also, there are some “fill columns” with long unique content that is difficult to compress. Nonetheless, the TPC-H data is a better data set for comparing data compression rates than TPC-C.
Luckily for us, Oracle have published their compression results for TPC-H data in a paper at the Conference on Very Large Data Bases (VLDB) titled Data Compression in Oracle. In this paper, they write that:
When loaded with the compression feature, compression for LINEITEM is the highest at a compression factor of about 1.6, while ORDER compresses at a compression factor of about 1.2… LINEITEM shrinks from 79 GB to 49 GB while ORDERS shrinks from 17 GB to 14 GB.
When we compare these rates to the compression rates that DB2 gets for TPC-H data, we see that DB2 has significantly higher compression rates. And, of course, higher compression rates are better because they mean you need less storage. Nobody needs to be reminded that storage-related costs are often the most expensive component of a system. Lowering storage costs can have a very real impact on IT budgets.
Please keep in mind that these results are only for the TPC-H data, and your data may achieve different compression rates. For instance, Tellabs and SunTrust Bank both report using DB2 to reduce the size of database tables by as much as 83%. The only way to know what kind of compression rates you will experience is to get the database vendors to run their tools on your data and let you know.