What you need to know about Column-Oriented Database Systems
Column-oriented database systems (Column Stores) have attracted a lot of attention in the past few years. Vendors have quoted impressive performance and storage gains over row-oriented database systems. In some cases, vendors have even claimed as much as 1000x performance improvement.
For many years Sybase IQ led the way for Column Stores. But they have since been joined by a long list of vendors, including Infobright, Paraccel, and Vertica. The claims being made by these vendors are attention-grabbing. But they don’t tell the whole story. Before you run out and get yourself a Column Store, you should be aware of the following:
- Performance issues when queries involve several columns.
Column stores can be faster than row-oriented stores when the number of columns involved is small (it depends on a number of factors, including the number of columns, the use of indexing, the specifics of the query, and so on). However, you can run into performance issues with Column Stores as the number of columns increases, due to the re-composition overhead. In fact, the performance degradation can be quite significant. If your queries involve more than a few columns (either columns for retrieving data or columns for query predicates), you need to be aware of this potential issue. If you are evaluating a column Store, make sure to test queries that involve more than a couple of columns.
- Overhead associated with inserting data.
When you create a new data record, you are creating a data row. However, a Column Store does not have a row-orientation. Instead, a Column Store must decompose that data row into the individual column values, and store each of those column values individually. This adds up to a lot more block updates for a Column Store than a row-oriented store. As you can imagine, this is quite a bit of additional work. You should also keep in mind that the values in Column Stores are typically sorted for fast selection and retrieval, which means even more work for data insert and update operations. So what does all this mean? Essentially these limitations make it difficult to have real-time or near real-time data analysis with Column Stores (unless the Column Store vendor uses an approach like Vertica where they have an “update area” in memory that is essentially a row store cache, where real-time inserts are first written, then asynchronously written to disk).
Column Stores are great as analytic data marts where queries do not involve many columns. In such situations, you can enjoy performance gains. However, for more involved usage, you may run into issues. For instance, a Column Store is almost certainly not up to supporting thousands of simultaneous users and mixed query workloads, which are common in Enterprise Data Warehouse (EDW) environments. Sometimes people can get blinded by Column Store success for relatively simple data mart environments. You should be aware that these performance gains do not necessarily translate to larger, more complex environments. In fact, they may not even translate to other simple data marts with different schemas, or where your queries involve more than a couple of columns. The bottom line here is that you need to know both the benefits and the limitations of a Column Store, and make the right decision for your particular situation.