Deploying DB2 and InfoSphere Warehouse on Private Clouds

Cloud computing is certainly a hot topic these days. If an organization is not already using cloud computing, it has plans to do so. The economics, agility, and value offered by cloud computing is just too persuasive for IT organizations ignore.

Even the high-profile Amazon outage couldn’t slow cloud computing’s relentless march towards mainstream adoption. If anything, that outage helped make cloud computing more robust by highlighting the need for hardened policies and procedures around provisioning in the cloud.

IBM recently announced updates to a set of products that make it easy to deploy DB2 and InfoSphere Warehouse on private clouds:

  • IBM Workload Deployer (previously know as WebSphere CloudBurst), which is a hardware/software appliance that streamlines the deployment and management of software on private clouds.
  • IBM Transactional Database Pattern, which works with the IBM Workload Deployer to generate DB2 instances that are suitable for transactional workloads.
  • IBM Data Mart Pattern, which generates InfoSphere Warehouse instances for data mart workloads.

These patterns consist of more than just deploying virtual images with pre-configured software. You should instead think of them as being like mini-applications for configuring and deploying a cloud-based database instances. Users specify information about the database, and then the pattern builds and deploys the database instance.

The Transactional Database Pattern is for OLTP deployments. It includes templates for sizing the virtual machine, database backup scheduling, database deployment cloning capabilities, and tooling (including Data Studio). The Data Mart Pattern incorporates the features to the OLTP pattern, together with deep compression and data movement tools. But, of course, it is configured and optimized for data mart workloads in a virtual environment.

Need Help Determining Hadoop Split Sizes? Use Adaptive MapReduce Instead!

IBM is actively working on adaptive features for the Map and Reduce phases of its InfoSphere BigInsights product (which is based on Apache Hadoop). In some cases, this involves applying techniques commonly found in mature data management products, and in some cases it involves developing new techniques. While a number of these adaptive features are still under development, there are some features in the product today. For instance, BigInsights currently includes an Adaptive Mapper capability that allows Mappers to successively process multiple splits for a job, and avoid the start-up costs for subsequent splits.

When a MapReduce job begins, Hadoop divides the data into multiple splits. It then creates Mapper tasks for each split. Hadoop deploys the first wave of Mapper tasks to the available processors. Then, as Mapper tasks complete, Hadoop deploys the next Mapper tasks in the queue to the available processors. However, each Mapper task has a start-up cost, and that start-up cost is repeated each time a Mapper task starts.

With BigInsights, there is not a separate Mapper task for each split. Instead, BigInsights creates Mapper tasks on each available processor, and those Mapper tasks successively process the splits. This means that BigInsights significantly reduces the Mapper start-up cost. You can see the results of a benchmark for a set-similarity join workload in the following chart. In this case, the tasks have a high start-up cost. The AM bar (Adaptive Mapper) in the chart is based on a 32MB split size. You can see that by avoiding the recurring start-up costs, you can significantly improve performance.

Adaptive MapReduce Benchmark: Set-Similarity Join Workload

Of course, if you chose the largest split size (2GB), you would achieve similar results to the Adaptive Mapper. However, the you might potentially expose yourself to the imbalanced workloads that sometimes accompany very large splits.

The following chart shows the results of a benchmark for a join query on TERASORT records. Again the AM bar (Adaptive Mapper) in the chart is based on a 32MB split size.

Adaptive MapReduce Benchmark: TERASORT Join Workload

In this case, the Adaptive Mapper results in a more modest performance improvement. Although, it is still an improvement. The key benefit of these Adaptive MapReduce features is that they eliminate some of the hassles associated with determining the split sizes, while also improving performance.

As I mentioned earlier in this post, a number of additional Adaptive MapReduce features are currently in development for future versions of BigInsights. I look forward to telling you about them when they are released…

In the mean time, make sure to check out the free online Hadoop courses at Big Data University. I previous blogged about my experiences with these courses in Hadoop Fundamentals Course on BigDataUniversity.com.

Comparing HDFS and GPFS for Hadoop

Here is a chart that compares the performance of Hadoop Distributed File System (HDFS) with General Parallel File System-Shared Nothing Cluster (GPFS-SNC) for certain Hadoop-based workloads (it comes from the Understanding Big Data book). As you can see, GPFS-SNC easily out-performs HDFS. In fact, the book claims that a 10-node GPFS-SNC-based Hadoop cluster can match the performance of a 16-node HDFS-based Hadoop cluster.

Comparing HDFS and GPFS for Hadoop Workloads

GPFS was developed by IBM in the 1990s for high-performance computing applications. It has been used in many of the world’s fastest computers (including Blue Gene and Watson). Recently, IBM extended GPFS to develop GPFS-SNC, which is suitable for Hadoop environments. A key difference between GPFS-SNC and HDFS is that GPFS-SNC is a kernel-level file system, whereas HDFS runs on top of the operating system. This means that GPFS-SNC offers several advantages over HDFS, including:

  • Better performance
  • Storage flexibility
  • Concurrent read/write
  • Improved security

If you are interested in seeing how GPFS-SNC performs in your Hadoop cluster, please contact IBM. Although GPFS-SNC is not in the current release of InfoSphere BigInsights (IBM’s Hadoop-based product), GPFS-SNC is currently available to select clients as a technology preview.

Informix Users are Going to San Diego

It has just been announced that next year’s International Informix Users Group (IIUG) conference will be held in San Diego, California on 22 – 25 April. The IIUG Conference continues to offer incredible value. Sign up soon to get the $695 early bird rate, and if you sign up for free IIUG membership, you even get $100 off that rate. $595 for a conference of this length and quality is amazing value. But you’re going to have to act fast to get this discount rate!

And, don’t forget that San Diego is such a great city to visit. Not only is it a wonderful city with an ideal year-round climate. But it also has fantastic array of attractions like the world-famous San Diego Zoo, Sea World, LEGO land, and the Zoo Safari Park (a personal favorite).

International Informix Users Group (IIUG) Conference

Highlights from the IDUG EMEA Conference

DB2Night ShowI’m still in the afterglow of the International DB2 User Group (IDUG) conference in Prague, Czech Republic. It was another great conference at a great facility in a great city. The conference organizers should be commended on a truly outstanding event. Its incredible to think that the conference organizers are user volunteers, and not professional conference planners! I’m already looking forward to the next IDUG EMEA conference in Berlin next year. If you are interested in a more in-depth discussion of the conference, including lessons learned from the technical sessions, Norberto Filho will be appearing on the DB2Night show on Friday 02 December 2011. Even if you were at the conference, there was so much happening there that you are sure to learn something new from Norberto’s experiences.

IBM is Baking NoSQL Capabilities into DB2 and Informix

IBM recently revealed its plan to integrate certain NoSQL capabilities into IBM DB2 and Informix. In particular, it is working to integrate graph store and key:value store capabilities into the flagship IBM database products. IBM is not yet indicating when these new capabilities will be available.

IBM does not plan to integrate all NoSQL technologies into DB2 and Informix. After all, there are many NoSQL technologies, and quite a few of them are clearly not suitable for integration into IBM’s products. The following chart summarizes the NoSQL product landscape. This landscape includes more than 100 products across a number of database categories. IBM is saying that they will integrate certain NoSQL capabilities into their products and work hand-in-hand with others NoSQL technologies.

NoSQL Landscape

Readers of this blog will know that these developments are consistent with my view that certain NoSQL technologies will eventually find themselves integrated into the major relational database products. In much the same way as the major relational database products fended off the challenge of object databases by adding features like stored procedures and user-defined functions, I expect the major relational database products to fend off the NoSQL challenge with similar tactics. And don’t forget that the major relational database products have already integrated XML capabilities, providing XQuery as an alternate query language. Its not too much of a stretch to imagine how several of these NoSQL capabilities might be supported in an optimized way as part of a relational database product.

I look forward to blogging more about this topic as news about it emerges…

IBM DB2 Analytics Accelerator—Bringing Netezza to the Mainframe

Now that the IBM Information on Demand (IOD) and International DB2 User Group (IDUG) conferences are behind me, I have time to blog about some of the great announcements from those conferences. Probably the announcement that generated the most interest among conferences attendees is the new release of the IBM DB2 Analytics Accelerator (IDAA). This product takes advantage of Netezza to accelerate analytics queries on DB2 for z/OS.

The way it works is… you specify the data whose analysis you want to speed up, and a copy of that data is placed on Netezza (DB2 for z/OS continues to be the system of record for all data). Then, when DB2 for z/OS receives a query, an optimizer determines whether that query should be handled by DB2 for z/OS or by IBM Netezza. Here is a chart from the IDUG conference that summarizes the query execution flow.

IBM DB2 Analytics Accelerator

Conceptually, you could almost think of the IBM DB2 Analytics Accelerator as a mainframe specialty processor for analytics. I know its not actually a specialty processor, but it does perform the processing involved with complex analytics queries. It also makes life easier for database administrators who often struggle with long-running complex queries, by providing them with an accelerator that does not require additional tuning. To see how much faster it is, here is another chart from the IDUG conference. It shows the experiences of IBM DB2 Analytics Accelerator Beta program participants.

IBM DB2 Analytics Accelerator Performance

If you run complex analytical queries on DB2 for z/OS, it is almost certainly worth you while to learn more about the IBM DB2 Analytics Accelerator.

What will Happen to "In-Memory" when Storage Class Memory Arrives?

During this week’s keynote address at the International DB2 User Group (IDUG) conference in Prague, Namik Hrle talked about Storage Class Memory. Storage Class Memory is a technology in development that promises the performance of Solid State Drive (SSD) technology at the low cost of Hard Disk Drive (HDD) technology. It also promises compelling breakthroughs in space and power consumption. Storage Class Memory is essentially the marriage of scalable non-volatile memory technology and ultra high-density technology. Here is a table that projects the 2020 characteristics of Storage Class Memory:

Storage Class Memory

This table was actually created in 2008. From what Mr. Hrle says, we are tracking ahead of this schedule and will have these capabilities available sooner than 2020.

The performance limitations of disk-based systems have led to the addition of many database and data warehouse “features” (clever optimizations that address these limitations, and provide acceptable performance). If Storage Class Memory delivers on its random and sequential I/O performance promises, as well as its cost promises, many of these optimizations will become either less important, or perhaps unnecessary. In fact, it makes you wonder if our industry’s current fixation with in-memory capabilities may be short-sighted. Several vendors have in-memory database product visions that will not be realized until the latter half of this decade, which is a similar time frame to the projected availability of low-cost Storage Class Memory. Certainly food for thought…

Comparing "New Big Data" with IMS on the Mainframe

While it does not come up often in today’s data management conversations, the IMS database software is at the heart of many major corporations around the world. For many people, it is the undisputed leader for mission-critical, enterprise transaction and data-serving workloads. IMS users routinely handle peaks of 100 million transactions in a day, and there are quite a few users who report more than 3,000 days without unplanned outages. That’s more than 8 years without an unplanned outage!

IBM recently announced IMS 12, claiming peak performance at a remarkable 66,000 transactions per second. The new release features improved performance and CPU efficiency for most IMS use cases, and a significant improvement in performance for certain use cases. For instance, the Fast Path Secondary Index means that workloads that use this secondary index are 60% faster.

It is interesting to compare the performance of IMS with the headline-grabbing “big data” solutions that are all the rage today. For instance, at the end of August this year, we read how Beyonce Pregnancy News Births New Twitter Record Of 8,868 Tweets Per Second. I am not saying that IMS can replace the infrastructure of Twitter. Far from it. However, I am saying that, when you consider that IMS can handle 66,000 transactions per second, the relative performance levels of the “new big data” solutions when compared with IMS are food for thought. Especially when you consider the very significant infrastructure in place at Twitter, and the staff needed to manage that infrastructure. And don’t forget that IMS supports these performance levels with full read-write capability, full data integrity, and mainframe-level security.

I appreciate that many of today’s Web-scale businesses begin with capital investments that preclude the hardware and software investments required for something like IMS. These new businesses need to be relatively agile, and depend upon the low barrier of entry that x86-based systems and open source/inexpensive software afford. However, I still think it interesting to put this “new big data” in perspective.

IBM Champions Delivering Sessions at the IOD Conference

IBM has great leaders among its user base. They may be technical leaders, whose technical expertise puts them in an elite group of people. They may be community leaders, who bring users together to help one another. They may be academic leaders, who are molding the next generation of innovators. IBM strives to recognize these leaders in its IBM Champion program.

At the Information On Demand (IOD) Conference later this month, 59 IBM Champions will be delivering an impressive lineup of more than 80 sessions across the Business Leadership, Information Management, Enterprise Content Management, and Business Analytics tracks. To see a list of all the IBM Champion-delivered sessions at the IOD Conference, check out the online Roadmap for IBM Champion Sessions.