Archive for the ‘NoSQL’ Category
This blog posts refers to the definition of Big Data commonly in use today. I do not include mainframe-based solutions, which some people might argue tackle Big Data challenges.
Both IBM and Oracle are going after the Big Data market. However, they are taking different approaches. I’m going to take a few moments to have a very brief look at what both companies are doing.
First of all, Oracle have introduced an “appliance” for Big Data. IBM have not. I put the word appliance in quotes because I consider this Oracle appliance to be closer in nature to an integrated collection of hardware and software components, rather than a true appliance that is designed for ease of operation. But the more important consideration is whether an appliance even makes sense for Big Data. There is a decent examination of this topic in the following blog post from Curt Monash and the accompanying comment stream: Why you would want an appliance — and when you wouldn’t. But, regardless of your position on this subject, the fact remains that Oracle currently propose an appliance-based approach, while IBM does not.
The other area I will briefly look at is the scope of the respective vendor approaches. In the press release announcing the Oracle Big Data Appliance, Oracle claim that:
Oracle Big Data Appliance is an engineered system optimized for acquiring, organizing, and loading unstructured data into Oracle Database 11g.
IBM takes a very different approach. IBM does not see its Big Data platform as primarily being a feeder for its relational database products. Instead, IBM sees this as being one possible use case. However, the way that customers want to use Big Data technologies extend well beyond that use case. IBM is designing its Big Data platform to cater for a wide variety of solutions, some of which involve relational solutions and some of which do not. For instance, the IBM Big Data platform includes:
- BigInsights for Hadoop-based data processing (regardless of the destination of the data)
- Streams for analyzing data in motion (where you don’t necessarily store the data)
- TimeSeries for smart meter and sensor data management
- and more
Today, Forrester published its Wave analysis for enterprise Hadoop solutions. It has detailed coverage of the Hadoop solutions from vendors like IBM, MapR, Cloudera, Hortonworks, and others. If you are considering an enterprise Hadoop solution, such as IBM InfoSphere BigInsights, it will make for very interesting reading. You can download a free copy of the report from The Forrester Wave™: Enterprise Hadoop Solutions, Q1 2012.
IBM is actively working on adaptive features for the Map and Reduce phases of its InfoSphere BigInsights product (which is based on Apache Hadoop). In some cases, this involves applying techniques commonly found in mature data management products, and in some cases it involves developing new techniques. While a number of these adaptive features are still under development, there are some features in the product today. For instance, BigInsights currently includes an Adaptive Mapper capability that allows Mappers to successively process multiple splits for a job, and avoid the start-up costs for subsequent splits.
When a MapReduce job begins, Hadoop divides the data into multiple splits. It then creates Mapper tasks for each split. Hadoop deploys the first wave of Mapper tasks to the available processors. Then, as Mapper tasks complete, Hadoop deploys the next Mapper tasks in the queue to the available processors. However, each Mapper task has a start-up cost, and that start-up cost is repeated each time a Mapper task starts.
With BigInsights, there is not a separate Mapper task for each split. Instead, BigInsights creates Mapper tasks on each available processor, and those Mapper tasks successively process the splits. This means that BigInsights significantly reduces the Mapper start-up cost. You can see the results of a benchmark for a set-similarity join workload in the following chart. In this case, the tasks have a high start-up cost. The AM bar (Adaptive Mapper) in the chart is based on a 32MB split size. You can see that by avoiding the recurring start-up costs, you can significantly improve performance.
Of course, if you chose the largest split size (2GB), you would achieve similar results to the Adaptive Mapper. However, the you might potentially expose yourself to the imbalanced workloads that sometimes accompany very large splits.
The following chart shows the results of a benchmark for a join query on TERASORT records. Again the AM bar (Adaptive Mapper) in the chart is based on a 32MB split size.
In this case, the Adaptive Mapper results in a more modest performance improvement. Although, it is still an improvement. The key benefit of these Adaptive MapReduce features is that they eliminate some of the hassles associated with determining the split sizes, while also improving performance.
As I mentioned earlier in this post, a number of additional Adaptive MapReduce features are currently in development for future versions of BigInsights. I look forward to telling you about them when they are released…
In the mean time, make sure to check out the free online Hadoop courses at Big Data University. I previous blogged about my experiences with these courses in Hadoop Fundamentals Course on BigDataUniversity.com.
Here is a chart that compares the performance of Hadoop Distributed File System (HDFS) with General Parallel File System-Shared Nothing Cluster (GPFS-SNC) for certain Hadoop-based workloads (it comes from the Understanding Big Data book). As you can see, GPFS-SNC easily out-performs HDFS. In fact, the book claims that a 10-node GPFS-SNC-based Hadoop cluster can match the performance of a 16-node HDFS-based Hadoop cluster.
GPFS was developed by IBM in the 1990s for high-performance computing applications. It has been used in many of the world’s fastest computers (including Blue Gene and Watson). Recently, IBM extended GPFS to develop GPFS-SNC, which is suitable for Hadoop environments. A key difference between GPFS-SNC and HDFS is that GPFS-SNC is a kernel-level file system, whereas HDFS runs on top of the operating system. This means that GPFS-SNC offers several advantages over HDFS, including:
- Better performance
- Storage flexibility
- Concurrent read/write
- Improved security
If you are interested in seeing how GPFS-SNC performs in your Hadoop cluster, please contact IBM. Although GPFS-SNC is not in the current release of InfoSphere BigInsights (IBM’s Hadoop-based product), GPFS-SNC is currently available to select clients as a technology preview.
IBM recently revealed its plan to integrate certain NoSQL capabilities into IBM DB2 and Informix. In particular, it is working to integrate graph store and key:value store capabilities into the flagship IBM database products. IBM is not yet indicating when these new capabilities will be available.
IBM does not plan to integrate all NoSQL technologies into DB2 and Informix. After all, there are many NoSQL technologies, and quite a few of them are clearly not suitable for integration into IBM’s products. The following chart summarizes the NoSQL product landscape. This landscape includes more than 100 products across a number of database categories. IBM is saying that they will integrate certain NoSQL capabilities into their products and work hand-in-hand with others NoSQL technologies.
Readers of this blog will know that these developments are consistent with my view that certain NoSQL technologies will eventually find themselves integrated into the major relational database products. In much the same way as the major relational database products fended off the challenge of object databases by adding features like stored procedures and user-defined functions, I expect the major relational database products to fend off the NoSQL challenge with similar tactics. And don’t forget that the major relational database products have already integrated XML capabilities, providing XQuery as an alternate query language. Its not too much of a stretch to imagine how several of these NoSQL capabilities might be supported in an optimized way as part of a relational database product.
I look forward to blogging more about this topic as news about it emerges…
While it does not come up often in today’s data management conversations, the IMS database software is at the heart of many major corporations around the world. For many people, it is the undisputed leader for mission-critical, enterprise transaction and data-serving workloads. IMS users routinely handle peaks of 100 million transactions in a day, and there are quite a few users who report more than 3,000 days without unplanned outages. That’s more than 8 years without an unplanned outage!
IBM recently announced IMS 12, claiming peak performance at a remarkable 66,000 transactions per second. The new release features improved performance and CPU efficiency for most IMS use cases, and a significant improvement in performance for certain use cases. For instance, the Fast Path Secondary Index means that workloads that use this secondary index are 60% faster.
It is interesting to compare the performance of IMS with the headline-grabbing “big data” solutions that are all the rage today. For instance, at the end of August this year, we read how Beyonce Pregnancy News Births New Twitter Record Of 8,868 Tweets Per Second. I am not saying that IMS can replace the infrastructure of Twitter. Far from it. However, I am saying that, when you consider that IMS can handle 66,000 transactions per second, the relative performance levels of the “new big data” solutions when compared with IMS are food for thought. Especially when you consider the very significant infrastructure in place at Twitter, and the staff needed to manage that infrastructure. And don’t forget that IMS supports these performance levels with full read-write capability, full data integrity, and mainframe-level security.
I appreciate that many of today’s Web-scale businesses begin with capital investments that preclude the hardware and software investments required for something like IMS. These new businesses need to be relatively agile, and depend upon the low barrier of entry that x86-based systems and open source/inexpensive software afford. However, I still think it interesting to put this “new big data” in perspective.
Last week, I included a demonstration of Using Hadoop to Extract and Analyze Unstructured Information. Now I’d like to share another demo. This demo also shows InfoSphere BigInsights and InfoSphere BigSheets. BigInsights is essentially Apache Hadoop together with extensions for installation, management, security, and integration, while BigSheets is basically an easy-to-use interface for creating and running Map and Reduce jobs.
This demo shows you how to run sentiment analysis on Tweets. Some of the details of creating the specific text analytics are not included. But it is interesting and useful nontheless. It also shows how you can easily run some cool visualizations on that data. Make sure to keep watching until the end where David Barnes show a great visualization on the UK Parliment data.
Don’t forget there is no charge for BigInsights Basic Edition. You can freely download it from InfoSphere BigInsights.
Here’s a nice demo. It shows InfoSphere BigInsights, which is IBM’s Hadoop product. BigInsights is essentially Apache Hadoop together with extensions for installation, management, security, integration, and so on. The demo also shows InfoShpere BigSheets. BigSheets is basically an easy-to-use interface for creating and running Map and Reduce jobs. As you can see from the demo, BigSheets makes it quick and easy to apply text analytics extractors and filters to unstructured or semi-structured data. The demo itself shows how you can quickly analyze several aspects of revenue information pulled from earnings press releases. It even includes a nice round-trip to the annotated source data to see “why” certain conditions occurred.
Don’t forget there is no charge for BigInsights Basic Edition. You can freely download it from InfoSphere BigInsights.
After spending some time reading about Apache Hadoop, I decided it was time to get my hands dirty. So this weekend, I took the Hadoop Fundamentals 1 self-paced course on BigDataUniversity.com. It is a really nice way to play with Hadoop. You have the choice of downloading the software and installing it on your computer, working with a VMware image, or working in the cloud. I chose the option of working in the cloud. Within a few minutes I had a Amazon AWS account, a RightScale account, and the software installed in the cloud. By the way, although the course is FREE, I did incur some cloud-related usage charges. It amounted to approximately $1 in Amazon charges for the time it took me to complete the course.
If you are curious about Hadoop, I’d recommend this course. I’m eagerly anticipating the availability of the follow-on Hadoop course…
The NoSQL movement has garnered a lot of attention recently. It has been built around a number of emerging highly-scalable non-relational data stores. The movement is also providing a real lease of life for smaller non-relational database vendors who have been around for a while.
Last week, I noticed an entire track for XML and XQuery sessions at the recent NoSQLNow Conference in San Jose. If XML databases and XQuery are key constituents of the NoSQL world, does that mean that IBM DB2 and Oracle Database should be included in the NoSQL movement? After all, both IBM DB2 and Oracle Database store XML data and provide XQuery interfaces. Of course, I’m not being serious here. I don’t believe that the bastions of the relational world should be included in the NoSQL community. Are native XML databases, which have been around for a while, really in the spirit of the NoSQL movement? What’s your opinion?
I believe that the boundaries of the NoSQL community are perhaps a bit looser than they should be. Essentially, absolutely everything except relational databases are being grouped under the NoSQL banner. I can understand how this has happened, but do the NoSQL community really want to dilute their message by including all of these technologies, most of which have been around for quite some time and had relatively limited traction. In the spirit of what I believe is at the genesis of the current NoSQL movement, I reckon that a NoSQL solution should have the following characteristics:
– Not be based on the relational model
– Have little or no acquisition cost
– Be designed to run on commodity hardware
– Use a distributed architecture
– Support extreme or Web-scale databases
Notice that I don’t include a characteristic based on lack of consistency. I reckon that, over time, consistency will become a characteristic of some NoSQL environments.
By the way, earlier in this blog post I referred to the XML and XQuery capabilities in IBM DB2 and Oracle Database. In case you are curious, there is a significant difference in how DB2 and Oracle Database have incorporated XML capabilities in their respective products, with Oracle essentially leveraging their existing relational infrastructure to provide several ways to store XML data, while IBM built true native XML storage capabilities into its product. In other words, DB2 is indeed a true “native XML store”. In the past, I used to blog about native XML storage over at www.nativeXMLdatabase.com, before handing the reigns over to Matthias Nicola. If you want a little more insight on XML support in Oracle Database, check out XML in Oracle 11g and Why Won’t Oracle Publish XML Benchmark Results for TPoX?
Here’s a short video that was recorded at the IDUG conference, where I talk about the characteristics of Big Data solutions, discuss some of the technologies involved, and describe some real world Big Data solutions that IBM has implemented. Its a high-level introduction, but if you’re not sure what this “Big Data” term refers to, you may find it useful.
In the video, I try to quantify what “big” means today, as well as describing some lessons we have learned while implementing Big Data solutions. Technologies introduced include Map/Reduce systems, systems for analyzing streaming data, Massive Parallel Processing data warehouse systems, and in-memory database systems.
Those of you that know me in person, will see that I was a little under-the-weather when the video was recorded. You can hear it in my voice, see it in my demeanor, and notice it in my cadence. I hope you can get past this, and find this video useful.
Matt Asay wrote an interesting article for The Register titled SQL Survives Murder Attempt by Mutant Stepchild where he opines that “NoSQL remains a tiny blip in the overall datastore universe“. And he’s correct. When it comes to the universe of data management deployments, NoSQL usage is a tiny fraction of the overall data management market.
The term NoSQL implies that these emerging data management technologies are fighting the SQL establishment. I would argue that, instead, they are fighting the traditional Relational Database Management System (RDBMS) establishment. The NoSQL movement has evolved out of a loose association of technologies that solve challenges that traditional relational solutions are not designed to solve well. RDBMS software is good at addressing the majority of our data management challenges. However, there are instances where the relational approach simply does not work well. While these situations are a relatively small part of the data management universe, they are nonetheless important. After all, these emerging technologies are meeting a very real market need, and the likelihood is that this market need will grow as the business world shifts towards use cases where these NoSQL solutions shine. So, essentially we have a situation where a bunch of data management technologies are emerging to solve a subset of data management challenges that are not well served by currently available technologies. I expect that some of these NoSQL use cases will evolve into reasonable, if relatively small, segments of the overall data management market.
To further illustrate that the term NoSQL is probably a misnomer, some of these NoSQL technologies have plans to adopt SQL interfaces. How will the NoSQL movement react when some of its products start adopting SQL interfaces? As Alanis Morissette would say, isn’t it ironic!
But anyway, back to the topic at hand. While certain segments of the high tech media are portraying this as a big battle between the incumbent and a challenger, I would instead portray it as the emergence of new technologies to augment the incumbent. The NoSQL solutions are essentially a set of technologies that address use cases that are not well served by existing relational technology. The relational database software market is huge today, and I don’t see this changing in any significant way in the foreseeable future. Despite what some wide-eyed and naive smaller vendors may claim, these emerging technologies are simply not in a position to wholesale unseat the incumbent relational database technology. Instead, they will likely augment relational technology in many IT environments. In some IT environments, where their business is built around NoSQL-friendly use cases, it may actually be the opposite with relational technologies augmenting the more dominant NoSQL technologies. However, as Matt points out in his article, the fact that SQL-based systems have such a low barrier-to-entry will ensure their long-term dominance. Another significant factor in determining how thing will evolve is the huge investment and significant maturity of the ease-of-use, ease-of-maintenance, stability, reliability, and security features that make RDBMS systems enterprise-ready today. And don’t forget that, as emerging technologies play catch-up with this huge investment, the relational vendors will continue to innovate.
In my opinion, the likely outcome here is that there will be a set of separate battles among vendors for each of the individual market segments corresponding to the NoSQL use cases. And the larger vendors will participate in the more lucrative of these market segments, either with organically-developed or acquired products. And, for the most part, the servicing of these use cases will be relatively independent of the larger relational database market. What’s you opinion?
As many of you know, IBM has been making big investments in Big Data. This includes InfoSphere BigInsights (which is based on Apache Hadoop), InfoSphere Streams, IBM Netezza, and more than $14B in analytics-based acquisitions. IBM is now announcing a set of hands-on workshops that will be held around the world to help you get to grips with Big Data. There will be 1,200 of these free workshops held in more than 150 cities in 60 countries in 2011. For more information, see IBM Launches Global Bootcamps to Help Companies Tackle Big Data Challenges.
Yesterday, IBM issued a press release that Unveils Software and Services to Help Organizations Make Sense of Their Deluge of Data. There is a lot of information in the press release. Basically, IBM is announcing IBM InfoSphere BigInsights, which is based on Apache Hadoop. So, in other words, IBM is announcing an offering that allows you to work with pedabytes of data. At present, IBM InfoSphere BigInsights consists of:
- BigInsights Core, which is software and services for implementing Apache Hadoop
- BigSheets, which acts as an insight engine for information in Hadoop. It’s a Web-based spreadsheet-like infrastructure for Big Data that includes a plug-in framework for analysis and presentation extensions. You can use BigSheets for extracting information, adding annotations, visualizing with pie charts, visualizing with tag clouds, and so on.
- Industry-specific solutions for finance, risk management, and media.
You might be wondering what IBM is bringing to the table here, aside from experience with deploying Hadoop-based solutions. Well, IBM sees its role as making Hadoop enterprise-ready. This includes the kinds of things that IBM is good at, like creating software with robust quality, accessibility, and localization. But it also includes adding key features that allow you to fully leverage the information in a Hadoop environment.
IBM is working to provide integration with DBMS, ETL, and MDM systems. Remember, ideally you want such environments to work with both existing and new data repositories. After all, you don’t want to create yet another silo of information within your organization. It is only by working with all information at your fingertips that organizations can see the full picture, and make good business decisions. Which leads me nicely to the other big thing that IBM brings to the table–the ability to add business value to the Hadoop deployment with Cognos, SPSS, and ECM application layers.
You can see more coverage on this topic, including Cloudera’s reaction at IBM punts commercial Hadoop distro.