Introduction
Digital revolution dictates the future of companies: they have to become a "smarty" of the business. Now, thanks to Hadoop and other technologies so-called big data companies can treat unstructured data as a whole. Some of the examples are: Airlines know when valuable for them client facing troubles at the time of departure, and can try to improve the service during the return flight. Doctors will be able to link unrelated kinds of data, such as MRI, pressure indicators, data on arterial fibrillation to predict the possibility of a heart attack or stroke. Banks can utilize unstructured data to segment customers and risks, predict fraudulent transactions, learn user's behaviors to make prompt offers for their needs.
It is not just about the volume of data because this is what first comes to mind at the mention of big data. On the contrary, the main thing is that between this data regardless of the type and source there is a very important hidden relation, such as between the information from the call center, data on the use of the website and sales amounts.
The ability to assess the data is much more important whether the flow of information directly from the Internet, or part of it leaked through the firewall, sensor data or information from public sources and then link all that into a single coherent picture.
Traditional RDBMS, Hadoop and Enterprise Large Enterprise systems use a typical pattern of RDBMS:
- Interactive RDBMS handles requests coming from the Web site or any other custom applications.
- The data is then retrieved from a relational database and loaded into the data warehouse for further processing and archiving.
- Data is usually de-normalized in OLAP (online analytical processing).
Unfortunately, modern RDBMS systems cannot accommodate all the enormous amount of data that is created in large companies, and then there is a need to compromise: data only partially copied to the RDBMS, or purged after a certain time. There is no need for such trade-offs, if Hadoop is used as the intermediate layer between the online database and data warehouse:
- Performance of data processing increases in proportion to the amount of storage, while in high-end servers where the increase in storage is easy achievable, but computing performance remains the same.
- When using Hadoop, to increase processing performance, simply add new nodes to the data warehouse.
- Hadoop can store and process many petabytes of data. However Hadoop has some serious limitations, and therefore cannot be used as operational databases:
- Even to perform not complex tasks Hadoop still requires a few seconds.
- Inability of making any changes to the data that is stored in the HDFS.
- Hadoop does not support transactions.
Hadoop vs. RDBMS
Relational database management systems (RDBMS) have many advantages:
- They are able to handle complex transactions;
- They are able to handle hundreds of thousands of queries per second;
- The results are given in real time;
- Used a simple but effective query language. However, there RDBMS and weaknesses:
- The requirement of defined schema before the data is transferred;
- Maximum capacity of RDBMS reaches hundreds of terabytes for storage;
- Limited amount of data in a single query: tens of terabytes.
Hadoop vs. Storage Systems
Business data of large companies are often stored in large file servers such as NetApp, EMC, etc., which provide a fast and random access to data and can simultaneously support a large number of client applications. However, when it comes to storing petabytes of data, the price for keeping a terabyte of data can greatly increase. In this case, Hadoop is a really good alternative to the file storage, with the provision that the random access may be replaced by sequential read and modify data.
Big Data Use in Banking and Enterprises
To quickly summarize pros and cons of the above, RDBMSs will continue to be a