Data storage has come a long, long way. Here's a quick run through of how we got to where we are today.
Computers have been using various forms of external data storage since they were invented. From retro punched cards and tapes, through to magnetic tapes, diskettes and optical disks, to today’s solid-state disks and modern hard drives.
It's clear that our data storage needs have changed. And it's not surprising, considering the 2.5 quintillion bytes of data we're creating and processing per day.
Naturally, as the capacity and performance of storage has increased, so have the data processing requirements. Now, systems need to process much more than just data created by a single application. The new challenge is how to create the most efficient and effective data storage.
Evolutionary changes have defined the path data storage has taken, but what does next generation data architecture potentially look like? And what preparations should we make to optimise these potential developments in the future?
Let’s make a quick excursion through the data storage evolution.
Data storage: The beginning
Data storage originally focused solely on the needs of a single, isolated application. Since the storage capacity was small, and storage was expensive, the method of storing data needed to be efficient and effective.
Each individual application stored just the necessary input and output data in the format and structure most suited to that particular application’s needs.
For example, a business’s complete monthly wage processing was centrally processed using the dedicated mainframe program. Neither the input or output data were shared among any other programs or tasks.
The data volume surge
As the performance and storage capacity of computers increased, modern computers began to be involved in more applications and business processes. Common data was reused amongst multiple applications, and natural groups of data and applications began to be formed within organizations.
These groups centered either on common functional agendas or business units in the organization. Even though the data could be shared with multiple applications in the same group, each group still remained isolated from the others. These are referred to as data silos.
Data silos create a negative impact on data integrity, making it virtually impossible to get a ‘single point of truth’. On top of this, they aren't good for overall corporate productivity, and create lots of repetitive tasks and wasted effort.
There were multiple approaches for how to integrate and share application data beyond the silo limits. The natural solution lay in the creation of a centralized repository combining, integrating and consolidating the data from the multiple silos. Such a central repository can mediate all data for combined views and provide a single version of the truth.
The straightforward approach was to build an overall enterprise wide repository with strong normalization in order to overcome data duplicates and wasted storage space. This approach is the basis for the top-down design of the data warehouse (DWH), as introduced by Bill Inmon.
However, making such a huge data store is a challenging task. In the case of large corporations, it can be doomed because of the high cost involved. Another issue with establishing a data warehouse comes from the need to validate the data in order to fit all business constraints. Losing invalid data which doesn’t fit the model means the loss of information which could be useful in the future.
Find out how businesses are managing their data by reading our Guide to Enterprise Data ArchitecturePreserving non-conforming facts with data vaults
Dan Linsted introduced a major improvement for structured central repositories, known as data vault modelling. A single version of a fact is stored rather than the single version of the truth. This means that even facts not conforming to the single version of truth are preserved, unlike in a data warehouse.
The storage model is based on the hub and spoke architecture. The business keys and the associations within source data are separated into hub tables (holding entity business keys) and spoke tables (holding associations among business keys).
These tables do not contain any descriptive attributes, facts and/or temporal attributes other than the business keys, load data source and audit information. Thus, the structure of the entity relationships is separated from their details stored in satellite tables.
This is a very strong design but even more difficult and expensive to build.
Forward-thinking: Combining data storage types for analysis & reporting
Both normalized data warehouses and data vault storages are not always a good solution for presenting data for analysis, reporting and BI.
To overcome this problem, the concept of the data mart was introduced. The data mart is focused on a single business process or functional area. It combines data taken either from the data warehouse or data vault storage and presents it in a dimensional model that is efficient for reporting and analytical query purposes (OLAP processing).
Typically, there are multiple data marts within an organization.
7 questions you need to ask when choosing your data architectureThe growth of agile: Centralized data storages
As already mentioned, both normalized data warehouses and data vault storages are hard to build.
Since the IT world has been heading more towards agile environments, business users began calling for the immediate availability of integrated data. Ralph Kimball brought an easy answer called bottom-up design.
The idea behind this approach was along the lines of, ‘let’s not bother with the overall normalized enterprise model first and go directly to designing data marts based on the source data!’. This brilliant idea decreased the costs required to build centralized data storages, as the data could be incrementally provided to the business.
But there are some drawbacks, as the incrementally deployed data marts must be somehow conformed.
The solution is based on conforming dimensions and the data warehouse bus matrix approach. The major disadvantage, in comparison to the older approaches, is that historical data from areas not covered by deployed data marts might be lost because of the source system retention.
Data volume surge, agile working and the birth of data lakes
The dawn of big data technologies and virtually unlimited storage enabled the birth of an alternate approach: the data lake.
A data lake stores data in its natural format in a single repository, no matter whether the data is structured (tables, indexed files), semi-structured (XML, JSON, text files) or even unstructured (documents, emails, audio and video clips, photos or document scans).
No source data should be lost, as a result of data validation and cleansing, while only a small part of the data is usually necessary to use for analytical and reporting purposes. The data can later be analysed directly using various tools (e.g. Apache Map Reduce, Pig, Hive, Spark or Flink), or transformed to data marts.
Nevertheless, adequate data management is necessary to govern a data lake. Even though the data lake approach preserves all data, it suffers from the lack of a structure and a single point of truth.
NoSQL and the best of both worlds
A step further along the data storage evolution brings us to an approach that can combine the strongest features of all historical approaches while eliminating the downsides. It's named the central discovery storage and employs a modern document NoSQL database.
How does it work?
Data from source systems is collected and, after some optional transformations, is loaded into the document database collections for permanent storage. This approach ensures that no historical data is lost, but ensures some structure can be introduced using the ETL (extract-transform-load) process.
Additional manual and automated ETL jobs can be used for further data transformations within the permanent storage to extend the structure in the existing document collection. These can introduce new attributes and create new document collections incrementally, as all original and derived data is permanently stored.
Finally, the data from permanent storage collections can be transformed using additional ETL jobs into document collections.
Compared to historical approaches, this central discovery storage method means that:
- All historical data can be retained without losing any records
- The structure of the stored data can be constructed incrementally as needed
- Data marts for business users can be deployed in an agile way
Finally, thanks to the ETL transformations performed in the multiple places, the overall storage can be as effective and efficient as possible.
So, that's our concise history of data storage. We wonder where data storage will take us next?
Editor's note: This blog post has been updated in 2019 to offer more value. Enjoy!