We all want our databases to work as efficiently as possible. In today's environment, being able to run queries quickly and accurately can be the difference between the best-performing businesses and the laggards. But while there are many tweaks you can make to improve performance, there are also a few steps you can take to help turn your data warehouse from a hatchback to a hypercar.
If you’re running SQL databases, there's one thing in particular that can have a huge impact on your performance, and that's adopting columnstore indexing. This has been available on products since SQL Server 2012, but a series of new features introduced in the 2016 version has helped expand these capabilities to larger data warehouses.
But what is columstore indexing, and how can you go about deploying it to give you the speed boost you need?
What is columnstore indexing?
In a traditional SQL solution, data is stored in rows, so each record will have a row of its own that may extend over multiple pages. For example, a typical database for employees may have entries like this:
Entry 1: Alice, Smith, Finance, Accountant
Entry 2: Bob, Jones, Sales, Sales Director
Entry 3: Charles, Williams, HR, Recruiter
And so on for the entire database, with each individual's details stored in their own row. But in a columnstore index, each field for every entry is stored in its own column on the disk, so the same data as above would look like this:
Entry 1: Alice, Bob, Charles
Entry 2: Smith, Jones, Williams
Entry 3: Finance, Sales, HR
Entry 4: Accountant, Sales Director, Recruiter
The key difference is every field is assigned its own column that's stored individually. This means, if you need to conduct a query that, for example, only searches for employees' surnames or job titles you only have to read the relevant column rather than pulling from the entire database.
Why columnstore indexing could be the answer
Instead of having to read every field for each entry, a query only needs to fetch the specific columns required. This means much less data is sent from the disk, reducing the number of input/output operations needed.
As each column will be made up of similar data, it will be easier to compress, which improves query performance by using a smaller in-memory footprint. This in turn boosts overall query performance, as the server can perform more queries and data operations in-memory.
The upshot of this is that you can see significant performance gains over rowstore solutions. In fact, a columnstore indexing format can achieve up to ten times the query performance of traditional row-oriented storage, as well as up to seven times better data compression over the uncompressed data size.
When is a columnstore solution the best option?
This method of storing data won't be appropriate for every dataset, so it pays to know when to use it. Generally speaking, columnstore indexes are able to provide high performance gains when being used for analytic queries that scan large amounts of highly-structured data, especially on large tables.
Columnstore indexes are well suited for use on data warehousing and analytics workloads, especially on fact tables, because these operations tend to require full table scans rather than table seeks.
On the other hand, rowstore solutions still have their place, particularly when used for queries that seek deeper into the data, when searching for a specific value, or for queries on a small range of values.
What are the limitations of columnstore indexing?
As noted above, columnar indexing offers some powerful advantages, but it has several limitations that mean it won’t be suitable for every database. As well as being harder to manage if you're looking for a deep dive into a specific record, there are a few other scenarios where a more traditional approach may serve you better.
For instance, columnstore indexing becomes much more useful when dealing with very large data sets - ie, those that contain records numbering in the millions. For databases smaller than this, the speed and compression advantages that columnar tables offer are somewhat negated.
Another issue may be if you're expecting the entries within your database to change frequently. The more volatile your data, the more complex and costly it will be to update a columnstore index. One reason for this is the more often you have to delete outdated or irrelevant values, the more fragmented your database gets, which also affects its performance. Some data types and formats are also not supported, especially if you're using older solutions such as SQL 2014 and earlier.
As a result, columnstore databases are most suitable for very large, fairly static datasets. Attempting to apply this technology to relatively small or highly dynamic databases is likely to cause more problems than it solves, as you have to deal with added complexity without being able to take full advantage of any performance increases.
The difference between clustered and non-clustered
Columnstore indexing can be used in both clustered and non-clustered forms. While they will both function in the same way, a non-clustered index is a secondary index that's created on a rowstore table, whereas a clustered columnstore index is the primary storage for the entire table.
A non-clustered columnstore index offers users a columnstore index that allows them to run analysis queries, while at the same time performing read-only operations on the original table. A clustered columnstore index, on the other hand, is updateable, allowing users to insert, delete and update operations on the index. As the physical storage for the entire table, a clustered columnstore is the only index for the table.
Understanding when to use columnstore indexing can provide your business with the speeds and efficiency it needs to transform how it interacts with data warehouses, delivering essential information to users at speeds that would not be possible with traditional solutions. If you aren't using them yet, now's the time to change that.
The future of column-oriented databases
Ultimately, however, columnar indexing solutions are widely viewed as the future of many databases, especially as the amount of data generated in the modern environment continues to grow. According to figures from IDC, the size of the world's 'datasphere' will grow from 33 zettabytes in 2018 to 175 zettabytes in 2025, so improved ways of storing and processing this will be vital.
At the same time, demand for big data analytics is also set to grow exponentially, so both the faster performance and greater compression that columnar solutions allow are therefore likely to be crucial in handling this huge volume of information. Business units will expect fast results from their analyses, so columnar tools may well be the only practical solution to give firms the efficiency and accuracy needed.
In particular, advances made in artificial intelligence over the past few years can also take advantage of columnar databases, which are well-suited to providing large data sets as sources for advanced machine learning processes.