Java

The Importance of Retaining Time Series Data

We need to keep all of our data. And we need to keep it in multiple dimensions. These days, this refers to planning to stream data into a centralized repository, or warehouse. And the next step is to maintain that data in a fashion where it’s serialized by time. Time series refers to data points in a series of data that are indexed and ordered at specific times. For example, a snapshot of tables in a database exported every few hours. It can also be defined as data recorded in a sequence, measuring a similar specific setting over time, in order. Such as when you have data streaming a repository upon commit of a field.

The logical extension is that the amount of data grows when retaining time series data. As do costs. Such as when we start using a Time Series Database (TSDB), giving us the opportunity to find solutions and great strategies for tracking their performance of various fields over time. Time Series Databases facilitate usability and scalability. And the insights that can be found in them are critical to engineers, scientists, and software developers.

The time a setting was changed is crucial. Time series data is unique and gives us the opportunity to perform time series analysis. It has a natural order in time. It’s not like the human world where time is linear, though. Time series data is not always about the chronological order of data change but often about the increasing value of events with time. In the modern world with the Internet of Things and big data, many applications depend heavily on data that indicates how events change over time. In the measurements of these events, time is primarily not just a metric but also an axis that can be plotted to show an increase or decrease in value. And using that axis, you can usually predict the future. 

The patterns you can see include various trends, instabilities, and cycles such as seasonality. Effectively time is one type of data. And other data is capture using time as a dimension. This gives us the ability to forecast based on past events and patterns. It allows us to build sales projects, to plan for spikes in utilization, to manage supply chains, and to make predictions about the timing for pretty much any event. 

There are a number of reasons to keep time series data, even though it often expands the amount of data you have to retain, sometimes even exponentially expanding cost. Predicting the weather, or other meteorological patterns is important and is part of the original inspiration in computing, going back to the ancient Greeks who built the first known mechanical computers. It also expands the use cases to include anomaly detection. And opens up things like noise control, malicious detection, and so much more. 

But the data comes at a cost. Not only is it more complicated to retain and make time series data available, but we now need more policies than ever around the retention of data, for GDPR, SOC2 compliance, privacy, and just plain being responsible with the data. Those policies usually require controls. Those controls usually require programming and/or additional third party products. And the more sensitive the data and the more regulation, the higher the cost to retention data. But ultimately, the ability to analyze that data over time provides value that outweighs the cost to retain the data, even if we don’t know what the uses will be just yet. 

Data has now become the largest advantage used by organizations and businesses in making sound data-driven decisions and policy plans. And the insights organizations can give their customers are now amongst the most valuable features that customers make decisions based on. If a company has the data and can expose the data in a way that the controller of the data documented as acceptable then there seem to be few limits to what we can do with it. So to summarize what I’ve been rambling on about: keep all your data and make sure you know when each piece of data was kept, either by streaming changes to data when updated, or using the smallest interval of duplication against a time series that you can. You’ll be glad you did when you’re told to do something cool with it and are actually able to do so.