Types of machine learning for records management

Manage the massive scale of unstructured data through the application of modern machine learning

Written by

Reviewed by

Published:

July 30, 2019

Last updated:

Finding it hard to keep up with this fast-paced industry?

Subscribe to FILED Newsletter.

Your monthly round-up of the latest news and views at the intersection of data privacy, data security, and governance.

Subscribe Now

With the advent of big data, it has become more difficult for a human records manager to ensure they classify all information for retention. There are finally ways to deal with the massive scale of unstructured data through the application of modern machine learning, and artificial intelligence (AI) approaches. In this article, we will cover the types of machine learning that are useful for records management.

Traditional approaches to records management and classification

The conventional approach is to have users categorize records as they create them. But, this approach does not scale with the volume of data now under management. The extra workload is disruptive to users who are trying to do their real jobs. This situation leads to loss of productivity and inaccurate classifications.

A more recent industry approach is to use a set of hand-coded rules to manage classification inside a records management platform. Records managers create an explicit, hierarchical system. In this system, we categorize incoming records based on their metadata. But, this system breaks down as organizations manage records from content sources that are not arranged like content management stores, such as email, chats, IoT sensors, and so on.

How do we ensure that we can manage the rules themselves as needed? Information categories are likely to shift with time, due to legislation, or other factors. In these situations, we need to examine the actual content that makes up a record. This helps us to determine how to classify a record.

Business documents contain a wealth of information that can be used to infer the right category that a document should be in. The problem here is that the documents, like the content sources, are varied. They may be spreadsheets, web pages, word processing documents, code files, images, audio, or video files. It is challenging to hand-craft a set of procedural rules that will parse and accurately categorize such varied documents. These are situations where machine learning can help.

What is Machine Learning?

For records management, we use Machine Learning (ML) to have a computer choose categories for content. ML is a process that uses statistical techniques, or models, to predict the categories of records. ML is how we classify data with Records365.

First, we use Natural Language Processing to turn our documents into machine-readable data that can be used to create ML models. This approach generally means that we identify important words in the body of text and then count how many times these words appear in each document. Next, we feed this information into a machine learning algorithm, which will create a model we can use to categorize incoming data.

There are many different algorithms for performing this kind of analysis. We can generally divide the algorithms into two types of ML processes: supervised and unsupervised learning.

What is supervised learning?

Supervised learning starts with records that we have already categorized. We use these existing records and categories as examples to train an ML model. They show the model how to identify records of a particular type so that we can find common characteristics of each kind of record.

For the technically inclined, this process involves fitting and tuning different algorithms to the data. We then select the one that looks to be the most accurate, while controlling for bias and variance. These algorithms include logistic regression, discriminant analysis, random forest classifiers, and support vector machines. We can also use neural networks (aka deep learning).

Supervised learning is a very intuitive approach to a records management system, especially if the categories and retention plans are well-defined ahead of time. But, we also need sample data that we have already classified. Content sources with a well-established and adopted information architecture are a good fit for supervised ML applications.

What is unsupervised learning?

In unsupervised learning, we do not know the categories of the sample documents.

In this paradigm, we tune different algorithms to identify which documents form similar clusters. This approach is useful for analyzing completely unknown data. But it presents difficulties.

We need to work out how many clusters fit the data. Then we need to decide how to map those clusters to retention outcomes.
These algorithms do not produce a model that we can use to make predictions about new documents.
If we receive new content, we need to rerun the algorithm to identify the correct cluster where it fits. But, this process is very inefficient, especially as data volumes grow.
This also makes it an unstable basis for classification. Records may move between clusters every time we add data.
There is no guarantee that the clusters created are relevant to records management categories.

Because of these difficulties, we best use clustering as a first step in working out how to manage uncategorized data. For example, we can use it to find some clusters in the data and then formalize them into file plan categories. This scenario would set us up nicely for a supervised learning approach. This machine learning type would generate a model we can use to categorize of new records.

‍

🎧 LISTEN

NYC Records Commissioner Pauline Toole discusses AI's impact on her department

‍

Machine Learning for records: the bottom line

The amount of data under management is growing. But, ML offers promising ways to help with the difficult task of classifying records. We don’t have to make users classify every record they generate. Records managers don’t need to fabricate complicated sets of rules based on record metadata. We can train ML algorithms to do this work for us by examining the actual documents.

There are two types of ML in records management:

Supervised learning, in which we train a model using a set of pre-classified documents.
Unsupervised learning, in which we look for groupings within the data. We then use them to determine how we will partition our records and assign them a retention outcome.

Using ML strategies will allow us to cope with the increasing volume, velocity, and variety of records that we need to manage in the world of big data.