Data Modeling Best Practices and Pitfalls

June 14, 2024

Data Modeling Best Practices and Pitfalls

Getting your Trinity Audio player ready...

At first glance, data modeling projects in an industrial ecosystem can seem daunting. There are many factors that manufacturers must consider to ensure their data modeling project is successful. The following guidelines address commonly encountered challenges and provide best practices to overcome these hurdles.

Focus on Requirements of End Systems and Use Cases

A data model is typically a standard representation of an asset, process, product, system, or role. In many cases, users who get stuck in modeling start by creating a logical view of a machine (it has a single press, two motors, a pump, etc.) and then work backward (it’s in Area 2, at the Kentucky plant), creating it all as hierarchy.

Instead, it’s best to start by asking, “What does the end application need? What is the minimal information needed from production to create that? What is the minimal information users need to see in the dashboard?” If a second application requiring a similar model is introduced, the user can leverage the existing model or create a new model specific to this application.

Plan on Models Changing

Creating models is similar to defining a class in programming. A programmer designs the class with an end application in mind, knowing that over time the requirements will evolve and mature and changes will be necessary. The best defense against changing models is to model the bare minimum for the end application. It’s easier to add to a model than to remove or change existing attributes.

Minimize Model Hierarchy Whenever Possible

Hierarchy scales complexity. As the model hierarchy grows, it becomes more difficult to manage in the end applications. Many applications don’t support hierarchy, so users will likely be required to flatten data models anyway. A simple example of this is site information. Rather than create a “Site” model, with a single name attribute, it’s better to put a “Site” attribute in the machine model. This way, organizations avoid hierarchy, and the machine model is self-contained. If sites have three or four other properties, like address, employee count, etc., then it may be worth creating the hierarchy where a site contains 1-N machines, but it’s better to avoid this when possible.

Include Location/Metadata in the Model Itself

In general, it is best practice to make models as self-contained as possible. Think of it like storing a model in a database. Whenever possible, store all the information in a single table. This way, users can query one table/topic/source and get all the information the end application needs. In SQL, splitting models is simple. As an example, one might create separate tables for site and machine information then do a table JOIN to get both sets of information. There is no concept of a JOIN in a UNS or MQTT broker, and writing code to listen on multiple topics and pull the information together is messy and unreliable. It’s best to load models up (using hierarchy or not) with all the information required in the end application.

Avoid Duplicating Models in Different Systems

Where possible, use structured data coming from the sub-system. If the source system is producing modeled data (e.g., JSON, OPC UA, SQL), try to leverage those underlying models as much as possible. As a last resort, break the model up into its attributes and then reconstruct it.

Keep Data Types Simple and Uniform

Users often need to map very specific data (i.e., Int8) to very generic data, like a numeric type in JSON. As a rule, specific type information is lost as data moves up to cloud systems. Plan for this. To mitigate complexity, keep type information simple. When possible, treat everything like ints or floats.

Limit the Number of Model “Instances”

A plant with 10 of the same machines may have production information from an MES stored in a database, each with a unique machine ID, and process information from an OPC UA server. Generally, one will create a model for the machine, and then create 10 instances of the model, one for each machine. This is required if the data mappings (in this case OPC UA tag addresses) are unique for each machine. However, if the data is only coming from SQL, and each row is a single machine, the user doesn’t need 10 instances. They just need a single instance that queries the rows, packages/manipulates the data (maybe changes a column name to something more human readable) and delivers it.

Use a Single Timestamp

Factories are all about time-series data, and every data change of every tag has a timestamp. But modeled data isn’t time-series data. IT/cloud systems want a snapshot of the machine at time X with the value for all its attributes. Machine learning software, require consistent data that shows the state of the machine at time X, X+100ms, X+200ms, etc. So, specifically for ML and AI applications, users must make sure they’re delivering the entire model with a single timestamp, typically in UTC and in epoch.

About the author

This article was written by Aron Semle, Chief Technology Officer at HighByte, responsible for research and development, technical evangelism, and supporting customer success. Aron has a bachelor’s degree in Computer Engineering from the University of Maine, Orono and more than 15 years of experience in industrial technology.