Blueprint for IIoT Timeseries Data in a Lakehouse

IIoT (Industrial Internet of Things) data is a strategic asset. Only when captured and stored efficiently does it become a central element of a truly data-driven organization.

Maintaining a robust and immutable operational history provides organizations with a reliable and verifiable record of production. This foundation facilitates root-cause analysis, continuous optimization, and smarter AI-powered decision-making.

…but where should this data be stored?

Should it go to the cloud? Into a dedicated timeseries database like InfluxDB or TimescaleDB? Or perhaps into plain PostgreSQL?

The most straightforward approach would be to select a cloud offering and call it a day. In an enterprise context, however, the challenge is more complex. The full value of sensor data is realized when it is integrated with ERP, MES, CRM, and SRM systems, providing business and operational context rather than isolated signals. In practice, this operational data is often already moved to cloud platforms like Databricks or Snowflake for analytics.

Therefore, this blueprint focuses on the data modeling and architecture of IoT timeseries in a Lakehouse using Delta Lake (with Apache Iceberg and DuckLake support coming soon). While this approach may sacrifice some timeseries query performance, it prioritizes integration from the outset, avoiding the added latency and complexity of extracting data from isolated timeseries systems.

Append-Only 🥉

Immutable History

Append-only Delta Table
Optimistic concurrency for writes
No deletes or updates
No requirement for chronological timestamps

Append-Only Table

Batch Stream

batch_id	timestamp	value	device_id	sensor_uuid	sensor_name	tag_unit	tag_hz

Append-Only Schema

Trino Delta

column	type	null
batch_id	BIGINT	NO
device_id	VARCHAR	NO
sensor_uuid	VARCHAR	NO
timestamp	VARCHAR	NO
value	VARCHAR	NO
sensor_name	VARCHAR	NO
tag_unit	VARCHAR	NO
tag_hz	VARCHAR	NO
tag_xyz	VARCHAR	YES

Table Schema

The table is denormalized, yet storage-efficient due to Parquet's columnar compression.
batch_id is required to efficiently handle late-arriving values.
sensor_uuid is required to maintain a continuous and consistent timeseries per sensor e.g. if the sensor_name is changing.
The schema should be quasi-static and versioned. Delta Lake can handle schema drift, but ideally only tag columns should change.

Optimizations

Partitioned by read access patterns (e.g., device_id). The partitioning can be extended by batch_id or timestamp to handle late-arriving data.
To optimize query performance and improve compression, small files produced by batch stream ingestion are periodically combined into larger files, and tombstoned files are vacuumed (deleted).
Vector queries and Z-ordering enable fast and efficient data filtering.

📁 Partitioning

⚡ Optimize & Vacuum

append_only_table/

├── _delta_log/

│ ├── 000000.json

│ ├── 000001.json

│ └── 000002.json

├── device_id=device_001/

│ ├── part-00001-a1b2.c000.snappy.parquet

│ ├── part-00002-c3d4.c000.snappy.parquet

│ └── part-00003-e5f6.c000.snappy.parquet

├── device_id=device_002/

│ ├── part-00001-g7h8.c000.snappy.parquet

│ ├── part-00002-i9j0.c000.snappy.parquet

│ └── part-00003-k1l2.c000.snappy.parquet

└── device_id=device_003/

├── part-00001-m3n4.c000.snappy.parquet

├── part-00002-o5p6.c000.snappy.parquet

└── part-00003-q7r8.c000.snappy.parquet

Before OPTIMIZE

device_id=device_001/

├── file1.parquet // 10MB

├── file2.parquet // 12MB

└── file3.parquet // 8MB

After OPTIMIZE

device_id=device_001/

├── file1.parquet // tombstoned

├── file2.parquet // tombstoned

├── file3.parquet // tombstoned

└── optimized_file.parquet // 30MB, active

After VACUUM

device_id=device_001/

└── optimized_file.parquet (30MB, active)

Pivot 🥈

Append-only tables are an optimal solution for storing and querying independent sensor timeseries. However, in the context of timeseries analytics, particularly multivariate timeseries prediction and analysis, it becomes necessary to analyze multiple timeseries simultaneously. In such cases, an append-only layout is no longer ideal.

Consequently, sensor data must be converted into a wide format, although this process is non-trivial. In practice, it is computationally expensive and tends to introduce significant complexity in both pre- and post-processing pipelines. Treating pivoting as an ad hoc, repeated task quickly becomes inefficient and error-prone. Instead, it should be implemented as a standardized, one-time transformation, robust by default and flexible enough to adapt to specific use cases.

Below you can see a perfect pivot table. However, to reach this ideal state, three main challenges must be addressed: row explosion, sparse values, and duplicate timestamps, as illustrated in the subsequent sections.

Append-Only Table LONG
batch_id	timestamp	value	sensor_uuid
B-1	09:00:01	45.1 °C	sensor_1
B-1	09:00:01	1021 hPa	sensor_2
B-1	09:00:01	81.3 %	sensor_3
B-2	09:00:04	48.9 °C	sensor_1
B-2	09:00:04	1045 hPa	sensor_2
B-2	09:00:04	92.0 %	sensor_3
B-3	09:00:07	41.2 °C	sensor_1
B-3	09:00:07	1005 hPa	sensor_2
B-3	09:00:07	77.8 %	sensor_3
B-4	09:00:10	46.5 °C	sensor_1
B-4	09:00:10	1033 hPa	sensor_2
B-4	09:00:10	85.5 %	sensor_3
B-5	09:00:13	43.7 °C	sensor_1
B-5	09:00:13	1011 hPa	sensor_2
B-5	09:00:13	79.9 %	sensor_3

dbt trino

Pivot

Pivot Table WIDE
timestamp	sensor_1	sensor_2	sensor_3
09:00:01	45.1 °C	1021 hPa	81.3 %
09:00:04	48.9 °C	1045 hPa	92.0 %
09:00:07	41.2 °C	1005 hPa	77.8 %
09:00:10	46.5 °C	1033 hPa	85.5 %
09:00:13	43.7 °C	1011 hPa	79.9 %

Row Explosion

The following table demonstrates how non-aligning timestamps during a pivot operation can lead to row explosion. In the worst-case scenario, this results in a separate row for each individual sensor measurement.

Depending on the business requirements, there are several approaches to tackle this challenge. In the most basic cases, it may be sufficient to truncate unnecessary precision or to apply timestamp rounding. However, especially in machine learning workflows, it is critical that the chosen approach does not introduce information leakage. Seemingly harmless preprocessing steps can inadvertently expose future information, leading to overly optimistic model and poor generalization performance in production.

Append-Only Table

6 Rows

timestamp	value	sensor_uuid

Pivot

Pivot Table

Still 6 Rows!

timestamp	sensor_1	sensor_2	sensor_3

Sparse Values

When aligning sensors with different sampling frequencies, sparsity is expected. Lower-frequency signals inherently result in gaps when combined with higher-frequency data on a common timeline. This is not an anomaly. It is a structural property of the data.

Handling these gaps requires explicit design decisions. Techniques range from straightforward methods like forward-filling to more advanced or domain-specific imputation strategies. Regardless of the method, gap-filling is a lossy and opinionated transformation that materially alters the dataset.

Data imputation is inherently context-dependent. In many cases, it is both valid and necessary to use different strategies for different use cases rather than forcing a single global approach. However, these variations should be formalized as gold layer datasets or, ideally, as data products with explicit semantics, ownership, and versioning. Moving this logic out of hidden downstream pipelines and into the lakehouse ensures that assumptions remain transparent and reproducibility is preserved.

Append-Only Table

timestamp	value	sensor_uuid	tag_hz

Pivot

Pivot Table

timestamp	sensor_1	sensor_2

Duplicate Timestamps

Duplicate timestamps commonly arise during batch or streaming ingestion, often due to fixed-size write operations or late-arriving data. While this behavior is expected in distributed systems, it should be handled as early as possible to avoid placing unnecessary complexity on downstream applications.

To resolve duplicate timestamps, upsert semantics are applied. Incoming batches incrementally update existing rows by filling in missing columns while preserving previously ingested values, resulting in a single, fully filled row per timestamp.

Append-Only Table

batch_id	sensor_uuid	timestamp	value

Pivot by Batch

Pivot Table

timestamp	sensor_1	sensor_2	sensor_3

Contextualization 🥇

This is where the architecture shines. Many organizations are already consolidating data from across the ISA-95 automation hierarchy, including ERP, MES, PLM, CRM, etc., into modern lakehouse platforms such as Databricks, Snowflake, and MotherDuck. Sensor data already present in an open table format simplifies the integration with those datasets. Thanks to the decoupling of storage and compute, different query engines and tools can share the same data, making it scalable and accelerating analytical turnaround.

End-to-end processes, such as order-to-cash with reclamations, should be analyzed in context, not isolation. For instance, manufacturers can trace customer complaints back to production conditions by linking temperature anomalies and e.g. bearing vibration directly to product quality issues. Integrating machine data with business outcomes enables organizations to transition from reactive problem-solving to proactive, data-driven decision-making, enhancing quality, reducing costs, and strengthening customer relationships.

The following denormalized table shows the contextualization of sensor data, ERP data, and MES data.

IIOT

MES

ERP

Join over timestamp + sensor_id

Context Table

Order to Cash with Reclamation

20 Records

IIoT

MES

ERP

timestamp

sensor_1

sensor_2

sensor_3

process_id

order_id

product_id

reclamation

Solution

iot-lakehouse provides an opinionated, production-ready selection of technologies and configurations to implement the lakehouse blueprint described above. The entire stack is open-source, CNCF-first, runs on Kubernetes, and is managed via Helm.

Designed with security and modularity in mind, it includes sensible defaults for easy bootstrapping. This allows organizations to rapidly prototype and distinguish between areas where building internal expertise creates a competitive edge, and where technology is merely a commodity. Self-hosting isn't mandatory, but owning the capability is your hedge against vendor lock-in. This approach guarantees that if the landscape shifts, you retain data sovereignty and a clear exit path.

Technology Highlights

Trino Query Engine

Trino serves as the core query engine and trusted compute layer, managing all required integrations across the data platform. It enforces robust data governance by integrating authentication and authorization from external providers. Fine-grained access control is supported even on row and column level.

RustFSObject Storage S3

While delivering excellent object storage performance, RustFS is primarily chosen for its S3 compatibility. This widely adopted S3 standard allows for retaining provider flexibility, while avoiding vendor lock-in. Alternatives like AWS S3, MinIO, or Dell ECS can be swapped without infrastructure refactoring.

Delta and IcebergOpen Table Format

These open table formats are designed to manage tables and their data stored as Parquet files in a data lake. Both provide key capabilities such as ACID transactions, time travel, and support for upserts and deletes. While both formats address the same core problem, they differ in their implementation. It is reasonable to expect that, over time, these formats will converge on a similar feature set. Unity Catalog and Polaris are already compatible with both formats, which is likely to accelerate this convergence. From a user perspective, data is always accessed as a table, making the underlying format mostly irrelevant. However, iot-lakehouse supports both formats ensuring flexibility and the freedom of choice.

Security and Enterprise Features

Cert- & Trust-ManagerTLS Encryption

Provides automated lifecycle management for TLS certificates and trust bundles across iot-lakehouse. By enforcing encrypted communication between all components and external clients, it eliminates manual certificate handling, reduces operational risk, and ensures compliance with enterprise security standards and zero-trust architectures.

AuthentikAuthentication

Acts as the centralized identity and access management (IAM) layer. Authentik integrates seamlessly with enterprise identity providers, enabling single sign-on (SSO), role-based access control, and automated user provisioning and de-provisioning. This ensures consistent identity governance across users, services, and APIs while reducing administrative overhead. Authentik is particularly valuable because of its outpost feature, which enables secure TLS communication, such as for LDAP, in environments where native LDAPS support is not available.

Open Policy Agent - OPAAuthorization

Serves as the unified policy engine for fine-grained authorization and compliance enforcement. OPA decouples policy logic from application code, enabling dynamic, auditable rules for data access, API usage, and operational controls. This allows organizations to implement row-, column-, and context-aware policies that adapt to regulatory and business requirements without redeployments.

TraefikSecure Ingress

Functions as the secure ingress and traffic management layer for all platform services. Traefik provides automatic TLS termination, secure routing, and integration with identity providers for authenticated access. Its dynamic configuration model ensures that new services are exposed securely by default, supporting both internal enterprise traffic and external client access.

Bring you own Governance

In the corporate environment, governance is the critical link between becoming data-driven and securing intellectual property. Governance initiatives rarely fail because an organization lacks permission concepts; they fail because implementing, testing, and verifying those approaches is notoriously difficult.

This is where the Open Policy Agent (OPA) shines: it allows to "Bring Your Own Governance" by implementing permission concepts as versioned, auditable code. For example, organizations can start with a centralized approach and, as data governance literacy matures, transition to a decentralized model (such as Data Mesh or Data Fabric) that grants greater agency to individual business units.

These permission concepts are operationalized using Role-Based Access Control (RBAC) and Attribute-Based Access Control (ABAC). While RBAC handles straightforward role definitions, ABAC enables granular control based on specific context, such as department affiliation or time-of-day access. The access control can be highly granular by supporting column masking and row filters that are strictly enforced by Trino.

Once this governance layer is established, unifying data becomes seamless. As demonstrated previously, you can join data from diverse sources (ERP, MES, IoT, etc.) via Trino simply by registering the tables. This highlights the architectural advantage of decoupling storage from compute. Furthermore, combining Trino with OPA ensures that governance policies from source systems (like Databricks or Snowflake) are translated and enforced uniformly across the platform.

The Future is Bright!

HTAP-systems

One of the most interesting advancements in Hybrid Transactional/Analytical Processing (HTAP), specifically within the lakehouse ecosystem, are solutions like Moonlink (developed by Mooncake Labs). These technologies solve the critical problem of executing real-time queries on open tables while simultaneously optimizing write throughput and query performance. Currently, you are forced to choose only one or the other.

However, open source development stalled when Databricks acquired Mooncake Labs in October 2025 to accelerate their "Lakebase" vision. While this centralization implies a shift away from open source, the technology will undoubtedly mature rapidly within the Databricks ecosystem. We will continue to monitor how such solutions unfold, or perhaps even consider building our own version of Moonlink directly for Delta Lake.

Catalogs

Summer 2024 brought a major transformation as industry leaders open-sourced their catalog technologies. Snowflake donated Apache Polaris to the Apache Software Foundation on July 30, while Databricks released Unity Catalog under Apache 2.0 license on June 12, decoupling governance from proprietary platforms.

Crucially, both now support Delta Lake and Apache Iceberg, ending format lock-in. Unity enables this via UniForm; Polaris natively expanded its support. The choice of format now belongs entirely to the query engine and use case. While Polaris already supports S3-compatible object storage, Unity is still to add this capability.

Both are powerful successors to the Hive Metastore (HMS), moving beyond the "one instance per catalog" limitation. Your decision will be based on ecosystem alignment and technical preferences. Those catalogs will definitely be one of the next features to be implemented.

DuckLake

DuckLake introduces a new take on the lakehouse by simplifying how metadata and catalogs are managed. Instead of spreading metadata across many manifest and JSON files in object storage, DuckLake stores all table metadata, schemas, snapshots, statistics, and the catalog itself directly inside a standard SQL database, making the catalog the single, transactional source of truth. This approach leverages mature database guarantees like ACID transactions, concurrency control, and familiar SQL tooling, while keeping actual table data in Parquet files in object storage.

A particularly interesting innovation is data inlining, which allows small inserts or updates to be stored directly in the metadata database rather than generating numerous tiny Parquet files. This reduces file fragmentation, lowers operational overhead, and significantly improves performance for workloads with frequent small writes. By combining metadata and catalog management into one SQL-backed system and selectively inlining data when it makes sense, DuckLake aims to deliver a simpler, more reliable, and more efficient lakehouse architecture.

Lakehouse is all you need!

Security

Cloud-Native

Governance

Open Standards

Blueprint for IIoT Timeseries Data in a Lakehouse

Append-Only 🥉

Immutable History

Append-Only Table

Append-Only Schema

Table Schema

Optimizations

Pivot 🥈

Append-Only Table

Pivot Table

Row Explosion

Append-Only Table

Pivot Table

Sparse Values

Append-Only Table

Pivot Table

Duplicate Timestamps

Append-Only Table

Pivot Table

Contextualization 🥇

Solution

Technology Highlights

Trino Query Engine

RustFSObject Storage S3

Delta and IcebergOpen Table Format

Security and Enterprise Features

Cert- & Trust-ManagerTLS Encryption

AuthentikAuthentication

Open Policy Agent - OPAAuthorization

TraefikSecure Ingress

Bring you own Governance

The Future is Bright!

HTAP-systems

Catalogs

DuckLake