OAuth 2.0, LDAP & SCIM
Fully TLS Encrypted
Secure Ingress
Kubernetes & Helm
Diagonally Scalable
Easy to Bootstrap
RBAC & ABAC
Row- & Column-Level
Fully Auditable
Delta Lake
Iceberg
Seamless Integration
IIoT (Industrial Internet of Things) data is a strategic asset. Only when captured and stored efficiently does it become a central element of a truly data-driven organization.
Maintaining a robust and immutable operational history provides organizations with a reliable and verifiable record of production. This foundation facilitates root-cause analysis, continuous optimization, and smarter AI-powered decision-making.
…but where should this data be stored?
Should it go to the cloud? Into a dedicated timeseries database like InfluxDB or TimescaleDB? Or perhaps into plain PostgreSQL?
The most straightforward approach would be to select a cloud offering and call it a day. In an enterprise context, however, the challenge is more complex. The full value of sensor data is realized when it is integrated with ERP, MES, CRM, and SRM systems, providing business and operational context rather than isolated signals. In practice, this operational data is often already moved to cloud platforms like Databricks or Snowflake for analytics.
Therefore, this blueprint focuses on the data modeling and architecture of IoT timeseries in a Lakehouse using Delta Lake (with Apache Iceberg and DuckLake support coming soon). While this approach may sacrifice some timeseries query performance, it prioritizes integration from the outset, avoiding the added latency and complexity of extracting data from isolated timeseries systems.
| batch_id | timestamp | value | device_id | sensor_uuid | sensor_name | tag_unit | tag_hz |
|---|
| column | type | null |
|---|---|---|
| batch_id | BIGINT | NO |
| device_id | VARCHAR | NO |
| sensor_uuid | VARCHAR | NO |
| timestamp | VARCHAR | NO |
| value | VARCHAR | NO |
| sensor_name | VARCHAR | NO |
| tag_unit | VARCHAR | NO |
| tag_hz | VARCHAR | NO |
| tag_xyz | VARCHAR | YES |
batch_id is required to efficiently handle late-arriving values.
sensor_uuid is required to maintain a continuous and consistent timeseries per
sensor e.g.
if the sensor_name is changing.
batch_id or timestamp to handle late-arriving data.
Append-only tables are an optimal solution for storing and querying independent sensor timeseries. However, in the context of timeseries analytics, particularly multivariate timeseries prediction and analysis, it becomes necessary to analyze multiple timeseries simultaneously. In such cases, an append-only layout is no longer ideal.
Consequently, sensor data must be converted into a wide format, although this process is non-trivial. In practice, it is computationally expensive and tends to introduce significant complexity in both pre- and post-processing pipelines. Treating pivoting as an ad hoc, repeated task quickly becomes inefficient and error-prone. Instead, it should be implemented as a standardized, one-time transformation, robust by default and flexible enough to adapt to specific use cases.
Below you can see a perfect pivot table. However, to reach this ideal state, three main challenges must be addressed: row explosion, sparse values, and duplicate timestamps, as illustrated in the subsequent sections.
Append-Only TableLONG |
|||
|---|---|---|---|
| batch_id | timestamp | value | sensor_uuid |
| B-1 | 09:00:01 | 45.1 °C | sensor_1 |
| B-1 | 09:00:01 | 1021 hPa | sensor_2 |
| B-1 | 09:00:01 | 81.3 % | sensor_3 |
| B-2 | 09:00:04 | 48.9 °C | sensor_1 |
| B-2 | 09:00:04 | 1045 hPa | sensor_2 |
| B-2 | 09:00:04 | 92.0 % | sensor_3 |
| B-3 | 09:00:07 | 41.2 °C | sensor_1 |
| B-3 | 09:00:07 | 1005 hPa | sensor_2 |
| B-3 | 09:00:07 | 77.8 % | sensor_3 |
| B-4 | 09:00:10 | 46.5 °C | sensor_1 |
| B-4 | 09:00:10 | 1033 hPa | sensor_2 |
| B-4 | 09:00:10 | 85.5 % | sensor_3 |
| B-5 | 09:00:13 | 43.7 °C | sensor_1 |
| B-5 | 09:00:13 | 1011 hPa | sensor_2 |
| B-5 | 09:00:13 | 79.9 % | sensor_3 |
Pivot TableWIDE |
|||
|---|---|---|---|
| timestamp | sensor_1 | sensor_2 | sensor_3 |
| 09:00:01 | 45.1 °C | 1021 hPa | 81.3 % |
| 09:00:04 | 48.9 °C | 1045 hPa | 92.0 % |
| 09:00:07 | 41.2 °C | 1005 hPa | 77.8 % |
| 09:00:10 | 46.5 °C | 1033 hPa | 85.5 % |
| 09:00:13 | 43.7 °C | 1011 hPa | 79.9 % |
The following table demonstrates how non-aligning timestamps during a pivot operation can lead to row explosion. In the worst-case scenario, this results in a separate row for each individual sensor measurement.
Depending on the business requirements, there are several approaches to tackle this challenge. In the most basic cases, it may be sufficient to truncate unnecessary precision or to apply timestamp rounding. However, especially in machine learning workflows, it is critical that the chosen approach does not introduce information leakage. Seemingly harmless preprocessing steps can inadvertently expose future information, leading to overly optimistic model and poor generalization performance in production.
| timestamp | value | sensor_uuid |
|---|
| timestamp | sensor_1 | sensor_2 | sensor_3 |
|---|
When aligning sensors with different sampling frequencies, sparsity is expected. Lower-frequency signals inherently result in gaps when combined with higher-frequency data on a common timeline. This is not an anomaly. It is a structural property of the data.
Handling these gaps requires explicit design decisions. Techniques range from straightforward methods like forward-filling to more advanced or domain-specific imputation strategies. Regardless of the method, gap-filling is a lossy and opinionated transformation that materially alters the dataset.
Data imputation is inherently context-dependent. In many cases, it is both valid and necessary to use different strategies for different use cases rather than forcing a single global approach. However, these variations should be formalized as gold layer datasets or, ideally, as data products with explicit semantics, ownership, and versioning. Moving this logic out of hidden downstream pipelines and into the lakehouse ensures that assumptions remain transparent and reproducibility is preserved.
| timestamp | value | sensor_uuid | tag_hz |
|---|
| timestamp | sensor_1 | sensor_2 |
|---|
Duplicate timestamps commonly arise during batch or streaming ingestion, often due to fixed-size write operations or late-arriving data. While this behavior is expected in distributed systems, it should be handled as early as possible to avoid placing unnecessary complexity on downstream applications.
To resolve duplicate timestamps, upsert semantics are applied. Incoming batches incrementally update existing rows by filling in missing columns while preserving previously ingested values, resulting in a single, fully filled row per timestamp.
| batch_id | sensor_uuid | timestamp | value |
|---|
| timestamp | sensor_1 | sensor_2 | sensor_3 |
|---|
This is where the architecture shines. Many organizations are already consolidating data from across the ISA-95 automation hierarchy, including ERP, MES, PLM, CRM, etc., into modern lakehouse platforms such as Databricks, Snowflake, and MotherDuck. Sensor data already present in an open table format simplifies the integration with those datasets. Thanks to the decoupling of storage and compute, different query engines and tools can share the same data, making it scalable and accelerating analytical turnaround.
End-to-end processes, such as order-to-cash with reclamations, should be analyzed in context, not isolation. For instance, manufacturers can trace customer complaints back to production conditions by linking temperature anomalies and e.g. bearing vibration directly to product quality issues. Integrating machine data with business outcomes enables organizations to transition from reactive problem-solving to proactive, data-driven decision-making, enhancing quality, reducing costs, and strengthening customer relationships.
The following denormalized table shows the contextualization of sensor data, ERP data, and MES data.
timestamp + sensor_id
iot-lakehouse provides an opinionated, production-ready selection of technologies and configurations to implement the lakehouse blueprint described above. The entire stack is open-source, CNCF-first, runs on Kubernetes, and is managed via Helm.
Designed with security and modularity in mind, it includes sensible defaults for easy bootstrapping. This allows organizations to rapidly prototype and distinguish between areas where building internal expertise creates a competitive edge, and where technology is merely a commodity. Self-hosting isn't mandatory, but owning the capability is your hedge against vendor lock-in. This approach guarantees that if the landscape shifts, you retain data sovereignty and a clear exit path.
Trino serves as the core query engine and trusted compute layer, managing all required integrations across the data platform. It enforces robust data governance by integrating authentication and authorization from external providers. Fine-grained access control is supported even on row and column level.
While delivering excellent object storage performance, RustFS is primarily chosen for its S3 compatibility. This widely adopted S3 standard allows for retaining provider flexibility, while avoiding vendor lock-in. Alternatives like AWS S3, MinIO, or Dell ECS can be swapped without infrastructure refactoring.
These open table formats are designed to manage tables and their data stored as Parquet files in a data lake. Both provide key capabilities such as ACID transactions, time travel, and support for upserts and deletes. While both formats address the same core problem, they differ in their implementation. It is reasonable to expect that, over time, these formats will converge on a similar feature set. Unity Catalog and Polaris are already compatible with both formats, which is likely to accelerate this convergence. From a user perspective, data is always accessed as a table, making the underlying format mostly irrelevant. However, iot-lakehouse supports both formats ensuring flexibility and the freedom of choice.
Provides automated lifecycle management for TLS certificates and trust bundles across iot-lakehouse. By enforcing encrypted communication between all components and external clients, it eliminates manual certificate handling, reduces operational risk, and ensures compliance with enterprise security standards and zero-trust architectures.
Acts as the centralized identity and access management (IAM) layer. Authentik integrates seamlessly with enterprise identity providers, enabling single sign-on (SSO), role-based access control, and automated user provisioning and de-provisioning. This ensures consistent identity governance across users, services, and APIs while reducing administrative overhead. Authentik is particularly valuable because of its outpost feature, which enables secure TLS communication, such as for LDAP, in environments where native LDAPS support is not available.
Serves as the unified policy engine for fine-grained authorization and compliance enforcement. OPA decouples policy logic from application code, enabling dynamic, auditable rules for data access, API usage, and operational controls. This allows organizations to implement row-, column-, and context-aware policies that adapt to regulatory and business requirements without redeployments.
Functions as the secure ingress and traffic management layer for all platform services. Traefik provides automatic TLS termination, secure routing, and integration with identity providers for authenticated access. Its dynamic configuration model ensures that new services are exposed securely by default, supporting both internal enterprise traffic and external client access.
In the corporate environment, governance is the critical link between becoming data-driven and securing intellectual property. Governance initiatives rarely fail because an organization lacks permission concepts; they fail because implementing, testing, and verifying those approaches is notoriously difficult.
This is where the Open Policy Agent (OPA) shines: it allows to "Bring Your Own Governance" by implementing permission concepts as versioned, auditable code. For example, organizations can start with a centralized approach and, as data governance literacy matures, transition to a decentralized model (such as Data Mesh or Data Fabric) that grants greater agency to individual business units.
These permission concepts are operationalized using Role-Based Access Control (RBAC) and Attribute-Based Access Control (ABAC). While RBAC handles straightforward role definitions, ABAC enables granular control based on specific context, such as department affiliation or time-of-day access. The access control can be highly granular by supporting column masking and row filters that are strictly enforced by Trino.
Once this governance layer is established, unifying data becomes seamless. As demonstrated previously, you can join data from diverse sources (ERP, MES, IoT, etc.) via Trino simply by registering the tables. This highlights the architectural advantage of decoupling storage from compute. Furthermore, combining Trino with OPA ensures that governance policies from source systems (like Databricks or Snowflake) are translated and enforced uniformly across the platform.
One of the most interesting advancements in Hybrid Transactional/Analytical Processing (HTAP), specifically within the lakehouse ecosystem, are solutions like Moonlink (developed by Mooncake Labs). These technologies solve the critical problem of executing real-time queries on open tables while simultaneously optimizing write throughput and query performance. Currently, you are forced to choose only one or the other.
However, open source development stalled when Databricks acquired Mooncake Labs in October 2025 to accelerate their "Lakebase" vision. While this centralization implies a shift away from open source, the technology will undoubtedly mature rapidly within the Databricks ecosystem. We will continue to monitor how such solutions unfold, or perhaps even consider building our own version of Moonlink directly for Delta Lake.
Summer 2024 brought a major transformation as industry leaders open-sourced their catalog technologies. Snowflake donated Apache Polaris to the Apache Software Foundation on July 30, while Databricks released Unity Catalog under Apache 2.0 license on June 12, decoupling governance from proprietary platforms.
Crucially, both now support Delta Lake and Apache Iceberg, ending format lock-in. Unity enables this via UniForm; Polaris natively expanded its support. The choice of format now belongs entirely to the query engine and use case. While Polaris already supports S3-compatible object storage, Unity is still to add this capability.
Both are powerful successors to the Hive Metastore (HMS), moving beyond the "one instance per catalog" limitation. Your decision will be based on ecosystem alignment and technical preferences. Those catalogs will definitely be one of the next features to be implemented.
DuckLake introduces a new take on the lakehouse by simplifying how metadata and catalogs are managed. Instead of spreading metadata across many manifest and JSON files in object storage, DuckLake stores all table metadata, schemas, snapshots, statistics, and the catalog itself directly inside a standard SQL database, making the catalog the single, transactional source of truth. This approach leverages mature database guarantees like ACID transactions, concurrency control, and familiar SQL tooling, while keeping actual table data in Parquet files in object storage.
A particularly interesting innovation is data inlining, which allows small inserts or updates to be stored directly in the metadata database rather than generating numerous tiny Parquet files. This reduces file fragmentation, lowers operational overhead, and significantly improves performance for workloads with frequent small writes. By combining metadata and catalog management into one SQL-backed system and selectively inlining data when it makes sense, DuckLake aims to deliver a simpler, more reliable, and more efficient lakehouse architecture.