Let’s talk about data baby, let’s talk about O and T - Data Lake

WRITTEN BY

James Brand

In last weeks post, I talked at a high level about how you can build an MVP for collecting data from your PLC's and display that data in a simple Power Bi dashboard. This provided real-time visibility into your operations that could easily be shared across the organisation but may add only limited value. Your data coming from the PLC is unlikely to give you much context and if you have lots of PLC's in lots of locations it is likely difficult to pin down where that device is and possibly what it's even being used for. If you didn't read last weeks post, you can here.

In this weeks newsletter I'm going to talk about how you can start to build your data platform and import data from other systems. In the coming week's we'll revisit our IoT data and look at how you can combine various datasets to start driving real organisational value. We'll also explore digital twins, look at how you can use machine learning for anomaly detection and discuss how you can implement security and access controls end to end.

Last week you landed your data into a storage account. Let's consider this to be your raw storage since at this point you haven't done any further processing on the data - it remains in the state that it came into the system. I've expanded last weeks diagram to include the following:

On-prem source systems such as self hosted ERP, WMS, MRP etc using popular databases such as SQL or Oracle Databases
On-prem self-hosted Integration Runtime
Azure Data Factory instance
Cloud based source systems. Again ERP, WMS, MRP etc
Additional storage accounts for enriched and curated data. These could be containers within a single storage account but for simplicity they are different accounts in the diagram.
Optional Azure Databricks instance

Azure Data Factory (ADF) is a cloud based tool that can be used to create code-free ETL (extract-transform-load), ELT (extract-load-transform) and data integration projects. It allows data from a huge variety of sources to be ingested, transformed and stored in another destination.

With ADF you can build pipelines - these essentially orchestrate the movement and processing of data by chaining different activities together. An activity is a processing step in a pipeline. You can also use dataflows for transforming the data - dataflows allow activities such as merge, pivot, sort, filter. etc etc . Dataflows are then orchestrated as an activity within a pipeline. The image below show's a simple dataflow.

ADF also uses the concept of connectors. Connectors allow ADF to talk to a data store. You can view a list of connectors available here. Note that even if your source system isn't listed it's likely that behind it is a database such as SQL Server so you would use the connector for that database type or the Generic ODBC connector. Failing that if your system has an API then you can use the Generic REST connector.

If connecting to an on-prem data source then it's recommended to use the self-hosted integration runtime on premise. The IR helps keep things secure and means you don't have to expose your on-prem data stores to the outside world. Instead the IR and ADF take care of routing the data in a secure way.

Initially you'll store the source data in your raw storage account. You'll then use this as your source for onwards processing. You could apply some data retention policies or clean up activities within a pipeline to ensure you don't retain data here long term. Next you can perform data enrichment activities. This could be done via a dataflow or via a Databricks notebook. During the enrichment process you can fix missing values, clean duplicates, perform lookups against other tables and maybe even create lookup tables from the data you've just ingested. You will store this "enriched" data in the enriched storage location for future use. You should also consider the format you are storing the data in. If the data was originally in JSON for example then it is much more performant to store the data in Parquet or Delta file format for onwards processing and storage.

You can now move onto the final step of the processing - data curation. Data curation is the process of optimising the data for analytics and consumption. You can use this process to output the curated data to other services such as Synapse.

While authoring your notebooks you can use ADF to debug them and ensure the output at each stage is what you require. Once you are happy you can configure a trigger so that the ADF pipeline runs periodically as defined by you. Maybe you need the latest data from your systems once a week, once a day or once per hour?

Finally, you can connect Power Bi to either your Synapse instance or your Curated storage account. You can now create interactive reports using the data from the systems you ingested earlier. You can use Power Bi and it's power to create further measures and dimensions as well as computed columns etc. You can create reports that make sense of data from various source systems giving your organisation the "bigger picture".

With the above pattern you can ingest, transform and output data for reporting from various source systems without having to write any code. The pattern also provides the flexibility to write powerful code when needed in the form of Databricks notebooks.

As you can imagine tools like Azure Data Factory are very powerful tools so we've only just scratched the surface of what it's capable of. It does have support for pipeline parameters, git integration (so we can develop in a test environment and promote to production), logging and monitoring features and a whole lot more.

What data sources does your organisation have that could deliver more value if used in the correct way?

Simple dictionary of terms

API (Application Programming Interface): A set of rules that define how applications or devices can connect to and communicate with each other.

Azure Databricks: Azure Databricks is a data analytics platform used for creating data intensive applications. It provides advanced processing for large data sets among many other features.

Container: Think of these as top level folders on a storage device.

ODBC (Open Database Connectivity): A standard application programming interface (API) for accessing database management systems

REST (REpresentational State Transfer): Is a standard that guides the design and development of processes which enable applications to interact with data stored on a web servers.

To read more about the different stages of data processing and storage please read the Microsoft Cloud Adoption Framework article

Let’s talk about data baby, let’s talk about O and T Let’s talk about data baby, let’s talk about O and T