How to Build Data Pipelines on Google Cloud Platform? In today’s digital world, gigs of data get churned out every day. The data might include information essential for businesses to thrive, government to function, and for us to receive the right products and services we ordered from an online marketplace.
As an entrepreneur, businessman/businesswoman in this century, you might have already considered hiring a data analyst to analyze and process the collected data and transform your business.
To process these data, data analysts used data pipelines. But what do we mean by data pipelines, their features, and how can we use cloud platforms like Google to build these pipelines?
This article will help you understand everything about data pipelines, so without further ado, let’s get started!
What is a Data Pipeline?
“Pipeline”, in general, refers to the system of big pipes moving resources like natural gas or oil from one place to another. Undoubtedly, these pipelines are faster means of carrying large amounts of material over large distances.
Read: What is Data Pipeline Architecture
Similarly, data processing pipelines act as a backbone working on the same principle for data ingestion. Data pipelines are the set of steps for data processing where the data is ingested at the initial stage of the pipeline if that data has not been stored in the data platform already. The pipeline defines what, where, and how the data will be collected.
Simply put, there are step series where each step provides an output that acts as an input for the next one, this continues till the pipeline is complete.
Moreover, a data pipeline includes three elements, namely a source, processing steps, and a destination (sink). With data pipelines, it becomes easier to transfer data from an app to a data warehouse, or data lake to an analytics database. It is possible to have the same source and destination for data pipelines as well, and that data pipeline will be purely there for modification of the previous data set. A data pipeline might also have filtering and resilience for better performance.
Types of Data Pipeline
Data pipelines are divided into Batch Processing and Streaming data.
-
Batch Processing Data Pipelines
In batch processing data pipelines, “batches of data” are loaded into the repository at the same time intervals, often scheduled during low-peak business hours. The batch is then queried by the software program or user when it is ready for processing, allowing them to explore and visualize the data.
Batch processing tasks create a sequence commands workflow, i.e., the output of one command becomes the input of the next one. One command may trigger column filtering, and the next may work on data aggregation, for instance.
Batch processing is the optimal data pipeline used when there is not an immediate requirement for dataset analysis.
-
Streaming Data Pipelines
Streaming data is used when there is a near real-time data processing requirement. Unlike batch processing, it is about deriving insights from the data within milliseconds by ingesting data sets as they are created and continuously updating reports, metrics, or summaries in response to every event.
Read: Top 5 Data Streaming Tools
It enables organizations to gain real-time analytics to get updated information on operations to act without delay. Streaming data pipelines are better used for social media or point or sales apps to update data and information instantly.
Data Pipeline Elements
Understanding the elements of a data pipeline will help you understand how it works. So, let’s take a brief look at these data pipeline components.
Read: What is DataOps
Source:
The source is the entry point of the data pipeline. The source can be a storage system of a company like a data lake, data warehouse, etc., or other data sources such as IoT devices, APIs, transaction processing systems, and social media.
Destination:
The destination is the final point of the data pipeline where all the collected data from the source gets stored. Most often than not, a data warehouse or data lake acts as the destination.
Dataflow:
Dataflow refers to the entire movement and changes data undergoes while transferring from its source to destination.
Processing:
Processing refers to the steps or activities involved to extract or ingest data from sources, its transformation, and moving it to the destination. It decides how the movement of dataflow should be implemented.
Workflow:
In the data pipeline, workflow focuses on defining the process sequence and its dependencies.
Monitoring
Working with the data pipeline requires continuous monitoring to ensure data integrity and potential data loss. Other than that, monitoring the data pipeline helps to check if the efficiency of the pipeline is affected by increasing data load.
Now that we have a better knowledge of data pipelines, it would be beneficial to understand what Google Cloud Platform (GCP) is before we move ahead to building data pipelines on GCP.
Google Cloud Platform - An Overview
Google Cloud Platform is a cloud computing services suite, running on the same infrastructure used by Google internally for its products like Google Drive, Gmail, or Google Search. GCP provides modular cloud services such as data storage, computing, machine learning, and data analytics along with its management tools.
Read: 5 Ways Cloud Computing Can Benefit Web App Development
Platform-as-a-Service (PaaS), Infrastructure-as-a-Service (IaaS), and serverless environments for computing are other examples of services that Google Cloud Platform offers.
Under the Google Cloud brand, Google has over 100 products. Some of the key services that we need to know are listed below.
-
App Engine
-
Cloud Functions
-
Compute Engine
-
Cloud Run
-
Cloud Storage
-
Cloud SQL
-
Cloud Bigtable
-
Cloud Spanner
-
Cloud Datastore
-
Persistent Disk
-
Cloud Memorystore
-
Local SSD
-
Filestore
-
AlloyDB
-
Cloud CDN
-
Cloud DNS
-
Cloud Interconnect
-
Cloud Armor
-
Cloud Load Balancing
-
Virtual Private Cloud
-
Network Service Tiers
-
DataProc
-
BigQuery
-
Cloud Dataflow
-
Cloud Composer
-
Cloud Dataprep
-
Cloud DataLab
-
Cloud Data Studio
-
Cloud Shell
-
Cloud APIs
-
Cloud AutoML
-
Cloud TPU
-
Cloud Console
-
Cloud Identity
-
Edge TPU
Methods to Build Data Pipelines on the Google Cloud Platform
Before creating data pipelines, make sure to add necessary IAM roles like datapipelines.admin, datapipelines.invoker, datapipelines.viewer to allow using certain operations respectively.
To create the data pipeline using Google Cloud Platform, access the data pipeline feature from its console. Then a setup page will open where you can allow listed APIs before creating data pipelines. Now you can either import a job or create a data pipeline.
Read: Principles of Web API Design
How to Build Data Pipelines on Google Cloud Platoform?
To create a data pipeline on the Google Cloud platofrm follow the following steps:
-
In the Google Cloud Console, go to Dataflow Pipeline and select ‘Create Data Pipeline’.
-
Provide a name to the data pipeline, fill in other parameters and template selections on the pipeline template.
-
For a batch job, you can provide a recurrence schedule for the pipeline.
Now to create a batch data pipeline give your project access to Cloud Storage Bucket and BigQuery dataset for storing input and output data while creating tables simultaneously.
Let’s take an example pipeline that reads CSV files from storage (source), runs a transformation, and then stores the value in the BigQuery table (destination) with three columns.
Now, create the below-mentioned files in the local drive:
A big-query-column-table.json that will contain the destination schema as
|
A transformation.js JavaScript file to implement a simple data transformation
f |
A record01.csv CSV file with records to be inserted in the BigQuery table
|
Use gsutil and copy the JSON and JS files in Cloud Storage Bucket ID Attendance-Record of your project and CSV file to Bucket ID Inputs as
|
After creating a record folder in Cloud Storage, create an attendance-record pipeline by entering the pipeline name, source, and destination, selecting “Text Files on Cloud Storage to BigQuery” under process data in bulk batch, and scheduling the pipeline based on your needs.
Other than the batch data pipeline, you can also create a streaming data pipeline based on the batch pipeline instructions, but remember the differences given below:
-
Streaming data pipelines do not have schedules speciṣfied for the Pipeline schedule, as the Dataflow streaming begins immediately.
-
Go to Process Data Continuously (stream) and then Text Files on Cloud Storage to BigQuery, when using the template for Dataflow.
-
If you use the Worker machine type, the pipeline processes the records you upload to the inputs/ folder that match the pattern gs://BUCKET_ID/inputs/record01.csv. To avoid out-of-memory errors when CSV files exceed several GBs, select a machine type with more memory than the default n1-standard-4 machine type.
Conclusion
So that was all about the data pipeline and Google Cloud Platform. And this is how you can easily create a simple yet featured data pipeline using GCP. Remember, there is no exception handling mentioned above, so while working for an organization you would have to include it yourself.