What is Data Pipeline?

In data engineering, a data pipeline refers to the collection of various steps of data processing. In a data pipeline, all data processing steps are arranged in a series where each step gives output data that is used as input by the subsequent step. The primary objective of data pipelines is to speed up the entire data processing operation by automation, and by establishing easier access to data sources. Furthermore, data pipelines control the flow of data to ensure that all data is adequately processed and reached its destination where it can be efficiently used by data scientists for further analysis.

Points to Remember

  • Data pipelines are an integral part of data engineering [link to #3.10]. They combine several different elements of data processing into a single process. Most common of these elements are data sources, data joins, data standardisation, extraction, correction, loading and automation.
  • Several different tools and technologies can be for data pipelining. However, the effectiveness and compatibility of these tools depend on the size, shape and overall structure of the pipeline.