This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
Transforming data in Frictionless means modifying data and metadata from state A to state B. For example, it could be transforming a messy Excel file to a cleaned CSV file, or transforming a folder of data files to a data package we can publish more easily. To read more about the concepts behind Frictionless Transform, please check out the Transform Principles sections belows.
In comparison to similiar Python software like Pandas, Frictionless provides better control over metadata, has a modular API, and fully supports Frictionless Specifications. Also, it is a streaming framework with an ability to work with large data. As a downside of the Frictionless architecture, it might be slower compared to other Python packages, especially to projects like Pandas.
Keep reading below to learn about the principles underlying Frictionless Transform, or skip ahead to see how to use the Transform code.
Frictionless Transform is based on a few core principles which are shared with other parts of the framework:
Frictionless Transform can be thought of as a list of functions that accept a source resource/package object and return a target resource/package object. Every function updates the input's metadata and data - and nothing more. We tried to make this straightforward and conceptually simple, because we want our users to be able to understand the tools and master them.
There are plenty of great ETL-frameworks written in Python and other languages. We use one of them (PETL) under the hood (described in more detail later). The core difference between Frictionless and others is that we treat metadata as a first-class citizen. This means that you don't lose type and other important information during the pipeline evaluation.
Whenever possible, Frictionless streams the data instead of reading it into memory. For example, for sorting big tables we use a memory usage threshold and when it is met we use the file system to unload the data. The ability to stream data gives users power to work with files of any size, even very large files.
With Frictionless all data manipulation happens on-demand. For example, if you reshape one table in a data package containing 10 big csv files, Frictionless will not even read the 9 other tables. Frictionless tries to be as explicit as possible regarding actions taken. For example, it will not use CPU resources to cast data unless a user adds a
normalize step. So it's possible to transform a rather big file without even casting types, for example, if you only need to reshape it.
For the core transform functions, Frictionless uses the amazing PETL project under the hood. This library provides lazy-loading functionality in running data pipelines. On top of PETL, Frictionless adds metadata management and a bridge between Frictionless concepts like Package/Resource and PETL's processors.
Frictionless supports a few different kinds of data and metadata transformations:
- resource and package transformations
- transformations based on a declarative pipeline
The main difference between these is that resource and package transforms are imperative while pipelines can be created beforehand or shared as a JSON file. We'll talk more about pipelines in the Transforming Pipeline section below. First, we will introduce the transform functions, then go into detail about how to transform a resource and a package. As a reminder, in the Frictionless ecosystem, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema. This concept is described in more detail in the Introduction.
transform.csvto reproduce the examples (right-click and "Save link as". You might need to change the file extension from .txt to .csv).
The high-level interface to transform data is a set of
transform: detects the source type and transforms data accordingly
transform_resource: transforms a resource
transform_package: transforms a package
transform_pipeline: transforms a resource or package based on a declarative pipeline definition
We'll see examples of these functions in the next few sections.
#Transforming a Resource
Let's write our first transformation. Here, we will transform a data file (a resource) by defining a source resource, applying transform steps and getting back a resulting target resource:
Let's break down the transforming steps we applied:
steps.table_normalize- cast data types and shape the table according to the schema, inferred or provided
steps.field_add- adds a field to data and metadata based on the information provided by the user
There are many more available steps that we will cover below.
#Transforming a Package
A package is a set of resources. Transforming a package means adding or removing resources and/or transforming those resources themselves. This example shows how transforming a package is similar to transforming a single resource:
We have basically done the same as in Transforming a Resource section. This example is quite artificial and created only to show how to join two resources, but hopefully it provides a basic understanding of how flexible package transformations can be.
A pipeline is a declarative way to write out metadata transform steps. With a pipeline, you can transform a resource, package, or write custom plugins too.
For resource and package types it's mostly the same functionality as we have seen above, but written declaratively. So let's run the same resource transformation as we did in the Transforming a Resource section:
This returns the same result as in the Transforming a Resource. So what's the reason to use declarative pipelines if it works the same as the Python code? The main difference is that pipelines can be saved as JSON files which can be shared among different users and used with CLI and API. For example, if you implement your own UI based on Frictionless Framework you can serialize the whole pipeline as a JSON file and send it to the server. This is the same for CLI - if your colleague has given you a
pipeline.json file, you can run
frictionless transform pipeline.json in the CLI to get the same results as they got.
Frictionless includes more than 40+ built-in transform steps. They are grouped by the object so you can find them easily using code auto completion in a code editor. For example, start typing
steps.table... and you will see all the available steps for that group. The available groups are:
See Transform Steps for a list of all available steps. It is also possible to write custom transform steps: see the next section.
Here is an example of a custom step written as a Python function. This example step removes a field from a data table (note: Frictionless already has a built-in function that does this same thing:
As you can see you can implement any custom steps within a Python script. To make it work within a declarative pipeline you need to implement a plugin. Learn more about Custom Steps and Plugins.
Transform Utils is under construction.
#Working with PETL
In some cases, it's better to use a lower-level API to achieve your goal. A resource can be exported as a PETL table. For more information please visit PETL's documentation portal.