Basic Examples

This example assumes that you are familiar with the concepts behind the Frictionless Framework. For an introduction, please read the Introduction.

Let's start with an example dataset. We will look at a few raw data files that have recently been collected by an anthropologist. The anthropologist wants to publish this data in an open repository so her colleagues can also use this data. Before publishing the data, she wants to add metadata and check the data for errors. We are here to help, so let’s start by exploring the data. We see that the quality of data is far from perfect. In fact, the first row contains comments from the anthropologist! To be able to use this data, we need to clean it up a bit.

Download countries.csv to reproduce the examples (right-click and "Save link as").

CLI
Python

cat countries.csv

countries.csv

# clean this data!
id,neighbor_id,name,population
1,Ireland,Britain,67
2,3,France,n/a,find the population
3,22,Germany,83
4,,Italy,60
5

As we can see, this is data containing information about European countries and their populations. Also, it looks like there are two fields having a relationship based on a country's identifier: neighbor_id is a Foreign Key to id.

Describing Data#

First of all, we're going to describe our dataset. Frictionless uses the powerful Frictionless Data Specifications. They are very handy to describe:

a data table - using Table Schema
a data resource - using Data Resource
a data package - using Data Package
and other objects

Let's describe the countries table:

CLI
Python

frictionless describe countries.csv # optionally add --stats to get statistics

# --------
# metadata: countries.csv
# --------

encoding: utf-8
format: csv
hashing: md5
layout:
  headerRows:
    - 2
name: countries
path: countries.csv
profile: tabular-data-resource
schema:
  fields:
    - name: id
      type: integer
    - name: neighbor_id
      type: string
    - name: name
      type: string
    - name: population
      type: string
scheme: file

As we can see, Frictionless was smart enough to understand that the first row contains a comment. It's good, but we still have a few problems:

we use n/a as a missing values marker
neighbor_id must be numerical: let's edit the schema
population must be numerical: setting proper missing values will solve it
there is a relation between the id and neighbor_id fields

Let's update our metadata and save it to the disc:

CLI
Python

frictionless describe countries.csv --yaml > countries.resource.yaml
editor countries.resource.yaml

Open this file in your favorite editor and update as it's shown below

Let's see what we have created:

CLI
Python

cat countries.resource.yaml

encoding: utf-8
format: csv
hashing: md5
layout:
  headerRows:
    - 2
name: countries
path: countries.csv
profile: tabular-data-resource
schema:
  fields:
    - name: id
      type: integer
    - name: neighbor_id
      type: integer
    - name: name
      type: string
    - name: population
      type: integer
  foreignKeys:
    - fields:
        - neighbor_id
      reference:
        fields:
          - id
        resource: ''
  missingValues:
    - ''
    - n/a
scheme: file

It has the same metadata as we saw above but also includes our editing related to missing values and data types. We didn't change all the wrong data types manually because providing proper missing values had fixed it automatically. Now we have a resource descriptor. In the next section, we will show why metadata matters and how to use it.

Extracting Data#

It's time to try extracting our data as a table. As a first naive attempt, we will ignore the metadata we saved on the previous step:

CLI
Python

frictionless extract countries.csv

# ----
# data: countries.csv
# ----

==  ===========  =======  ==========
id  neighbor_id  name     population
==  ===========  =======  ==========
 1  Ireland      Britain  67
 2  3            France   n/a
 3  22           Germany  83
 4               Italy    60
 5
==  ===========  =======  ==========

Actually, it doesn't look terrible, but in reality, data like this is not quite useful:

it's not possible to export this data e.g., to SQL because integers are mixed with strings
there is still a basically empty row we don't want to have
there are some mistakes in the neighbor_id column

Let's use the metadata we save to try extracting data with the help of Frictionless Data specifications:

CLI
Python

frictionless extract countries.resource.yaml

# ----
# data: countries.resource.yaml
# ----

==  ===========  =======  ==========
id  neighbor_id  name     population
==  ===========  =======  ==========
 1               Britain          67
 2            3  France
 3           22  Germany          83
 4               Italy            60
 5
==  ===========  =======  ==========

It's now much better! Numerical fields are numerical fields, and there are no more textual missing values markers. We can't see in the command-line, but missing values are now None values in Python, and the data can be e.g., exported to SQL. Although, it's still not ready for being published. In the next section, we will validate it!

Validating Data#

Data validation with Frictionless is as easy as describing or extracting data:

CLI
Python

frictionless validate countries.csv

# -------
# invalid: countries.csv
# -------

===  =====  ============  =============================================================================
row  field  code          message
===  =====  ============  =============================================================================
  4      5  extra-cell    Row at position "4" has an extra value in field at position "5"
  7      2  missing-cell  Row at position "7" has a missing cell in field "neighbor_id" at position "2"
  7      3  missing-cell  Row at position "7" has a missing cell in field "name" at position "3"
  7      4  missing-cell  Row at position "7" has a missing cell in field "population" at position "4"
===  =====  ============  =============================================================================

Ahh, we had seen that coming. The data is not valid; there are some missing and extra cells. But wait a minute, in the first step, we created the metadata file with more information about our table. We have to use it.

CLI
Python

CLI

frictionless validate countries.resource.yaml

# -------
# invalid: countries.csv
# -------

===  =====  =================  ==============================================================================================================
row  field  code               message
===  =====  =================  ==============================================================================================================
  3      2  type-error         Type error in the cell "Ireland" in row "3" and field "neighbor_id" at position "2": type is "integer/default"
  4      5  extra-cell         Row at position "4" has an extra value in field at position "5"
  5         foreign-key-error  Row at position "5" violates the foreign key: not found in the lookup table
  7      2  missing-cell       Row at position "7" has a missing cell in field "neighbor_id" at position "2"
  7      3  missing-cell       Row at position "7" has a missing cell in field "name" at position "3"
  7      4  missing-cell       Row at position "7" has a missing cell in field "population" at position "4"
===  =====  =================  ==============================================================================================================

Now it's even worse, but regarding data validation errors, the more, the better, actually. Thanks to the metadata, we were able to reveal some critical errors:

the bad data types, i.e. Ireland instead of an id
the bad relation between id and neighbor_id: we don't have a country with id 22

In the next section, we will clean up the data.

Transforming Data#

We will use metadata to fix all the data type problems automatically. The only two things we need to handle manually:

France's population
Germany's neighborhood

CLI
Python

$ cat > countries.pipeline.yaml <<EOF
tasks:
  - type: resource
    source: countries.csv
    steps:
      - code: cell-replace
        fieldName: neighbor_id
        pattern: '22'
        replace: '2'
      - code: cell-replace
        fieldName: population
        pattern: 'n/a'
        replace: '67'
      - code: row-filter
        formula: population
      - code: field-update
        name: neighbor_id
        type: integer
      - code: table-normalize
      - code: table-write
        path: countries.csv
EOF
$ frictionless transform countries.pipeline.yaml

# -------
# success: countries.pipeline.yaml
# -------

Finally, we've got the cleaned version of our data, which can be exported to a database or published. We have used a CSV as an output format but could have used Excel, JSON, SQL, and others.

CLI
Python

cat countries.csv

id,neighbor_id,name,population
1,,Britain,67
2,3,France,67
3,2,Germany,83
4,,Italy,60

Basically, that's it; now, we have a valid data file and a corresponding metadata file. It can be shared with other people or stored without fear of type errors or other problems making research data not reproducible.

CLI
Python

ls countries.*

countries.csv
countries.resource.yaml

In the next articles, we will explore more advanced Frictionless functionality.