Skip to main content

Extracting Data

This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.

Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract functions (extract, extract_resource, extract_package) and will then go into more advanced details about the Resource Class, Package Class, Header Class, and Row Class.

Let's see this with some real files:

Download country-3.csv to reproduce the examples (right-click and "Save link as").

cat country-3.csv
Data: country-3.csv
id,capital_id,name,population
1,1,Britain,67
2,3,France,67
3,2,Germany,83
4,5,Italy,60
5,4,Spain,47

Download capital-3.csv to reproduce the examples (right-click and "Save link as").

cat capital-3.csv
Data: capital-3.csv
id,name
1,London
2,Berlin
3,Paris
4,Madrid
5,Rome

To start, we will extract data from a resource:

frictionless extract country-3.csv
Data: capital-3.csv
# ----
# data: country-3.csv
# ----
== ========== ======= ==========
id capital_id name population
== ========== ======= ==========
1 1 Britain 67
2 3 France 67
3 2 Germany 83
4 5 Italy 60
5 4 Spain 47
== ========== ======= ==========

Extract Functions#

The high-level interface for extracting data provided by Frictionless is a set of extract functions:

  • extract: detects the source file type and extracts data accordingly
  • extract_resource: accepts a resource descriptor and returns a data table
  • extract_package: accepts a package descriptor and returns a map of the package's tables

As described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.

The command/function would be used as follows:

frictionless extract your-table.csv
frictionless extract your-resource.json --type resource
frictionless extract your-package.json --type package

The extract functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.

Extracting a Resource#

A resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:

frictionless extract capital-3.csv
Data: capital-3.csv
# ----
# data: capital-3.csv
# ----
== ======
id name
== ======
1 London
2 Berlin
3 Paris
4 Madrid
5 Rome
== ======

Our second option is to extract the resource from a descriptor file by using the extract_resource function. A descriptor file is useful because it can contain different metadata and be stored on the disc.

As an example of how to use extract_resource, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):

Python
from frictionless import Resource
resource = Resource('capital-3.csv')
resource.infer()
# as an example, in the next line we will append the schema
resource.schema.missing_values.append('3') # will interpret 3 as a missing value
resource.to_yaml('capital.resource.yaml') # use resource.to_json for JSON format

You can also use a pre-made descriptor file.

Now, this descriptor file can be used to extract the resource:

frictionless extract capital.resource.yaml
Data: capital.resource.yaml
# ----
# data: capital.resource.yaml
# ----
== ======
id name
== ======
1 London
2 Berlin
Paris
4 Madrid
5 Rome
== ======

So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id number 3 now appears as None representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.

You can read more advanced details about the Resource Class below.

Extracting a Package#

The third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.

As a primary example, we provide two data files to the extract command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:

frictionless extract *-3.csv
Data: *-3.csv
# ----
# data: capital-3.csv
# ----
== ======
id name
== ======
1 London
2 Berlin
3 Paris
4 Madrid
5 Rome
== ======
# ----
# data: country-3.csv
# ----
== ========== ======= ==========
id capital_id name population
== ========== ======= ==========
1 1 Britain 67
2 3 France 67
3 2 Germany 83
4 5 Italy 60
5 4 Spain 47
== ========== ======= ==========

We can also extract the package from a descriptor file using the extract_package function (Note: see the Package Class section for the creation of the country.package.yaml file):

Python
from frictionless import Package
package = Package('country.package.yaml')
pprint(package)

You can read more advanced details about the Package Class below.

The following sections contain further, advanced details about the Resource Class, Package Class, Header Class, and Row Class.

Resource Class#

The Resource class provides metadata about a resource with read and stream functions. The extract functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows, data, text, or bytes. Let's try reading all of them.

Reading Bytes#

It's a byte representation of the contents:

Python
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_bytes())
Data: country-3.csv
(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
b'3\n4,5,Italy,60\n5,4,Spain,47\n')

Reading Text#

It's a textual representation of the contents:

Python
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_text())
Data: country-3.csv
('id,capital_id,name,population\n'
'1,1,Britain,67\n'
'2,3,France,67\n'
'3,2,Germany,83\n'
'4,5,Italy,60\n'
'5,4,Spain,47\n')

Reading Lists#

For a tabular data there are raw representaion of the tabular contents:

Python
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_lists())
Data: country-3.csv
[['id', 'capital_id', 'name', 'population'],
['1', '1', 'Britain', '67'],
['2', '3', 'France', '67'],
['3', '2', 'Germany', '83'],
['4', '5', 'Italy', '60'],
['5', '4', 'Spain', '47']]

Reading Rows#

For a tabular data there are row available which is are normalized lists presented as dictionaries:

Python
from frictionless import Resource
resource = Resource('country-3.csv')
pprint(resource.read_rows())
Data: country-3.csv
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]

Reading a Header#

For a tabular data there is the Header object available:

Python
from frictionless import Resource
with Resource('country-3.csv') as resource:
pprint(resource.header)
Data header: country-3.csv
['id', 'capital_id', 'name', 'population']

Streaming Interfaces#

It's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:

Python
from frictionless import Resource
with Resource('country-3.csv') as resource:
resource.byte_stream
resource.text_stream
resource.list_stream
resource.row_stream

Package Class#

The Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:

frictionless describe *-3.csv --json > country.package.json

Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.

We can create a package from data files (using their paths) and then read the package's resources:

Python
from frictionless import Package
package = Package('*-3.csv')
pprint(package.get_resource('country-3').read_rows())
pprint(package.get_resource('capital-3').read_rows())
Data: country-3.csv, capital-3.csv
[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
[{'id': 1, 'name': 'London'},
{'id': 2, 'name': 'Berlin'},
{'id': 3, 'name': 'Paris'},
{'id': 4, 'name': 'Madrid'},
{'id': 5, 'name': 'Rome'}]

The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.

Header Class#

After opening a resource you get access to a resource.header object which describes the resource in more detail. This is a list of normalized labels but also provides some additional functionality. Let's take a look:

Python
from frictionless import Resource
with Resource('capital-3.csv') as resource:
print(f'Header: {resource.header}')
print(f'Labels: {resource.header.labels}')
print(f'Fields: {resource.header.fields}')
print(f'Field Names: {resource.header.field_names}')
print(f'Field Positions: {resource.header.field_positions}')
print(f'Errors: {resource.header.errors}')
print(f'Valid: {resource.header.valid}')
print(f'As List: {resource.header.to_list()}')
Header: ['id', 'name']
Labels: ['id', 'name']
Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
Field Names: ['id', 'name']
Field Positions: [1, 2]
Errors: []
Valid: True
As List: ['id', 'name']

The example above shows a case when a header is valid. For a header that contains errors in its tabular structure, this information can be very useful, revealing discrepancies, duplicates or missing cell information:

Python
from pprint import pprint
from frictionless import Resource
with Resource([['name', 'name'], ['value', 'value']]) as resource:
pprint(resource.header.errors)
[{'code': 'duplicate-label',
'description': 'Two columns in the header row have the same value. Column '
'names should be unique.',
'fieldName': 'name2',
'fieldNumber': 2,
'fieldPosition': 2,
'label': 'name',
'labels': ['name', 'name'],
'message': 'Label "name" in the header at position "2" is duplicated to a '
'label: at position "1"',
'name': 'Duplicate Label',
'note': 'at position "1"',
'rowPositions': [1],
'tags': ['#table', '#header', '#label']}]

Please read the API Reference for more details.

Row Class#

The extract, resource.read_rows() and other functions return or yield row objects. In Python, this returns a dictionary with the following information. Note: this example uses the Detector object, which tweaks how different aspects of metadata are detected.

Python
from frictionless import Resource, Detector
detector = Detector(schema_patch={'missingValues': ['1']})
with Resource('data/capital-3.csv', detector=detector) as resource:
for row in resource:
print(f'Row: {row}')
print(f'Cells: {row.cells}')
print(f'Fields: {row.fields}')
print(f'Field Names: {row.field_names}')
print(f'Field Positions: {row.field_positions}')
print(f'Value of field "name": {row["name"]}') # accessed as a dict
print(f'Row Position: {row.row_position}') # physical line number starting from 1
print(f'Row Number: {row.row_number}') # counted row number starting from 1
print(f'Blank Cells: {row.blank_cells}')
print(f'Error Cells: {row.error_cells}')
print(f'Errors: {row.errors}')
print(f'Valid: {row.valid}')
print(f'As Dict: {row.to_dict(json=False)}')
print(f'As List: {row.to_list(json=True)}') # JSON compatible data types
break
Row: {'id': None, 'name': 'London'}
Cells: ['1', 'London']
Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
Field Names: ['id', 'name']
Field Positions: [1, 2]
Value of field "name": London
Row Position: 2
Row Number: 1
Blank Cells: {'id': '1'}
Error Cells: {}
Errors: []
Valid: True
As Dict: {'id': None, 'name': 'London'}
As List: [None, 'London']

As we can see, this output provides a lot of information which is especially useful when a row is not valid. Our row is valid but we demonstrated how it can preserve data about missing values. It also preserves data about all cells that contain errors:

Python
from pprint import pprint
from frictionless import Resource
with Resource([['name'], ['value', 'value']]) as resource:
for row in resource.row_stream:
pprint(row.errors)
[{'cell': 'value',
'cells': ['value', 'value'],
'code': 'extra-cell',
'description': 'This row has more values compared to the header row (the '
'first row in the data source). A key concept is that all the '
'rows in tabular data must have the same number of columns.',
'fieldName': '',
'fieldNumber': 1,
'fieldPosition': 2,
'message': 'Row at position "2" has an extra value in field at position "2"',
'name': 'Extra Cell',
'note': '',
'rowNumber': 1,
'rowPosition': 2,
'tags': ['#table', '#row', '#cell']}]

Please read the API Reference for more details.