Extracting Data
This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract
functions (extract
, extract_resource
, extract_package
) and will then go into more advanced details about the Resource Class
, Package Class
, Header Class
, and Row Class
.
Let's see this with some real files:
Download
country-3.csv
to reproduce the examples (right-click and "Save link as").
- CLI
- Python
Download
capital-3.csv
to reproduce the examples (right-click and "Save link as").
- CLI
- Python
To start, we will extract data from a resource:
- CLI
- Python
#
Extract FunctionsThe high-level interface for extracting data provided by Frictionless is a set of extract
functions:
extract
: detects the source file type and extracts data accordinglyextract_resource
: accepts a resource descriptor and returns a data tableextract_package
: accepts a package descriptor and returns a map of the package's tables
As described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.
The command/function would be used as follows:
- CLI
- Python
The extract
functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.
#
Extracting a ResourceA resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:
- CLI
- Python
Our second option is to extract the resource from a descriptor file by using the extract_resource
function. A descriptor file is useful because it can contain different metadata and be stored on the disc.
As an example of how to use extract_resource
, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):
You can also use a pre-made descriptor file.
Now, this descriptor file can be used to extract the resource:
- CLI
- Python
So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id
number 3 now appears as None
representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.
You can read more advanced details about the Resource Class below.
#
Extracting a PackageThe third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.
As a primary example, we provide two data files to the extract
command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:
- CLI
- Python
We can also extract the package from a descriptor file using the extract_package
function (Note: see the Package Class section for the creation of the country.package.yaml
file):
You can read more advanced details about the Package Class below.
The following sections contain further, advanced details about the
Resource Class
,Package Class
,Header Class
, andRow Class
.
#
Resource ClassThe Resource class provides metadata about a resource with read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows
, data
, text
, or bytes
. Let's try reading all of them.
#
Reading BytesIt's a byte representation of the contents:
#
Reading TextIt's a textual representation of the contents:
#
Reading ListsFor a tabular data there are raw representaion of the tabular contents:
#
Reading RowsFor a tabular data there are row available which is are normalized lists presented as dictionaries:
#
Reading a HeaderFor a tabular data there is the Header object available:
#
Streaming InterfacesIt's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:
#
Package ClassThe Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:
- CLI
- Python
Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.
We can create a package from data files (using their paths) and then read the package's resources:
The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.
#
Header ClassAfter opening a resource you get access to a resource.header
object which describes the resource in more detail. This is a list of normalized labels but also provides some additional functionality. Let's take a look:
The example above shows a case when a header is valid. For a header that contains errors in its tabular structure, this information can be very useful, revealing discrepancies, duplicates or missing cell information:
Please read the API Reference for more details.
#
Row ClassThe extract
, resource.read_rows()
and other functions return or yield row objects. In Python, this returns a dictionary with the following information. Note: this example uses the Detector object, which tweaks how different aspects of metadata are detected.
As we can see, this output provides a lot of information which is especially useful when a row is not valid. Our row is valid but we demonstrated how it can preserve data about missing values. It also preserves data about all cells that contain errors:
Please read the API Reference for more details.