diff --git a/.gitignore b/.gitignore index 088d28f607..c24a2d10aa 100644 --- a/.gitignore +++ b/.gitignore @@ -95,3 +95,4 @@ coverage/ site/ tmp/ .vim +!**/*.html diff --git a/404.html b/404.html new file mode 100644 index 0000000000..b4bbabf792 --- /dev/null +++ b/404.html @@ -0,0 +1,3439 @@ + + +
+ + + + + ++
+ By Evgeny Karev on 2022-08-22 » + Blog Index +
We're releasing a first beta of Frictionless Framework (v5)! +Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shortcomings and areas that can be improved in the next version for the framework. Once that process had been done we started working on a new v5 with a goal to make the framework more bullet-proof, easy to maintain and simplify user interface. Today, this version is almost stable and ready to be published. Let's go through the main improvements we have made:
+This year we started working on the Frictionless Application, at the same time, we were thinking about next steps for the Frictionless Standards. For both we need well-defined and an easy-to-understand metadata model. Partially it's already published as standards like Table Schema and partially it's going to be published as standards like File Dialect and possibly validation/transform metadata.
+In v4 of the framework we had Control/Dialect/Layout concepts to describe resource details related to different formats and schemes, as well as tabular details like header rows. In v5 it's merged into the only one concept called Dialect which is going to be standardised as a File Dialect spec. Here is an example:
+ +header: true
+headerRows: [2, 3]
+commentChar: '#'
+csv:
+ delimiter: ';'
+
+
+ from frictionless import Dialect, Control, formats
+
+dialect = Dialect(header=True, header_rows=[2, 3], comment_char='#')
+dialect.add_control(formats.CsvControl(delimiter=';'))
+print(dialect)
+
+
+ A dialect descriptor can be saved and reused within a resource. Technically, it's possible to provide different schemes and formats settings within one Dialect (e.g. for CSV and Excel) so it's possible to create e.g. one re-usable dialect for a data package. A legacy CSV Dialect spec is supported and will be supported forever so it's possible to provide CSV properties on the root level:
+ +header: true
+delimiter: ';'
+
+
+ from frictionless import Dialect, Control, formats
+
+dialect = Dialect.from_descriptor({"header": True, "delimiter": ';'})
+print(dialect)
+
+
+ For performance and codebase maintainability reasons some marginal Layout features have been removed completely such as skip/pick/limit/offsetFields/etc
. It's possible to achieve the same results using the Pipeline concept as a part of the transformation workflow.
Read an article about Dialect Class for more information.
+Checklist is a new concept introduced in v5. It's basically a collection of validation steps and a few other settings to make "validation rules" sharable. For example:
+ +checks:
+ - type: ascii-value
+ - type: row_constraint
+ formula: id > 1
+skipErrors:
+ - duplicate-label
+
+
+ from frictionless import Checklist, checks
+
+checklist = Checklist(
+ checks=[checks.ascii_value(), checks.row_constraint(formula='id > 1')],
+ skip_errors=['duplicate-label'],
+)
+print(checklist)
+
+
+ Having and sharing this checklist it's possible to tune data quality requirements for some data file or set of data files. This concept will provide an ability for creating data quality "libraries" within projects or domains. We can use a checklist for validation:
+ +frictionless validate table1.csv --checklist checklist.yaml
+frictionless validate table2.csv --checklist checklist.yaml
+
+
+ Here is a list of another changes:
+From (v4) | +To (v5) | +
---|---|
Check(descriptor) | +Check.from_descriptor(descriptor) | +
check.code | +check.type | +
Read an article about Checklist Class for more information.
+In v4 Pipeline was a complex concept similar to validation Inquiry. We reworked it for v5 to be a lightweight set of validation steps that can be applied to a data resource or a data package. For example:
+ +steps:
+ - type: table-normalize
+ - type: cell-set
+ fieldName: version
+ value: v5
+
+
+ from frictionless import Pipeline, steps
+
+pipeline = Pipeline(
+ steps=[steps.table_normalize(), steps.cell_set(field_name='version', value='v5')],
+)
+print(pipeline)
+
+
+ Similar to the Checklist concept, Pipeline is a reusable (data-abstract) object that can be saved to a descriptor and used in some complex data workflow:
+ +frictionless transform table1.csv --pipeline pipeline.yaml
+frictionless transform table2.csv --pipeline pipeline.yaml
+
+
+ Here is a list of another changes:
+From (v4) | +To (v5) | +
---|---|
Step(descriptor) | +Step.from_descriptor(descriptor) | +
step.code | +step.type | +
Read an article about Pipeline Class for more information.
+frictionless@5.7
this experimental feature (resource.checklist/pipeline
) has been disabled to conform better with the standards.
+ There are no changes in the Resource related to the standards although currently by default instead of profile
the type
property will be used to mark a resource as a table. It can be changed using the --standards v1
flag.
It's now possible to set Checklist and Pipeline as a Resource property similar to Dialect and Schema:
+ +path: table.csv
+# ...
+checklist:
+ checks:
+ - type: ascii-value
+ - type: row_constraint
+ formula: id > 1
+pipeline: pipeline.yaml
+ steps:
+ - type: table-normalize
+ - type: cell-set
+ fieldName: version
+ value: v5
+
+
+ Or using dereference:
+ +path: table.csv
+# ...
+checklist: checklist.yaml
+pipeline: pipeline.yaml
+
+
+ In this case the validation/transformation will use it by default providing an ability to ship validation rules and transformation pipelines within resources and packages. This is an important development for data publishers who want to define what they consider to be valid for their datasets as well as sharing raw data with a cleaning pipeline steps:
+ +frictionless validate resource.yaml # will use the checklist above
+frictionless transform resource.yaml # will use the pipeline above
+
+
+ There are minor changes in the stats
property. Now it uses named keys to simplify hash distinction (md5/sha256 are calculated by default and it's not possible to change for performance reasons as it was in v4):
from frictionless import describe
+
+resource = describe('table.csv', stats=True)
+print(resource.stats)
+
+
+ Here is a list of another changes:
+From (v4) | +To (v5) | +
---|---|
for row in resource: | +for row in resource.row_stream | +
Read an article about Resource Class for more information.
+There are no changes in the Package related to the standards although it's now possible to use resource dereference:
+ +name: package
+resources:
+ - resource1.yaml
+ - resource2.yaml
+
+
+ Read an article about Package Class for more information.
+frictionless@5.7
this experimental feature is changes and now it requires catalog.datasets[].package
structure.
+ Catalog is a new concept that is a collection of data packages that can be written inline or using dereference:
+ +name: catalog
+packages:
+ - package1.yaml
+ - package2.yaml
+
+
+ Read an article about Catalog Class for more information.
+Detector is now a metadata class (it wasn't in v4) so it can be saved and shared as other metadata classes:
+ +from frictionless import Detector
+
+detector = Detector(sample_size=1000)
+print(detector)
+
+
+ Read an article about Detector Class for more information.
+There are few changes in the Inquiry concept which is known for using in the Frictionless Repository project:
+From (v4) | +To (v5) | +
---|---|
inquiryTask.source | +inquiryTask.path | +
inquiryTask.source | +inquiryTask.resource | +
inquiryTask.source | +inquiryTask.package | +
Read an article about Inquiry Class for more information.
+The Report concept has been significantly simplified by removing the resource
property from reportTask
. It's been replaced by name/type/place/labels
properties. Also report.time
is now report.stats.seconds
. The report/reportTask.warnings: List[str]
have been added to provide non-error information like reached limits:
frictionless validate table.csv --yaml
+
+
+ Here is a list of changes:
+From (v4) | +To (v5) | +
---|---|
report.time | +report.stats.seconds | +
reportTask.time | +reportTask.stats.seconds | +
reportTask.resource.name | +reportTask.name | +
reportTask.resource.profile | +reportTask.type | +
reportTask.resource.path | +reportTask.place | +
reportTask.resource.schema | +reportTask.labels | +
Read an article about Report Class for more information.
+Changes in the Schema class:
+From (v4) | +To (v5) | +
---|---|
Schema(descriptor) | +Schema.from_descriptor(descriptor) | +
There are a few changes in the Error data structure:
+From (v4) | +To (v5) | +
---|---|
error.code | +error.type | +
error.name | +error.title | +
error.rowPosition | +error.rowNumber | +
error.fieldPosition | +error.fieldNumber | +
Note that all the metadata entities that have multiple implementations in v5 are based on a unified type model. It means that they use the type
property to provide type information:
From (v4) | +To (v5) | +
---|---|
resource.profile | +resource.type | +
check.code | +check.type | +
control.code | +control.type | +
error.code | +error.type | +
field.type | +field.type | +
step.type | +step.type | +
The new v5 version still supports old notation in descriptors for backward-compatibility.
+It's been many years that Frictionless were mixing declarative metadata and object model for historical reasons. Since the first implementation of datapackage
library we used different approaches to sync internal state to provide both interfaces descriptor and object model. In Frictionless Framework v4 this technique had been taken to a really sophisticated level with special observables dictionary classes. It was quite smart and nice-to-use for quick prototyping in REPL but it was really hard to maintain and error-prone.
In Framework v5 we finally decided to follow the "right way" for handling this problem and split descriptors and object model completely.
+In the Frictionless World we deal with a lot of declarative metadata descriptors such as packages, schemas, pipelines, etc. Nothing changes in v5 regarding this. So for example here is a Table Schema:
+ +fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+
+
+ The difference comes here we we create a metadata instance based on this descriptor. In v4 all the metadata classes were a subclasses of the dict class providing a mix between a descriptor and object model for state management. In v5 there is a clear boundary between descriptor and object model. All the state are managed as it should be in a normal Python class using class attributes:
+ +from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.yaml')
+# Here we deal with a proper object model
+descriptor = schema.to_descriptor()
+# Here we export it back to be a descriptor
+
+
+ There are a few important traits of the new model:
+package.add_resource
This separation might make one to add a few additional lines of code, but it gives us much less fragile programs in the end. It's especially important for software integrators who want to be sure that they write working code. At the same time, for quick prototyping and discovery Frictionless still provides high-level actions like validate
function that are more forgiving regarding user input.
One of the most important consequences of "fixing" state management in Frictionless is our new ability to provide static typing for the framework codebase. This work is in progress but we have already added a lot of types and it successfully pass pyright
validation. We highly recommend enabling pyright
in your IDE to see all the type problems in-advance:
We're happy to announce that we're finally ready to drop a JavaScript dependency for the docs generation as we migrated it to Livemark. Moreover, Livemark's ability to execute scripts inside the documentation and other nifty features like simple Tabs or a reference generator will save us hours and hours for writing better docs.
+We hope that Livemark docs writing experience will make our contributors happier and allow to grow our community of Frictionless Authors and Users. Let's chat in our Slack if you have questions or just want to say hi.
+Read Livemark Docs for more information.
+ ++
+ By Shashi Gharti on 2022-09-07 » + Blog Index +
We are happy to announce github plugin which makes sharing data between frictionless and github easier without any extra work and configuration. All the github plugin functionalities are wrapped around the PyGithub library. The main idea is to make the interaction between the framework and github seamless using read and write functions developed on top of the Frictionless python library. Here is a short introduction and examples of the features. +Reading package from github repository is made easy! The existing Package
class can identify the github url and read the packages and resources from the repo. It can read packages from repos with or without packages descriptors. If a package descriptor is not defined, it will create a package descriptor with resources that it finds in the repo.
from frictionless import Package
+
+package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
+print(package)
+
+
+ Writing and publishing can be easily done by passing the repository link using publish
function.
from frictionless import Package, portals
+
+apikey = 'YOUR-GITHUB-API-KEY'
+package = Package('data/datapackage.json')
+response = package.publish("https://github.com/fdtester/test-repo-write",
+ control=portals.GithubControl(apikey=apikey)
+ )
+
+
+ Catalog can be created from a single repository by using 'search' queries. Repositories can be searched using combination of any search text and github qualifiers. A simple example of creating catalog from search is as follows:
+ +from frictionless import Catalog, portals
+
+catalog = Catalog(
+ control=portals.GithubControl(search="user:fdtester", per_page=1, page=1),
+ )
+
+
+ We will have more updates in future and would love to hear from you about this new feature. Let's chat in our Slack if you have questions or just want to say hi.
+Read Github Plugin Docs for more information.
+ ++
+ By Shashi Gharti on 2022-11-07 » + Blog Index +
Zenodo integration was very highly requested feature and we are happy to share our first draft of the plugin which makes sharing data between frictionless and zenodo easier without any extra work and configuration. This plugin uses zenodopy library underneath to communicate with Zenodo REST API. A frictionless user can use the framework functionalities and then easily publish data to zenodo and viceversa. Here is a short description of the features with examples. +You can simply read the package or create a new package from the zenodo repository if package does not exists. No additional configuration is required. The existing Package
class identifies zenodo url and reads the packages and resources from the repo. Example of reading package from the zenodo repo is as follows:
from frictionless import Package
+
+package = Package("https://zenodo.org/record/7078760")
+print(package)
+
+
+ Once read you can apply all the available functions to the package such as validation, transformation etc.
+To write the package we can simply use publish
function, which will write the package and resource files to zenodo repository. We need to provide meta data for the repository while publishing data which we pass as meta.json as shown in the example below:
from frictionless import Package, portals
+
+control = portals.ZenodoControl(
+ metafn="data/zenodo/metadata.json",
+ apikey=apikey
+)
+package = Package("data/datapackage.json")
+deposition_id = package.publish(control=control)
+print(deposition_id)
+
+
+
+ Once the package is published, deposition_id will be returned.
+Catalog can be created from a single repository or from multiple repositories. Repositories can be searched using any search terms, phrase, field search or combination of all. A simple example of creating catalog from search is as follows:
+ +from frictionless import Catalog, portals
+control=portals.ZenodoControl(search='title:"open science"')
+catalog = Catalog(
+ control=control,
+ )
+
+
+ We will have more updates in future and would love to hear from you about this new feature. Let's chat in our Slack if you have questions or just want to say hi.
+Read Zenodo Plugin Docs for more information.
+ ++ + By Shashi Gharti + on 2022-11-07 + +
+ This blog gives the introduction of the zenodo plugin which helps to easily read data from and write data to Zenodo. + Read more » ++ + By Shashi Gharti + on 2022-09-07 + +
+ This blog gives the introduction of the github plugin which helps to seamlessly transfer/read data to/from Github. + Read more » ++ + By Evgeny Karev + on 2022-08-22 + +
+ Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shorcomings and areas that can be improved in the next version of the framework. + Read more » +This guides provides a high-level overview of the Frictionless Framework architecture. It will be useful for plugin authors and advanced users.
+Frictionless uses modular approach for its architecture. During reading a data source goes through various subsystems which are selected depending on the data characteristics:
+ + +Frictionless is built on top of a powerful plugins system which is used internally and allows to extend the framework.
+To create a plugin you need:
+frictionless_<name>
available in PYTHONPATHPlease consult with System/Plugin for in-detail information about the Plugin interface and how these methods can be implemented.
+Let's say we're interested in supporting the csv2k
format that we have just invented. For simplicity, let's use a format that is exactly the same with CSV.
First of all, we need to create a frictionless_csv2k
module containing a Plugin implementation and a Parser implementation but we're going to re-use the CsvParser as our new format is the same:
++ +frictionless_csv2k.py
+
from frictionless import Plugin, system
+from frictionless.plugins.csv import CsvParser
+
+class Csv2kPlugin(Plugin):
+ def create_parser(self, resource):
+ if resource.format == "csv2k":
+ return Csv2kParser(resource)
+
+class Csv2kParser(CsvParser):
+ pass
+
+system.register('csv2k', Csv2kPlugin())
+
+
+ Now, we can use our new format in any of the Frictionless functions that accept a table source, for example, extract
or Table
:
from frictionless import extract
+
+rows = extract('data/table.csv2k')
+print(rows)
+
+
+ This example is over-simplified to show the high-level mechanics but writing Frictionless Plugins is designed to be easy. For inspiration, you can check the frictionless/plugins
directory and learn from real-life examples. Also, in the Frictionless codebase there are many Check
, Control
, Dialect
, Loader
, Parser
, and Server
implementations - you can read their code for better understanding of how to write your own subclass or reach out to us for support.
Plugin representation + +It's an interface for writing Frictionless plugins. +You can implement one or more methods to hook into Frictionless system.
+Create adapter
+(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]
+Create loader
+(resource: Resource) -> Optional[Loader]
+Create parser
+(resource: Resource) -> Optional[Parser]
+Detect field candidates
+(candidates: List[dict[str, Any]]) -> None
+Hook into resource detection
+(resource: Resource) -> None
+The most important undelaying object in the Frictionless Framework is system
. It's an singleton object avaialble as frictionless.system
.
Using the system
object a user can alter the execution context. It uses a Python context manager so it can be used in anyway that it's possible in Python, for example, it can be nested or combined.
If data or metadata comes from a trusted origin, it's possible to disable safety checks for paths:
+with system.use_context(trusted=True):
+ extract('/path/to/file/is/absolute.csv')
+
+To raise warning or errors on data problems, it's possible to use onerror
context value. It's default to ignore
and can be set to warn
or error
:
with system.use_context(onerror='error'):
+ extract('table-with-error-will-raise-an-exeption.csv')
+
+By default, the framework uses coming v2
version of the standards for outputing metadata. It's possible to alter this behaviour:
with system.use_context(standards='v1'):
+ describe('metadata-will-be-in-v1.csv')
+
+It's possible to provide a custom requests.Session
:
session = requests.Session()
+with system.use_context(http_session=session):
+ with Resource(BASEURL % "data/table.csv") as resource:
+ assert resource.header == ["id", "name"]
+
+This object can be used to instantiate different kind of lower-level as though Check
, Step
, or Field
. Here is a quick example of using the system
object:
from frictionless import Resource, system
+
+# Create
+
+adapter = system.create_adapter(source, control=control)
+loader = system.create_loader(resource)
+parser = system.create_parser(resource)
+
+# Detect
+
+system.detect_resource(resource)
+field_candidates = system.detect_field_candidates()
+
+# Select
+
+Check = system.selectCheck('type')
+Control = system.selectControl('type')
+Error = system.selectError('type')
+Field = system.selectField('type')
+Step = system.selectStep('type')
+
+
+ As an extension author you might use the system
object in various cases. For example, take a look at this MultipartLoader
excerpts:
def read_line_stream(self):
+ for number, path in enumerate(self.__path, start=1):
+ resource = Resource(path=path)
+ resource.infer(sample=False)
+ with system.create_loader(resource) as loader:
+ for line_number, line in enumerate(loader.byte_stream, start=1):
+ if not self.__headless and number > 1 and line_number == 1:
+ continue
+ yield line
+
+
+ It's important to understand that creating low-level objects in general is more corect using the system
object than just classes because it will include all the available plugins in the process.
The Plugin API almost fully follows the system object's API. So as a plugin author you need to hook into the same methods. For example, let's take a look at a builtin Csv Plugin:
+ +class CsvPlugin(Plugin):
+ """Plugin for CSV"""
+
+ # Hooks
+
+ def create_parser(self, resource: Resource):
+ if resource.format in ["csv", "tsv"]:
+ return CsvParser(resource)
+
+ def detect_resource(self, resource: Resource):
+ if resource.format in ["csv", "tsv"]:
+ resource.type = "table"
+ resource.mediatype = f"text/{resource.format}"
+
+ def select_Control(self, type: str):
+ if type == "csv":
+ return CsvControl
+
+
+ Loader representation
+(resource: Resource)
++ Specifies if the resource is remote. +
+bool
+types.IBuffer
+Resource byte stream + +The stream is available after opening the loader
+types.IByteStream
+Whether the loader is closed
+bool
+Resource
+Resource text stream + +The stream is available after opening the loader
+types.ITextStream
+Close the loader as "filelike.close" does
+() -> None
+Open the loader as "io.open" does
+Read bytes stream
+() -> types.IByteStream
+Detect metadta using sample
+(buffer: bytes)
+Buffer byte stream
+(byte_stream: types.IByteStream)
+Create bytes stream
+() -> types.IByteStream
+Decompress byte stream
+(byte_stream: types.IByteStream) -> types.IByteStream
+Process byte stream
+(byte_stream: types.IByteStream) -> ByteStreamWithStatsHandling
+Read text stream
+Write from a temporary file
+(path: str) -> Any
+Create byte stream for writing
+(path: str) -> types.IByteStream
+Store byte stream
+(byte_stream: types.IByteStream) -> Any
+Parser representation
+(resource: Resource)
++ Specifies if parser requires the loader to load the + data. +
+ClassVar[bool]
++ Data types supported by the parser. +
+ClassVar[List[str]]
+types.ICellStream
+Whether the parser is closed
+bool
+Loader
+Resource
+types.ISample
+Close the parser as "filelike.close" does
+() -> None
+Open the parser as "io.open" does
+Read list stream
+() -> types.ICellStream
+Create list stream from loader
+() -> types.ICellStream
+Wrap list stream into error handler
+(cell_stream: types.ICellStream) -> CellStreamWithErrorHandling
+Create and open loader
+() -> Optional[Loader]
+Write row stream from the source resource
+(source: TableResource) -> Any
+Plugin representation + +It's an interface for writing Frictionless plugins. +You can implement one or more methods to hook into Frictionless system.
+Create adapter
+(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]
+Create loader
+(resource: Resource) -> Optional[Loader]
+Create parser
+(resource: Resource) -> Optional[Parser]
+Detect field candidates
+(candidates: List[dict[str, Any]]) -> None
+Hook into resource detection
+(resource: Resource) -> None
+System representation + +This class provides an ability to make system Frictionless calls. +It's available as `frictionless.system` singletone.
++ A flag that indicates if resource, path or package is trusted. +
+ClassVar[List[str]]
++ A flag that indicates if resource, path or package is trusted. +
+bool
++ Type of action to take on Error such as "warn", "raise" or "ignore". +
+types.IOnerror
++ Setting this value user can use feature of the specific version. + The default value is v2. +
+types.IStandards
+Return a HTTP session + +This method will return a new session or the session +from `system.use_http_session` context manager
+Create adapter
+(source: Any, *, control: Optional[Control] = None, basepath: Optional[str] = None, packagify: bool = False) -> Optional[Adapter]
+Create loader
+(resource: Resource) -> Loader
+Create parser
+(resource: Resource) -> Parser
+Deregister a plugin
+(name: str)
+Create candidates
+() -> List[dict[str, Any]]
+Hook into resource detection
+(resource: Resource) -> None
+Register a plugin
+(name: str, plugin: Plugin)
+Let's start with an example dataset. We will look at a few raw data files that have recently been collected by an anthropologist. The anthropologist wants to publish this data in an open repository so her colleagues can also use this data. Before publishing the data, she wants to add metadata and check the data for errors. We are here to help, so let’s start by exploring the data. We see that the quality of data is far from perfect. In fact, the first row contains comments from the anthropologist! To be able to use this data, we need to clean it up a bit.
+++ +Download
+countries.csv
to reproduce the examples (right-click and "Save link as").
cat countries.csv
+
+
+# clean this data!
+id,neighbor_id,name,population
+1,Ireland,Britain,67
+2,3,France,n/a,find the population
+3,22,Germany,83
+4,,Italy,60
+5
+
+ with open('countries.csv') as file:
+ print(file.read())
+
+
+# clean this data!
+id,neighbor_id,name,population
+1,Ireland,Britain,67
+2,3,France,n/a,find the population
+3,22,Germany,83
+4,,Italy,60
+5
+
+ As we can see, this is data containing information about European countries and their populations. Also, it looks like there are two fields having a relationship based on a country's identifier: neighbor_id is a Foreign Key to id.
+First of all, we're going to describe our dataset. Frictionless uses the powerful Frictionless Data Specifications. They are very handy to describe:
+Let's describe the countries
table:
frictionless describe countries.csv # optionally add --stats to get statistics
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ countries
+┏━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name ┃ population ┃
+┡━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
+│ integer │ string │ string │ string │
+└─────────┴─────────────┴────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import describe
+
+resource = describe('countries.csv')
+pprint(resource)
+
+
+{'name': 'countries',
+ 'type': 'table',
+ 'path': 'countries.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'dialect': {'headerRows': [2]},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'neighbor_id', 'type': 'string'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'string'}]}}
+
+ As we can see, Frictionless was smart enough to understand that the first row contains a comment. It's good, but we still have a few problems:
+n/a
as a missing values markerneighbor_id
must be numerical: let's edit the schemapopulation
must be numerical: setting proper missing values will solve itid
and neighbor_id
fieldsLet's update our metadata and save it to the disc:
+++ +Open this file in your favorite editor and update as it's shown below
+
frictionless describe countries.csv --yaml > countries.resource.yaml
+editor countries.resource.yaml
+
+
+ from frictionless import Detector, describe
+
+detector = Detector(field_missing_values=["", "n/a"])
+resource = describe("countries.csv", detector=detector)
+resource.schema.set_field_type("neighbor_id", "integer")
+resource.schema.foreign_keys.append(
+ {"fields": ["neighbor_id"], "reference": {"resource": "", "fields": ["id"]}}
+)
+resource.to_yaml("countries.resource.yaml")
+
+
+ Let's see what we have created:
+ +cat countries.resource.yaml
+
+
+name: countries
+type: table
+path: countries.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+schema:
+ fields:
+ - name: id
+ type: integer
+ - name: neighbor_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+ missingValues:
+ - ''
+ - n/a
+ foreignKeys:
+ - fields:
+ - neighbor_id
+ reference:
+ resource: ''
+ fields:
+ - id
+
+ with open('countries.resource.yaml') as file:
+ print(file.read())
+
+
+name: countries
+type: table
+path: countries.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+schema:
+ fields:
+ - name: id
+ type: integer
+ - name: neighbor_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+ missingValues:
+ - ''
+ - n/a
+ foreignKeys:
+ - fields:
+ - neighbor_id
+ reference:
+ resource: ''
+ fields:
+ - id
+
+ It has the same metadata as we saw above but also includes our editing related to missing values and data types. We didn't change all the wrong data types manually because providing proper missing values had fixed it automatically. Now we have a resource descriptor. In the next section, we will show why metadata matters and how to use it.
+It's time to try extracting our data as a table. As a first naive attempt, we will ignore the metadata we saved on the previous step:
+ +frictionless extract countries.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ countries
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ Ireland │ Britain │ 67 │
+│ 2 │ 3 │ France │ n/a │
+│ 3 │ 22 │ Germany │ 83 │
+│ 4 │ None │ Italy │ 60 │
+│ 5 │ None │ None │ None │
+└────┴─────────────┴─────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('countries.csv')
+pprint(rows)
+
+
+{'countries': [{'id': 1,
+ 'name': 'Britain',
+ 'neighbor_id': 'Ireland',
+ 'population': '67'},
+ {'id': 2,
+ 'name': 'France',
+ 'neighbor_id': '3',
+ 'population': 'n/a'},
+ {'id': 3,
+ 'name': 'Germany',
+ 'neighbor_id': '22',
+ 'population': '83'},
+ {'id': 4,
+ 'name': 'Italy',
+ 'neighbor_id': None,
+ 'population': '60'},
+ {'id': 5,
+ 'name': None,
+ 'neighbor_id': None,
+ 'population': None}]}
+
+ Actually, it doesn't look terrible, but in reality, data like this is not quite useful:
+The output of the extract is in 'utf-8' encoding scheme. Let's use the metadata we save to try extracting data with the help of Frictionless Data specifications:
+ +frictionless extract countries.resource.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ countries │ table │ countries.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ countries
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ None │ Britain │ 67 │
+│ 2 │ 3 │ France │ None │
+│ 3 │ 22 │ Germany │ 83 │
+│ 4 │ None │ Italy │ 60 │
+│ 5 │ None │ None │ None │
+└────┴─────────────┴─────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('countries.resource.yaml')
+pprint(rows)
+
+
+{'countries': [{'id': 1,
+ 'name': 'Britain',
+ 'neighbor_id': None,
+ 'population': 67},
+ {'id': 2,
+ 'name': 'France',
+ 'neighbor_id': 3,
+ 'population': None},
+ {'id': 3,
+ 'name': 'Germany',
+ 'neighbor_id': 22,
+ 'population': 83},
+ {'id': 4,
+ 'name': 'Italy',
+ 'neighbor_id': None,
+ 'population': 60},
+ {'id': 5,
+ 'name': None,
+ 'neighbor_id': None,
+ 'population': None}]}
+
+ It's now much better! Numerical fields are numerical fields, and there are no more textual missing values markers. We can't see in the command-line, but missing values are now None
values in Python, and the data can be e.g., exported to SQL. Although, it's still not ready for being published. In the next section, we will validate it!
Data validation with Frictionless is as easy as describing or extracting data:
+ +frictionless validate countries.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ countries │ table │ countries.csv │ INVALID │
+└───────────┴───────┴───────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ countries
+┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ 4 │ 5 │ extra-cell │ Row at position "4" has an extra value in field │
+│ │ │ │ at position "5" │
+│ 7 │ 2 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "neighbor_id" at position "2" │
+│ 7 │ 3 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "name" at position "3" │
+│ 7 │ 4 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "population" at position "4" │
+└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import validate
+
+report = validate('countries.csv')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
+[[4, 5, 'extra-cell'],
+ [7, 2, 'missing-cell'],
+ [7, 3, 'missing-cell'],
+ [7, 4, 'missing-cell']]
+
+ Ahh, we had seen that coming. The data is not valid; there are some missing and extra cells. But wait a minute, in the first step, we created the metadata file with more information about our table. We have to use it.
+ +frictionless validate countries.resource.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ countries │ table │ countries.csv │ INVALID │
+└───────────┴───────┴───────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ countries
+┏━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ 3 │ 2 │ type-error │ Type error in the cell "Ireland" in row "3" and │
+│ │ │ │ field "neighbor_id" at position "2": type is │
+│ │ │ │ "integer/default" │
+│ 4 │ 5 │ extra-cell │ Row at position "4" has an extra value in field │
+│ │ │ │ at position "5" │
+│ 5 │ None │ foreign-key │ Row at position "5" violates the foreign key: │
+│ │ │ │ for "neighbor_id": values "22" not found in the │
+│ │ │ │ lookup table "" as "id" │
+│ 7 │ 2 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "neighbor_id" at position "2" │
+│ 7 │ 3 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "name" at position "3" │
+│ 7 │ 4 │ missing-cell │ Row at position "7" has a missing cell in field │
+│ │ │ │ "population" at position "4" │
+└─────┴───────┴──────────────┴─────────────────────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import validate
+
+report = validate('countries.resource.yaml')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
+[[3, 2, 'type-error'],
+ [4, 5, 'extra-cell'],
+ [5, None, 'foreign-key'],
+ [7, 2, 'missing-cell'],
+ [7, 3, 'missing-cell'],
+ [7, 4, 'missing-cell']]
+
+ Now it's even worse, but regarding data validation errors, the more, the better, actually. Thanks to the metadata, we were able to reveal some critical errors:
+Ireland
instead of an idid
and neighbor_id
: we don't have a country with id 22In the next section, we will clean up the data.
+We will use metadata to fix all the data type problems automatically. The only two things we need to handle manually:
+cat > countries.pipeline.yaml <<EOF
+steps:
+ - type: cell-replace
+ fieldName: neighbor_id
+ pattern: '22'
+ replace: '2'
+ - type: cell-replace
+ fieldName: population
+ pattern: 'n/a'
+ replace: '67'
+ - type: row-filter
+ formula: population
+ - type: field-update
+ name: neighbor_id
+ descriptor:
+ type: integer
+ - type: field-update
+ name: population
+ descriptor:
+ type: integer
+ - type: table-normalize
+ - type: table-write
+ path: countries-cleaned.csv
+EOF
+frictionless transform countries.csv --pipeline countries.pipeline.yaml
+
+
+## Schema
+
++-------------+---------+------------+
+| name | type | required |
++=============+=========+============+
+| id | integer | |
++-------------+---------+------------+
+| neighbor_id | integer | |
++-------------+---------+------------+
+| name | string | |
++-------------+---------+------------+
+| population | integer | |
++-------------+---------+------------+
+
+## Table
+
++----+-------------+---------+------------+
+| id | neighbor_id | name | population |
++====+=============+=========+============+
+| 1 | None | Britain | 67 |
++----+-------------+---------+------------+
+| 2 | 3 | France | 67 |
++----+-------------+---------+------------+
+| 3 | 2 | Germany | 83 |
++----+-------------+---------+------------+
+| 4 | None | Italy | 60 |
++----+-------------+---------+------------+
+
+ from pprint import pprint
+from frictionless import Resource, Pipeline, describe, transform, steps
+
+pipeline = Pipeline(steps=[
+ steps.cell_replace(field_name='neighbor_id', pattern='22', replace='2'),
+ steps.cell_replace(field_name='population', pattern='n/a', replace='67'),
+ steps.row_filter(formula='population'),
+ steps.field_update(name='neighbor_id', descriptor={"type": "integer"}),
+ steps.table_normalize(),
+ steps.table_write(path="countries-cleaned.csv"),
+])
+
+source = Resource('countries.csv')
+target = source.transform(pipeline)
+pprint(target.read_rows())
+
+
+[{'id': 1, 'neighbor_id': None, 'name': 'Britain', 'population': '67'},
+ {'id': 2, 'neighbor_id': 3, 'name': 'France', 'population': '67'},
+ {'id': 3, 'neighbor_id': 2, 'name': 'Germany', 'population': '83'},
+ {'id': 4, 'neighbor_id': None, 'name': 'Italy', 'population': '60'}]
+
+ Finally, we've got the cleaned version of our data, which can be exported to a database or published. We have used a CSV as an output format but could have used Excel, JSON, SQL, and others.
+ +cat countries-cleaned.csv
+
+
+id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,,Italy,60
+
+ with open('countries-cleaned.csv') as file:
+ print(file.read())
+
+
+id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,,Italy,60
+
+ Basically, that's it; now, we have a valid data file and a corresponding metadata file. It can be shared with other people or stored without fear of type errors or other problems making research data not reproducible.
+ +ls countries-cleaned.*
+
+
+countries-cleaned.csv
+
+ import os
+
+files = [f for f in os.listdir('.') if os.path.isfile(f) and f.startswith('countries-cleaned.')]
+print(files)
+
+
+['countries-cleaned.csv']
+
+ In the next articles, we will explore more advanced Frictionless functionality.
+ +The Baseline Check is always enabled. It makes various small checks that reveal a great deal of tabular errors. You can create an empty Checklist
to see the baseline check scope:
++ +Download
+capital-invalid.csv
to reproduce the examples (right-click and "Save link as")..
from pprint import pprint
+from frictionless import Checklist, validate
+
+checklist = Checklist()
+pprint(checklist.scope)
+report = validate('capital-invalid.csv') # we don't pass the checklist as the empty one is default
+pprint(report.flatten(['type', 'message']))
+
+
+['hash-count',
+ 'byte-count',
+ 'field-count',
+ 'row-count',
+ 'blank-header',
+ 'extra-label',
+ 'missing-label',
+ 'blank-label',
+ 'duplicate-label',
+ 'incorrect-label',
+ 'blank-row',
+ 'primary-key',
+ 'foreign-key',
+ 'extra-cell',
+ 'missing-cell',
+ 'type-error',
+ 'constraint-error',
+ 'unique-error']
+[['duplicate-label',
+ 'Label "name" in the header at position "3" is duplicated to a label: at '
+ 'position "2"'],
+ ['missing-cell',
+ 'Row at position "10" has a missing cell in field "name2" at position "3"'],
+ ['blank-row', 'Row at position "11" is completely blank'],
+ ['type-error',
+ 'Type error in the cell "x" in row "12" and field "id" at position "1": type '
+ 'is "integer/default"'],
+ ['extra-cell',
+ 'Row at position "12" has an extra value in field at position "4"']]
+
+ The Baseline Check is incorporated into base Frictionless classes as though Resource, Header, and Row. There is no exact order in which those errors are revealed as it's highly optimized. One should consider the Baseline Check as one unit of validation.
+Check a table for basic errors + +This check is enabled by default for any `validate` function run.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+If you want to skip non-ascii characters, this check helps to notify if there are any in data during validation. Here is how we can use this check.
+from pprint import pprint
+from frictionless import validate, checks
+
+source=[["s.no","code"],[1,"ssµ"]]
+report = validate(source, checks=[checks.ascii_value()])
+pprint(report.flatten(["type", "message"]))
+
+
+[['ascii-value',
+ 'The cell ssµ in row at position 2 and field code at position 2 has an '
+ 'error: the cell contains non-ascii characters']]
+
+ Check whether all the string characters in the data are ASCII + +This check can be enabled using the `checks` parameter +for the `validate` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+This check identifies deviated cells from the normal ones. To flag the deviated cell, the check compares the length of the characters in each cell with a threshold value. The threshold value is either 5000 or value calculated using Python's built-in statistics
module which is average plus(+) three standard deviation. The exact algorithm can be found here. For example:
++ +Download
+issue-1066.csv
to reproduce the examples (right-click and "Save link as")..
from pprint import pprint
+from frictionless import validate, checks
+
+report = validate("issue-1066.csv", checks=[checks.deviated_cell()])
+pprint(report.flatten(["type", "message"]))
+
+
+[['deviated-cell',
+ 'There is a possible error because the cell is deviated: cell at row "35" '
+ 'and field "Gestore" has deviated size']]
+
+ Check if the cell size is deviated
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, interval: int = 3, ignore_fields: List[str] = NOTHING) -> None
++ Interval specifies number of standard deviation away from the center. + The median is used to find the center of the data. The default value + is 3. +
+int
++ List of data columns to be skipped by check. To all the data columns + listed here, check will not be applied. The default value is []. +
+List[str]
+This check uses Python's built-in statistics
module to check a field's data for deviations. By default, deviated values are outside of the average +- three standard deviations. Take a look at the API Reference for more details about available options and default values. The exact algorithm can be found here. For example:
from pprint import pprint
+from frictionless import validate, checks
+
+source = [["temperature"], [1], [-2], [7], [0], [1], [2], [5], [-4], [1000], [8], [3]]
+report = validate(source, checks=[checks.deviated_value(field_name="temperature")])
+pprint(report.flatten(["type", "message"]))
+
+
+[['deviated-value',
+ 'There is a possible error because the value is deviated: value "1000" in '
+ 'row at position "10" and field "temperature" is deviated "[-809.88, '
+ '995.52]"']]
+
+ Check for deviated values in a field.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, interval: int = 3, average: str = mean) -> None
++ Name of the field to which the check will be applied. + Check will not be applied to fields other than this. +
+str
++ Interval specifies number of standard deviation away from the mean. + The default value is 3. +
+int
++ It specifies preferred method to calculate average of the data. + Default value is "mean". Supported average calculation methods + are "mean", "median", and "mode". +
+str
+This check ensures that some field doesn't have any forbidden or denylist values.
+from pprint import pprint
+from frictionless import validate, checks
+
+source = b'header\nvalue1\nvalue2'
+checks = [checks.forbidden_value(field_name='header', values=['value2'])]
+report = validate(source, format='csv', checks=checks)
+pprint(report.flatten(['type', 'message']))
+
+
+[['forbidden-value',
+ 'The cell value2 in row at position 3 and field header at position 1 has an '
+ 'error: forbidden values are "[\'value2\']"']]
+
+ Check for forbidden values in a field.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, values: List[Any]) -> None
++ The name of the field to apply the check. Check will not be applied to + other fields. +
+str
++ Specify the forbidden values to check for, in the field specified by + "field_name". +
+List[Any]
+This check gives us an opportunity to validate sequential fields like primary keys or other similar data. It doesn't need to start from 0 or 1. We're providing a field name.
+from pprint import pprint
+from frictionless import validate, checks
+
+source = b'header\n2\n3\n5'
+report = validate(source, format='csv', checks=[checks.sequential_value(field_name='header')])
+pprint(report.flatten(['type', 'message']))
+
+
+[['sequential-value',
+ 'The cell 5 in row at position 4 and field header at position 1 has an '
+ 'error: the value is not sequential']]
+
+ Check that a column having sequential values.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str) -> None
++ The name of the field to apply the check. Check will not be + applied to other fields. +
+str
+Sometime during data export from a database or other storage, data values can be truncated. This check tries to detect such truncation. Let's explore some truncation indicators.
+from pprint import pprint
+from frictionless import validate, checks
+
+source = [["int", "str"], ["a" * 255, 32767], ["good", 2147483647]]
+report = validate(source, checks=[checks.truncated_value()])
+pprint(report.flatten(["type", "message"]))
+
+
+[['truncated-value',
+ 'The cell '
+ 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa '
+ 'in row at position 2 and field int at position 1 has an error: value is '
+ 'probably truncated'],
+ ['truncated-value',
+ 'The cell 32767 in row at position 2 and field str at position 2 has an '
+ 'error: value is probably truncated'],
+ ['truncated-value',
+ 'The cell 2147483647 in row at position 3 and field str at position 2 has an '
+ 'error: value is probably truncated']]
+
+ Check for possible truncated values + +This check can be enabled using the `checks` parameter +for the `validate` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+This checks for duplicate rows. You need to take into account that checking for duplicate rows can lead to high memory consumption on big files. Here is an example.
+from pprint import pprint
+from frictionless import validate, checks
+
+source = b"header\nvalue\nvalue"
+report = validate(source, format="csv", checks=[checks.duplicate_row()])
+pprint(report.flatten(["type", "message"]))
+
+
+[['duplicate-row',
+ 'Row at position 3 is duplicated: the same as row at position "2"']]
+
+ Check for duplicate rows + +This check can be enabled using the `checks` parameter +for the `validate` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+This check is the most powerful one as it uses the external simpleeval
package allowing you to evaluate arbitrary Python expressions on data rows. Let's show on an example.
from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+ ["row", "salary", "bonus"],
+ [2, 1000, 200],
+ [3, 2500, 500],
+ [4, 1300, 500],
+ [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.row_constraint(formula="salary == bonus * 5")])
+pprint(report.flatten(["type", "message"]))
+
+
+[['row-constraint',
+ 'The row at position 4 has an error: the row constraint to conform is '
+ '"salary == bonus * 5"']]
+
+ Check that every row satisfies a provided Python expression.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, formula: str) -> None
++ Python expression to apply to all rows. To evaluate the formula + simpleeval library is used. +
+str
+This check is used to validate if your data has expected dimensions as: exact number of rows , minimum and maximum number of rows, exact number of fields , minimum and maximum number of fields.
+from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+ ["row", "salary", "bonus"],
+ [2, 1000, 200],
+ [3, 2500, 500],
+ [4, 1300, 500],
+ [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.table_dimensions(num_rows=5)])
+pprint(report.flatten(["type", "message"]))
+
+
+[['table-dimensions',
+ 'The data source does not have the required dimensions: number of rows is 4, '
+ 'the required is 5']]
+
+ You can also give multiples limits at the same time:
+ +from pprint import pprint
+from frictionless import validate, checks
+
+source = [
+ ["row", "salary", "bonus"],
+ [2, 1000, 200],
+ [3, 2500, 500],
+ [4, 1300, 500],
+ [5, 5000, 1000],
+]
+report = validate(source, checks=[checks.table_dimensions(num_rows=5, num_fields=4)])
+pprint(report.flatten(["type", "message"]))
+
+
+[['table-dimensions',
+ 'The data source does not have the required dimensions: number of fields is '
+ '3, the required is 4'],
+ ['table-dimensions',
+ 'The data source does not have the required dimensions: number of rows is 4, '
+ 'the required is 5']]
+
+ It is possible to use de check declaratively as:
+ +from pprint import pprint
+from frictionless import Check, validate, checks
+
+source = [
+ ["row", "salary", "bonus"],
+ [2, 1000, 200],
+ [3, 2500, 500],
+ [4, 1300, 500],
+ [5, 5000, 1000],
+]
+
+check = Check.from_descriptor({"type": "table-dimensions", "minFields": 4, "maxRows": 3})
+report = validate(source, checks=[check])
+pprint(report.flatten(["type", "message"]))
+
+
+[['table-dimensions',
+ 'The data source does not have the required dimensions: number of fields is '
+ '3, the minimum is 4'],
+ ['table-dimensions',
+ 'The data source does not have the required dimensions: number of rows is 4, '
+ 'the maximum is 3']]
+
+ But the table dimensions check arguments num_rows
, min_rows
, max_rows
, num_fields
, min_fields
, max_fields
must be passed in camelCase format as the example above i.e. numRows
, minRows
, maxRows
, numFields
, minFields
and maxFields
.
Check for minimum and maximum table dimensions.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, num_rows: Optional[int] = None, min_rows: Optional[int] = None, max_rows: Optional[int] = None, num_fields: Optional[int] = None, min_fields: Optional[int] = None, max_fields: Optional[int] = None) -> None
++ Specify the number of rows to compare with actual rows in + the table. If the actual number of rows are less than num_rows it will + notify user as errors. +
+Optional[int]
++ Specify the minimum number of rows that should be in the table. + If the actual number of rows are less than min_rows it will notify user + as errors. +
+Optional[int]
++ Specify the maximum number of rows allowed. + If the actual number of rows are more than max_rows it will notify user + as errors. +
+Optional[int]
++ Specify the number of fields to compare with actual fields in + the table. If the actual number of fields are less than num_fields it will + notify user as errors. +
+Optional[int]
++ Specify the minimum number of fields that should be in the table. + If the actual number of fields are less than min_fields it will notify user + as errors. +
+Optional[int]
++ Specify the maximum number of expected fields. + If the actual number of fields are more than max_fields it will notify user + as errors. +
+Optional[int]
+++This page is powered by contributors-img
+
This package is a collective effort made by many great people working on various projects. You can click on the pictures below to see their contribution in detail.
+Here described only the breaking and most significant changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.
+ignore_constraints
option for the Indexer
(#1691)header_case = False
(#1635)extract
returning the results depends on the source type (now it's always a dictionary indexed by the resource name)frictionless.Resource(source)
guessing abilities; if you just like to open a table resource use frictionless.resources.TableResource(path=path)
catalog/dataset/package/resource.deference
(#1451)resources
module including File/Text/Json/TableResource
resource.type
argument -- use the classes abovecatalog.packages[]
to catalog.datasets[].package
resource.schema
optional (resource.has_schema
is removed)resource.normpath
optional (resource.normdata
is removed)system/plugin.select_Check/etc
to system/plugin.select_check_class/etc
sqlalchemy@2
(#1427)program/resource.index
preview (#1395)dialect.skip_blank_rows
(#1387)steps.resource_update
for resource transformations (#1381)wkt
format in fields.StringField
(#1363 by @jze)descriptor
argument for actions/program.extract
(#1372)fieldNames
, fieldCells
, referenceName
, and referenceFieldNames
summary
and partial validation details (#1106)summary
(#1127)schema.to_summary
report.to_summary
summary
package.to_zip
(#1104)extract
command (#1130)package.to_er_diagram
(#1135)checks.ascii_value
(#1064)checks.deviated_cell
(#1069)detector.field_true/false_values
(#1074)describe_*
extract_*
transform_*
validate_*
pipeline.validate
(will replace validate_pipeline
in v5)pipeline.transform
(will replace transform_pipeline
in v5)inqiury.validate
(will replace validate_inqiury
in v5)Schema.describe
(will replace describe_schema
in v5)schema.validate
(will replace validate_schema
in v5)steps.field_merge
steps.field_pack
Package.describe
(will replace describe_package
in v5)package.extract
(will replace extract_package
in v5)package.validate
(will replace validate_package
in v5)package.transform
(will replace transform_package
in v5)Resource.describe
(will replace describe_resource
in v5)resource.extract
(will replace extract_resource
in v5)resource.validate
(will replace validate_resource
in v5)resource.transform
(will replace transform_resource
in v5)sheet
, table
, keys
, and keyed
(#886)SqlDialect.basepath
(#982) (https://framework.frictionlessdata.io/docs/tutorials/formats/sql-tutorial)inlineDialect.keys
to inlineDialect.data_keys
due to a conflict with dict.keys
propertysystem/plugin.create_candidates
(#893)system.get/use_http_session
(#892)extract/validate
(#881)json
argument to resource.to_snap
--path
CLI argument (#829)Package(innerpath)
argument for unzipping a data package's descriptordescribe_dialect
and describe(path, type="dialect")
--dialect
argument in CLISchema.from_jsonschema
(#797)field.constraints.maxLength
for SQL's VARCHAR (#795)resource.to_view()
(#781)fields[].arrayItem
errors more granular (#767)fields[].arrayItem
(#750)frictionless@4
:tada:filelike
loader to stream
loadertext
loader to buffer
loadertransform_resource(resource)
signaturetransform_package(package)
signatureparser.write_row_stream
APIresource.from/to
APIpackage.from/to
APIStorage
APIsystem.create_storage
APIPandasStorage
into PandasParser
SpssStorage
into SpssParser
data_stream
to list_stream
readData
to readLists
sample
to fragment
(sample
now is raw lists)Detector
class (BREAKING)Detector
resource.infer
omit empty objectsresource.read_*(size)
argumentresource.labels
propertyvalidate(extra_checks=[...])
to validate(checks=[{"code": 'code', ...}])
validate_table
(use validate_resource
)Table
and File
classesdataflows
pluginnopool
by parallel
(not parallel by default)report.tables
to report.tasks
report.tasks[].resource
(instead of plain path/scheme/format/etc)Query
class and arguments/properties to Layout
header
options from Dialect
to Layout
transform(type)
argumentdescribe(source_type)
argument to type
extract_table
(use extract_resource
with the same API)extract(source_type)
argument to type
Package/Resource(source)
notation (guess descriptor/path/etc)schema.infer
-> Schema.from_sample
resource.inline
-> resource.memory
compression_path
-> innerpath
compression: no
-> compression: ""
Package/Resource.infer
not to infer stats (use stats=True
)Package/Resource.infer(only_sample)
argumentResouce.from/to_zip
(use Package.from/to_zip
)Resouce.source
(use Resource.data
or Resource.fullpath
)package/resource.infer(source)
argument (use constructors)resource.read_sample()
to resource.sample
resource.read_header()
to resource.header
resource.read_stats()
to resource.stats
resource.to_table()
resource.to_file()
resource/table.read_data(_stream)
now includes a header row if presenterrors.ExtraHeaderError->ExtraLabelError
(extra-label-error
)errors.MissingHeaderError->MissingLabelError
(missing-label-error
)errors.BlankHeaderError->BlankLabelError
(blank-label-error
)errors.DuplicateHeaderError->DuplicateLabelError
(duplicate-label-error
)errors.NonMatchingHeaderError->IncorrectLabelError
(incorrect-label-error
)schema.read/write_data->read/write_cells
$ pip install frictionless[aws] # before
+$ pip install frictionless[s3] # after
+
+gsheet
plugin/format to gsheets
(BREAKING: minor)frictionless.controls
to frictionless.plugins.*
(BREAKING)frictionless.dialects
to frictionless.plugins.*
(BREAKING)frictionless.exceptions.FrictionlessException
to frictionless.FrictionlessException
(BREAKING)excel
dependencies to frictionless[excel]
extras (BREAKING)json
dependencies to frictionless[json]
extras (BREAKING)json
files to be a metadata by default (BREAKING)Code example:
+# Before
+# pip install frictionless
+from frictionless import dialects, exceptions
+excel_dialect = dialects.ExcelDialect()
+json_dialect = dialects.JsonDialect()
+exception = exceptions.FrictionlessException()
+
+# After
+# pip install frictionless[excel,json]
+from frictionless import FrictionlessException
+from frictionless.plugins.excel import ExcelDialect
+from frictionless.plugins.json import JsonDialect
+excel_dialect = dialects.ExcelDialect()
+json_dialect = dialects.JsonDialect()
+exception = FrictionlessException()
+
+schema.get/remove_field
now raise if not found (#505) (BREAKING)package.get/remove_resource
now raise if not found (#505) (BREAKING)hashing
parameter to describe/describe_package
table.onerror
property (BREAKING)expand
argument from metadata.to_dict
(BREAKING)resource.stats
(BREAKING)on_error
to onerror
(BREAKING)resource.stats.fields
on_error
argument to Table/Resource/Package (#445)goodtables
successor versioningWe welcome contributions from anyone! Please read the following guidelines, and feel free to reach out to us if you have questions. Thanks for your interest in helping make Frictionless awesome!
+We use Github as a code and issues hosting platform. To report a bug or propose a new feature, please open an issue. For pull requests, we would ask you initially create an issue and then create a pull requests linked to this issue.
+To start working on the project:
+Install Python headers if they are missing:
+sudo apt-get install libpython3.10-dev
+
+For development orchestration we use Hatch for Python (defined in pyproject.toml
). We use make
to run high-level commands (defined in Makefile
)
pip3 install hatch
+
+Before starting with the project we recommend configuring hatch
. The following line will ensure that all the virtual environments will be stored in the .python
directory in the project root:
hatch config set 'dirs.env.virtual' '.python'
+
+Now you can setup you IDE to use a proper Python path:
+.python/frictionless/bin/python
+
+Enter the virtual environment before starting the work. It will ensure that all the development dependencies are installed into a virtual environment:
+hatch shell
+
+Use the following command to build the container:
+ +hatch run image
+
+
+ This should take care of setting up everything. If the container is built without errors, you can then run commands like hatch
inside the container to accomplish various tasks (see the next section for details).
To make things easier, we can create an alias:
+ +alias "frictionless-dev=docker run --rm -v $PWD:/home/frictionless -it frictionless-dev"
+
+
+ Then, for example, to run the tests, we can use:
+ +frictionless-dev hatch run test
+
+
+ Frictionless is a Python3.8+ framework, and it uses some common Python tools for the development process (we recommend enabling support of these tools in your IDE):
+ruff
pyright
pytest
You also need git
to work on the project.
To contribute to the documentation, please find an article in the docs
folder and update its contents. We write our documentation using Livemark. Livemark provides an ability to provide examples without providing an output as it's generated automatically.
It's possible to run this documentation portal locally:
+ +livemark start
+
+
+ VCR library records the response from HTTP requests locally as cassette in its first run. All subsequent calls are run using recorded metadata +from previous HTTP request, so it speeds up the testing process. To record a unit test(as cassette), we mark it with a decorator:
+@pytest.mark.vcr
+def test_connect_with_server():
+ pass
+
+Cassettee will be recorded as "test_connect_with_server.yaml". A new call is made when params change. To skip sensitive data, +we can use filters:
+@pytest.fixture(scope="module")
+def vcr_config():
+ return {"filter_headers": ["authorization"]}
+
+CKAN_APIKEY=***************************
+
+Read
+ZENODO_ACCESS_TOKEN=***************************
+
+Write
+ZENODO_SANDBOX_ACCESS_TOKEN=***************************
+
+base_url='base_url="https://sandbox.zenodo.org/api/'
+
+GITHUB_NAME=FD
+GITHUB_EMAIL=frictionlessdata@okfn.org
+GITHUB_ACCESS_TOKEN=***************************
+
+To release a new version:
+main
branchhatch version <major|minor|micro>
to update the versionCHANGELOG.md
if it's not a patch release (major or minor)hatch run release
which create a release commit and tag and push it to GithubCopyright © 2020
Open Knowledge Foundation
Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE.
+ +Frictionless is a logical continuation of many existing packages created for Frictionless Data as though datapackage
or tableschema
. Although, most of these packages will be supported going forward, you can migrate to Frictionless, which is Python 3.8+, as it improves many aspects of working with data and metadata. This document also covers migration from one framework's version to another.
Since the initial Frictionless Framework release we'd been collecting feedback and analyzing both high-level users' needs and bug reports to identify shortcomings and areas that can be improved in the next version of the framework. Read about a new version of the framework and migration details in this blog:
+ +Frictionless Framework provides the frictionless transform
function for data transformation. It can be used to migrate from dataflows
or datapackage-pipelines
:
Frictionless Framework provides the frictionless validate
function which is in high-level exactly the same as goodtables validate
. Also frictionless describe
is an improved version of goodtables init
. You instead need to use the frictionless
command instead of the goodtables
command:
Frictionless Framework has Package
and Resource
classes which is almost the same as datapackage
has:
Frictionless Framework has Schema
and Field
classes which is almost the same as tableschema
has:
Frictionless has Resource
class which is an equivalent of the tabulator's Stream
class:
validate
if needed.
+ With convert
command you can quickly convert a tabular data file from one format to another (or the same format with different dialect):
For example, let's convert a CSV file into an Excel:
+ +frictionless convert table.csv table.xlsx
+
+
+ The command can be used for downloading files as well. For example, let's cherry-pick one CSV file from a Zenodo dataset:
+ +frictionless convert https://zenodo.org/record/3977957 --name aaawrestlers --to-path test.csv
+
+
+ Consider, we want to change the CSV delimiter:
+ + + +frictionless convert table.csv table-copy.csv --csv-delimiter ;
+
+
+ describe
and list
command: if datapackage.json
is not provided describe
will load a sample from every tabular data file in a dataset and infer a schema while list
is a very lean and quick command operating only with available metadata and not touching actual data files.
+ With Frtictionless describe
command you can get a metadata of file or a dataset.
By default, it outputs metadata visually formatted:
+ +frictionless describe tables/*.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ chunk1
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+ chunk2
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+
+ It's possible to output as YAML
or JSON
, for example:
frictionless describe tables/*.csv --yaml
+
+
+resources:
+ - name: chunk1
+ type: table
+ path: tables/chunk1.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+ - name: chunk2
+ type: table
+ path: tables/chunk2.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+
+ With the explore
command you can open your dataset in Visidata which is an amazing visual tool for working with tabular data in Console. For example try "Shift+F" for creating data histograms!
pip install frictionless[visidata]
+pip install frictionless[visidata,zenodo] # for examples in this tutorial
+
+
+ For example, let's expore this interesing dataset:
+ +frictionless explore https://zenodo.org/record/3977957
+
+
+ Before entering Visidata, it's highly recommended to read its documentation:
+ +You can get it in Console as well:
+ + + +vd --help
+
+
+vd(1) Quick Reference Guide vd(1)
+
+NAME
+ VisiData — a terminal utility for exploring and arranging tabular data
+
+SYNOPSIS
+ vd [options] [input ...]
+ visidata [options] [input ...]
+ vd [options] --play cmdlog [-w waitsecs] [--batch] [-i] [-o output]
+ [field=value]
+ vd [options] [input ...] +toplevel:subsheet:row:col
+
+DESCRIPTION
+ VisiData is an easy-to-use multipurpose tool to explore, clean, edit, and
+ restructure data. Rows can be selected, filtered, and grouped; columns
+ can be rearranged, transformed, and derived via regex or Python expres‐
+ sions; and workflows can be saved, documented, and replayed.
+
+ REPLAY MODE
+ -p, --play=cmdlog replay a saved cmdlog within the interface
+ -w, --replay-wait=seconds
+ wait seconds between commands
+ -b, --batch replay in batch mode (with no interface)
+ -i, --interactive launch VisiData in interactive mode after batch
+ -o, --output=file save final visible sheet to file as .tsv
+ field=value replace "{field}" in cmdlog contents with value
+
+ Commands During Replay
+ ^K cancel current replay
+
+ GLOBAL COMMANDS
+ All keystrokes are case sensitive. The ^ prefix is shorthand for Ctrl.
+
+ Keystrokes to start off with
+ ^Q abort program immediately
+ ^C cancel user input or abort all async threads on current
+ sheet
+ g^C abort all secondary threads
+ q quit current sheet or menu
+ Q quit current sheet and free associated memory
+ gq quit all sheets (clean exit)
+
+ Alt+H activate help menu (Enter/left-mouse to expand submenu
+ or execute command)
+ g^H view this man page
+ z^H view sheet of command longnames and keybindings for cur‐
+ rent sheet
+
+ gb open sidebar in a new sheet
+ b toggle sidebar
+
+ U undo the most recent modification (requires enabled
+ options.undo)
+ R redo the most recent undo (requires enabled
+ options.undo)
+
+ Space longname open command palette; execute top command by its
+ longname
+
+ Command Palette
+ Tab Move to command palette, and cycle through commands
+ 0-9 Execute numbered command
+ Enter Execute highlighted command
+
+ Cursor Movement
+ Arrow PgUp go as expected
+ h j k l go left/down/up/right
+ gh gj gk gl go all the way to the left/bottom/top/right of sheet
+ G gg go all the way to the bottom/top of sheet
+ Ic. End Home go all the way to the bottom/top of sheet
+ ^B ^F scroll one page back/forward
+ ^Left ^Right scroll one page left/right
+ zz scroll current row to center of screen
+
+ ^^ (Ctrl+^) jump to previous sheet (swaps with current sheet)
+
+ / ? regex search for regex forward/backward in current column's
+ displayed values
+ g/ g? regex search for regex forward/backward over all visible
+ columns' displayed values
+ z/ z? expr search by Python expr forward/backward in current column
+ (with column names as variables)
+ n N go to next/previous match from last regex search
+
+ < > go up/down current column to next value
+ z< z> go up/down current column to next null value
+ { } go up/down current column to next selected row
+
+ c regex go to next column with name matching regex
+ r regex go to next row with key matching regex
+ zc zr number go to column/row number (0-based)
+
+ H J K L slide current row/column left/down/up/right
+ gH gJ gK gL slide current row/column all the way to the left/bot‐
+ tom/top/right of sheet
+ zH zJ zK zK number
+ slide current row/column number positions to the
+ left/down/up/right
+
+ zh zj zk zl scroll one left/down/up/right
+
+ Column Manipulation
+ _ (underbar) toggle width of current column between full and default
+ width
+ g_ toggle widths of all visible columns between full and
+ default width
+ z_ number adjust width of current column to number
+ gz_ number adjust widths of all visible columns to Ar number
+
+ - (hyphen) hide current column
+ z- reduce width of current column by half
+ gv unhide all columns
+
+ ! z! toggle/unset current column as a key column
+ ~ # % $ @ z#
+ set type of current column to str/int/float/cur‐
+ rency/date/len
+ Alt++ Alt+- show more/less precision in current numerical column
+ ^ rename current column
+ g^ rename all unnamed visible columns to contents of se‐
+ lected rows (or current row)
+ z^ rename current column to combined contents of current
+ cell in selected rows (or current row)
+ gz^ rename all visible columns to combined contents of cur‐
+ rent column for selected rows (or current row)
+
+ = expr create new column from Python expr, with column names,
+ and attributes, as variables
+ g= expr set current column for selected rows to result of Python
+ expr
+ gz= expr set current column for selected rows to the items in
+ result of Python sequence expr
+ z= expr evaluate Python expression on current row and set
+ current cell with result of Python expr
+
+ i add column with incremental values
+ gi set current column for selected rows to incremental
+ values
+ zi step add column with values at increment step
+ gzi step set current column for selected rows at increment step
+
+ ' (tick) add a frozen copy of current column with all cells eval‐
+ uated
+ g' open a frozen copy of current sheet with all visible
+ columns evaluated
+ z' gz' add/reset cache for current/all visible column(s)
+
+ Note that regex operations apply to the displayed value in a cell.
+ : regex add new columns from regex split; number of columns
+ determined by example row at cursor
+ ; regex add new columns from capture groups of regex (also
+ requires example row)
+ z; expr create new column from bash expr, with $columnNames as
+ variables
+ * regex/subst add column derived from current column, replacing regex
+ with subst (may include \1 backrefs)
+ g* gz* regex/subst
+ modify selected rows in current/all visible column(s),
+ replacing regex with subst (may include \1 backrefs)
+
+ ( g( expand current/all visible column(s) of lists (e.g. [3])
+ or dicts (e.g. {3}) one level
+ z( gz( depth expand current/all visible column(s) of lists (e.g. [3])
+ or dicts (e.g. {3}) to given depth (0= fully)
+ ) g( unexpand current/all visible column(s); restore original
+ column and remove other columns at this level
+ z) gz) depth contract current/all visible column(s) of former lists
+ (e.g. [3]) or dicts (e.g. {3}) to given depth (0= fully)
+ zM row-wise expand current column of lists (e.g. [3]) or
+ dicts (e.g. {3}) within that column
+
+ Row Selection
+ s t u select/toggle/unselect current row
+ gs gt gu select/toggle/unselect all rows
+ zs zt zu select/toggle/unselect all rows from top to cursor
+ gzs gzt gzu select/toggle/unselect all rows from cursor to bottom
+ | \ regex select/unselect rows matching regex in current column
+ g| g\ regex select/unselect rows matching regex in any visible
+ column
+ z| z\ expr select/unselect rows matching Python expr in any visible
+ column
+ , (comma) select rows matching display value of current cell in
+ current column
+ g, select rows matching display value of current row in all
+ visible columns
+ z, gz, select rows matching typed value of current cell/row in
+ current column/all visible columns
+
+ Row Sorting/Filtering
+ [ ] sort ascending/descending by current column; replace any
+ existing sort criteria
+ g[ g] sort ascending/descending by all key columns; replace
+ any existing sort criteria
+ z[ z] sort ascending/descending by current column; add to ex‐
+ isting sort criteria
+ gz[ gz] sort ascending/descending by all key columns; add to ex‐
+ isting sort criteria
+ " open duplicate sheet with only selected rows
+ g" open duplicate sheet with all rows
+ gz" open duplicate sheet with deepcopy of selected rows
+
+ The rows in these duplicated sheets (except deepcopy) are references to
+ rows on the original source sheets, and so edits to the filtered rows
+ will naturally be reflected in the original rows. Use g' to freeze sheet
+ contents in a deliberate copy.
+
+ Editing Rows and Cells
+ a za append blank row/column; appended columns cannot be
+ copied to clipboard
+ ga gza number append number blank rows/columns
+ d gd delete current/selected row(s)
+ y gy yank (copy) current/all selected row(s) to clipboard in
+ Memory Sheet
+ x gx cut (copy and delete) current/all selected row(s) to
+ clipboard in Memory Sheet
+ zy gzy yank (copy) contents of current column for
+ current/selected row(s) to clipboard in Memory Sheet
+ zd gzd set contents of current column for current/selected
+ row(s) to options.null_value
+ zx gzx cut (copy and delete) contents of current column for
+ current/selected row(s) to clipboard in Memory Sheet
+ p P paste clipboard rows after/before current row
+ zp gzp set cells of current column for current/selected row(s)
+ to last clipboard value
+ zP gzP paste to cells of current column for current/selected
+ row(s) using the system clipboard
+ Y gY yank (copy) current/all selected row(s) to system
+ clipboard (using options.clipboard_copy_cmd)
+ zY gzY yank (copy) contents of current column for
+ current/selected row(s) to system clipboard (using
+ options.clipboard_copy_cmd)
+ f fill null cells in current column with contents of non-
+ null cells up the current column
+ e text edit contents of current cell
+ ge text set contents of current column for selected rows to text
+
+ Commands While Editing Input
+ Enter ^C accept/abort input
+ ^O g^O open external $EDITOR to edit contents of current/se‐
+ lected rows in current column
+ ^R reload initial value
+ ^A ^E go to beginning/end of line
+ ^B ^F go back/forward one character
+ ^← ^→ (arrow) go back/forward one word
+ ^H ^D delete previous/current character
+ ^T transpose previous and current characters
+ ^U ^K clear from cursor to beginning/end of line
+ ^Y paste from cell clipboard
+ Backspace Del delete previous/current character
+ Insert toggle insert mode
+ Up Down set contents to previous/next in history
+ Tab Shift+Tab move cursor left/right and re-enter edit mode
+ Shift+Arrow move cursor in direction of Arrow and re-enter edit
+ mode
+
+ Data Toolkit
+ o input open input in VisiData
+ zo open file or url from path in current cell
+ ^S g^S filename save current/all sheet(s) to filename in format
+ determined by extension (default .tsv)
+ Note: if the format does not support multisave, or the
+ filename ends in a /, a directory will be created.
+ z^S filename save current column only to filename in format
+ determined by extension (default .tsv)
+ ^D filename.vdj save CommandLog to filename.vdj file
+ A open new blank sheet with one column
+ T open new sheet that has rows and columns of current
+ sheet transposed
+
+ + aggregator add aggregator to current column (see Frequency Table)
+ z+ aggregator display result of aggregator over values in selected
+ rows for current column; store result in Memory Sheet
+ & append top two sheets in Sheets Stack
+ g& append all sheets in Sheets Stack
+
+ w nBefore nAfter
+ add column where each row contains a list of that row,
+ nBefore rows, and nAfter rows
+
+ Data Visualization
+ . (dot) plot current numeric column vs key columns. The numeric
+ key column is used for the x-axis; categorical key column
+ values determine color.
+ g. plot a graph of all visible numeric columns vs key
+ columns.
+
+ If rows on the current sheet represent plottable coordinates (as in .shp
+ or vector .mbtiles sources), . plots the current row, and g. plots all
+ selected rows (or all rows if none selected).
+
+ Canvas-specific Commands
+ + - increase/decrease zoom level, centered on cursor
+ _ (underbar) zoom to fit full extent
+ z_ (underbar) set aspect ratio
+ x xmin xmax set xmin/xmax on graph
+ y ymin ymax set ymin/ymax on graph
+ s t u select/toggle/unselect rows on source sheet con‐
+ tained within canvas cursor
+ gs gt gu select/toggle/unselect rows on source sheet visi‐
+ ble on screen
+ d delete rows on source sheet contained within can‐
+ vas cursor
+ gd delete rows on source sheet visible on screen
+ Enter open sheet of source rows contained within canvas
+ cursor
+ gEnter open sheet of source rows visible on screen
+ 1 - 9 toggle display of layers
+ ^L redraw all pixels on canvas
+ v toggle show_graph_labels option
+ mouse scrollwheel zoom in/out of canvas
+ left click-drag set canvas cursor
+ right click-drag scroll canvas
+
+ Split Screen
+ Z split screen in half, so that second sheet on the stack is
+ visible in a second pane
+ zZ split screen, and queries for height of second pane
+
+ Split Window specific Commands
+ gZ close an already split screen, current pane full
+ screens
+ Z push second sheet on current pane's stack to the
+ top of the other pane's stack
+ Tab jump to other pane
+ gTab swap panes
+ g Ctrl+^ cycle through sheets
+
+ Other Commands
+ Q quit current sheet and remove it from the CommandLog
+ v toggle sheet-specific visibility (multi-line rows on
+ Sheet, legends/axes on Graph)
+
+ ^E g^E view traceback for most recent error(s)
+ z^E view traceback for error in current cell
+
+ ^L refresh screen
+ ^R reload current sheet
+ ^Z suspend VisiData process
+ ^G show cursor position and bounds of current sheet on sta‐
+ tus line
+ ^V show version and copyright information on status line
+ ^P open Status History
+ m keystroke first, begin recording macro; second, prompt for
+ keystroke , and complete recording. Macro can then be
+ executed everytime provided keystroke is used. Will
+ override existing keybinding. Macros will run on current
+ row, column, sheet.
+ gm open an index of all existing macros. Can be directly
+ viewed with Enter, and then modified with ^S.
+
+ ^Y z^Y g^Y open current row/cell/sheet as Python object
+ ^X expr evaluate Python expr and opens result as Python object
+ z^X expr evaluate Python expr, in context of current row, and
+ open result as Python object
+ g^X module import Python module in the global scope
+
+ Internal Sheets List
+ . Directory Sheet browse properties of files in a directory
+ . Guide Index read documentation from within VisiData
+ . Memory Sheet (Alt+Shift+M) browse saved values, including clipboard
+
+ Metasheets
+ . Columns Sheet (Shift+C) edit column properties
+ . Sheets Sheet (Shift+S) jump between sheets or join them together
+ . Options Sheet (Shift+O) edit configuration options
+ . Commandlog (Shift+D) modify and save commands for replay
+ . Error Sheet (Ctrl+E) view last error
+ . Status History (Ctrl+P) view history of status messages
+ . Threads Sheet (Ctrl+T) view, cancel, and profile
+ asynchronous threads
+
+ Derived Sheets
+ . Frequency Table (Shift+F) group rows by column value, with
+ aggregations of other columns
+ . Describe Sheet (Shift+I) view summary statistics for each column
+ . Pivot Table (Shift+W) group rows by key and summarize current
+ column
+ . Melted Sheet (Shift+M) unpivot non-key columns into
+ variable/value columns
+ . Transposed Sheet (Shift+T) open new sheet with rows and columns
+ transposed
+
+ INTERNAL SHEETS
+ Directory Sheet
+ (global commands)
+ Space open-dir-current
+ open the Directory Sheet for the current directory
+ (sheet-specific commands)
+ Enter gEnter open current/selected file(s) as new sheet(s)
+ ^O g^O open current/selected file(s) in external $EDITOR
+ ^R z^R gz^R reload information for all/current/selected file(s)
+ d gd delete current/selected file(s) from filesystem, upon
+ commit
+ y gy directory
+ copy current/selected file(s) to given directory,
+ upon commit
+ e ge name rename current/selected file(s) to name
+ ` (backtick) open parent directory
+ z^S commit changes to file system
+
+ Guide Index
+ Browse through a list of available guides. Each guide shows you how to
+ use a particular feature. Gray guides have not been written yet.
+ (global commands)
+ Space open-guide-index
+ open the Guide Index
+ (sheet-specific commands)
+ Enter open a guide
+
+ Memory Sheet
+ Browse through a list of stored values, referanceable in expressions
+ through their name.
+ (global commands)
+ Alt+Shift+M open the Memory Sheet
+ Alt+M name store value in current cell in Memory Sheet under
+ name
+ (sheet-specific commands)
+ e edit either value or name, to edit reference
+
+ METASHEETS
+ Columns Sheet (Shift+C)
+ Properties of columns on the source sheet can be changed with standard
+ editing commands (e ge g= Del) on the Columns Sheet. Multiple aggregators
+ can be set by listing them (separated by spaces) in the aggregators
+ column. The 'g' commands affect the selected rows, which are the literal
+ columns on the source sheet.
+ (global commands)
+ gC open Columns Sheet with all visible columns from all
+ sheets
+ (sheet-specific commands)
+ & add column from appending selected source columns
+ g! gz! toggle/unset selected columns as key columns on
+ source sheet
+ g+ aggregator add Ar aggregator No to selected source columns
+ g- (hyphen) hide selected columns on source sheet
+ g~ g# g% g$ g@ gz# z%
+ set type of selected columns on source sheet to
+ str/int/float/currency/date/len/floatsi
+ Enter open a Frequency Table sheet grouped by column
+ referenced in current row
+
+ Sheets Sheet (Shift+S)
+ open Sheets Stack, which contains only the active sheets on the current
+ stack
+ (global commands)
+ gS open Sheets Sheet, which contains all sheets from
+ current session, active and inactive
+ Alt number jump to sheet number
+ (sheet-specific commands)
+ Enter jump to sheet referenced in current row
+ gEnter push selected sheets to top of sheet stack
+ a add row to reference a new blank sheet
+ gC gI open Columns Sheet/Describe Sheet with all visible
+ columns from selected sheets
+ g^R reload all selected sheets
+ z^C gz^C abort async threads for current/selected sheets(s)
+ g^S save selected or all sheets
+ & jointype merge selected sheets with visible columns from all,
+ keeping rows according to jointype:
+ . inner keep only rows which match keys on all
+ sheets
+ . outer keep all rows from first selected sheet
+ . full keep all rows from all sheets (union)
+ . diff keep only rows NOT in all sheets
+ . append combine all rows from all sheets
+ . concat similar to 'append' but keep first sheet
+ type and columns
+ . extend copy first selected sheet, keeping all rows
+ and sheet type, and extend with columns from other
+ sheets
+ . merge keep all rows from first sheet, updating
+ any False-y cells with non-False-y values from
+ second sheet; add unique rows from second sheet
+
+ Options Sheet (Shift+O)
+ (global commands)
+ Shift+O edit global options (apply to all sheets)
+ zO edit sheet options (apply to current sheet only)
+ gO open options.config as TextSheet
+ (sheet-specific commands)
+ Enter e edit option at current row
+ d remove option override for this context
+ ^S save option configuration to foo.visidatarc
+
+ CommandLog (Shift+D)
+ (global commands)
+ D open current sheet's CommandLog with all other loose
+ ends removed; includes commands from parent sheets
+ gD open global CommandLog for all commands executed in
+ the current session
+ zD open current sheet's CommandLog with the parent
+ sheets commands' removed
+ (sheet-specific commands)
+ x replay command in current row
+ gx replay contents of entire CommandLog
+ ^C abort replay
+
+ Threads Sheet (Ctrl+T)
+ (global commands)
+ ^T open global Threads Sheet for all asynchronous
+ threads running
+ z^T open current sheet's Threads Sheet
+ (sheet-specific commands)
+ ^C abort thread at current row
+ g^C abort all threads on current Threads Sheet
+
+ DERIVED SHEETS
+ Frequency Table (Shift+F)
+ A Frequency Table groups rows by one or more columns, and includes
+ summary columns for those with aggregators.
+ (global commands)
+ gF open Frequency Table, grouped by all key columns on
+ source sheet
+ zF open one-line summary for all rows and selected rows
+ (sheet-specific commands)
+ s t u select/toggle/unselect these entries in source sheet
+ Enter gEnter open copy of source sheet with rows that are grouped
+ in current cell / selected rows
+
+ Describe Sheet (Shift+I)
+ A Describe Sheet contains descriptive statistics for all visible columns.
+ (global commands)
+ gI open Describe Sheet for all visible columns on all
+ sheets
+ (sheet-specific commands)
+ zs zu select/unselect rows on source sheet that are being
+ described in current cell
+ ! toggle/unset current column as a key column on source
+ sheet
+ Enter open a Frequency Table sheet grouped on column
+ referenced in current row
+ zEnter open copy of source sheet with rows described in cur‐
+ rent cell
+
+ Pivot Table (Shift+W)
+ Set key column(s) and aggregators on column(s) before pressing Shift+W on
+ the column to pivot.
+ (sheet-specific commands)
+ Enter open sheet of source rows aggregated in current pivot
+ row
+ zEnter open sheet of source rows aggregated in current pivot
+ cell
+
+ Melted Sheet (Shift+M)
+ Open Melted Sheet (unpivot), with key columns retained and all non-key
+ columns reduced to Variable-Value rows.
+ (global commands)
+ gM regex open Melted Sheet (unpivot), with key columns
+ retained and regex capture groups determining how the
+ non-key columns will be reduced to Variable-Value
+ rows.
+
+ Python Object Sheet (^X ^Y g^Y z^Y)
+ (sheet-specific commands)
+ Enter dive further into Python object
+ v toggle show/hide for methods and hidden properties
+ gv zv show/hide methods and hidden properties
+
+COMMANDLINE OPTIONS
+ Add -n/--nonglobal to make subsequent CLI options sheet-specific
+ (applying only to paths specified directly on the CLI). By default, CLI
+ options apply to all sheets.
+
+ Options can also be set via the Options Sheet or a .visidatarc (see
+ FILES).
+
+ -P=longname preplay longname before replay or regular
+ launch; limited to Base Sheet bound commands
+ +toplevel:subsheet:row:col launch vd with subsheet of toplevel at
+ top-of-stack, and cursor at row and col; all
+ arguments are optional
+ --overwrite=c Overwrite with confirmation
+ --guides open Guide Index
+
+ -f, --filetype=filetype tsv set loader to use for
+ filetype instead of file extension
+ -d, --delimiter=delimiter \t field delimiter to use
+ for tsv/usv filetype
+ -y, --overwrite=y y overwrite existing files
+ without confirmation
+ -ro, --overwrite=n n do not overwrite existing
+ files
+ -N, --nothing=T False disable loading
+ .visidatarc and plugin addons
+ --visidata-dir=str ~/.visidata/ directory to load and
+ store additional files
+ --debug False exit on error and display
+ stacktrace
+ --undo=bool True enable undo/redo
+ --col-cache-size=int 0 max number of cache en‐
+ tries in each cached col‐
+ umn
+ --scroll-incr=int -3 amount to scroll with
+ scrollwheel
+ --force-256-colors False use 256 colors even if
+ curses reports fewer
+ --quitguard False confirm before quitting
+ modified sheet
+ --default-width=int 20 default column width
+ --default-height=int 4 default column height
+ --name-joiner=str _ string to join sheet or
+ column names
+ --value-joiner=str string to join display
+ values
+ --max-rows=int 1000000000 number of rows to load
+ from source
+ --wrap False wrap text to fit window
+ width on TextSheet
+ --save-filetype=str tsv specify default file type
+ to save as
+ --profile False enable profiling on
+ threads
+ --min-memory-mb=int 0 minimum memory to con‐
+ tinue loading and async
+ processing
+ --encoding=str utf-8-sig encoding passed to
+ codecs.open when reading
+ a file
+ --encoding-errors=str surrogateescape encoding_errors passed to
+ codecs.open
+ --mouse-interval=int 1 max time between
+ press/release for click
+ (ms)
+ --bulk-select-clear False clear selected rows be‐
+ fore new bulk selections
+ --some-selected-rows False if no rows selected, if
+ True, someSelectedRows
+ returns all rows; if
+ False, fails
+ --regex-skip=str regex of lines to skip in
+ text sources
+ --regex-flags=str I flags to pass to re.com‐
+ pile() [AILMSUX]
+ --load-lazy False load subsheets always
+ (False) or lazily (True)
+ --skip=int 0 skip N rows before header
+ --header=int 1 parse first N rows as
+ column names
+ --delimiter=str field delimiter to use
+ for tsv/usv filetype
+ --row-delimiter=str " row delimiter to use
+ for tsv/usv filetype
+ --tsv-safe-newline=str replacement for newline
+ character when saving to
+ tsv
+ --tsv-safe-tab=str replacement for tab char‐
+ acter when saving to tsv
+ --visibility=int 0 visibility level
+ --default-sample-size=int 100 number of rows to sample
+ for regex.split (0=all)
+ --fmt-expand-dict=str %s.%s format str to use for
+ names of columns expanded
+ from dict (colname, key)
+ --fmt-expand-list=str %s[%s] format str to use for
+ names of columns expanded
+ from list (colname, in‐
+ dex)
+ --json-indent=NoneType None indent to use when saving
+ json
+ --json-sort-keys False sort object keys when
+ saving to json
+ --json-ensure-ascii=bool True ensure ascii encode when
+ saving json
+ --default-colname=str column name to use for
+ non-dict rows
+ --filetype=str specify file type
+ --safe-error=str #ERR error string to use while
+ saving
+ --save-encoding=str utf-8 encoding passed to
+ codecs.open when saving a
+ file
+ --clean-names False clean column/sheet names
+ to be valid Python iden‐
+ tifiers
+ --replay-wait=float 0.0 time to wait between re‐
+ played commands, in sec‐
+ onds
+ --rowkey-prefix=str キ string prefix for rowkey
+ in the cmdlog
+ --clipboard-copy-cmd=str xclip -selection clipboard -filter
+ command to copy stdin to
+ system clipboard
+ --clipboard-paste-cmd=str xclip -selection clipboard -o
+ command to send contents
+ of system clipboard to
+ stdout
+ --fancy-chooser False a nicer selection inter‐
+ face for aggregators and
+ jointype
+ --null-value=NoneType None a value to be counted as
+ null
+ --histogram-bins=int 0 number of bins for his‐
+ togram of numeric columns
+ --numeric-binning False bin numeric columns into
+ ranges
+ --plot-colors=str list of distinct colors
+ to use for plotting dis‐
+ tinct objects
+ --motd-url=str source of randomized
+ startup messages
+ --dir-depth=int 0 folder recursion depth on
+ DirSheet
+ --dir-hidden False load hidden files on
+ DirSheet
+ --config=Path /home/saul/.visidatarc
+ config file to exec in
+ Python
+ --play=str file.vdj to replay
+ --batch False replay in batch mode
+ (with no interface and
+ all status sent to std‐
+ out)
+ --output=NoneType None save the final visible
+ sheet to output at the
+ end of replay
+ --preplay=str longnames to preplay be‐
+ fore replay
+ --imports=str plugins imports to preload before
+ .visidatarc (command-line
+ only)
+ --nothing False no config, no plugins,
+ nothing extra
+ --interactive False run interactive mode af‐
+ ter batch replay
+ --overwrite=str c overwrite existing files
+ {y=yes|c=confirm|n=no}
+ --plugins-autoload=bool True do not autoload plugins
+ if False
+ --theme=str display/color theme to
+ use
+ --airtable-auth-token=str Airtable API key from
+ https://airtable.com/ac‐
+ count
+ --matrix-token=str matrix API token
+ --matrix-user-id=str matrix user ID associated
+ with token
+ --matrix-device-id=str VisiData device ID associated with
+ matrix login
+ --reddit-client-id=str client_id for reddit api
+ --reddit-client-secret=str client_secret for reddit
+ api
+ --reddit-user-agent=str 3.1dev user_agent for reddit api
+ --zulip-batch-size=int -100 number of messages to
+ fetch per call (<0 to
+ fetch before anchor)
+ --zulip-anchor=int 1000000000 message id to start
+ fetching from
+ --zulip-delay-s=float 1e-05 seconds to wait between
+ calls (0 to stop after
+ first)
+ --zulip-api-key=str Zulip API key
+ --zulip-email=str Email for use with Zulip
+ API key
+ --csv-dialect=str excel dialect passed to
+ csv.reader
+ --csv-delimiter=str , delimiter passed to
+ csv.reader
+ --csv-doublequote=bool True quote-doubling setting
+ passed to csv.reader
+ --csv-quotechar=str " quotechar passed to
+ csv.reader
+ --csv-quoting=int 0 quoting style passed to
+ csv.reader and csv.writer
+ --csv-skipinitialspace=bool True skipinitialspace passed
+ to csv.reader
+ --csv-escapechar=NoneType None escapechar passed to
+ csv.reader
+ --csv-lineterminator=str " lineterminator passed
+ to csv.writer
+ --safety-first False sanitize input/output to
+ handle edge cases, with a
+ performance cost
+ --f5log-object-regex=NoneType None A regex to perform on the
+ object name, useful where
+ object names have a
+ structure to extract. Use
+ the (?P<foo>...) named
+ groups form to get column
+ names.
+ --f5log-log-year=NoneType None Override the default year
+ used for log parsing. Use
+ all four digits of the
+ year (e.g., 2022). By de‐
+ fault (None) use the year
+ from the ctime of the
+ file, or failing that the
+ current year.
+ --f5log-log-timezone=str UTC The timezone the source
+ file is in, by default
+ UTC.
+ --fixed-rows=int 1000 number of rows to check
+ for fixed width columns
+ --fixed-maxcols=int 0 max number of fixed-width
+ columns to create (0 is
+ no max)
+ --graphviz-edge-labels=bool True whether to include edge
+ labels on graphviz dia‐
+ grams
+ --grep-base-dir=NoneType None base directory for rela‐
+ tive paths opened with
+ sysopen-row
+ --html-title=str <h2>{sheet.name}</h2>
+ table header when saving
+ to html
+ --http-max-next=int 0 max next.url pages to
+ follow in http response
+ --http-req-headers=dict {} http headers to send to
+ requests
+ --http-ssl-verify=bool True verify host and certifi‐
+ cates for https
+ --npy-allow-pickle False numpy allow unpickling
+ objects (unsafe)
+ --pcap-internet=str n (y/s/n) if save_dot in‐
+ cludes all internet hosts
+ separately (y), combined
+ (s), or does not include
+ the internet (n)
+ --pdf-tables False parse PDF for tables in‐
+ stead of pages of text
+ --postgres-schema=str public The desired schema for
+ the Postgres database
+ --s3-endpoint=str alternate S3 endpoint,
+ used for local testing or
+ alternative S3-compatible
+ services
+ --s3-glob=bool True enable glob-matching for
+ S3 paths
+ --s3-version-aware False show all object versions
+ in a versioned bucket
+ --sqlite-onconnect=str sqlite statement to exe‐
+ cute after opening a con‐
+ nection
+ --xlsx-meta-columns False include columns for cell
+ objects, font colors, and
+ fill colors
+ --xlsx-color-cells=bool True color cells based on xlsx
+ source
+ --xml-parser-huge-tree=bool True allow very deep trees and
+ very long text content
+ --plt-marker=str . matplotlib.markers
+ --plot-palette=str Set3 colorbrewer palette to
+ use
+ --server-addr=str 127.0.0.1 IP address to listen for
+ commands
+ --server-port=int 0 port to listen for com‐
+ mands
+ --fixer-api-key=str API Key for api.api‐
+ layer.com/fixer
+ --fixer-cache-days=int 1 Cache days for currency
+ conversions
+ --describe-aggrs=str mean stdev numeric aggregators to
+ calculate on Describe
+ sheet
+ --hello-world=str ¡Hola mundo! shown by the hello-world
+ command
+ --incr-base=float 1.0 start value for column
+ increments
+ --ping-count=int 3 send this many pings to
+ each host
+ --ping-interval=float 0.1 wait between ping rounds,
+ in seconds
+ --regex-maxsplit=int 0 maxsplit to pass to
+ regex.split
+ --rename-cascade False cascade column renames
+ into expressions
+ --faker-locale=str en_US default locale to use for
+ Faker
+ --faker-extra-providers=NoneType None list of additional
+ Provider classes to load
+ via add_provider()
+ --faker-salt=str Use a non-empty string to
+ enable deterministic
+ fakes
+ --mailcap-mimetype=str force mimetype for
+ sysopen-mailcap
+ --unfurl-empty False if unfurl includes rows
+ for empty containers
+
+ DISPLAY OPTIONS
+ Display options can only be set via the Options Sheet or a .visidatarc
+ (see FILES).
+
+ disp_menu True show menu on top line when not
+ active
+ disp_menu_keys True show keystrokes inline in sub‐
+ menus
+ color_menu black on 68 blue color of menu items in general
+ color_menu_active 223 yellow on black
+ color of active menu items
+ color_menu_spec black on 34 green color of sheet-specific menu
+ items
+ color_menu_help black italic on 68 blue
+ color of helpbox
+ disp_menu_boxchars ││──┌┐└┘├┤ box characters to use for menus
+ disp_menu_more » command submenu indicator
+ disp_menu_push ⎘ indicator if command pushes sheet
+ onto sheet stack
+ disp_menu_input … indicator if input required for
+ command
+ disp_menu_fmt | VisiData {vd.version} | {vd.hintStatus}
+ right-side menu format string
+ disp_float_fmt {:.02f} default fmtstr to format float
+ values
+ disp_int_fmt {:d} default fmtstr to format int val‐
+ ues
+ disp_formatter generic formatter to create the text in
+ each cell (also used by text
+ savers)
+ disp_displayer generic displayer to render the text in
+ each cell
+ disp_splitwin_pct 0 height of second sheet on screen
+ disp_note_none ⌀ visible contents of a cell whose
+ value is None
+ disp_truncator … indicator that the contents are
+ only partially visible
+ disp_oddspace · displayable character for odd
+ whitespace
+ disp_more_left < header note indicating more col‐
+ umns to the left
+ disp_more_right > header note indicating more col‐
+ umns to the right
+ disp_error_val displayed contents for computa‐
+ tion exception
+ disp_ambig_width 1 width to use for unicode chars
+ marked ambiguous
+ disp_pending string to display in pending
+ cells
+ disp_note_pending : note to display for pending cells
+ disp_note_fmtexc ? cell note for an exception during
+ formatting
+ disp_note_getexc ! cell note for an exception during
+ computation
+ disp_note_typeexc ! cell note for an exception during
+ type conversion
+ color_note_pending bold green color of note in pending cells
+ color_note_type 226 yellow color of cell note for non-str
+ types in anytype columns
+ color_note_row 220 yellow color of row note on left edge
+ disp_column_sep │ separator between columns
+ disp_keycol_sep ║ separator between key columns and
+ rest of columns
+ disp_rowtop_sep │
+ disp_rowmid_sep ⁝
+ disp_rowbot_sep ⁝
+ disp_rowend_sep ║
+ disp_keytop_sep ║
+ disp_keymid_sep ║
+ disp_keybot_sep ║
+ disp_endtop_sep ║
+ disp_endmid_sep ║
+ disp_endbot_sep ║
+ disp_selected_note •
+ disp_sort_asc ↑↟⇞⇡⇧⇑ characters for ascending sort
+ disp_sort_desc ↓↡⇟⇣⇩⇓ characters for descending sort
+ color_default white on black the default fg and bg colors
+ color_default_hdr bold white on black
+ color of the column headers
+ color_bottom_hdr underline white on black
+ color of the bottom header row
+ color_current_row reverse color of the cursor row
+ color_current_col bold on 232 color of the cursor column
+ color_current_cell color of current cell, if differ‐
+ ent from color_cur‐
+ rent_row+color_current_col
+ color_current_hdr bold reverse color of the header for the cur‐
+ sor column
+ color_column_sep white on black color of column separators
+ color_key_col 81 cyan color of key columns
+ color_hidden_col 8 color of hidden columns on
+ metasheets
+ color_selected_row 215 yellow color of selected rows
+ color_clickable bold color of internally clickable
+ item
+ color_code bold white on 237 color of code sample
+ color_heading bold black on yellow
+ color of header
+ color_guide_unwritten 243 on black color of unwritten guides in
+ GuideGuide
+ disp_wrap_max_lines 3 max lines for multiline view
+ disp_wrap_break_long_words False break words longer than column
+ width in multiline
+ disp_wrap_replace_whitespace False replace whitespace with spaces in
+ multiline
+ disp_wrap_placeholder … multiline string to indicate
+ truncation
+ disp_multiline_focus True only multiline cursor row
+ color_aggregator bold 255 white on 234 black
+ color of aggregator summary on
+ bottom row
+ disp_rstatus_fmt {sheet.threadStatus} {sheet.keystrokeStatus}
+ [:longname_status]{sheet.longname}[/]
+ {sheet.nRows:9d} {sheet.rowtype}
+ {sheet.modifiedStatus}{sheet.selectedStatus}{vd.replayStatus}{vd.sidebarStatus}
+ right-side status format string
+ disp_status_fmt {sheet.sheetlist}|
+ left-side status format string
+ disp_lstatus_max 0 maximum length of left status
+ line
+ disp_status_sep │ separator between statuses
+ color_keystrokes bold white on 237 color of input keystrokes
+ color_longname_guide 237 color of command longnames
+ color_longname_status white color of command longnames
+ color_keys bold reverse color of keystrokes in help
+ color_status bold on 238 status line color
+ color_error 202 1 error message color
+ color_warning 166 15 warning message color
+ color_top_status underline top window status bar color
+ color_active_status black on 68 blue active window status bar color
+ color_inactive_status 8 on black inactive window status bar color
+ color_highlight_status black on green color of highlighted elements in
+ statusbar
+ color_working 118 5 color of system running smoothly
+ color_edit_unfocused 238 on 110 display color for unfocused input
+ in form
+ color_edit_cell 233 on 110 cell color to use when editing
+ cell
+ disp_edit_fill _ edit field fill character
+ disp_unprintable · substitute character for unprint‐
+ ables
+ disp_date_fmt %Y-%m-%d default fmtstr passed to strftime
+ for date values
+ disp_currency_fmt %.02f default fmtstr to format for cur‐
+ rency values
+ color_currency_neg red color for negative values in cur‐
+ rency displayer
+ disp_replay_play ▶ status indicator for active re‐
+ play
+ disp_replay_record ⏺ status indicator for macro record
+ color_status_replay green color of replay status indicator
+ disp_histogram ■ histogram element character
+ disp_graph_labels True show axes and legend on graph
+ disp_canvas_charset
+ ⠀⠁⠂⠃⠄⠅⠆⠇⠈⠉⠊⠋⠌⠍⠎⠏⠐⠑⠒⠓⠔⠕⠖⠗⠘⠙⠚⠛⠜⠝⠞⠟⠠⠡⠢⠣⠤⠥⠦⠧⠨⠩⠪⠫⠬⠭⠮⠯⠰⠱⠲⠳⠴⠵⠶⠷⠸⠹⠺⠻⠼⠽⠾⠿⡀⡁⡂⡃⡄⡅⡆⡇⡈⡉⡊⡋⡌⡍⡎⡏⡐⡑⡒⡓⡔⡕⡖⡗⡘⡙⡚⡛⡜⡝⡞⡟⡠⡡⡢⡣⡤⡥⡦⡧⡨⡩⡪⡫⡬⡭⡮⡯⡰⡱⡲⡳⡴⡵⡶⡷⡸⡹⡺⡻⡼⡽⡾⡿⢀⢁⢂⢃⢄⢅⢆⢇⢈⢉⢊⢋⢌⢍⢎⢏⢐⢑⢒⢓⢔⢕⢖⢗⢘⢙⢚⢛⢜⢝⢞⢟⢠⢡⢢⢣⢤⢥⢦⢧⢨⢩⢪⢫⢬⢭⢮⢯⢰⢱⢲⢳⢴⢵⢶⢷⢸⢹⢺⢻⢼⢽⢾⢿⣀⣁⣂⣃⣄⣅⣆⣇⣈⣉⣊⣋⣌⣍⣎⣏⣐⣑⣒⣓⣔⣕⣖⣗⣘⣙⣚⣛⣜⣝⣞⣟⣠⣡⣢⣣⣤⣥⣦⣧⣨⣩⣪⣫⣬⣭⣮⣯⣰⣱⣲⣳⣴⣵⣶⣷⣸⣹⣺⣻⣼⣽⣾⣿
+ charset to render 2x4 blocks on
+ canvas
+ disp_pixel_random False randomly choose attr from set of
+ pixels instead of most common
+ disp_zoom_incr 2.0 amount to multiply current zoom‐
+ level when zooming
+ color_graph_hidden 238 blue color of legend for hidden attri‐
+ bute
+ color_graph_selected bold color of selected graph points
+ color_graph_axis bold color for graph axis labels
+ disp_graph_tick_x ╵ character for graph x-axis ticks
+ color_graph_refline color for graph reference value
+ lines
+ disp_graph_reflines_x_charset ▏││▕ charset to render vertical refer‐
+ ence lines on graph
+ disp_graph_reflines_y_charset ▔──▁ charset to render horizontal ref‐
+ erence lines on graph
+ disp_graph_multiple_reflines_char ▒ char to render multiple parallel
+ reflines
+ disp_expert 0 max level of options and columns
+ to include
+ color_add_pending green color for rows pending add
+ color_change_pending reverse yellow color for cells pending modifica‐
+ tion
+ color_delete_pending red color for rows pending delete
+ disp_sidebar True whether to display sidebar
+ disp_sidebar_fmt format string for default sidebar
+ disp_sidebar_width 0 max width for sidebar
+ disp_sidebar_height 0 max height for sidebar
+ color_sidebar black on 114 blue base color of sidebar
+ color_sidebar_title black on yellow color of sidebar title
+ color_match red color for matching chars in pal‐
+ ette chooser
+ color_f5log_mon_up green color of f5log monitor status up
+ color_f5log_mon_down red color of f5log monitor status
+ down
+ color_f5log_mon_unknown blue color of f5log monitor status un‐
+ known
+ color_f5log_mon_checking magenta color of monitor status checking
+ color_f5log_mon_disabled black color of monitor status disabled
+ color_f5log_logid_alarm red color of alarms
+ color_f5log_logid_warn yellow color of warnings
+ color_f5log_logid_notice cyan color of notice
+ color_f5log_logid_info green color of info
+ color_xword_active green color of active clue
+ color_cmdpalette black on 72 base color of command palette
+ disp_cmdpal_max 10 max number of suggestions for
+ command palette
+ disp_scroll_context 0 minimum number of lines to keep
+ visible above/below cursor when
+ scrolling
+ disp_sparkline ▁▂▃▄▅▆▇ characters to display sparkline
+
+EXAMPLES
+ vd
+ launch DirSheet for current directory
+
+ vd foo.tsv
+ open the file foo.tsv in the current directory
+
+ vd -f ddw
+ open blank sheet of type ddw
+
+ vd new.tsv
+ open new blank tsv sheet named new
+
+ vd -f sqlite bar.db
+ open the file bar.db as a sqlite database
+
+ vd foo.tsv -n -f sqlite bar.db
+ open foo.tsv as tsv and bar.db as a sqlite database
+
+ vd -f sqlite foo.tsv bar.db
+ open both foo.tsv and bar.db as a sqlite database
+
+ vd -b countries.fixed -o countries.tsv
+ convert countries.fixed (in fixed width format) to countries.tsv (in tsv
+ format)
+
+ vd postgres://username:password@hostname:port/database
+ open a connection to the given postgres database
+
+ vd --play tests/pivot.vdj --replay-wait 1 --output tests/pivot.tsv
+ replay tests/pivot.vdj, waiting 1 second between commands, and output the
+ final sheet to test/pivot.tsv
+
+ ls -l | vd -f fixed --skip 1 --header 0
+ parse the output of ls -l into usable data
+
+ ls | vd | lpr
+ interactively select a list of filenames to send to the printer
+
+ vd newfile.tsv
+ open a blank sheet named newfile if file does not exist
+
+ vd sample.xlsx +:sheet1:2:3
+ launch with sheet1 at top-of-stack, and cursor at column 2 and row 3
+
+ vd -P open-plugins
+ preplay longname open-plugins before starting the session
+
+FILES
+ At the start of every session, VisiData looks for $HOME/.visidatarc, and
+ calls Python exec() on its contents if it exists. For example:
+
+ options.min_memory_mb=100 # stop processing without 100MB free
+
+ bindkey('0', 'go-leftmost') # alias '0' to go to first column, like vim
+
+ def median(values):
+ L = sorted(values)
+ return L[len(L)//2]
+
+ vd.aggregator('median', median)
+
+ Functions defined in .visidatarc are available in python expressions
+ (e.g. in derived columns).
+
+SUPPORTED SOURCES
+ Core VisiData includes these sources:
+
+ tsv (tab-separated value)
+ Plain and simple. VisiData writes tsv format by default. See the
+ --tsv-delimiter option.
+
+ csv (comma-separated value)
+ .csv files are a scourge upon the earth, and still regrettably
+ common.
+ See the --csv-dialect, --csv-delimiter, --csv-quotechar, and
+ --csv-skipinitialspace options.
+ Accepted dialects are excel-tab, unix, and excel.
+
+ fixed (fixed width text)
+ Columns are autodetected from the first 1000 rows (adjustable with
+ --fixed-rows).
+
+ json (single object) and jsonl/ndjson/ldjson (one object per line).
+ Cells containing lists (e.g. [3]) or dicts ({3}) can be expanded
+ into new columns with ( and unexpanded with ).
+
+ sqlite
+ May include multiple tables. The initial sheet is the table
+ directory; Enter loads the entire table into memory. z^S saves
+ modifications to source.
+
+ URL schemes are also supported:
+ http (requires requests); can be used as transport for with another
+ filetype
+
+ For a list of all remaining formats supported by VisiData, see
+ https://visidata.org/formats.
+
+ In addition, .zip, .gz, .bz2, .xz, ,zstd, and .zst files are decompressed
+ on the fly.
+
+AUTHOR
+ VisiData was made by Saul Pwanson <vd@saul.pw>.
+
+Linux/MacOS October 13, 2024 Linux/MacOS
+
+ validate
if needed.
+ With Frtictionless extract
command you can extract data from a file or a dataset.
By default, it outputs metadata visually formatted:
+ +frictionless extract tables/*.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ chunk1
+┏━━━━┳━━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━╇━━━━━━━━━┩
+│ 1 │ english │
+└────┴─────────┘
+ chunk2
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━╇━━━━━━━━┩
+│ 2 │ 中国人 │
+└────┴────────┘
+
+ It's possible to output as YAML
or JSON
, for example:
frictionless extract tables/*.csv --yaml
+
+
+chunk1:
+- id: 1
+ name: english
+chunk2:
+- id: 2
+ name: 中国人
+
+ validate
if needed.
+ frictionless@5.5
as a feature preview and request for comments. The implementation is raw and doesn't cover many edge cases.
+ Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.
+pip install frictionless[sql]
+
+
+ This mode is supported for any database that is supported by sqlalchemy
. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null
values and in-general it guarantees to finish successfully for any data even very invalid.
frictionless index table.csv --database sqlite:///index/project.db
+frictionless extract sqlite:///index/project.db --table table --json
+
+
+──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 3 rows in 0.21 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+ "project": [
+ {
+ "id": 1,
+ "name": "english"
+ },
+ {
+ "id": 2,
+ "name": "中国人"
+ }
+ ]
+}
+
+ sqlite3@3.34+
command to be available.
+ Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY
in Potgresql and .import
in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.
frictionless index table.csv --database sqlite:///index/project.db --fast
+frictionless extract sqlite:///index/project.db --table table --json
+
+
+──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 30 bytes in 0.368 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+ "project": [
+ {
+ "id": 1,
+ "name": "english"
+ },
+ {
+ "id": 2,
+ "name": "中国人"
+ }
+ ]
+}
+
+ To ensure that the data will be successfully indexed it's possible to use fallback
option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
+
+
+ Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:
+ + + +frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
+
+
+ describe
and list
command: if datapackage.json
is not provided describe
will load a sample from every tabular data file in a dataset and infer a schema while list
is a very lean and quick command operating only with available metadata and not touching actual data files.
+ With Frtictionless list
command you can get a list of resources from a data source. For more detailed output see describe
command.
By default, it outputs metadata visually formatted:
+frictionless list tables/*.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
+│ chunk1 │ table │ tables/chunk1.csv │
+│ chunk2 │ table │ tables/chunk2.csv │
+└────────┴───────┴───────────────────┘
+It's possible to output as YAML
or JSON
, for example:
frictionless list tables/*.csv --yaml
+
+
+- name: chunk1
+ type: table
+ path: tables/chunk1.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+- name: chunk2
+ type: table
+ path: tables/chunk2.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+The Command-Line interface is a vital part for the Frictionless Framework. While working within Python provides more flexibility, CLI is the easist way to interact with Frictionless.
+To install the package please follow the Getting Started guide. Usually, a simple installation using Pip or Anaconda will install the frictionless
binary on your computer so you don't need to install CLI aditionally.
The frictionless
binary requires providing a command like describe
or validate
:
frictionless describe # to describe your data
+frictionless explore # to explore your data
+frictionless extract # to extract your data
+frictionless index # to index your data
+frictionless list # to list your data
+frictionless publish # to publish your data
+frictionless query # to query your data
+frictionless script # to script your data
+frictionless validate # to validate your data
+frictionless --help # to get list of the command
+frictionless --version # to get the version
+
+
+ All the arguments for the main CLI command are the same as they are in Python. You can read Guides and use almost all the information from there within the command-line. There is an important different in how arguments are written (note the dashes):
+Python: validate('data/table.csv', limit_errors=1)
+CLI: $ validate data/table.csv --limit-errors 1
+
+To get help for a command and its arguments you can use the help flag with the command:
+ +frictionless describe --help # to get help for describe
+frictionless extract --help # to get help for extract
+frictionless validate --help # to get help for validate
+frictionless transform --help # to get help for transform
+
+
+ Usually, Frictionless commands returns pretty-formatted tabular data like extract
or validate
do. For the describe
command you get a metadata back and you can choose in what format to return it:
frictionless describe # default YAML with a commented front-matter
+frictionless describe --yaml # standard YAML
+frictionless describe --json # standard JSON
+
+
+ The Frictionless' CLI interface should not fail with any internal Python errors with a traceback (a long listing of related code). If you see something like this please create an issue in the Issue Tracker.
+To debug a problem please use:
+ + + +frictionless command --debug
+
+
+ With publish
command you can publish your dataset to a data publishing platform like CKAN:
frictionless publish data/tables/*.csv --target http://ckan:5000/dataset/my-best --title "My best dataset"
+
+It will ask for an API Key to upload your metadata and data. As a result:
+validate
if needed.
+ With query
command you can explore tabular files within a Sqlite database.
pip install frictionless[sql]
+pip install frictionless[sql,zenodo] # for examples in this tutorial
+
+
+ frictionless query https://zenodo.org/record/3977957
+
+validate
if needed.
+ With script
command you can explore tabular files with Pandas by one console command
pip install frictionless[sql]
+pip install frictionless[sql,zenodo] # for examples in this tutorial
+
+
+ frictionless script https://zenodo.org/record/3977957
+
+With validate
command you can validate your tabular files (indivisual or the whole dataset). For example:
frictionless validate table.csv invalid.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ table │ table │ table.csv │ VALID │
+│ invalid │ table │ invalid.csv │ INVALID │
+└─────────┴───────┴─────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ blank-label │ Label in the header in field at position │
+│ │ │ │ "3" is blank │
+│ None │ 4 │ duplicate-label │ Label "name" in the header at position "4" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 2 │ 3 │ missing-cell │ Row at position "2" has a missing cell in │
+│ │ │ │ field "field3" at position "3" │
+│ 2 │ 4 │ missing-cell │ Row at position "2" has a missing cell in │
+│ │ │ │ field "name2" at position "4" │
+│ 3 │ 3 │ missing-cell │ Row at position "3" has a missing cell in │
+│ │ │ │ field "field3" at position "3" │
+│ 3 │ 4 │ missing-cell │ Row at position "3" has a missing cell in │
+│ │ │ │ field "name2" at position "4" │
+│ 4 │ None │ blank-row │ Row at position "4" is completely blank │
+│ 5 │ 5 │ extra-cell │ Row at position "5" has an extra value in │
+│ │ │ │ field at position "5" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ Name | +Value | +
---|---|
Type | +cell-error | +
Title | +Cell Error | +
Description | +Cell Error | +
Template | +Cell Error | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +extra-cell | +
Title | +Extra Cell | +
Description | +This row has more values compared to the header row (the first row in the data source). A key concept is that all the rows in tabular data must have the same number of columns. | +
Template | +Row at position "{rowNumber}" has an extra value in field at position "{fieldNumber}" | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +missing-cell | +
Title | +Missing Cell | +
Description | +This row has less values compared to the header row (the first row in the data source). A key concept is that all the rows in tabular data must have the same number of columns. | +
Template | +Row at position "{rowNumber}" has a missing cell in field "{fieldName}" at position "{fieldNumber}" | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +type-error | +
Title | +Type Error | +
Description | +The value does not match the schema type and format for this field. | +
Template | +Type error in the cell "{cell}" in row "{rowNumber}" and field "{fieldName}" at position "{fieldNumber}": {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +constraint-error | +
Title | +Constraint Error | +
Description | +A field value does not conform to a constraint. | +
Template | +The cell "{cell}" in row at position "{rowNumber}" and field "{fieldName}" at position "{fieldNumber}" does not conform to a constraint: {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +unique-error | +
Title | +Unique Error | +
Description | +This field is a unique field but it contains a value that has been used in another row. | +
Template | +Row at position "{rowNumber}" has unique constraint violation in field "{fieldName}" at position "{fieldNumber}": {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +truncated-value | +
Title | +Truncated Value | +
Description | +The value is possible truncated. | +
Template | +The cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +forbidden-value | +
Title | +Forbidden Value | +
Description | +The value is forbidden. | +
Template | +The cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +sequential-value | +
Title | +Sequential Value | +
Description | +The value is not sequential. | +
Template | +The cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note} | +
Tags | +#table #row #cell | +
Name | +Value | +
---|---|
Type | +ascii-value | +
Title | +Ascii Value | +
Description | +The cell contains non-ascii characters. | +
Template | +The cell {cell} in row at position {rowNumber} and field {fieldName} at position {fieldNumber} has an error: {note} | +
Tags | +#table #row #cell | +
Cell error representation. + +A base class for all the errors related to the cell value.
+(*, note: str, cells: List[str], row_number: int, cell: str, field_name: str, field_number: int) -> None
++ Cell where the error occurred. +
+str
++ Name of the field that has an error. +
+str
++ Index of the field that has an error. +
+int
+Create and error from a cell
+(row: Row, *, note: str, field_name: str)
+Name | +Value | +
---|---|
Type | +data-error | +
Title | +Data Error | +
Description | +There is a data error. | +
Template | +Data error: {note} | +
Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
+Name | +Value | +
---|---|
Type | +file-error | +
Title | +File Error | +
Description | +There is a file error. | +
Template | +General file error: {note} | +
Tags | +#file | +
Name | +Value | +
---|---|
Type | +hash-count | +
Title | +Hash Count Error | +
Description | +This error can happen if the data is corrupted. | +
Template | +The data source does not match the expected hash count: {note} | +
Tags | +#file | +
Name | +Value | +
---|---|
Type | +byte-count | +
Title | +Byte Count Error | +
Description | +This error can happen if the data is corrupted. | +
Template | +The data source does not match the expected byte count: {note} | +
Tags | +#file | +
Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
+Name | +Value | +
---|---|
Type | +header-error | +
Title | +Header Error | +
Description | +Cell Error | +
Template | +Cell Error | +
Tags | +#table #header | +
Name | +Value | +
---|---|
Type | +blank-header | +
Title | +Blank Header | +
Description | +This header is empty. A header should contain at least one value. | +
Template | +Header is completely blank | +
Tags | +#table #header | +
Header error representation. + +A base class for all the errors related to the resource header.
+(*, note: str, labels: List[str], row_numbers: List[int]) -> None
++ List of labels that has errors. +
+List[str]
++ Row number where the error occurred. +
+List[int]
+Name | +Value | +
---|---|
Type | +label-error | +
Title | +Label Error | +
Description | +Label Error | +
Template | +Label Error | +
Tags | +#table #header #label | +
Name | +Value | +
---|---|
Type | +extra-label | +
Title | +Extra Label | +
Description | +The header of the data source contains label that does not exist in the provided schema. | +
Template | +There is an extra label "{label}" in header at position "{fieldNumber}" | +
Tags | +#table #header #label | +
Name | +Value | +
---|---|
Type | +missing-label | +
Title | +Missing Label | +
Description | +Based on the schema there should be a label that is missing in the data's header. | +
Template | +There is a missing label in the header's field "{fieldName}" at position "{fieldNumber}" | +
Tags | +#table #header #label | +
Name | +Value | +
---|---|
Type | +blank-label | +
Title | +Blank Label | +
Description | +A label in the header row is missing a value. Label should be provided and not be blank. | +
Template | +Label in the header in field at position "{fieldNumber}" is blank | +
Tags | +#table #header #label | +
Name | +Value | +
---|---|
Type | +duplicate-label | +
Title | +Duplicate Label | +
Description | +Two columns in the header row have the same value. Column names should be unique. | +
Template | +Label "{label}" in the header at position "{fieldNumber}" is duplicated to a label: {note} | +
Tags | +#table #header #label | +
Name | +Value | +
---|---|
Type | +incorrect-label | +
Title | +Incorrect Label | +
Description | +One of the data source header does not match the field name defined in the schema. | +
Template | +Label "{label}" in field {fieldName} at position "{fieldNumber}" does not match the field name in the schema | +
Tags | +#table #header #label | +
Label error representation. + +A base class for all the errors related to the labels of the columns/fields.
+(*, note: str, labels: List[str], row_numbers: List[int], label: str, field_name: str, field_number: int) -> None
++ Label of the field that has an error. +
+str
++ Name of the field that has an error. +
+str
++ Index of the field that has an error. +
+int
+Name | +Value | +
---|---|
Type | +metadata-error | +
Title | +Metadata Error | +
Description | +There is a metadata error. | +
Template | +Metadata error: {note} | +
Name | +Value | +
---|---|
Type | +catalog-error | +
Title | +Catalog Error | +
Description | +A validation cannot be processed. | +
Template | +The data catalog has an error: {note} | +
Name | +Value | +
---|---|
Type | +dataset-error | +
Title | +Dataset Error | +
Description | +A validation cannot be processed. | +
Template | +The dataset has an error: {note} | +
Name | +Value | +
---|---|
Type | +checklist-error | +
Title | +Checklist Error | +
Description | +Provided checklist is not valid. | +
Template | +Checklist is not valid: {note} | +
Name | +Value | +
---|---|
Type | +check-error | +
Title | +Check Error | +
Description | +Provided check is not valid | +
Template | +Check is not valid: {note} | +
Name | +Value | +
---|---|
Type | +detector-error | +
Title | +Detector Error | +
Description | +Provided detector is not valid. | +
Template | +Detector is not valid: {note} | +
Name | +Value | +
---|---|
Type | +dialect-error | +
Title | +Dialect Error | +
Description | +Provided dialect is not valid. | +
Template | +Dialect is not valid: {note} | +
Name | +Value | +
---|---|
Type | +control-error | +
Title | +Control Error | +
Description | +Provided control is not valid. | +
Template | +Control is not valid: {note} | +
Name | +Value | +
---|---|
Type | +inquiry-error | +
Title | +Inquiry Error | +
Description | +Provided inquiry is not valid. | +
Template | +Inquiry is not valid: {note} | +
Name | +Value | +
---|---|
Type | +inquiry-task-error | +
Title | +Inquiry Task Error | +
Description | +Provided inquiry task is not valid. | +
Template | +Inquiry task is not valid: {note} | +
Name | +Value | +
---|---|
Type | +package-error | +
Title | +Package Error | +
Description | +A validation cannot be processed. | +
Template | +The data package has an error: {note} | +
Name | +Value | +
---|---|
Type | +pipeline-error | +
Title | +Pipeline Error | +
Description | +Provided pipeline is not valid. | +
Template | +Pipeline is not valid: {note} | +
Name | +Value | +
---|---|
Type | +step-error | +
Title | +Step Error | +
Description | +Provided step is not valid | +
Template | +Step is not valid: {note} | +
Name | +Value | +
---|---|
Type | +report-error | +
Title | +Report Error | +
Description | +Provided report is not valid. | +
Template | +Report is not valid: {note} | +
Name | +Value | +
---|---|
Type | +report-task-error | +
Title | +Report Task Error | +
Description | +Provided report task is not valid. | +
Template | +Report task is not valid: {note} | +
Name | +Value | +
---|---|
Type | +schema-error | +
Title | +Schema Error | +
Description | +Provided schema is not valid. | +
Template | +Schema is not valid: {note} | +
Name | +Value | +
---|---|
Type | +field-error | +
Title | +Field Error | +
Description | +Provided field is not valid. | +
Template | +Field is not valid: {note} | +
Name | +Value | +
---|---|
Type | +stats-error | +
Title | +Stats Error | +
Description | +Stats object has an error. | +
Template | +Stats object has an error: {note} | +
Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
+Name | +Value | +
---|---|
Type | +resource-error | +
Title | +Resource Error | +
Description | +A validation cannot be processed. | +
Template | +The data resource has an error: {note} | +
Name | +Value | +
---|---|
Type | +source-error | +
Title | +Source Error | +
Description | +Data reading error because of not supported or inconsistent contents. | +
Template | +The data source has not supported or has inconsistent contents: {note} | +
Name | +Value | +
---|---|
Type | +scheme-error | +
Title | +Scheme Error | +
Description | +Data reading error because of incorrect scheme. | +
Template | +The data source could not be successfully loaded: {note} | +
Name | +Value | +
---|---|
Type | +format-error | +
Title | +Format Error | +
Description | +Data reading error because of incorrect format. | +
Template | +The data source could not be successfully parsed: {note} | +
Name | +Value | +
---|---|
Type | +encoding-error | +
Title | +Encoding Error | +
Description | +Data reading error because of an encoding problem. | +
Template | +The data source could not be successfully decoded: {note} | +
Name | +Value | +
---|---|
Type | +compression-error | +
Title | +Compression Error | +
Description | +Data reading error because of a decompression problem. | +
Template | +The data source could not be successfully decompressed: {note} | +
Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
+Name | +Value | +
---|---|
Type | +row-error | +
Title | +Row Error | +
Description | +Row Error | +
Template | +Row Error | +
Tags | +#table #row | +
Name | +Value | +
---|---|
Type | +blank-row | +
Title | +Blank Row | +
Description | +This row is empty. A row should contain at least one value. | +
Template | +Row at position "{rowNumber}" is completely blank | +
Tags | +#table #row | +
Name | +Value | +
---|---|
Type | +primary-key | +
Title | +PrimaryKey Error | +
Description | +Values in the primary key fields should be unique for every row | +
Template | +Row at position "{rowNumber}" violates the primary key: {note} | +
Tags | +#table #row | +
Name | +Value | +
---|---|
Type | +foreign-key | +
Title | +ForeignKey Error | +
Description | +Values in the foreign key fields should exist in the reference table | +
Template | +Row at position "{rowNumber}" violates the foreign key: {note} | +
Tags | +#table #row | +
Name | +Value | +
---|---|
Type | +duplicate-row | +
Title | +Duplicate Row | +
Description | +The row is duplicated. | +
Template | +Row at position {rowNumber} is duplicated: {note} | +
Tags | +#table #row | +
Name | +Value | +
---|---|
Type | +row-constraint | +
Title | +Row Constraint | +
Description | +The value does not conform to the row constraint. | +
Template | +The row at position {rowNumber} has an error: {note} | +
Tags | +#table #row | +
Row error representation. + +A base class for all the errors related to a row of the +tabular data.
+(*, note: str, cells: List[str], row_number: int) -> None
++ Values of all the cells in the row that has an error. +
+List[str]
++ Index of the row that has an error. +
+int
+Create an error from a row
+(row: Row, *, note: str)
+Row error representation. + +A base class for all the errors related to a row of the +tabular data.
+(*, note: str, cells: List[str], row_number: int, field_names: List[str], field_cells: List[str], reference_name: str, reference_field_names: List[str]) -> None
++ Keys in the resource target column. +
+List[str]
++ Cells not found in the lookup table. +
+List[str]
++ Name of the lookup table the keys were searched on +
+str
++ Key names in the lookup table defined as foreign keys in the resource. +
+List[str]
+Create an foreign-key-error from a row
+(row: Row, *, note: str, field_names: List[str], field_values: List[Any], reference_name: str, reference_field_names: List[str])
+Name | +Value | +
---|---|
Type | +table-error | +
Title | +Table Error | +
Description | +There is a table error. | +
Template | +General table error: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +field-count | +
Title | +Field Count Error | +
Description | +This error can happen if the data is corrupted. | +
Template | +The data source does not match the expected field count: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +row-count | +
Title | +Row Count Error | +
Description | +This error can happen if the data is corrupted. | +
Template | +The data source does not match the expected row count: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +table-dimensions | +
Title | +Table dimensions error | +
Description | +This error can happen if the data is corrupted. | +
Template | +The data source does not have the required dimensions: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +deviated-value | +
Title | +Deviated Value | +
Description | +The value is deviated. | +
Template | +There is a possible error because the value is deviated: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +deviated-cell | +
Title | +Deviated cell | +
Description | +The cell is deviated. | +
Template | +There is a possible error because the cell is deviated: {note} | +
Tags | +#table | +
Name | +Value | +
---|---|
Type | +required-value | +
Title | +Required Value | +
Description | +The required values are missing. | +
Template | +Required values not found: {note} | +
Tags | +#table | +
Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
+AnyField provides an ability to skip any cell parsing. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], [1], ['1']]
+rows = extract(data, schema=Schema(fields=[fields.AnyField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': 1}, {'name': '1'}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+The field contains data that is a valid JSON format arrays. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['["value1", "value2"]']]
+rows = extract(data, schema=Schema(fields=[fields.ArrayField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': ['value1', 'value2']}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, array_item: Optional[Dict[str, Any]] = NOTHING) -> None
++ A dictionary that specifies the type and other constraints for the + data that will be read in this data type field. +
+Optional[Dict[str, Any]]
+The field contains boolean (true/false) data.
+In the physical representations of data where boolean values are represented with strings, the values set in trueValues and falseValues are to be cast to their logical representation as booleans. trueValues and falseValues are arrays which can be customised to user need. The default values for these are in the additional properties section below. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['true'], ['false']]
+rows = extract(data, schema=Schema(fields=[fields.BooleanField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': True}, {'name': False}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, true_values: List[str] = NOTHING, false_values: List[str] = NOTHING) -> None
++ It defines the values to be read as true values while reading data. The default + true values are ["true", "True", "TRUE", "1"]. +
+List[str]
++ It defines the values to be read as false values while reading data. The default + true values are ["false", "False", "FALSE", "0"]. +
+List[str]
+A date without a time (by default in ISO8601 format). Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08-22']]
+rows = extract(data, schema=Schema(fields=[fields.DateField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': datetime.date(2022, 8, 22)}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+A date with a time (by default in ISO8601 format). Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08-22T12:00:00']]
+rows = extract(data, schema=Schema(fields=[fields.DatetimeField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': datetime.datetime(2022, 8, 22, 12, 0)}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+A duration of time. We follow the definition of XML Schema duration datatype directly +and that definition is implicitly inlined here. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['P1Y']]
+rows = extract(data, schema=Schema(fields=[fields.DurationField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': isodate.duration.Duration(0, 0, 0, years=1, months=0)}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+The field contains a JSON object according to GeoJSON or TopoJSON spec. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['{"geometry": null, "type": "Feature", "properties": {"k": "v"}}']]
+rows = extract(data, schema=Schema(fields=[fields.GeojsonField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': {'geometry': None, 'type': 'Feature', 'properties': {'k': 'v'}}}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+The field contains data describing a geographic point. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ["180, -90"]]
+rows = extract(data, schema=Schema(fields=[fields.GeopointField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': [180.0, -90.0]}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+The field contains integers - that is whole numbers. Integer values are indicated in the standard way for any valid integer. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['1'], ['2'], ['3']]
+rows = extract(data, schema=Schema(fields=[fields.IntegerField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': 1}, {'name': 2}, {'name': 3}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, bare_number: bool = True) -> None
++ It specifies that the value is a bare number. If true, the pattern to + remove non digit character does not get applied and vice versa. + The default value is True. +
+bool
+The field contains numbers of any kind including decimals. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['1.1'], ['2.2'], ['3.3']]
+rows = extract(data, schema=Schema(fields=[fields.NumberField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': Decimal('1.1')}, {'name': Decimal('2.2')}, {'name': Decimal('3.3')}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None, bare_number: bool = True, float_number: bool = False, decimal_char: str = ., group_char: str = ) -> None
++ It specifies that the value is a bare number. If true, the pattern to remove non digit + character does not get applied and vice versa. The default value is True. +
+bool
++ It specifies that the value is a float number. +
+bool
++ It specifies the char to be used as decimal character. The default + value is ".". It values can be: ".", "@" etc. +
+str
++ It specifies the char to be used as group character. The default value + is "". It can take values such as: ",", "#" etc. +
+str
+The field contains data which is valid JSON. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['{"key": "value"}']]
+rows = extract(data, schema=Schema(fields=[fields.ObjectField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': {'key': 'value'}}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+The field contains strings, that is, sequences of characters. Read more in Table Schema Standard. Currently supported formats:
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['value']]
+rows = extract(data, schema=Schema(fields=[fields.StringField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': 'value'}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+A time without a date. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['15:00:00']]
+rows = extract(data, schema=Schema(fields=[fields.TimeField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': datetime.time(15, 0)}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+A calendar year as per XMLSchema gYear. Usual lexical representation is YYYY. There are no format options. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022']]
+rows = extract(data, schema=Schema(fields=[fields.YearField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': 2022}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+A specific month in a specific year as per XMLSchema gYearMonth. Usual lexical representation is: YYYY-MM. Read more in Table Schema Standard.
+from frictionless import Schema, extract, fields
+
+data = [['name'], ['2022-08']]
+rows = extract(data, schema=Schema(fields=[fields.YearmonthField(name='name')]))
+print(rows)
+
+
+{'memory': [{'name': yearmonth(year=2022, month=8)}]}
+
+ Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
+CSV is a file format which you can you in Frictionless for reading and writing. Arguable it's the main Open Data format so it's supported very well in Frictionless.
+You can read this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('table.csv')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.csv')
+print(target)
+print(target.to_view())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ There is a control to configure how Frictionless read and write files in this format. For example:
+ +from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('tmp/table.csv', control=formats.CsvControl(delimiter=';'))
+
+
+ Csv dialect representation. + +Control class to set params for CSV reader/writer.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, delimiter: str = ,, line_terminator: str = \r\n, quote_char: str = ", double_quote: bool = True, escape_char: Optional[str] = None, null_sequence: Optional[str] = None, skip_initial_space: bool = False) -> None
++ Specify the delimiter used to separate text strings while + reading from or writing to the csv file. Default value is ",". + For example: delimiter=";" +
+str
++ Specify the line terminator for the csv file while reading/writing. + For example: line_terminator="\n". Default line_terminator is "\r\n". +
+str
++ Specify the quote char for fields that contains a special character + such as comma, CR, LF or double quote. Default value is '"'. + For example: quotechar='|' +
+str
++ It controls how 'quote_char' appearing inside a field should themselves be + quoted. When set to True, the 'quote_char' is doubled else escape char is + used. Default value is True. +
+bool
++ A one-character string used by the csv writer to escape. Default is None, which disables + escaping. It uses 'quote_char', if double_quote is False. +
+Optional[str]
++ Specify the null sequence and not set by default. + For example: \\N +
+Optional[str]
++ Ignores spaces following the comma if set to True. + For example space in header(in csv file): "Name", "Team" +
+bool
+Convert to Python's `csv.Dialect`
+Frictionless supports exporting a data package as an ER-diagram dot
file. For example:
package = Package('datapackage.zip')
+package.to_er_diagram(path='erd.dot')
+
+Excel is a very popular tabular data format that usually has xlsx
(newer) and xls
(older) file extensions. Frictionless supports Excel files extensively.
pip install frictionless[excel]
+pip install 'frictionless[excel]' # for zsh shell
+
+
+ You can read this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.xlsx')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.xlsx')
+print(target)
+print(target.to_view())
+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table-output-sheet.xls', control=formats.ExcelControl(sheet='My Table'))
+
+
+ Excel control representation. + +Control class to set params for Excel reader/writer.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, sheet: Union[str, int] = 1, workbook_cache: Optional[Any] = None, fill_merged_cells: bool = False, preserve_formatting: bool = False, adjust_floating_point_error: bool = False, stringified: bool = False) -> None
++ Name of the sheet from where to read or write data. +
+Union[str, int]
++ An empty dictionary which is used to handle workbook caching for remote workbooks. + It stores the path to the temporary file while reading remote workbooks. +
+Optional[Any]
++ If True, it will unmerge and fill all merged cells by the visible value. + Default value is False. +
+bool
++ If set to True, it preserves text formatting for numeric and temporal cells. If not set, + it will return all cell value as string. Default value is False. +
+bool
++ If True, it corrects the Excel behavior regarding floating point numbers. + For example: 274.65999999999997 -> 274.66 (When True). +
+bool
++ Stringifies all the cell values. Default value + is False. + + Note that a table resource schema will still be applied and types coerced to match the schema + (either provided or inferred) _after_ the rows are read as strings. + + To return all cells as strings then both set `stringified=True` and specify a + schema that defines all fields to be of type string (see #1659). +
+bool
+Frictionless supports parsing Google Sheets data as a file format.
+ +pip install frictionless[gsheets]
+pip install 'frictionless[gsheets]' # for zsh shell
+
+
+ You can read from Google Sheets using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+path='https://docs.google.com/spreadsheets/d/1mHIWnDvW9cALRMq9OdNfRwjAthCUFUOACPp0Lkyl7b4/edit?usp=sharing'
+resource = Resource(path=path)
+pprint(resource.read_rows())
+
+
+ [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+The same is actual for writing:
+ +from frictionless import Resource, formats
+
+control = formats.GsheetsControl(credentials=".google.json")
+resource = Resource(path='data/table.csv')
+resource.write("https://docs.google.com/spreadsheets/d/<id>/edit", control=control})
+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from frictionless import Resource, formats
+
+control = formats.GsheetsControl(credentials=".google.json")
+resource = Resource(path='data/table.csv')
+resource.write("https://docs.google.com/spreadsheets/d/<id>/edit", control=control)
+
+
+ Gsheets control representation. + +Control class to set params for Gsheets api.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, credentials: Optional[str] = None) -> None
++ API key to access google sheets. +
+Optional[str]
+Frictionless supports parsing HTML format:
+ +pip install frictionless[html]
+pip install 'frictionless[html]' # for zsh shell
+
+
+ You can this file format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import resources
+
+resource = resources.TableResource(path='table1.html')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.html')
+source.write(target)
+print(target)
+print(target.to_view())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.html',
+ 'scheme': 'file',
+ 'format': 'html',
+ 'mediatype': 'text/html'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ There is a dialect to configure HTML, for example:
+ +from frictionless import Resource, resources, formats
+
+control=formats.HtmlControl(selector='#id')
+resource = resources.TableResource(path='table1.html', control=control)
+print(resource.read_rows())
+
+
+[]
+
+ Html control representation. + +Control class to set params for Html reader/writer.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, selector: str = table) -> None
++ Any valid css selector. Default selector is 'table'. + For example: "table", "#id", ".meme" etc. +
+str
+Frictionless supports working with Inline Data from memory.
+You can read data in this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource
+
+source = Resource('table.csv')
+target = source.write(format='inline', datatype='table')
+print(target)
+print(target.to_view())
+
+
+{'name': 'memory',
+ 'type': 'table',
+ 'data': [['id', 'name'], [1, 'english'], [2, '中国人']],
+ 'format': 'inline'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | '中国人' |
++----+-----------+
+
+ There is a dialect to configure this format, for example:
+ +from frictionless import Resource, formats
+
+control = formats.InlineControl(keyed=True, keys=['name', 'id'])
+resource = Resource(data=[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}], control=control)
+print(resource.to_view())
+
+
++-----------+----+
+| name | id |
++===========+====+
+| 'english' | 1 |
++-----------+----+
+| 'german' | 2 |
++-----------+----+
+
+ Inline control representation. + +Control class to set params for Inline reader/writer.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False) -> None
++ Specify the keys/columns to read from the resource. + For example: keys=["id","name"]. +
+Optional[List[str]]
++ If set to True, It returns the data as key:value pair. +
+bool
+Frictionless supports parsing JSON tables (JSON and JSONL/NDJSON).
+ +pip install frictionless[json]
+pip install 'frictionless[json]' # for zsh shell
+
+
+ ++We use the
+path
argument to ensure that it will not be guessed to be a metadata file
You can read this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import resources
+
+resource = resources.TableResource(path='table.json')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.json')
+source.write(target)
+print(target)
+print(target.to_view())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.json',
+ 'scheme': 'file',
+ 'format': 'json',
+ 'mediatype': 'text/json'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from pprint import pprint
+from frictionless import Resource, resources, formats
+
+control=formats.JsonControl(keyed=True)
+resource = resources.TableResource(path='table.keyed.json', control=control)
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ Json control representation. + +Control class to set params for JSON reader/writer class.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False, property: Optional[str] = None) -> None
++ Specifies the keys/columns to read from the resource. + For example: keys=["id","name"]. +
+Optional[List[str]]
++ If set to True, It returns the data as key:value pair. Default value is False. +
+bool
++ This property specifies the path to the attribute in a json file, it it has + nested fields. +
+Optional[str]
+Frictionless supports importing a JsonSchema profile as a Table Schema. For example:
+schema = Schema.from_jsonschema('table.jsonschema')
+
+Frictionless supports exporting a metadata object as a Markdown document. For example:
+schema = Schema('schema.json')
+schema.to_markdown('schema.md')
+
+Frictionless supports ODS parsing.
+ +pip install frictionless[ods]
+pip install 'frictionless[ods]' # for zsh shell
+
+
+ You can read this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.ods')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from pprint import pprint
+from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.ods')
+pprint(target)
+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from frictionless import Resource, formats
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table-output-sheet.ods', control=formats.OdsControl(sheet='My Table'))
+
+
+ Ods control representation. + +Control class to set params for ODS reader/writer.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, sheet: Union[str, int] = 1) -> None
++ Name or index of the sheet to read/write. +
+Union[str, int]
+Frictionless supports reading and writing Pandas dataframes.
+ +pip install frictionless[pandas]
+pip install 'frictionless[pandas]' # for zsh shell
+
+
+ You can read a Pandas dataframe:
+ +from frictionless import Resource
+
+resource = Resource(df)
+pprint(resource.read_rows())
+
+
+ You can write a dataset to Pandas:
+ + + +from frictionless import Resource
+
+resource = Resource('table.csv')
+df = resource.to_pandas()
+
+
+ Frictionless supports reading and writing Parquet files.
+ +pip install frictionless[parquet]
+pip install 'frictionless[parquet]' # for zsh shell
+
+
+ You can read a Parquet file:
+ +from frictionless import Resource
+
+resource = Resource('table.parq')
+print(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ You can write a dataset to Parquet:
+ +from frictionless import Resource
+
+resource = Resource('table.csv')
+target = resource.write('table-output.parq')
+print(target)
+print(target.read_rows())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.parq',
+ 'scheme': 'file',
+ 'format': 'parq',
+ 'mediatype': 'application/parquet'}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ Parquet control representation. + +Control class to set params for Parquet read/write class.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, columns: Optional[List[str]] = None, categories: Optional[Any] = None, filters: Optional[Any] = False) -> None
++ A list of columns to load. By selecting columns, we can only access + parts of file that we are interested in and skip columns that are + not of interest. Default value is None. +
+Optional[List[str]]
++ List of columns that should be returned as Pandas Category-type column. + The second example specifies the number of expected labels for that column. + For example: categories=['col1'] or categories={'col1': 12} +
+Optional[Any]
++ Specifies the condition to filter data(row-groups). + For example: [('col3', 'in', [1, 2, 3, 4])]) +
+Optional[Any]
+Convert to options
+Frictionless supports reading and writing SPSS files.
+ +pip install frictionless[spss]
+pip install 'frictionless[spss]' # for zsh shell
+
+
+ You can read SPSS files:
+ +from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('table.sav')
+pprint(resource.read_rows())
+
+
+ You can write SPSS files:
+ + + +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.sav')
+pprint(target)
+pprint(target.read_rows())
+
+
+ Frictionless supports reading and writing SQL databases.
+Frictionless Framework in-general support many databases that can be used with sqlalchemy
. Here is a list of the databases with tested support:
+ ++
It's a well-tested default database used by Frictionless:
+ +pip install frictionless[sql]
+
+
+ + ++
This database is well-tested and provides the most data types:
+ +pip install frictionless[postgresql]
+
+
+ + ++
Another popular databases having been tested with Frictionless:
+ +pip install frictionless[mysql]
+
+
+ + ++
DuckDB is a reletively new database and, currently, Frictionless support is experimental:
+ +pip install frictionless[duckdb]
+
+
+ You can read SQL database:
+ +from frictionless import Resource, formats
+
+control = SqlControl(table="test_table", basepath='data')
+with Resource(path="sqlite:///sqlite.db", control=control) as resource:
+ print(resource.read_rows())
+
+
+ You can write SQL databases:
+ +from frictionless import Package
+
+package = Package('path/to/datapackage.json')
+package.publish('postgresql://database')
+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from frictionless import Resource, formats
+
+control = SqlControl(table='table', order_by='field', where='field > 20')
+resource = Resource('postgresql://database', control=control)
+
+
+ SQL control representation. + +Control class to set params for Sql read/write class.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, table: Optional[str] = None, order_by: Optional[str] = None, where: Optional[str] = None, namespace: Optional[str] = None, basepath: Optional[str] = None, with_metadata: bool = False) -> None
++ Table name from which to read the data. +
+Optional[str]
++ It specifies the ORDER BY keyword for SQL queries to sort the + results that are being read. The default value is None. +
+Optional[str]
++ It specifies the WHERE clause to filter the records in SQL + queries. The default value is None. +
+Optional[str]
++ To refer to table using schema or namespace or database such as + `FOO`.`TABLEFOO1` we can specify namespace. For example: + control = formats.SqlControl(table="test_table", namespace="FOO") +
+Optional[str]
++ It specifies the base path for the database. The basepath will + be appended to the db path. The default value is None. For example: + formats.SqlControl(table="test_table", basepath="data") +
+Optional[str]
++ Indicates if a table contains metadata columns like + _rowNumber or _rowValid +
+bool
+Frictionless supports parsing YAML tables.
+++We use the
+path
argument to ensure that it will not be guessed to be a metadata file
You can read this format using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource, resources
+
+resource = resources.TableResource(path='table.yaml')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ The same is actual for writing:
+ +from frictionless import Resource, resources
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = resources.TableResource(path='table-output.yaml')
+source.write(target)
+print(target)
+print(target.to_view())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.yaml',
+ 'scheme': 'file',
+ 'format': 'yaml',
+ 'mediatype': 'text/yaml'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ There is a dialect to configure how Frictionless read and write files in this format. For example:
+ +from pprint import pprint
+from frictionless import Resource, resources, formats
+
+control=formats.YamlControl(keyed=True)
+resource = resources.TableResource(path='table.keyed.yaml', control=control)
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ Yaml control representation.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, keys: Optional[List[str]] = None, keyed: bool = False, property: Optional[str] = None) -> None
++ Specifies the keys/columns to read from the resource. + For example: keys=["id","name"]. +
+Optional[List[str]]
++ If set to True, It returns the data as key:value pair. Default value is False. +
+bool
++ This property specifies the path to the attribute in a json file, it it has + nested fields. +
+Optional[str]
+Frictionless supports zipped resources and reading/publishing data packages as a zip archive. For example:
+package = Package('datapackage.zip')
+package.publish('otherpackage.zip')
+
+Describe is a high-level function (action) to infer a metadata from a data source.
+from frictionless import describe
+
+resource = describe('table.csv')
+print(resource)
+
+
+{'name': 'table',
+ 'type': 'table',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}
+
+ Describe the data source
+(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata
+Extract is a high-level function (action) to read tabular data from a data source. The output is encoded in 'utf-8' scheme.
+from pprint import pprint
+from frictionless import extract
+
+rows = extract('table.csv')
+pprint(rows)
+
+
+{'table': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+
+ Extract rows
+(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, filter: Optional[types.IFilterFunction] = None, process: Optional[types.IProcessFunction] = None, limit_rows: Optional[int] = None, resource_name: Optional[str] = None, **options: Any)
+Validate is a high-level function (action) to validate data from a data source.
+from frictionless import validate
+
+report = validate('table.csv')
+print(report.valid)
+
+
+True
+
+ Validate resource
+(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, checklist: Union[frictionless.checklist.checklist.Checklist, str, NoneType] = None, checks: List[frictionless.checklist.check.Check] = [], pick_errors: List[str] = [], skip_errors: List[str] = [], limit_errors: int = 1000, limit_rows: Optional[int] = None, parallel: bool = False, resource_name: Optional[str] = None, **options: Any)
+Transform is a high-level function (action) to transform tabular data from a data source.
+from frictionless import transform, steps
+
+resource = transform('table.csv', steps=[steps.cell_set(field_name='name', value='new')])
+print(resource.read_rows())
+
+
+[{'id': 1, 'name': 'new'}, {'id': 2, 'name': 'new'}]
+
+ Transform resource
+(source: Optional[Any] = None, *, type: Optional[str] = None, pipeline: Union[frictionless.pipeline.pipeline.Pipeline, str, NoneType] = None, steps: Optional[List[frictionless.pipeline.step.Step]] = None, **options: Any)
+Catalog is a set of data packages.
+We can create a catalog providing a list of data packages:
+ +from frictionless import Catalog, Dataset, Package
+
+catalog = Catalog(datasets=[Dataset(name='name', package=Package('tables/*'))])
+
+
+ Usually Catalog is used to describe some external set of datasets like a CKAN instance or a Github user or search. For example:
+ +from frictionless import Catalog
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+print(catalog)
+
+
+ The core purpose of having a catalog is to provide an ability to have a set of datasets. The Catalog class provides useful methods to manage datasets:
+ +from frictionless import Catalog
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+catalog.dataset_names
+catalog.has_dataset
+catalog.add_dataset
+catalog.get_dataset
+catalog.clear_datasets
+
+
+ As any of the Metadata classes the Catalog class can be saved as JSON or YAML:
+ +from frictionless import Package
+
+catalog = Catalog('https://demo.ckan.org/dataset/')
+catalog.to_json('datacatalog.json') # Save as JSON
+catalog.to_yaml('datacatalog.yaml') # Save as YAML
+
+
+ Catalog representation
+(*, source: Optional[Any] = None, control: Optional[Control] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, datasets: List[Dataset] = NOTHING, basepath: Optional[str] = None) -> None
++ # TODO: add docs +
+Optional[Any]
++ # TODO: add docs +
+Optional[Control]
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +
+Optional[str]
++ Type of the object +
+ClassVar[Union[str, None]]
++ A Catalog title according to the specs. It should be a + human-oriented title of the resource. +
+Optional[str]
++ A Catalog description according to the specs. It should be a + human-oriented description of the resource. +
+Optional[str]
++ A list of datasets. Each package in the list is a Data Dataset. +
+List[Dataset]
++ A basepath of the catalog. The normpath of the resource is joined + `basepath` and `/path` +
+Optional[str]
+Return names of datasets
+List[str]
+Add new dataset to the catalog
+(dataset: Union[Dataset, str]) -> Dataset
+Remove all the datasets
+Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object
+Get dataset by name
+(name: str) -> Dataset
+Check if a dataset is present
+(name: str) -> bool
+Infer catalog's metadata
+(*, stats: bool = False)
+Remove dataset by name
+(name: str) -> Dataset
+Set dataset by name
+(dataset: Dataset) -> Optional[Dataset]
+Create a copy of the catalog
+(**options: Any)
+Dataset representation.
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, package: Union[Package, str], basepath: Optional[str] = None, catalog: Optional[Catalog] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +
+str
++ A short name(preferably human-readable) for the Check. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +
+ClassVar[str]
++ A human-readable title for the Check. +
+Optional[str]
++ A detailed description for the Check. +
+Optional[str]
++ # TODO: add docs +
+Union[Package, str]
++ # TODO: add docs +
+Optional[str]
++ # TODO: add docs +
+Optional[Catalog]
+Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object
+Infer dataset's metadata
+(*, stats: bool = False)
+Checklist is a set of validation checks and a few addition settings. Let's create a checklist:
+ +from frictionless import Checklist, checks
+
+checklist = Checklist(checks=[checks.row_constraint(formula='id > 1')])
+print(checklist)
+
+
+{'checks': [{'type': 'row-constraint', 'formula': 'id > 1'}]}
+
+ The Check concept is a part of the Validation API. You can create a custom Check to be used as part of resource or package validation.
+from frictionless import Check, errors
+
+class duplicate_row(Check):
+ code = "duplicate-row"
+ Errors = [errors.DuplicateRowError]
+
+ def __init__(self, descriptor=None):
+ super().__init__(descriptor)
+ self.__memory = {}
+
+ def validate_row(self, row):
+ text = ",".join(map(str, row.values()))
+ hash = hashlib.sha256(text.encode("utf-8")).hexdigest()
+ match = self.__memory.get(hash)
+ if match:
+ note = 'the same as row at position "%s"' % match
+ yield errors.DuplicateRowError.from_row(row, note=note)
+ self.__memory[hash] = row.row_position
+
+ # Metadata
+
+ metadata_profile = { # type: ignore
+ "type": "object",
+ "properties": {},
+ }
+
+It's usual to create a custom Error along side with a Custom Check.
+Checklist representation. + +A class that combines multiple checks to be applied while validating +a resource or package.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, checks: List[Check] = NOTHING, pick_errors: List[str] = NOTHING, skip_errors: List[str] = NOTHING) -> None
++ A short name(preferably human-readable) for the Checklist. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +
+Optional[str]
++ Type of the object +
+ClassVar[Union[str, None]]
++ A human-readable title for the Checklist. +
+Optional[str]
++ A detailed description for the Checklist. +
+Optional[str]
++ List of checks to be applied during validation such as "deviated-cell", + "required-value" etc. +
+List[Check]
++ Specify the errors names to be picked while validation such as "sha256-count", + "byte-count". Errors other than specified will be ignored. +
+List[str]
++ Specify the errors names to be skipped while validation such as "sha256-count", + "byte-count". Other errors will be included. +
+List[str]
+Add new check to the schema
+(check: Check) -> None
+Remove all the checks
+() -> None
+Get check by type
+(type: str) -> Check
+Check if a check is present
+(type: str) -> bool
+Remove check by type
+(type: str) -> Check
+Set check by type
+(check: Check) -> Optional[Check]
+Check representation. + +A base class for all the checks. To add a new custom check, it has to be derived +from this class.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +
+Optional[str]
++ A short name(preferably human-readable) for the Check. + This MUST be lower-case and contain only alphanumeric characters + along with "-" or "_". +
+ClassVar[str]
++ A human-readable title for the Check. +
+Optional[str]
++ A detailed description for the Check. +
+Optional[str]
++ List of errors that are being used in the Check. +
+ClassVar[List[Type[Error]]]
+Resource
+Connect to the given resource
+(resource: Resource)
+Called to validate the resource before closing
+() -> Iterable[Error]
+Called to validate the given row (on every row)
+(row: Row) -> Iterable[Error]
+Called to validate the resource after opening
+() -> Iterable[Error]
+The Detector object can be used in various places within the Framework. The main purpose of this class is to tweak how different aspects of metadata are detected.
+Here is a quick example:
+ +frictionless extract table.csv --field-missing-values 1,2
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
+│ table │ table │ table.csv │
+└───────┴───────┴───────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ table
+┏━━━━━━┳━━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━╇━━━━━━━━━┩
+│ None │ english │
+│ None │ 中国人 │
+└──────┴─────────┘
+
+ from frictionless import Detector, Resource
+
+detector = Detector(field_missing_values=['1', '2'])
+resource = Resource('table.csv', detector=detector)
+print(resource.read_rows())
+
+
+[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]
+
+ Many options below have their CLI equivalent. Please consult with the CLI help.
+The detector class instance are accepted by many classes and functions:
+You just need to create a Detector instance using desired options and pass to the classed and function from above.
+By default, Frictionless will use the first 10000 bytes to detect encoding. Including more bytes by increasing buffer_size can improve the inference. However, it will be slower, but the encoding detection will be more accurate.
+ +from frictionless import Detector, describe
+
+detector = Detector(buffer_size=100000)
+resource = describe("country-1.csv", detector=detector)
+print(resource.encoding)
+
+
+utf-8
+
+ By default, Frictionless will use the first 100 rows to detect field types. Including more samples by increasing sample_size can improve the inference. However, it will be slower, but the result will be more accurate.
+ +from frictionless import Detector, describe
+
+detector = Detector(sample_size=1000)
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'neighbor_id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
+
+ By default, Frictionless encoding_function is None and user can use built in encoding functions. But user has option to implement their own encoding using this feature. The following example simply returns utf-8 encoding but user can add more complex logic to the encoding function.
+ +from frictionless import Detector, Resource
+
+detector = Detector(encoding_function=lambda sample: "utf-8")
+with Resource("table.csv", detector=detector) as resource:
+ print(resource.encoding)
+
+
+utf-8
+
+ This option allows manually setting all the field types to a given type. It's useful when you need to skip data casting (setting any
type) or have everything as a string (setting string
type):
from frictionless import Detector, describe
+
+detector = Detector(field_type='string')
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+
+{'fields': [{'name': 'id', 'type': 'string'},
+ {'name': 'neighbor_id', 'type': 'string'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'string'}]}
+
+ Sometimes you don't want to use existent header row to compose field names. It's possible to provide custom names:
+ +from frictionless import Detector, describe
+
+detector = Detector(field_names=["f1", "f2", "f3", "f4"])
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema.field_names)
+
+
+['f1', 'f2', 'f3', 'f4']
+
+ By default, Frictionless uses 0.9 (90%) confidence level for data types detection. It means that it there are 9 integers in a field and one string it will be inferred as an integer. If you want a guarantee that an inferred schema will conform to the data you can set it to 1 (100%):
+ +from frictionless import Detector, describe
+
+detector = Detector(field_confidence=1)
+resource = describe("country-1.csv", detector=detector)
+print(resource.schema)
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'neighbor_id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
+
+ By default, Frictionless will consider that all non integer numbers are decimals. It's possible to make them float which is a faster data type:
+ +from frictionless import Detector, describe
+
+detector = Detector(field_float_numbers=True)
+resource = describe("floats.csv", detector=detector)
+print(resource.schema)
+print(resource.read_rows())
+
+
+{'fields': [{'name': 'number', 'type': 'number', 'floatNumber': True}]}
+[{'number': 1.1}, {'number': 1.2}, {'number': 1.3}, {'number': 1.4}, {'number': 1.5}]
+
+ Missing Values is an important concept in data description. It provides information about what cell values should be considered as nulls. We can customize the defaults:
+ +from frictionless import Detector, describe
+
+detector = Detector(field_missing_values=["", "1", "2"])
+resource = describe("table.csv", detector=detector)
+print(resource.schema.missing_values)
+print(resource.read_rows())
+
+
+['', '1', '2']
+[{'id': None, 'name': 'english'}, {'id': None, 'name': '中国人'}]
+
+ As we can see, the textual values equal to "67" are now considered nulls. Usually, it's handy when you have data with values like: '-', 'n/a', and similar.
+There is a way to sync provided schema based on a header row's field order. It's very useful when you have a schema that describes a subset or a superset of the resource's fields:
+ +from frictionless import Detector, Resource, Schema, fields
+
+# Note the order of the fields
+detector = Detector(schema_sync=True)
+schema = Schema(fields=[fields.StringField(name='name'), fields.IntegerField(name='id')])
+with Resource('table.csv', schema=schema, detector=detector) as resource:
+ print(resource.schema)
+ print(resource.read_rows())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ Sometimes we just want to update only a few fields or some schema's properties without providing a brand new schema. For example, the two examples above can be simplified as:
+ +from frictionless import Detector, Resource
+
+detector = Detector(schema_patch={'fields': {'id': {'type': 'string'}}})
+with Resource('table.csv', detector=detector) as resource:
+ print(resource.schema)
+ print(resource.read_rows())
+
+
+{'fields': [{'name': 'id', 'type': 'string'},
+ {'name': 'name', 'type': 'string'}]}
+[{'id': '1', 'name': 'english'}, {'id': '2', 'name': '中国人'}]
+
+ Detector representation. + +This main purpose of this class is to set the parameters to define +how different aspects of metadata are detected.
+(*, buffer_size: int = 10000, sample_size: int = 100, encoding_function: Optional[types.IEncodingFunction] = None, encoding_confidence: float = 0.5, field_type: Optional[str] = None, field_names: Optional[List[str]] = None, field_confidence: float = 0.9, field_float_numbers: bool = False, field_missing_values: List[str] = NOTHING, field_true_values: List[str] = NOTHING, field_false_values: List[str] = NOTHING, schema_sync: bool = False, schema_patch: Optional[Dict[str, Any]] = None) -> None
++ The amount of bytes to be extracted as a buffer. It defaults to 10000. + The buffer_size can be increased to improve the inference accuracy to + detect file encoding. +
+int
++ The amount of rows to be extracted as a sample for dialect/schema inferring. + It defaults to 100. The sample_size can be increased to improve the inference + accuracy. +
+int
++ A custom encoding function for the file. +
+Optional[types.IEncodingFunction]
++ Confidence value for encoding function. +
+float
++ Enforce all the inferred types to be this type. + For more information, please check "Describing Data" guide. +
+Optional[str]
++ Enforce all the inferred fields to have provided names. + For more information, please check "Describing Data" guide. +
+Optional[List[str]]
++ A number from 0 to 1 setting the infer confidence. + If 1 the data is guaranteed to be valid against the inferred schema. + For more information, please check "Describing Data" guide. + It defaults to 0.9 +
+float
++ Flag to indicate desired number type. + By default numbers will be `Decimal`; if `True` - `float`. + For more information, please check "Describing Data" guide. + It defaults to `False` +
+bool
++ String to be considered as missing values. + For more information, please check "Describing Data" guide. + It defaults to `['']` +
+List[str]
++ String to be considered as true values. + For more information, please check "Describing Data" guide. + It defaults to `["true", "True", "TRUE", "1"]` +
+List[str]
++ String to be considered as false values. + For more information, please check "Describing Data" guide. + It defaults to `["false", "False", "FALSE", "0"]` +
+List[str]
++ Whether to sync the schema. + If it sets to `True` the provided schema will be mapped to + the inferred schema. It means that, for example, you can + provide a subset of fields to be applied on top of the inferred + fields or the provided schema can have different order of fields. +
+bool
++ A dictionary to be used as an inferred schema patch. + The form of this dictionary should follow the Schema descriptor form + except for the `fields` property which should be a mapping with the + key named after a field name and the values being a field patch. + For more information, please check "Extracting Data" guide. +
+Optional[Dict[str, Any]]
+This method aims to add missing required labels and + +primary key field not in labels to schema fields.
+(fields_mapping: Dict[str, Field], schema: Schema, labels: List[str], case_sensitive: bool)
+Detect dialect from sample
+(sample: types.ISample, *, dialect: Optional[Dialect] = None) -> Dialect
+Detect encoding from buffer
+(buffer: types.IBuffer, *, encoding: Optional[str] = None) -> str
+Return an descriptor type as 'resource' or 'package'
+(source: Any, *, format: Optional[str] = None) -> Optional[str]
+Detects path details
+(resource: Resource) -> None
+Detect schema from fragment
+(fragment: types.IFragment, *, labels: Optional[List[str]] = None, schema: Optional[Schema] = None, field_candidates: List[Dict[str, Any]] = [{type: yearmonth}, {type: geopoint}, {type: duration}, {type: geojson}, {type: object}, {type: array}, {type: datetime}, {type: time}, {type: date}, {type: integer}, {type: number}, {type: boolean}, {type: year}, {type: string}], **options: Any) -> Schema
+Create a dictionnary to map field names with schema fields
+(fields: List[Field], case_sensitive: bool) -> Dict[str, Field]
+Rearrange fields according to the order of labels. All fields + +missing from labels are dropped
+(fields_mapping: Dict[str, Field], schema: Schema, labels: List[str])
+The Table Dialect is a core Frictionless Data concept meaning a metadata information regarding tabular data source. The Table Dialect concept give us an ability to manage table header and any details related to specific formats.
+The Dialect class instance are accepted by many classes and functions:
+You just need to create a Dialect instance using desired options and pass to the classed and function from above. We will show it on this examplar table:
+ +cat capital-3.csv
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ It's a boolean flag which defaults to True
indicating whether the data has a header row or not. In the following example the header row will be treated as a data row:
from frictionless import Resource, Dialect
+
+dialect = Dialect(header=False)
+with Resource('capital-3.csv', dialect=dialect) as resource:
+ print(resource.header.labels)
+ print(resource.to_view())
+
+
+[]
++--------+----------+
+| field1 | field2 |
++========+==========+
+| 'id' | 'name' |
++--------+----------+
+| '1' | 'London' |
++--------+----------+
+| '2' | 'Berlin' |
++--------+----------+
+| '3' | 'Paris' |
++--------+----------+
+| '4' | 'Madrid' |
++--------+----------+
+...
+
+ If header is True
which is default, this parameters indicates where to find the header row or header rows for a multiline header. Let's see on example how the first two data rows can be treated as a part of a header:
from frictionless import Resource, Dialect
+
+dialect = Dialect(header_rows=[1, 2, 3])
+with Resource('capital-3.csv', dialect=dialect) as resource:
+ print(resource.header)
+ print(resource.to_view())
+
+
+['id 1 2', 'name London Berlin']
++--------+--------------------+
+| id 1 2 | name London Berlin |
++========+====================+
+| 3 | 'Paris' |
++--------+--------------------+
+| 4 | 'Madrid' |
++--------+--------------------+
+| 5 | 'Rome' |
++--------+--------------------+
+
+ If there are multiple header rows which is managed by header_rows
parameter, we can set a string to be a separator for a header's cell join operation. Usually it's very handy for some "fancy" Excel files. For the sake of simplicity, we will show on a CSV file:
from frictionless import Resource, Dialect
+
+dialect = Dialect(header_rows=[1, 2, 3], header_join='/')
+with Resource('capital-3.csv', dialect=dialect) as resource:
+ print(resource.header)
+ print(resource.to_view())
+
+
+['id/1/2', 'name/London/Berlin']
++--------+--------------------+
+| id/1/2 | name/London/Berlin |
++========+====================+
+| 3 | 'Paris' |
++--------+--------------------+
+| 4 | 'Madrid' |
++--------+--------------------+
+| 5 | 'Rome' |
++--------+--------------------+
+
+ By default a header is validated in a case sensitive mode. To disable this behaviour we can set the header_case
parameter to False
. This option is accepted by any Dialect and a dialect can be passed to extract
, validate
and other functions. Please note that it doesn't affect a resulting header it only affects how it's validated:
from frictionless import Resource, Schema, Dialect, fields
+
+dialect = Dialect(header_case=False)
+schema = Schema(fields=[fields.StringField(name="ID"), fields.StringField(name="NAME")])
+with Resource('capital-3.csv', dialect=dialect, schema=schema) as resource:
+ print(f'Header: {resource.header}')
+ print(f'Valid: {resource.header.valid}') # without "header_case" it will have 2 errors
+
+
+Header: ['ID', 'NAME']
+Valid: True
+
+ Specifies char used to comment the rows:
+ +from frictionless import Resource, Dialect
+
+dialect = Dialect(comment_char="#")
+with Resource(b'name\n#row1\nrow2', format="csv", dialect=dialect) as resource:
+ print(resource.read_rows())
+
+
+[{'name': 'row2'}]
+
+ A list of rows to ignore:
+ +from frictionless import Resource, Dialect
+
+dialect = Dialect(comment_rows=[2])
+with Resource(b'name\nrow1\nrow2', format="csv", dialect=dialect) as resource:
+ print(resource.read_rows())
+
+
+[{'name': 'row2'}]
+
+ Ignores rows if they are completely blank.
+ +from frictionless import Resource, Dialect
+
+dialect = Dialect(skip_blank_rows=True)
+with Resource(b'name\n\nrow2', format="csv", dialect=dialect) as resource:
+ print(resource.read_rows())
+
+
+[{'name': 'row2'}]
+
+ Dialect representation
+(*, descriptor: Optional[Union[types.IDescriptor, str]] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, header: bool = True, header_rows: List[int] = NOTHING, header_join: str = , header_case: bool = True, comment_char: Optional[str] = None, comment_rows: List[int] = NOTHING, skip_blank_rows: bool = False, controls: List[Control] = NOTHING) -> None
++ # TODO: add docs +
+Optional[Union[types.IDescriptor, str]]
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the object +
+ClassVar[Union[str, None]]
++ A human-oriented title for the Dialect. +
+Optional[str]
++ A brief description of the Dialect. +
+Optional[str]
++ If true, the header will be read else header will be skipped. +
+bool
++ Specifies the row numbers for the header. Default is [1]. +
+List[int]
++ Separator to join text of two column's. The default value is " " and other values + could be ":", "-" etc. +
+str
++ If set to false, it does case insensitive matching of header. The default value + is True. +
+bool
++ Specifies char used to comment the rows. The default value is None. + For example: "#". +
+Optional[str]
++ A list of rows to ignore. For example: [1, 2] +
+List[int]
++ Ignores rows if they are completely blank +
+bool
++ A list of controls which defines different aspects of reading data. +
+List[Control]
+Add new control to the schema
+(control: Control) -> None
+Describe the given source as a dialect
+(source: Optional[Any] = None, **options: Any) -> Dialect
+Get control by type
+(type: str) -> Control
+Check if control is present
+(type: str)
+Set control by type
+(control: Control) -> Optional[Control]
+Control representation. + +This class is the base class for all the control classes that are +used to set the states of various different components.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the control. It could be a zenodo plugin control, csv control etc. + For example: "csv", "zenodo" etc +
+ClassVar[str]
++ A human-oriented title for the control. +
+Optional[str]
++ A brief description of the control. +
+Optional[str]
+The Error class is a metadata with no behavior. It's used to describe an error that happened during Framework work or during the validation.
+To create a custom error you basically just need to fill the required class fields:
+from frictionless import errors
+
+class DuplicateRowError(errors.RowError):
+ code = "duplicate-row"
+ name = "Duplicate Row"
+ tags = ["#table", "#row", "#duplicate"]
+ template = "Row at position {rowPosition} is duplicated: {note}"
+ description = "The row is duplicated."
+
+Error representation. + +It is a baseclass from which other subclasses of errors are inherited or +derived from.
+(*, note: str) -> None
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+ClassVar[str]
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+ClassVar[str]
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+ClassVar[str]
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+ClassVar[str]
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+ClassVar[List[str]]
++ A human readable informative comprehensive description of the error. It can be set to any custom text. + If not set, default description is more comprehensive with error type, message and reasons included. +
+str
++ A short human readable description of the error. It can be set to any custom text. +
+str
+The Inquiry gives you an ability to create arbitrary validation jobs containing a set of individual validation tasks.
+Let's create an inquiry that includes an individual file validation and a resource validation:
+ +from frictionless import Inquiry
+
+inquiry = Inquiry.from_descriptor({'tasks': [
+ {'path': 'capital-valid.csv'},
+ {'path': 'capital-invalid.csv'},
+]})
+inquiry.to_yaml('capital.inquiry-example.yaml')
+print(inquiry)
+
+
+{'tasks': [{'path': 'capital-valid.csv'}, {'path': 'capital-invalid.csv'}]}
+
+ Tasks in the Inquiry accept the same arguments written in camelCase as the corresponding validate
functions have. As usual, let' run validation:
frictionless validate capital.inquiry-example.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-valid │ table │ capital-valid.csv │ VALID │
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ duplicate-label │ Label "name" in the header at position "3" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 10 │ 3 │ missing-cell │ Row at position "10" has a missing cell in │
+│ │ │ │ field "name2" at position "3" │
+│ 11 │ None │ blank-row │ Row at position "11" is completely blank │
+│ 12 │ 1 │ type-error │ Type error in the cell "x" in row "12" and │
+│ │ │ │ field "id" at position "1": type is │
+│ │ │ │ "integer/default" │
+│ 12 │ 4 │ extra-cell │ Row at position "12" has an extra value in │
+│ │ │ │ field at position "4" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ At first sight, it's no clear why such a construct exists but when your validation workflow gets complex, the Inquiry can provide a lot of flexibility and power. Last but not least, the Inquiry will use multiprocessing if there are more than 1 task provided.
+Inquiry representation.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, tasks: List[InquiryTask] = NOTHING) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the object +
+ClassVar[Union[str, None]]
++ A human-oriented title for the Inquiry. +
+Optional[str]
++ A brief description of the Inquiry. +
+Optional[str]
++ List of underlaying task to be validated. +
+List[InquiryTask]
+Validate inquiry
+(*, parallel: bool = False)
+Inquiry task representation.
+(*, name: Optional[str] = None, type: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, path: Optional[str] = None, scheme: Optional[str] = None, format: Optional[str] = None, encoding: Optional[str] = None, mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: Optional[List[str]] = None, innerpath: Optional[str] = None, dialect: Optional[Dialect] = None, schema: Optional[Schema] = None, checklist: Optional[Checklist] = None, resource: Optional[str] = None, package: Optional[str] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the source to be validated such as "package", "resource" etc. +
+Optional[str]
++ A human-oriented title for the Inquiry. +
+Optional[str]
++ A brief description of the Inquiry. +
+Optional[str]
++ Path to the data source. +
+Optional[str]
++ Scheme for loading the file (file, http, ...). If not set, it'll be + inferred from `source`. +
+Optional[str]
++ File source's format (csv, xls, ...). If not set, it'll be + inferred from `source`. +
+Optional[str]
++ Source encoding. If not set, it'll be inferred from `source`. +
+Optional[str]
++ Mediatype/mimetype of the resource e.g. “text/csv”, or “application/vnd.ms-excel”. + Mediatypes are maintained by the Internet Assigned Numbers Authority (IANA) in a + media type registry. +
+Optional[str]
++ Source file compression (zip, ...). If not set, it'll be inferred from `source`. +
+Optional[str]
++ List of paths to concatenate to the main path. It's used for multipart resources. +
+Optional[List[str]]
++ Path within the compressed file. It defaults to the first file in the archive + (if the source is an archive). +
+Optional[str]
++ Specific set of formatting parameters applied while reading data source. + The parameters are set as a Dialect class. For more information, please + check the Dialect Class documentation. +
+Optional[Dialect]
++ Schema descriptor. A string descriptor or path to schema file. +
+Optional[Schema]
++ Checklist class with a set of validation checks to be applied to the + data source being read. For more information, please check the + Validation Checks documentation. +
+Optional[Checklist]
++ Resource descriptor. A string descriptor or path to resource file. +
+Optional[str]
++ Package descriptor. A string descriptor or path to package + file. +
+Optional[str]
+The Data Package is a core Frictionless Data concept meaning a set of resources with additional metadata provided. You can read Data Package Standard for more information.
+Let's create a data package:
+ +from frictionless import Package, Resource
+
+package = Package('table.csv') # from a resource path
+package = Package('tables/*') # from a resources glob
+package = Package(['tables/chunk1.csv', 'tables/chunk2.csv']) # from a list
+package = Package('package/datapackage.json') # from a descriptor path
+package = Package({'resources': {'path': 'table.csv'}}) # from a descriptor
+package = Package(resources=[Resource(path='table.csv')]) # from arguments
+
+
+ As you can see it's possible to create a package providing different kinds of sources which will be detected to have some type automatically (e.g. whether it's a glob or a path). It's possible to make this step more explicit:
+ +from frictionless import Package, Resource
+
+package = Package(resources=[Resource(path='table.csv')]) # from arguments
+package = Package('datapackage.json') # from a descriptor
+
+
+ The standards support a great deal of package metadata which is possible to have with Frictionless Framework too:
+ +from frictionless import Package, Resource
+
+package = Package(
+ name='package',
+ title='My Package',
+ description='My Package for the Guide',
+ resources=[Resource(path='table.csv')],
+ # it's possible to provide all the official properties like homepage, version, etc
+)
+print(package)
+
+
+{'name': 'package',
+ 'title': 'My Package',
+ 'description': 'My Package for the Guide',
+ 'resources': [{'name': 'table',
+ 'type': 'table',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}
+
+ If you have created a package, for example, from a descriptor you can access this properties:
+ +from frictionless import Package
+
+package = Package('datapackage.json')
+print(package.name)
+# and others
+
+
+test-tabulator
+
+ And edit them:
+ +from frictionless import Package
+
+package = Package('datapackage.json')
+package.name = 'new-name'
+package.title = 'New Title'
+package.description = 'New Description'
+# and others
+print(package)
+
+
+{'name': 'new-name',
+ 'title': 'New Title',
+ 'description': 'New Description',
+ 'resources': [{'name': 'first-resource',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'file',
+ 'format': 'xls',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'number'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'number-two',
+ 'type': 'table',
+ 'path': 'table-reverse.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}
+
+ The core purpose of having a package is to provide an ability to have a set of resources. The Package class provides useful methods to manage resources:
+ +from frictionless import Package, Resource
+
+package = Package('datapackage.json')
+print(package.resources)
+print(package.resource_names)
+package.add_resource(Resource(name='new', data=[['key1', 'key2'], ['val1', 'val2']]))
+resource = package.get_resource('new')
+print(package.has_resource('new'))
+package.remove_resource('new')
+
+
+[{'name': 'first-resource',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'file',
+ 'format': 'xls',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'number'},
+ {'name': 'name', 'type': 'string'}]}}, {'name': 'number-two',
+ 'type': 'table',
+ 'path': 'table-reverse.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]
+['first-resource', 'number-two']
+True
+
+ As any of the Metadata classes the Package class can be saved as JSON or YAML:
+ +from frictionless import Package
+package = Package('tables/*')
+package.to_json('datapackage.json') # Save as JSON
+package.to_yaml('datapackage.yaml') # Save as YAML
+
+
+ Package representation + +This class is one of the cornerstones of of Frictionless framework. +It manages underlaying resource and provides an ability to describe a package. + +```python +package = Package(resources=[Resource(path="data/table.csv")]) +package.get_resoure('table').read_rows() == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'},
+(*, source: Optional[Any] = None, control: Optional[Control] = None, basepath: Optional[str] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, contributors: List[Dict[str, Any]] = NOTHING, keywords: List[str] = NOTHING, image: Optional[str] = None, version: Optional[str] = None, created: Optional[str] = None, resources: List[Resource] = NOTHING, dataset: Optional[Dataset] = None, dialect: Optional[Dialect] = None, detector: Optional[Detector] = None) -> None
++ # TODO: add docs +
+Optional[Any]
++ # TODO: add docs +
+Optional[Control]
++ # TODO: add docs +
+Optional[str]
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “.”, “_” or “-” characters. +
+Optional[str]
++ Type of the package +
+ClassVar[Union[str, None]]
++ A Package title according to the specs + It should a human-oriented title of the resource. +
+Optional[str]
++ A Package description according to the specs + It should a human-oriented description of the resource. +
+Optional[str]
++ A URL for the home on the web that is related to this package. + For example, github repository or ckan dataset address. +
+Optional[str]
++ A fully-qualified URL that points directly to a JSON Schema + that can be used to validate the descriptor +
+Optional[str]
++ The license(s) under which the package is provided. +
+List[Dict[str, Any]]
++ The raw sources for this data package. + It MUST be an array of Source objects. + Each Source object MUST have a title and + MAY have path and/or email properties. +
+List[Dict[str, Any]]
++ The people or organizations who contributed to this package. + It MUST be an array. Each entry is a Contributor and MUST be an object. + A Contributor MUST have a title property and MAY contain + path, email, role and organization properties. +
+List[Dict[str, Any]]
++ An Array of string keywords to assist users searching. + For example, ['data', 'fiscal'] +
+List[str]
++ An image to use for this data package. + For example, when showing the package in a listing. +
+Optional[str]
++ A version string identifying the version of the package. + It should conform to the Semantic Versioning requirements and + should follow the Data Package Version pattern. +
+Optional[str]
++ The datetime on which this was created. + The datetime must conform to the string formats for RFC3339 datetime, +
+Optional[str]
++ A list of resource descriptors. + It can be dicts or Resource instances +
+List[Resource]
++ It returns reference to dataset of which catalog the package is part of. If package + is not part of any catalog, then it is set to None. +
+Optional[Dataset]
++ # TODO: add docs +
+Optional[Dialect]
++ # TODO: add docs +
+Optional[Detector]
+A basepath of the package + +The normpath of the resource is joined `basepath` and `/path`
+Optional[str]
+Return names of resources
+List[str]
+Return names of resources
+List[str]
+Add new resource to the package
+(resource: Union[Resource, str]) -> Resource
+Analyze the resources of the package + +This feature is currently experimental, and its API may change +without warning.
+(*, detailed: bool = False)
+Remove all the resources
+Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object
+Describe the given source as a package
+(source: Optional[Any] = None, *, stats: bool = False, **options: Any)
+Extract rows
+(*, name: Optional[str] = None, filter: Optional[types.IFilterFunction] = None, process: Optional[types.IProcessFunction] = None, limit_rows: Optional[int] = None) -> types.ITabularData
+Flatten the package + +Parameters + spec (str[]): flatten specification
+(spec: List[str] = [name, path])
+Get resource by name
+(name: str) -> Resource
+Get table resource by name (raise if not table)
+(name: str) -> TableResource
+Check if a resource is present
+(name: str) -> bool
+Check if a table resource is present
+(name: str) -> bool
+Infer metadata
+(*, stats: bool = False) -> None
+Publish package to any supported data portal
+(target: Any = None, *, control: Optional[Control] = None) -> PublishResult
+Remove resource by name
+(name: str) -> Resource
+Set resource by name
+(resource: Resource) -> Optional[Resource]
+Create a copy of the package
+(**options: Any) -> Self
+Generate ERD(Entity Relationship Diagram) from package resources + +and exports it as .dot file + +Based on: +- https://github.com/frictionlessdata/frictionless-py/issues/1118
+(path: Optional[str] = None) -> str
+Transform package
+(: Package, pipeline: Pipeline)
+Update resource
+(name: str, descriptor: types.IDescriptor) -> Resource
+Validate package
+(: Package, checklist: Optional[Checklist] = None, *, name: Optional[str] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000)
+Pipeline is an object containing a list of transformation steps.
+Let's create a pipeline using Python interface:
+ +from frictionless import Pipeline, transform, steps
+
+pipeline = Pipeline(steps=[steps.table_normalize(), steps.table_melt(field_name='name')])
+print(pipeline)
+
+
+{'steps': [{'type': 'table-normalize'},
+ {'type': 'table-melt', 'fieldName': 'name'}]}
+
+ To run a pipeline you need to use a transform function or method:
+ +from frictionless import Pipeline, transform, steps
+
+pipeline = Pipeline(steps=[steps.table_normalize(), steps.table_melt(field_name='name')])
+resource = transform('table.csv', pipeline=pipeline)
+print(resource.schema)
+print(resource.read_rows())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'variable', 'type': 'string'},
+ {'name': 'value', 'type': 'any'}]}
+[{'name': 'english', 'variable': 'id', 'value': 1}, {'name': '中国人', 'variable': 'id', 'value': 2}]
+
+ The Step concept is a part of the Transform API. You can create a custom Step to be used as part of resource or package transformation.
+++This step uses PETL under the hood.
+
from frictionless import Step
+
+class cell_set(Step):
+ code = "cell-set"
+
+ def __init__(self, descriptor=None, *, value=None, field_name=None):
+ self.setinitial("value", value)
+ self.setinitial("fieldName", field_name)
+ super().__init__(descriptor)
+
+ def transform_resource(self, resource):
+ value = self.get("value")
+ field_name = self.get("fieldName")
+ yield from resource.to_petl().update(field_name, value)
+
+Pipeline representation
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, steps: List[Step] = NOTHING) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the package +
+ClassVar[Union[str, None]]
++ A human-oriented title for the Pipeline. +
+Optional[str]
++ A brief description of the Pipeline. +
+Optional[str]
++ List of transformation steps to apply. +
+List[Step]
+Return type list of the steps
+List[str]
+Add new step to the schema
+(step: Step) -> None
+Remove all the steps
+() -> None
+Get step by type
+(type: str) -> Step
+Check if a step is present
+(type: str) -> bool
+Remove step by type
+(type: str) -> Step
+Set step by type
+(step: Step) -> Optional[Step]
+Step representation. + +A base class for all the step subclasses.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ A short url-usable (and preferably human-readable) name/type. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. For example: "cell-fill". +
+ClassVar[str]
++ A human-oriented title for the Step. +
+Optional[str]
++ A brief description of the Step. +
+Optional[str]
+Transform package
+(package: Package)
+Transform resource
+(resource: Resource)
+All the validate
functions return the Validation Report. It's an unified object containing information about a validation: source details, found error, etc. Let's explore a report:
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.007},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 1,
+ 'warnings': 0,
+ 'seconds': 0.007,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}]}]}
+
+ As we can see, there are a lot of information; you can find its details description in "API Reference". Errors are grouped by tables; for some validation there are can be dozens of tables. Let's use the report.flatten
function to simplify errors representation:
from pprint import pprint
+from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+pprint(report.flatten(['rowNumber', 'fieldNumber', 'code', 'message']))
+
+
+[[None,
+ 3,
+ None,
+ 'Label "name" in the header at position "3" is duplicated to a label: at '
+ 'position "2"']]
+
+ In some situation, an error can't be associated with a table; then it goes to the top-level report.errors
property:
from frictionless import validate
+
+report = validate('bad.json', type='schema')
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad',
+ 'type': 'json',
+ 'valid': False,
+ 'place': 'bad.json',
+ 'labels': [],
+ 'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [{'type': 'schema-error',
+ 'title': 'Schema Error',
+ 'description': 'Provided schema is not valid.',
+ 'message': 'Schema is not valid: cannot retrieve '
+ 'metadata "bad.json" because "[Errno 2] No '
+ 'such file or directory: \'bad.json\'"',
+ 'tags': [],
+ 'note': 'cannot retrieve metadata "bad.json" because '
+ '"[Errno 2] No such file or directory: '
+ '\'bad.json\'"'}]}]}
+
+ The Error object is at the heart of the validation process. The Report has report.errors
and report.tables[].errors
properties that can contain the Error object. Let's explore it:
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+error = report.task.error # it's only available for 1 table / 1 error sitution
+print(f'Type: "{error.type}"')
+print(f'Title: "{error.title}"')
+print(f'Tags: "{error.tags}"')
+print(f'Note: "{error.note}"')
+print(f'Message: "{error.message}"')
+print(f'Description: "{error.description}"')
+
+
+Type: "duplicate-label"
+Title: "Duplicate Label"
+Tags: "['#table', '#header', '#label']"
+Note: "at position "2""
+Message: "Label "name" in the header at position "3" is duplicated to a label: at position "2""
+Description: "Two columns in the header row have the same value. Column names should be unique."
+
+ Above, we have listed universal error properties. Depending on the type of an error there can be additional ones. For example, for our duplicate-label
error:
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+error = report.task.error # it's only available for 1 table / 1 error sitution
+print(error)
+
+
+{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+ 'names should be unique.',
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+ 'label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}
+
+ Please explore "Errors Reference" to learn about all the available errors and their properties.
+Report representation. + +A class that stores the summary of the validation action.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, valid: bool, stats: types.IReportStats, warnings: List[str] = NOTHING, errors: List[Error] = NOTHING, tasks: List[ReportTask] = NOTHING) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the package +
+ClassVar[Union[str, None]]
++ A human-oriented title for the Report. +
+Optional[str]
++ A brief description of the Detector. +
+Optional[str]
++ Flag to specify if the data is valid or not. +
+bool
++ Additional statistics of the data as defined in Stats class. +
+types.IReportStats
++ List of warnings raised while validating the data. +
+List[str]
++ List of errors raised while validating the data. +
+List[Error]
++ List of task that were applied during data validation. +
+List[ReportTask]
+Validation error (if there is only one)
+Validation task (if there is only one)
+Flatten the report + +Parameters + spec (str[]): flatten specification
+(spec: List[str] = [taskNumber, rowNumber, fieldNumber, type])
+Create a report from a validation
+(*, time: float = 0, tasks: List[ReportTask] = [], errors: List[Error] = [], warnings: List[str] = [])
+Create a report from a set of validation reports
+(*, time: float, reports: List[Report])
+Create a report from a validation task
+(resource: Resource, *, time: float, labels: List[str] = [], errors: List[Error] = [], warnings: List[str] = [])
+Summary of the report
+Report task representation.
+(*, name: str, type: Optional[str], title: Optional[str] = None, description: Optional[str] = None, valid: bool, place: str, labels: List[str], stats: types.IReportTaskStats, warnings: List[str] = NOTHING, errors: List[Error] = NOTHING) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+str
++ Sets the property tabular to True if the type is "table". +
+Optional[str]
++ A human-oriented title for the Report. +
+Optional[str]
++ A brief description of the Detector. +
+Optional[str]
++ Flag to specify if the data is valid or not. +
+bool
+
+ Specifies the place of the file. For example: "
str
++ List of labels of the task resource. +
+List[str]
++ Additional statistics of the data as defined in Stats class. +
+types.IReportTaskStats
++ List of warnings raised while validating the data. +
+List[str]
++ List of errors raised while validating the data. +
+List[Error]
+Validation error if there is only one
+Whether task's resource is tabular
+bool
+Flatten the report + +Parameters + spec (any[]): flatten specification
+(spec: List[str] = [rowNumber, fieldNumber, type])
+Generate summary for validation task"
+() -> str
+The Resource class is arguable the most important class of the whole Frictionless Framework. It's based on Data Resource Standard and Tabular Data Resource Standard
+Let's create a data resource:
+ +from frictionless import Resource
+
+resource = Resource('table.csv') # from a resource path
+resource = Resource('resource.json') # from a descriptor path
+resource = Resource({'path': 'table.csv'}) # from a descriptor
+resource = Resource(path='table.csv') # from arguments
+
+
+ As you can see it's possible to create a resource providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a descriptor or a path). It's possible to make this step more explicit:
+ +from frictionless import Resource
+
+resource = Resource(path='data/table.csv') # from a path
+resource = Resource('data/resource.json') # from a descriptor
+
+
+ The standards support a great deal of resource metadata which is possible to have with Frictionless Framework too:
+ +from frictionless import Resource
+
+resource = Resource(
+ name='resource',
+ title='My Resource',
+ description='My Resource for the Guide',
+ path='table.csv',
+ # it's possible to provide all the official properties like mediatype, etc
+)
+print(resource)
+
+
+{'name': 'resource',
+ 'type': 'table',
+ 'title': 'My Resource',
+ 'description': 'My Resource for the Guide',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+
+ If you have created a resource, for example, from a descriptor you can access this properties:
+ +from frictionless import Resource
+
+resource = Resource('resource.json')
+print(resource.name)
+# and others
+
+
+name
+
+ And edit them:
+ +from frictionless import Resource
+
+resource = Resource('resource.json')
+resource.name = 'new-name'
+resource.title = 'New Title'
+resource.description = 'New Description'
+# and others
+print(resource)
+
+
+{'name': 'new-name',
+ 'type': 'table',
+ 'title': 'New Title',
+ 'description': 'New Description',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+
+ As any of the Metadata classes the Resource class can be saved as JSON or YAML:
+ +from frictionless import Resource
+resource = Resource('table.csv')
+resource.to_json('resource.json') # Save as JSON
+resource.to_yaml('resource.yaml') # Save as YAML
+
+
+ You might have noticed that we had to duplicate the with Resource(...)
statement in some examples. The reason is that Resource is a streaming interface. Once it's read you need to open it again. Let's show it in an example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+resource.open()
+pprint(resource.read_rows())
+pprint(resource.read_rows())
+# We need to re-open: there is no data left
+resource.open()
+pprint(resource.read_rows())
+# We need to close manually: not context manager is used
+resource.close()
+
+
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+[]
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+
+ At the same you can read data for a resource without opening and closing it explicitly. In this case Frictionless Framework will open and close the resource for you so it will be basically a one-time operation:
+ +from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+
+ The Resource class is also a metadata class which provides various read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data. It can be rows
, data
, text
, or bytes
. Let's try reading all of them:
from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_bytes())
+pprint(resource.read_text())
+pprint(resource.read_cells())
+pprint(resource.read_rows())
+
+
+(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
+ b'3\n4,5,Italy,60\n5,4,Spain,47\n')
+''
+[['id', 'capital_id', 'name', 'population'],
+ ['1', '1', 'Britain', '67'],
+ ['2', '3', 'France', '67'],
+ ['3', '2', 'Germany', '83'],
+ ['4', '5', 'Italy', '60'],
+ ['5', '4', 'Spain', '47']]
+[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+
+ It's really handy to read all your data into memory but it's not always possible if a file is really big. For such cases, Frictionless provides streaming functions:
+ +from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+ pprint(resource.byte_stream)
+ pprint(resource.text_stream)
+ pprint(resource.cell_stream)
+ pprint(resource.row_stream)
+ for row in resource.row_stream:
+ print(row)
+
+
+<frictionless.system.loader.ByteStreamWithStatsHandling object at 0x7fe60aee8190>
+<_io.TextIOWrapper name='country-3.csv' encoding='utf-8'>
+<itertools.chain object at 0x7fe60b1285b0>
+<generator object TableResource.__open_row_stream.<locals>.row_stream at 0x7fe60cf2af10>
+{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67}
+{'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67}
+{'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83}
+{'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60}
+{'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}
+
+ frictionless@5.5
as a feature preview and request for comments. The implementation is raw and doesn't cover many edge cases.
+ Indexing resource in Frictionless terms means loading a data table into a database. Let's explore how this feature works in different modes.
+++All the example are written for SQLite for simplicity
+
This mode is supported for any database that is supported by sqlalchemy
. Under the hood, Frictionless will infer Table Schema and populate the data table as it normally reads data. It means that type errors will be replaced by null
values and in-general it guarantees to finish successfully for any data even very invalid.
frictionless index table.csv --database sqlite:///index/project.db --name table
+frictionless extract sqlite:///index/project.db --table table --json
+
+
+──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 3 rows in 0.204 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+ "project": [
+ {
+ "id": 1,
+ "name": "english"
+ },
+ {
+ "id": 2,
+ "name": "中国人"
+ }
+ ]
+}
+
+ import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table')
+print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
+
+
+{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+
+ sqlite3@3.34+
command to be available.
+ Fast mode is supported for SQLite and Postgresql databases. It will infer Table Schema using a data sample and index data using COPY
in Potgresql and .import
in SQLite. For big data files this mode will be 10-30x faster than normal indexing but the speed comes with the price -- if there is invalid data the indexing will fail.
frictionless index table.csv --database sqlite:///index/project.db --name table --fast
+frictionless extract sqlite:///index/project.db --table table --json
+
+
+──────────────────────────────────── Index ─────────────────────────────────────
+
+[table] Indexed 30 bytes in 0.208 seconds
+──────────────────────────────────── Result ────────────────────────────────────
+Succesefully indexed 1 tables
+{
+ "project": [
+ {
+ "id": 1,
+ "name": "english"
+ },
+ {
+ "id": 2,
+ "name": "中国人"
+ }
+ ]
+}
+
+ import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True)
+print(Resource('sqlite:///index/project.db', control=formats.sql.SqlControl(table='table')).extract())
+
+
+{'project': [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]}
+
+ To ensure that the data will be successfully indexed it's possible to use fallback
option. If the fast indexing fails Frictionless will start over in normal mode and finish the process successfully.
frictionless index table.csv --database sqlite:///index/project.db --name table --fast --fallback
+
+
+ import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True, fallback=True)
+
+
+ Another option is to provide a path to QSV binary. In this case, initial schema inferring will be done based on the whole data file and will guarantee that the table is valid type-wise:
+ +frictionless index table.csv --database sqlite:///index/project.db --name table --fast --qsv qsv_path
+
+
+ import sqlite3
+from frictionless import Resource, formats
+
+resource = Resource('table.csv')
+resource.index('sqlite:///index/project.db', name='table', fast=True, qsv_path='qsv_path')
+
+
+ The scheme also know as protocol indicates which loader Frictionless should use to read or write data. It can be file
(default), text
, http
, https
, s3
, and others.
from frictionless import Resource
+
+with Resource(b'header1,header2\nvalue1,value2', format='csv') as resource:
+ print(resource.scheme)
+ print(resource.to_view())
+
+
+buffer
++----------+----------+
+| header1 | header2 |
++==========+==========+
+| 'value1' | 'value2' |
++----------+----------+
+
+ The format or as it's also called extension helps Frictionless to choose a proper parser to handle the file. Popular formats are csv
, xlsx
, json
and others
from frictionless import Resource
+
+with Resource(b'header1,header2\nvalue1,value2.csv', format='csv') as resource:
+ print(resource.format)
+ print(resource.to_view())
+
+
+csv
++----------+--------------+
+| header1 | header2 |
++==========+==============+
+| 'value1' | 'value2.csv' |
++----------+--------------+
+
+ Frictionless automatically detects encoding of files but sometimes it can be inaccurate. It's possible to provide an encoding manually:
+ +from frictionless import Resource
+
+with Resource('country-3.csv', encoding='utf-8') as resource:
+ print(resource.encoding)
+ print(resource.path)
+
+
+utf-8
+country-3.csv
+
+ utf-8
+data/country-3.csv
+
+By default, Frictionless uses the first file found in a zip archive. It's possible to adjust this behaviour:
+ +from frictionless import Resource
+
+with Resource('table-multiple-files.zip', innerpath='table-reverse.csv') as resource:
+ print(resource.compression)
+ print(resource.innerpath)
+ print(resource.to_view())
+
+
+zip
+table-reverse.csv
++----+-----------+
+| id | name |
++====+===========+
+| 1 | '中国人' |
++----+-----------+
+| 2 | 'english' |
++----+-----------+
+
+ It's possible to adjust compression detection by providing the algorithm explicitly. For the example below it's not required as it would be detected anyway:
+ +from frictionless import Resource
+
+with Resource('table.csv.zip', compression='zip') as resource:
+ print(resource.compression)
+ print(resource.to_view())
+
+
+zip
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | '中国人' |
++----+-----------+
+
+ Please read Table Dialect Guide for more information.
+Please read Table Schema Guide for more information.
+Please read Checklist Guide for more information.
+Please read Pipeline Guide for more information.
+Resource's stats can be accessed with resource.stats
:
from frictionless import Resource
+
+resource = Resource('table.csv')
+resource.infer(stats=True)
+print(resource.stats)
+
+
+<frictionless.resource.stats.ResourceStats object at 0x7fe60a078a90>
+
+ Resource representation. + +This class is one of the cornerstones of of Frictionless framework. +It loads a data source, and allows you to stream its parsed contents. +At the same time, it's a metadata class data description. + +```python +with Resource("data/table.csv") as resource: + resource.header == ["id", "name"] + resource.read_rows() == [ + {'id': 1, 'name': 'english'}, + {'id': 2, 'name': '中国人'}, + ] +```
+(*, source: Optional[Any] = None, control: Optional[Control] = None, packagify: bool = False, name: Optional[str] = , title: Optional[str] = None, description: Optional[str] = None, homepage: Optional[str] = None, profile: Optional[str] = None, licenses: List[Dict[str, Any]] = NOTHING, sources: List[Dict[str, Any]] = NOTHING, path: Optional[str] = None, data: Optional[Any] = None, scheme: Optional[str] = None, format: Optional[str] = None, datatype: Optional[str] = , mediatype: Optional[str] = None, compression: Optional[str] = None, extrapaths: List[str] = NOTHING, innerpath: Optional[str] = None, encoding: Optional[str] = None, hash: Optional[str] = None, bytes: Optional[int] = None, fields: Optional[int] = None, rows: Optional[int] = None, dialect: Union[Dialect, str] = NOTHING, schema: Union[Schema, str] = NOTHING, basepath: Optional[str] = None, detector: Detector = NOTHING, package: Optional[Package] = None) -> None
++ # TODO: add docs +
+Optional[Any]
++ # TODO: add docs +
+Optional[Control]
++ # TODO: add docs +
+bool
++ Resource name according to the specs. + It should be a slugified name of the resource. +
+Optional[str]
++ Type of the resource +
+ClassVar[str]
++ Resource title according to the specs + It should a human-oriented title of the resource. +
+Optional[str]
++ Resource description according to the specs + It should a human-oriented description of the resource. +
+Optional[str]
++ A URL for the home on the web that is related to this package. + For example, github repository or ckan dataset address. +
+Optional[str]
++ A fully-qualified URL that points directly to a JSON Schema + that can be used to validate the descriptor +
+Optional[str]
++ The license(s) under which the resource is provided. + If omitted it's considered the same as the package's licenses. +
+List[Dict[str, Any]]
++ The raw sources for this data resource. + It MUST be an array of Source objects. + Each Source object MUST have a title and + MAY have path and/or email properties. +
+List[Dict[str, Any]]
++ Path to data source +
+Optional[str]
++ Inline data source +
+Optional[Any]
++ Scheme for loading the file (file, http, ...). + If not set, it'll be inferred from `source`. +
+Optional[str]
++ File source's format (csv, xls, ...). + If not set, it'll be inferred from `source`. +
+Optional[str]
++ Frictionless Framework specific data type as "table" or "schema" +
+Optional[str]
++ Mediatype/mimetype of the resource e.g. “text/csv”, + or “application/vnd.ms-excel”. Mediatypes are maintained by the + Internet Assigned Numbers Authority (IANA) in a media type registry. +
+Optional[str]
++ Source file compression (zip, ...). + If not set, it'll be inferred from `source`. +
+Optional[str]
++ List of paths to concatenate to the main path. + It's used for multipart resources. +
+List[str]
++ Path within the compressed file. + It defaults to the first file in the archive (if the source is an archive). +
+Optional[str]
++ Source encoding. + If not set, it'll be inferred from `source`. +
+Optional[str]
++ # TODO: add docs +
+Optional[str]
++ # TODO: add docs +
+Optional[int]
++ # TODO: add docs +
+Optional[int]
++ # TODO: add docs +
+Optional[int]
++ # TODO: add docs +
+Union[Dialect, str]
++ # TODO: add docs +
+Union[Schema, str]
++ # TODO: add docs +
+Optional[str]
++ File/table detector. + For more information, please check the Detector documentation. +
+Detector
++ Parental to this resource package. + For more information, please check the Package documentation. +
+Optional[Package]
++ # TODO: add docs +
+ResourceStats
++ Whether the resource is tabular +
+ClassVar[bool]
+A basepath of the resource + +The normpath of the resource is joined `basepath` and `/path`
+Optional[str]
+File's bytes used as a sample + +These buffer bytes are used to infer characteristics of the +source file (e.g. encoding, ...).
+types.IBuffer
+Byte stream in form of a generator
+types.IByteStream
+Whether the table is closed
+bool
+Whether resource is not path based
+bool
+Whether resource is multipart
+bool
+Normalized path of the resource or raise if not set
+Optional[str]
+Normalized paths of the resource
+List[str]
+All paths of the resource
+List[str]
+Stringified resource location
+str
+Whether resource is remote
+bool
+Text stream in form of a generator
+types.ITextStream
+Close the resource as "filelike.close" does
+() -> None
+Dereference underlaying metadata + +If some of underlaying metadata is provided as a string +it will replace it by the metadata object
+Describe the given source as a resource
+(source: Optional[Any] = None, *, name: Optional[str] = None, type: Optional[str] = None, stats: bool = False, **options: Any) -> Metadata
+Infer metadata
+(*, stats: bool = False) -> None
+List dataset resources
+(*, name: Optional[str] = None) -> List[Resource]
+Open the resource as "io.open" does
+Read bytes into memory
+(*, size: Optional[int] = None) -> bytes
+Read data into memory
+(*, size: Optional[int] = None) -> Any
+Read text into memory
+(*, size: Optional[int] = None) -> str
+Create a copy from the resource
+(**options: Any) -> Self
+Validate resource
+(checklist: Optional[Checklist] = None, *, name: Optional[str] = None, on_row: Optional[types.ICallbackFunction] = None, parallel: bool = False, limit_rows: Optional[int] = None, limit_errors: int = 1000) -> Report
+The Table Schema is a core Frictionless Data concept meaning a metadata information regarding tabular data source. You can read Table Schema Standard for more information.
+Let's create a table schema:
+ +from frictionless import Schema, fields, describe
+
+schema = describe('table.csv', type='schema') # from a resource path
+schema = Schema.from_descriptor('schema.json') # from a descriptor path
+schema = Schema.from_descriptor({'fields': [{'name': 'id', 'type': 'integer'}]}) # from a descriptor
+
+
+ As you can see it's possible to create a schema providing different kinds of sources which will be detector to have some type automatically (e.g. whether it's a dict or a path). It's possible to make this step more explicit:
+ +from frictionless import Schema, Field
+
+schema = Schema(fields=[fields.StringField(name='id')]) # from fields
+schema = Schema.from_descriptor('schema.json') # from a descriptor
+
+
+ The standard support some additional schema's metadata:
+ +from frictionless import Schema, fields
+
+schema = Schema(
+ fields=[fields.StringField(name='id')],
+ missing_values=['na'],
+ primary_key=['id'],
+ # foreign_keys
+)
+print(schema)
+
+
+{'fields': [{'name': 'id', 'type': 'string'}],
+ 'missingValues': ['na'],
+ 'primaryKey': ['id']}
+
+ If you have created a schema, for example, from a descriptor you can access this properties:
+ +from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.json')
+print(schema.missing_values)
+# and others
+
+
+['']
+
+ And edit them:
+ +from frictionless import Schema
+
+schema = Schema.from_descriptor('schema.json')
+schema.missing_values.append('-')
+# and others
+print(schema)
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}],
+ 'missingValues': ['', '-']}
+
+ The Schema class provides useful methods to manage fields:
+ +from frictionless import Schema, fields
+
+schema = Schema.from_descriptor('schema.json')
+print(schema.fields)
+print(schema.field_names)
+schema.add_field(fields.StringField(name='new-name'))
+field = schema.get_field('new-name')
+print(schema.has_field('new-name'))
+schema.remove_field('new-name')
+
+
+[{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+['id', 'name']
+True
+
+ As any of the Metadata classes the Schema class can be saved as JSON or YAML:
+ +from frictionless import Schema, fields
+schema = Schema(fields=[fields.IntegerField(name='id')])
+schema.to_json('schema.json') # Save as JSON
+schema.to_yaml('schema.yaml') # Save as YAML
+
+
+ During the process of data reading a resource uses a schema to convert data:
+ +from frictionless import Schema, fields
+
+schema = Schema(fields=[fields.IntegerField(name='integer'), fields.StringField(name='string')])
+cells, notes = schema.read_cells(['3', 'value'])
+print(cells)
+
+
+[3, 'value']
+
+ During the process of data writing a resource uses a schema to convert data:
+ +from frictionless import Schema, fields
+
+schema = Schema(fields=[fields.IntegerField(name='integer'), fields.StringField(name='string')])
+cells, notes = schema.write_cells([3, 'value'])
+print(cells)
+
+
+[3, 'value']
+
+ Let's create a field:
+ +from frictionless import fields
+
+field = fields.IntegerField(name='name')
+print(field)
+
+
+{'name': 'name', 'type': 'integer'}
+
+ Usually we work with fields which were already created by a schema:
+ +from frictionless import describe
+
+resource = describe('table.csv')
+field = resource.schema.get_field('id')
+print(field)
+
+
+{'name': 'id', 'type': 'integer'}
+
+ Frictionless Framework supports all the Table Schema Standard field types along with an ability to create custom types.
+For some types there are additional properties available:
+ +from frictionless import describe
+
+resource = describe('table.csv')
+field = resource.schema.get_field('id') # it's an integer
+print(field.bare_number)
+
+
+True
+
+ See the complete reference at Tabular Fields.
+During the process of data reading a schema uses a field internally. If needed a user can convert their data using this interface:
+ +from frictionless import fields
+
+field = fields.IntegerField(name='name')
+cell, note = field.read_cell('3')
+print(cell)
+
+
+3
+
+ During the process of data writing a schema uses a field internally. The same as with reading a user can convert their data using this interface:
+ +from frictionless import fields
+
+field = fields.IntegerField(name='name')
+cell, note = field.write_cell(3)
+print(cell)
+
+
+3
+
+ Schema representation + +This class is one of the cornerstones of of Frictionless framework. +It allow to work with Table Schema and its fields. + +```python +schema = Schema('schema.json') +schema.add_fied(Field(name='name', type='string')) +```
+(*, descriptor: Optional[Union[types.IDescriptor, str]] = None, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, fields: List[Field] = NOTHING, missing_values: List[str] = NOTHING, primary_key: List[str] = NOTHING, foreign_keys: List[Dict[str, Any]] = NOTHING) -> None
++ # TODO: add docs +
+Optional[Union[types.IDescriptor, str]]
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+Optional[str]
++ Type of the object +
+ClassVar[Union[str, None]]
++ A human-oriented title for the Schema. +
+Optional[str]
++ A brief description of the Schema. +
+Optional[str]
++ A List of fields in the schema. +
+List[Field]
++ List of string values to be set as missing values in the schema fields. If any of string in + missing values is found in any of the field value then it is set as None. +
+List[str]
++ Specifies primary key for the schema. +
+List[str]
++ Specifies the foreign keys for the schema. +
+List[Dict[str, Any]]
+List of field names
+List[str]
+List of field types
+List[str]
+Add new field to the schema
+(field: Field, *, position: Optional[int] = None) -> None
+Remove all the fields
+() -> None
+Describe the given source as a schema
+(source: Optional[Any] = None, **options: Any) -> Schema
+Flatten the schema + +Parameters + spec (str[]): flatten specification
+(spec: List[str] = [name, type])
+Create a Schema from JSONSchema profile
+(profile: Union[types.IDescriptor, str]) -> Schema
+Get field by name
+(name: str) -> Field
+Check if a field is present
+(name: str) -> bool
+Read a list of cells (normalize/cast)
+(cells: List[Any])
+Remove field by name
+(name: str) -> Field
+Set field by name
+(field: Field) -> Optional[Field]
+Set field type
+(name: str, type: str) -> Field
+Export schema as an excel template
+(path: str) -> None
+Summary of the schema in table format
+() -> str
+Update field
+(name: str, descriptor: types.IDescriptor) -> Field
+Write a list of cells (normalize/uncast)
+(cells: List[Any], *, types: List[str] = [])
+Field representation
+(*, name: str, title: Optional[str] = None, description: Optional[str] = None, format: str = default, missing_values: List[str] = NOTHING, constraints: Dict[str, Any] = NOTHING, rdf_type: Optional[str] = None, example: Optional[str] = None, schema: Optional[Schema] = None) -> None
++ A short url-usable (and preferably human-readable) name. + This MUST be lower-case and contain only alphanumeric characters + along with “_” or “-” characters. +
+str
++ Type of the field such as "boolean", "integer" etc. +
+ClassVar[str]
++ A human-oriented title for the Field. +
+Optional[str]
++ A brief description of the Field. +
+Optional[str]
++ Format of the field to specify different value readers for the field type. + For example: "default","array" etc. +
+str
++ List of string values to be set as missing values in the field. If any of string in missing values + is found in the field value then it is set as None. +
+List[str]
++ A dictionary with rules that constraints the data value permitted for a field. +
+Dict[str, Any]
++ RDF type. Indicates whether the field is of RDF type. +
+Optional[str]
++ An example of a value for the field. +
+Optional[str]
++ Schema class of which the field is part of. +
+Optional[Schema]
++ Specifies if field is the builtin feature. +
+ClassVar[bool]
++ List of supported constraints for a field. +
+ClassVar[List[str]]
+Indicates if field is mandatory.
+(bool) ->
+After opening a resource you get access to a resource.header
object which describes the resource in more detail. This is a list of normalized labels but also provides some additional functionality. Let's take a look:
from frictionless import Resource
+
+with Resource('capital-3.csv') as resource:
+ print(f'Header: {resource.header}')
+ print(f'Labels: {resource.header.labels}')
+ print(f'Fields: {resource.header.fields}')
+ print(f'Field Names: {resource.header.field_names}')
+ print(f'Field Numbers: {resource.header.field_numbers}')
+ print(f'Errors: {resource.header.errors}')
+ print(f'Valid: {resource.header.valid}')
+ print(f'As List: {resource.header.to_list()}')
+
+
+Header: ['id', 'name']
+Labels: ['id', 'name']
+Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+Field Names: ['id', 'name']
+Field Numbers: [1, 2]
+Errors: []
+Valid: True
+As List: ['id', 'name']
+
+ The example above shows a case when a header is valid. For a header that contains errors in its tabular structure, this information can be very useful, revealing discrepancies, duplicates or missing cell information:
+ +from pprint import pprint
+from frictionless import Resource
+
+with Resource([['name', 'name'], ['value', 'value']]) as resource:
+ pprint(resource.header.errors)
+
+
+[{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+ 'names should be unique.',
+ 'message': 'Label "name" in the header at position "2" is duplicated to a '
+ 'label: at position "1"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "1"',
+ 'labels': ['name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 2}]
+
+ The extract
, resource.read_rows()
and other functions return or yield row objects. In Python, this returns a dictionary with the following information. Note: this example uses the Detector object, which tweaks how different aspects of metadata are detected.
from frictionless import Resource, Detector
+
+detector = Detector(schema_patch={'missingValues': ['1']})
+with Resource('capital-3.csv', detector=detector) as resource:
+ for row in resource.row_stream:
+ print(f'Row: {row}')
+ print(f'Cells: {row.cells}')
+ print(f'Fields: {row.fields}')
+ print(f'Field Names: {row.field_names}')
+ print(f'Value of field "name": {row["name"]}') # accessed as a dict
+ print(f'Row Number: {row.row_number}') # counted row number starting from 1
+ print(f'Blank Cells: {row.blank_cells}')
+ print(f'Error Cells: {row.error_cells}')
+ print(f'Errors: {row.errors}')
+ print(f'Valid: {row.valid}')
+ print(f'As Dict: {row.to_dict(json=False)}')
+ print(f'As List: {row.to_list(json=True)}') # JSON compatible data types
+ break
+
+
+Row: {'id': None, 'name': 'London'}
+Cells: ['1', 'London']
+Fields: [{'name': 'id', 'type': 'integer'}, {'name': 'name', 'type': 'string'}]
+Field Names: ['id', 'name']
+Value of field "name": London
+Row Number: 2
+Blank Cells: {'id': '1'}
+Error Cells: {}
+Errors: []
+Valid: True
+As Dict: {'id': None, 'name': 'London'}
+As List: [None, 'London']
+
+ As we can see, this output provides a lot of information which is especially useful when a row is not valid. Our row is valid but we demonstrated how it can preserve data about missing values. It also preserves data about all cells that contain errors:
+ +from pprint import pprint
+from frictionless import Resource
+
+with Resource([['name'], ['value', 'value']]) as resource:
+ for row in resource.row_stream:
+ pprint(row.errors)
+
+
+[{'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to the header row (the '
+ 'first row in the data source). A key concept is that all the '
+ 'rows in tabular data must have the same number of columns.',
+ 'message': 'Row at position "2" has an extra value in field at position "2"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['value', 'value'],
+ 'rowNumber': 2,
+ 'cell': 'value',
+ 'fieldName': '',
+ 'fieldNumber': 2}]
+
+ Header representation + +> Constructor of this object is not Public API
+(labels: List[str], *, fields: List[Field], row_numbers: List[int], ignore_case: bool = False)
+Convert to a list
+Row representation + +> Constructor of this object is not Public API + +This object is returned by `extract`, `resource.read_rows`, and other functions. + +```python +rows = extract("data/table.csv") +for row in rows: + # work with the Row +```
+(cells: List[Any], *, field_info: Dict[str, Any], row_number: int)
+(*, csv: bool = False, json: bool = False, types: Optional[List[str]] = None) -> Dict[str, Any]
+(*, json: bool = False, types: Optional[List[str]] = None)
+(**options: Any)
+Let's get started with Frictionless! We will learn how to install and use the framework. The simple example below will showcase the framework's basic functionality.
+++ +The framework requires Python3.8+. Versioning follows the SemVer Standard.
+
pip install frictionless
+pip install frictionless[sql] # to install a core plugin (optional)
+pip install 'frictionless[sql]' # for zsh shell
+
+
+ The framework supports CSV, Excel, and JSON formats by default. The second command above installs a plugin for SQL support. There are plugins for SQL, Pandas, HTML, and others (all supported plugins are listed in the "File Formats" and schemes in "File Schemes" menu). Usually, you don't need to think about it in advance–frictionless will display a useful error message about a missing plugin with installation instructions.
+Did you have an error installing Frictionless? Here are some dependencies and common errors:
+pip: command not found
. Please see the pip docs for help installing pip.Still having a problem? Ask us for help on our Discord chat or open an issue. We're happy to help!
+The framework can be used:
+For instance, both examples below do the same thing:
+ +frictionless extract data/table.csv
+
+
+ from frictionless import extract
+
+rows = extract('data/table.csv')
+
+
+ The interfaces are as much alike as possible regarding naming conventions and +the way you interact with them. Usually, it's straightforward to translate, +for instance, Python code to a command-line call. Frictionless provides code +completion for Python and the command-line, which should help to get useful +hints in real time.
+Arguments conform to the following naming convention:
+missing_values
missingValues
--missing-values
To get the documentation for a command-line interface just use the --help
flag:
frictionless --help
+frictionless describe --help
+frictionless extract --help
+frictionless validate --help
+frictionless transform --help
+
+
+ ++Download
+invalid.csv
to reproduce the examples (right-click and "Save link as"). For more examples, please take a look at the Basic Examples article.
We will take a very messy data file:
+ +cat invalid.csv
+
+
+id,name,,name
+1,english
+1,english
+
+2,german,1,2,3
+
+ with open('invalid.csv') as file:
+ print(file.read())
+
+
+id,name,,name
+1,english
+1,english
+
+2,german,1,2,3
+
+ First of all, let's use describe
to infer the metadata directly from the tabular data. We can then edit and save it to provide others with useful information about the data:
++ +The CLI output is in YAML, it is a default Frictionless output format.
+
frictionless describe invalid.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │
+└─────────┴───────┴─────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ invalid
+┏━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
+┃ id ┃ name ┃ field3 ┃ name2 ┃
+┡━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
+│ integer │ string │ integer │ integer │
+└─────────┴────────┴─────────┴─────────┘
+
+ from pprint import pprint
+from frictionless import describe
+
+resource = describe('invalid.csv')
+pprint(resource)
+
+
+{'name': 'invalid',
+ 'type': 'table',
+ 'path': 'invalid.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'field3', 'type': 'integer'},
+ {'name': 'name2', 'type': 'integer'}]}}
+
+ Now that we have inferred a table schema from the data file (e.g., expected format of the table, expected type of each value in a column, etc.), we can use extract
to read the normalized tabular data from the source CSV file:
frictionless extract invalid.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │
+└─────────┴───────┴─────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ invalid
+┏━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
+┃ id ┃ name ┃ field3 ┃ name2 ┃
+┡━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
+│ 1 │ english │ None │ None │
+│ 1 │ english │ None │ None │
+│ None │ None │ None │ None │
+│ 2 │ german │ 1 │ 2 │
+└──────┴─────────┴────────┴───────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('invalid.csv')
+pprint(rows)
+
+
+{'invalid': [{'field3': None, 'id': 1, 'name': 'english', 'name2': None},
+ {'field3': None, 'id': 1, 'name': 'english', 'name2': None},
+ {'field3': None, 'id': None, 'name': None, 'name2': None},
+ {'field3': 1, 'id': 2, 'name': 'german', 'name2': 2}]}
+
+ Last but not least, let's get a validation report. This report will help us to identify and fix all the errors present in the tabular data, as comprehensive information is provided for every problem:
+ +frictionless validate invalid.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ invalid │ table │ invalid.csv │ INVALID │
+└─────────┴───────┴─────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ blank-label │ Label in the header in field at position │
+│ │ │ │ "3" is blank │
+│ None │ 4 │ duplicate-label │ Label "name" in the header at position "4" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 2 │ 3 │ missing-cell │ Row at position "2" has a missing cell in │
+│ │ │ │ field "field3" at position "3" │
+│ 2 │ 4 │ missing-cell │ Row at position "2" has a missing cell in │
+│ │ │ │ field "name2" at position "4" │
+│ 3 │ 3 │ missing-cell │ Row at position "3" has a missing cell in │
+│ │ │ │ field "field3" at position "3" │
+│ 3 │ 4 │ missing-cell │ Row at position "3" has a missing cell in │
+│ │ │ │ field "name2" at position "4" │
+│ 4 │ None │ blank-row │ Row at position "4" is completely blank │
+│ 5 │ 5 │ extra-cell │ Row at position "5" has an extra value in │
+│ │ │ │ field at position "5" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import validate
+
+report = validate('invalid.csv')
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
+[[None, 3, 'blank-label'],
+ [None, 4, 'duplicate-label'],
+ [2, 3, 'missing-cell'],
+ [2, 4, 'missing-cell'],
+ [3, 3, 'missing-cell'],
+ [3, 4, 'missing-cell'],
+ [4, None, 'blank-row'],
+ [5, 5, 'extra-cell']]
+
+ Now that we have all this information:
+++This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start. Also, this guide is meant to be read in order from top to bottom, and reuses examples throughout the text. You can use the menu to skip sections, but please note that you might need to run code from earlier sections to make all the examples work.
+
In Frictionless terms, "Describing data" means creating metadata for your data files. Having metadata is important because data files by themselves usually do not provide enough information to fully understand the data. For example, if you have a data table in a CSV format without metadata, you are missing a few critical pieces of information:
+size
field means (does that field mean geographic size? Or does it refer to the size of the file?)For a dataset, there is even more information that can be provided, like the general purpose of a dataset, information about data sources, list of authors, and more. Also, when there are many tabular files, relational rules can be very important. Usually, there are foreign keys ensuring the integrity of the dataset; for example, think of a reference table containing country names and other data tables using it as a reference. Data in this form is called "normalized data" and it occurs very often in scientific and other kinds of research.
+Now that we have a general understanding of what "describing data" is, we can discuss why it is important:
+These are not the only positives of having metadata, but they are two of the most important. Please continue reading to learn how Frictionless helps to achieve these advantages by describing your data. This guide will discuss the main describe
functions (describe
, Schema.describe
, Resource.describe
, Package.describe
) and will then go into more detail about how to create and edit metadata in Frictionless.
For the following examples, you will need to have Frictionless installed. See our Quick Start Guide if you need help.
+ +pip install frictionless
+
+
+ The describe
functions are the main Frictionless tool for describing data. In many cases, this high-level interface is enough for data exploration and other needs.
The frictionless framework provides 4 different describe
functions in Python:
describe
: detects the source type and returns Data Resource or Data Package metadataSchema.describe
: always returns Table Schema metadataResource.describe
: always returns Data Resource metadataPackage.describe
: always returns Data Package metadataAs described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.
+In the command-line, there is only 1 command (describe
) but there is also a flag to adjust the behavior:
frictionless describe your-table.csv
+frictionless describe your-table.csv --type schema
+frictionless describe your-table.csv --type resource
+frictionless describe your-table.csv --type package
+
+
+ Please take into account that file names might be used by Frictionless to detect a metadata type for data extraction or validation. It's recommended to use corresponding suffixes when you save your metadata to the disk. For example, you might name your Table Schema as table.schema.yaml
, Data Resource as table.resource.yaml
, and Data Package as table.package.yaml
. If there is no hint in the file name Frictionless will assume that it's a resource descriptor by default.
For example, if we want a Data Package descriptor for a single file:
+++ +Download
+table.csv
to reproduce the examples (right-click and "Save link as").
frictionless describe table.csv --type package
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━╇━━━━━━━╇━━━━━━━━━━━┩
+│ table │ table │ table.csv │
+└───────┴───────┴───────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ table
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+
+ from frictionless import describe
+
+package = describe("table.csv", type="package")
+print(package.to_yaml())
+
+
+resources:
+ - name: table
+ type: table
+ path: table.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+
+ Table Schema is a specification for providing a "schema" (similar to a database schema) for tabular data. This information includes the expected data type for each value in a column ("string", "number", "date", etc.), constraints on the value ("this string can only be at most 10 characters long"), and the expected format of the data ("this field should only contain strings that look like email addresses"). Table Schema can also specify relations between data tables.
+We're going to use this file for the examples in this section. For this guide, we only use CSV files because of their demonstrativeness, but in general Frictionless can handle data in Excel, JSON, SQL, and many other formats:
+++ +Download
+country-1.csv
to reproduce the examples (right-click and "Save link as").
cat country-1.csv
+
+
+id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ with open('country-1.csv') as file:
+ print(file.read())
+
+
+id,neighbor_id,name,population
+1,,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ Let's get a Table Schema using the Frictionless framework (note: this example uses YAML for the schema format, but Frictionless also supports JSON format):
+ +from frictionless import Schema
+
+schema = Schema.describe("country-1.csv")
+schema.to_yaml("country.schema.yaml") # use schema.to_json for JSON
+
+
+ The high-level functions of Frictionless operate on the dataset and resource levels so we have to use a little bit of Python programming to get the schema information. Below we will show how to use a command-line interface for similar tasks.
+ +cat country.schema.yaml
+
+
+fields:
+ - name: id
+ type: integer
+ - name: neighbor_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+
+ with open('country.schema.yaml') as file:
+ print(file.read())
+
+
+fields:
+ - name: id
+ type: integer
+ - name: neighbor_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+
+ As we can see, we were able to infer basic metadata from our data file. But describing data doesn't end here - we can provide additional information that we discussed earlier:
+++ +You can edit "country.schema.yaml" manually instead of running Python
+
from frictionless import Schema
+
+schema = Schema.describe("country-1.csv")
+schema.get_field("id").title = "Identifier"
+schema.get_field("neighbor_id").title = "Identifier of the neighbor"
+schema.get_field("name").title = "Name of the country"
+schema.get_field("population").title = "Population"
+schema.get_field("population").description = "According to the year 2020's data"
+schema.get_field("population").constraints["minimum"] = 0
+schema.foreign_keys.append(
+ {"fields": ["neighbor_id"], "reference": {"resource": "", "fields": ["id"]}}
+)
+schema.to_yaml("country.schema-full.yaml")
+
+
+ Let's break it down:
+cat country.schema-full.yaml
+
+
+fields:
+ - name: id
+ type: integer
+ title: Identifier
+ - name: neighbor_id
+ type: integer
+ title: Identifier of the neighbor
+ - name: name
+ type: string
+ title: Name of the country
+ - name: population
+ type: integer
+ title: Population
+ description: According to the year 2020's data
+ constraints:
+ minimum: 0
+foreignKeys:
+ - fields:
+ - neighbor_id
+ reference:
+ resource: ''
+ fields:
+ - id
+
+ with open('country.schema-full.yaml') as file:
+ print(file.read())
+
+
+fields:
+ - name: id
+ type: integer
+ title: Identifier
+ - name: neighbor_id
+ type: integer
+ title: Identifier of the neighbor
+ - name: name
+ type: string
+ title: Name of the country
+ - name: population
+ type: integer
+ title: Population
+ description: According to the year 2020's data
+ constraints:
+ minimum: 0
+foreignKeys:
+ - fields:
+ - neighbor_id
+ reference:
+ resource: ''
+ fields:
+ - id
+
+ Later we're going to show how to use the schema we created to ensure the validity of your data; in the next few sections, we will focus on Data Resource and Data Package metadata.
+To continue learning about table schemas please read:
+ +The Data Resource format describes a data resource such as an individual file or data table. +The essence of a Data Resource is a path to the data file it describes. +A range of other properties can be declared to provide a richer set of metadata including Table Schema for tabular data.
+For this section, we will use a file that is slightly more complex to handle. In this example, cells are separated by the ";" character and there is a comment on the top:
+++ +Download
+country-2.csv
to reproduce the examples (right-click and "Save link as").
cat country-2.csv
+
+
+# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+
+ with open('country-2.csv') as file:
+ print(file.read())
+
+
+# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+
+ Let's describe it:
+ +frictionless describe country-2.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ string │
+└─────────────────────────┘
+
+ from frictionless import describe
+
+resource = describe('country-2.csv')
+print(resource.to_yaml())
+
+
+name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+schema:
+ fields:
+ - name: '# Author: the scientist'
+ type: string
+
+ OK, that looks wrong -- for example, the schema has only inferred one field, and that field does not seem correct either. As we have seen in the "Introductory Guide" Frictionless is capable of inferring some complicated cases' metadata but our data table is too complex for it to automatically infer. We need to manually program it:
+++ +You can edit "country.resource.yaml" manually instead of running Python
+
from frictionless import Schema, describe
+
+resource = describe("country-2.csv")
+resource.dialect.header_rows = [2]
+resource.dialect.get_control('csv').delimiter = ";"
+resource.schema = "country.schema.yaml"
+resource.to_yaml("country.resource-cleaned.yaml")
+
+
+ So what we did here:
+cat country.resource-cleaned.yaml
+
+
+name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+ csv:
+ delimiter: ;
+schema: country.schema.yaml
+
+ with open('country.resource-cleaned.yaml') as file:
+ print(file.read())
+
+
+name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+ csv:
+ delimiter: ;
+schema: country.schema.yaml
+
+ Our resource metadata includes the schema metadata we created earlier, but it also has:
+But the most important difference is that the resource metadata contains the path
property. This is a conceptual distinction of the Data Resource specification compared to the Table Schema specification. While a Table Schema descriptor can describe a class of data files, a Data Resource descriptor describes only one exact data file, data/country-2.csv
in our case.
Using programming terminology we could say that:
+We will show the practical difference in the "Using Metadata" section, but in the next section, we will overview the Data Package specification.
+To continue learning about data resources please read:
+ +A Data Package consists of:
+The Data Package metadata is stored in a "descriptor". This descriptor is what makes a collection of data a Data Package. The structure of this descriptor is the main content of the specification below.
+In addition to this descriptor, a data package will include other resources such as data files. The Data Package specification does NOT impose any requirements on their form or structure and can, therefore, be used for packaging any kind of data.
+The data included in the package may be provided as:
+For this section, we will use the following files:
+++ +Download
+country-3.csv
to reproduce the examples (right-click and "Save link as")
cat country-3.csv
+
+
+id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ with open('country-3.csv') as file:
+ print(file.read())
+
+
+id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ ++ +Download
+capital-3.csv
to reproduce the examples (right-click and "Save link as").
cat capital-3.csv
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ with open('capital-3.csv') as file:
+ print(file.read())
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ First of all, let's describe our package now. We did it before for a resource but now we're going to use a glob pattern to indicate that there are multiple files:
+ +frictionless describe *-3.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-3
+┏━━━━━━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━━━━╇━━━━━━━━┩
+│ integer │ string │
+└─────────┴────────┘
+ country-3
+┏━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name ┃ population ┃
+┡━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
+│ integer │ integer │ string │ integer │
+└─────────┴────────────┴────────┴────────────┘
+
+ from frictionless import describe
+
+package = describe("*-3.csv")
+print(package.to_yaml())
+
+
+resources:
+ - name: capital-3
+ type: table
+ path: capital-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+ - name: country-3
+ type: table
+ path: country-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: capital_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+
+ We have already learned about many concepts that are reflected in this metadata. We can see resources, schemas, fields, and other familiar entities. The difference is that this descriptor has information about multiple files which is a popular way of sharing data - in datasets. Very often you have not only one data file but also additional data files, some textual documents e.g. PDF, and others. To package all of these files with the corresponding metadata we use data packages.
+Following the pattern that is already familiar to the guide reader, we add some additional metadata:
+++ +You can edit "country.package.yaml" manually instead of running Python
+
from frictionless import describe
+
+package = describe("*-3.csv")
+package.title = "Countries and their capitals"
+package.description = "The data was collected as a research project"
+package.get_resource("country-3").name = "country"
+package.get_resource("capital-3").name = "capital"
+package.get_resource("country").schema.foreign_keys.append(
+ {"fields": ["capital_id"], "reference": {"resource": "capital", "fields": ["id"]}}
+)
+package.to_yaml("country.package.yaml")
+
+
+ In this case, we add a relation between different files connecting id
and capital_id
. Also, we provide dataset-level metadata to explain the purpose of this dataset. We haven't added individual fields' titles and descriptions, but that can be done as it was shown in the "Table Schema" section.
cat country.package.yaml
+
+
+title: Countries and their capitals
+description: The data was collected as a research project
+resources:
+ - name: capital
+ type: table
+ path: capital-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+ - name: country
+ type: table
+ path: country-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: capital_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+ foreignKeys:
+ - fields:
+ - capital_id
+ reference:
+ resource: capital
+ fields:
+ - id
+
+ with open('country.package.yaml') as file:
+ print(file.read())
+
+
+title: Countries and their capitals
+description: The data was collected as a research project
+resources:
+ - name: capital
+ type: table
+ path: capital-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: name
+ type: string
+ - name: country
+ type: table
+ path: country-3.csv
+ scheme: file
+ format: csv
+ mediatype: text/csv
+ encoding: utf-8
+ schema:
+ fields:
+ - name: id
+ type: integer
+ - name: capital_id
+ type: integer
+ - name: name
+ type: string
+ - name: population
+ type: integer
+ foreignKeys:
+ - fields:
+ - capital_id
+ reference:
+ resource: capital
+ fields:
+ - id
+
+ The main role of the Data Package descriptor is describing a dataset; as we can see, it includes previously shown descriptors like schema
, dialect
, and resource
. But it would be a mistake to think that Data Package is the least important specification; actually, it completes the Frictionless Data suite making it possible to share and validate not only individual files but also complete datasets.
To continue learning about data resources please read:
+ +This documentation contains a great deal of information on how to use metadata and why it's vital for your data. In this section, we're going to provide a quick example based on the "Data Resource" section but please read other documents to get the full picture.
+Let's get back to this complex data table:
+ +cat country-2.csv
+
+
+# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+
+ with open('country-2.csv') as file:
+ print(file.read())
+
+
+# Author: the scientist
+id;neighbor_id;name;population
+1;;Britain;67
+2;3;France;67
+3;2;Germany;83
+4;5;Italy;60
+5;4;Spain;47
+
+ As we tried before, by default Frictionless can't properly describe this file so we got something like:
+ +frictionless describe country-2.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ string │
+└─────────────────────────┘
+
+ from frictionless import describe
+
+resource = describe("country-2.csv")
+print(resource.to_yaml())
+
+
+name: country-2
+type: table
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+schema:
+ fields:
+ - name: '# Author: the scientist'
+ type: string
+
+ Trying to extract the data will fail this way:
+ +frictionless extract country-2.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ country-2
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ # Author: the scientist ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ id;neighbor_id;name;population │
+│ 1;;Britain;67 │
+│ 2;3;France;67 │
+│ 3;2;Germany;83 │
+│ 4;5;Italy;60 │
+│ 5;4;Spain;47 │
+└────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract("country-2.csv")
+pprint(rows)
+
+
+{'country-2': [{'# Author: the scientist': 'id;neighbor_id;name;population'},
+ {'# Author: the scientist': '1;;Britain;67'},
+ {'# Author: the scientist': '2;3;France;67'},
+ {'# Author: the scientist': '3;2;Germany;83'},
+ {'# Author: the scientist': '4;5;Italy;60'},
+ {'# Author: the scientist': '5;4;Spain;47'}]}
+
+ This example highlights a really important idea - without metadata many software will not be able to even read this data file. Furthermore, without metadata people cannot understand the purpose of this data. To see how we can use metadata to fix our data, let's now use the country.resource-full.yaml
file we created in the "Data Resource" section with Frictionless extract
:
frictionless extract country.resource-cleaned.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-2 │ table │ country-2.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ country-2
+┏━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ neighbor_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ None │ Britain │ 67 │
+│ 2 │ 3 │ France │ 67 │
+│ 3 │ 2 │ Germany │ 83 │
+│ 4 │ 5 │ Italy │ 60 │
+│ 5 │ 4 │ Spain │ 47 │
+└────┴─────────────┴─────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract("country.resource-cleaned.yaml")
+pprint(rows)
+
+
+{'country-2': [{'id': 1,
+ 'name': 'Britain',
+ 'neighbor_id': None,
+ 'population': 67},
+ {'id': 2, 'name': 'France', 'neighbor_id': 3, 'population': 67},
+ {'id': 3, 'name': 'Germany', 'neighbor_id': 2, 'population': 83},
+ {'id': 4, 'name': 'Italy', 'neighbor_id': 5, 'population': 60},
+ {'id': 5, 'name': 'Spain', 'neighbor_id': 4, 'population': 47}]}
+
+ As we can see, the data is now fixed. The metadata we had saved the day! If we explore this data in Python we can discover that it also corrected data types - e.g. id
is Python's integer not string. We can now export and share this data without any worries.
++Many Frictionless Framework's classes are metadata classes as though Schema, Resource, or Package. All the sections below are applicable for all these classes. You can read about the base Metadata class in more detail in API Reference.
+
Many Frictionless functions infer metadata under the hood such as describe
, extract
, and many more. On a lower-level, it's possible to control this process. To see this, let's create a Resource
.
from frictionless import Resource
+
+resource = Resource("country-1.csv")
+print(resource)
+
+
+{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'country-1.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+{'path': 'country-1.csv'}
+
+Frictionless always tries to be as explicit as possible. We didn't provide any metadata except for path
so we got the expected result. But now, we'd like to infer
additional metadata:
++ +We can ask for stats using CLI with
+frictionless describe data/table.csv --stats
. Note that we use thestats
argument for theresource.infer
function.
frictionless describe country-1.csv --stats --json
+
+
+{
+ "name": "country-1",
+ "type": "table",
+ "path": "country-1.csv",
+ "scheme": "file",
+ "format": "csv",
+ "mediatype": "text/csv",
+ "encoding": "utf-8",
+ "hash": "sha256:7cf6ce03c75461e1d9862b89250dbacf43e97976d1f25c056173971dfb203671",
+ "bytes": 100,
+ "fields": 4,
+ "rows": 5,
+ "schema": {
+ "fields": [
+ {
+ "name": "id",
+ "type": "integer"
+ },
+ {
+ "name": "neighbor_id",
+ "type": "integer"
+ },
+ {
+ "name": "name",
+ "type": "string"
+ },
+ {
+ "name": "population",
+ "type": "integer"
+ }
+ ]
+ }
+}
+
+ from pprint import pprint
+from frictionless import Resource
+
+resource = Resource("country-1.csv")
+resource.infer(stats=True)
+pprint(resource)
+
+
+{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'country-1.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:7cf6ce03c75461e1d9862b89250dbacf43e97976d1f25c056173971dfb203671',
+ 'bytes': 100,
+ 'fields': 4,
+ 'rows': 5,
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'neighbor_id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}}
+
+ The result is really familiar to us already. We have seen it a lot as an output of the describe
function or command. Basically, that's what this high-level function does under the hood: create a resource and then infer additional metadata.
All the main Metadata
classes have this method with different available options but with the same conceptual purpose:
package.infer
resource.infer
For more advanced detection options, please read the Detector Guide
+Metadata validity is an important topic, and we recommend validating your metadata before publishing. For example, let's first make it invalid:
+ +import yaml
+from frictionless import Resource
+
+descriptor = {}
+descriptor['path'] = 'country-1.csv'
+descriptor['title'] = 1
+try:
+ Resource(descriptor)
+except Exception as exception:
+ print(exception.error)
+ print(exception.reasons)
+
+
+{'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': 'The data resource has an error: descriptor is not valid',
+ 'tags': [],
+ 'note': 'descriptor is not valid'}
+[{'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': "The data resource has an error: 'name' is a required property",
+ 'tags': [],
+ 'note': "'name' is a required property"}, {'type': 'resource-error',
+ 'title': 'Resource Error',
+ 'description': 'A validation cannot be processed.',
+ 'message': "The data resource has an error: 1 is not of type 'string' at "
+ "property 'title'",
+ 'tags': [],
+ 'note': "1 is not of type 'string' at property 'title'"}]
+
+ False
+[{'code': 'resource-error', 'name': 'Resource Error', 'tags': [], 'note': '"1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile', 'message': 'The data resource has an error: "1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile', 'description': 'A validation cannot be processed.'}]
+
+We see this error'"1 is not of type \'string\'" at "title" in metadata and at "properties/title/type" in profile'
as we set title
to be an integer.
Frictionless' high-level functions like validate
runs all metadata checks by default.
We have seen this before but let's re-iterate; it's possible to transform core metadata properties using Python's interface:
+ +from frictionless import Resource
+
+resource = Resource("country.resource-cleaned.yaml")
+resource.title = "Countries"
+resource.description = "It's a research project"
+resource.dialect.header_rows = [2]
+resource.dialect.get_control('csv').delimiter = ";"
+resource.to_yaml("country.resource-updated.yaml")
+
+
+ We can add custom options using the custom
property:
from frictionless import Resource
+
+resource = Resource("country.resource-updated.yaml")
+resource.custom["customKey1"] = "Value1"
+resource.custom["customKey2"] = "Value2"
+resource.to_yaml("country.resource-updated2.yaml")
+
+
+ Let's check it out:
+ +cat country.resource-updated2.yaml
+
+
+name: country-2
+type: table
+title: Countries
+description: It's a research project
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+ csv:
+ delimiter: ;
+schema: country.schema.yaml
+customKey1: Value1
+customKey2: Value2
+
+ with open('country.resource-updated2.yaml') as file:
+ print(file.read())
+
+
+name: country-2
+type: table
+title: Countries
+description: It's a research project
+path: country-2.csv
+scheme: file
+format: csv
+mediatype: text/csv
+encoding: utf-8
+dialect:
+ headerRows:
+ - 2
+ csv:
+ delimiter: ;
+schema: country.schema.yaml
+customKey1: Value1
+customKey2: Value2
+
+ ++This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
+
Extracting data means reading tabular data from a source. We can use various customizations for this process such as providing a file format, table schema, limiting fields or rows amount, and much more. This guide will discuss the main extract
functions (extract
, extract_resource
, extract_package
) and will then go into more advanced details about the Resource Class
, Package Class
, Header Class
, and Row Class
. The output from the extract function is in 'utf-8' encoding scheme.
Let's see this with some real files:
+++ +Download
+country-3.csv
to reproduce the examples (right-click and "Save link as").
cat country-3.csv
+
+
+id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ with open('country-3.csv') as file:
+ print(file.read())
+
+
+id,capital_id,name,population
+1,1,Britain,67
+2,3,France,67
+3,2,Germany,83
+4,5,Italy,60
+5,4,Spain,47
+
+ ++ +Download
+capital-3.csv
to reproduce the examples (right-click and "Save link as").
cat capital-3.csv
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ with open('capital-3.csv') as file:
+ print(file.read())
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ To start, we will extract data from a resource:
+ +frictionless extract country-3.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ country-3
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ 1 │ Britain │ 67 │
+│ 2 │ 3 │ France │ 67 │
+│ 3 │ 2 │ Germany │ 83 │
+│ 4 │ 5 │ Italy │ 60 │
+│ 5 │ 4 │ Spain │ 47 │
+└────┴────────────┴─────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('country-3.csv')
+pprint(rows)
+
+
+{'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+ {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+ {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+ {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+ {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+
+ The high-level interface for extracting data provided by Frictionless is a set of extract
functions:
extract
: detects the source file type and extracts data accordinglyresource.extract
: returns a data tablepackage.extract
: returns a map of the package's tablesAs described in more detail in the Introduction, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema.
+The command/function would be used as follows:
+ +frictionless extract your-table.csv
+frictionless extract your-resource.json --type resource
+frictionless extract your-package.json --type package
+
+
+ from frictionless import extract
+
+rows = extract('capital-3.csv')
+resource = extract('capital-3.csv', type="resource")
+package = extract('capital-3.csv', type="package")
+
+
+ The extract
functions always reads data in the form of rows, into memory. The lower-level interfaces will allow you to stream data, which you can read about in the Resource Class section below.
A resource contains only one file. To extract a resource, we have three options. First, we can use the same approach as above, extracting from the data file itself:
+ +frictionless extract capital-3.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-3
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━╇━━━━━━━━┩
+│ 1 │ London │
+│ 2 │ Berlin │
+│ 3 │ Paris │
+│ 4 │ Madrid │
+│ 5 │ Rome │
+└────┴────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('capital-3.csv')
+pprint(rows)
+
+
+{'capital-3': [{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]}
+
+ Our second option is to extract the resource from a descriptor file by using the extract_resource
function. A descriptor file is useful because it can contain different metadata and be stored on the disc.
As an example of how to use extract_resource
, let's first create a descriptor file (note: this example uses YAML for the descriptor, but Frictionless also supports JSON):
from frictionless import Resource
+
+resource = Resource('capital-3.csv')
+resource.infer()
+# as an example, in the next line we will append the schema
+resource.schema.missing_values.append('3') # will interpret 3 as a missing value
+resource.to_yaml('capital.resource-test.yaml') # use resource.to_json for JSON format
+
+
+ You can also use a pre-made descriptor file.
+Now, this descriptor file can be used to extract the resource:
+ +frictionless extract capital.resource-test.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-3
+┏━━━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━━━╇━━━━━━━━┩
+│ 1 │ London │
+│ 2 │ Berlin │
+│ None │ Paris │
+│ 4 │ Madrid │
+│ 5 │ Rome │
+└──────┴────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+rows = extract('capital.resource.yaml')
+pprint(rows)
+
+
+{'capital-invalid': [{'id': 1, 'name': 'London', 'name2': 'Britain'},
+ {'id': 2, 'name': 'Berlin', 'name2': 'Germany'},
+ {'id': 3, 'name': 'Paris', 'name2': 'France'},
+ {'id': 4, 'name': 'Madrid', 'name2': 'Spain'},
+ {'id': 5, 'name': 'Rome', 'name2': 'Italy'},
+ {'id': 6, 'name': 'Zagreb', 'name2': 'Croatia'},
+ {'id': 7, 'name': 'Athens', 'name2': 'Greece'},
+ {'id': 8, 'name': 'Vienna', 'name2': 'Austria'},
+ {'id': 8, 'name': 'Warsaw', 'name2': None},
+ {'id': None, 'name': None, 'name2': None},
+ {'id': None, 'name': 'Tokio', 'name2': 'Japan'}]}
+
+ So what has happened in this example? We set the textual representation of the number "3" to be a missing value. In the output we can see how the id
number 3 now appears as None
representing a missing value. This toy example demonstrates how the metadata in a descriptor can be used; other values like "NA" are more common for missing values.
You can read more advanced details about the Resource Class below.
+The third way we can extract information is from a package, which is a set of two or more files, for instance, two data files and a corresponding metadata file.
+As a primary example, we provide two data files to the extract
command which will be enough to detect that it's a dataset. Let's start by using the command-line interface:
frictionless extract *-3.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital-3 │ table │ capital-3.csv │
+│ country-3 │ table │ country-3.csv │
+└───────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-3
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━╇━━━━━━━━┩
+│ 1 │ London │
+│ 2 │ Berlin │
+│ 3 │ Paris │
+│ 4 │ Madrid │
+│ 5 │ Rome │
+└────┴────────┘
+ country-3
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ 1 │ Britain │ 67 │
+│ 2 │ 3 │ France │ 67 │
+│ 3 │ 2 │ Germany │ 83 │
+│ 4 │ 5 │ Italy │ 60 │
+│ 5 │ 4 │ Spain │ 47 │
+└────┴────────────┴─────────┴────────────┘
+
+ from pprint import pprint
+from frictionless import extract
+
+data = extract('*-3.csv')
+pprint(data)
+
+
+{'capital-3': [{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}],
+ 'country-3': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+ {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+ {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+ {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+ {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+
+ We can also extract the package from a descriptor file using the package.extract
function (Note: see the Package Class section for the creation of the country.package.yaml
file):
frictionless extract country.package.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃
+┡━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━┩
+│ capital │ table │ capital-3.csv │
+│ country │ table │ country-3.csv │
+└─────────┴───────┴───────────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital
+┏━━━━┳━━━━━━━━┓
+┃ id ┃ name ┃
+┡━━━━╇━━━━━━━━┩
+│ 1 │ London │
+│ 2 │ Berlin │
+│ 3 │ Paris │
+│ 4 │ Madrid │
+│ 5 │ Rome │
+└────┴────────┘
+ country
+┏━━━━┳━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ id ┃ capital_id ┃ name ┃ population ┃
+┡━━━━╇━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
+│ 1 │ 1 │ Britain │ 67 │
+│ 2 │ 3 │ France │ 67 │
+│ 3 │ 2 │ Germany │ 83 │
+│ 4 │ 5 │ Italy │ 60 │
+│ 5 │ 4 │ Spain │ 47 │
+└────┴────────────┴─────────┴────────────┘
+
+ from frictionless import Package
+
+package = Package('country.package.yaml')
+pprint(package.extract())
+
+
+{'capital': [{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}],
+ 'country': [{'capital_id': 1, 'id': 1, 'name': 'Britain', 'population': 67},
+ {'capital_id': 3, 'id': 2, 'name': 'France', 'population': 67},
+ {'capital_id': 2, 'id': 3, 'name': 'Germany', 'population': 83},
+ {'capital_id': 5, 'id': 4, 'name': 'Italy', 'population': 60},
+ {'capital_id': 4, 'id': 5, 'name': 'Spain', 'population': 47}]}
+
+ You can read more advanced details about the Package Class below.
+++The following sections contain further, advanced details about the
+Resource Class
,Package Class
,Header Class
, andRow Class
.
The Resource class provides metadata about a resource with read and stream functions. The extract
functions always read rows into memory; Resource can do the same but it also gives a choice regarding output data which can be rows
, data
, text
, or bytes
. Let's try reading all of them.
It's a byte representation of the contents:
+ +from pprint import pprint
+from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_bytes())
+
+
+(b'id,capital_id,name,population\n1,1,Britain,67\n2,3,France,67\n3,2,Germany,8'
+ b'3\n4,5,Italy,60\n5,4,Spain,47\n')
+
+ It's a textual representation of the contents:
+ +from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_text())
+
+
+('id,capital_id,name,population\n'
+ '1,1,Britain,67\n'
+ '2,3,France,67\n'
+ '3,2,Germany,83\n'
+ '4,5,Italy,60\n'
+ '5,4,Spain,47\n')
+
+ For a tabular data there are raw representaion of the tabular contents:
+ +from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_cells())
+
+
+[['id', 'capital_id', 'name', 'population'],
+ ['1', '1', 'Britain', '67'],
+ ['2', '3', 'France', '67'],
+ ['3', '2', 'Germany', '83'],
+ ['4', '5', 'Italy', '60'],
+ ['5', '4', 'Spain', '47']]
+
+ For a tabular data there are row available which is are normalized lists presented as dictionaries:
+ +from frictionless import Resource
+
+resource = Resource('country-3.csv')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+
+ For a tabular data there is the Header object available:
+ +from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+ pprint(resource.header)
+
+
+['id', 'capital_id', 'name', 'population']
+
+ It's really handy to read all your data into memory but it's not always possible if a file is very big. For such cases, Frictionless provides streaming functions:
+ +from frictionless import Resource
+
+with Resource('country-3.csv') as resource:
+ resource.byte_stream
+ resource.text_stream
+ resource.list_stream
+ resource.row_stream
+
+
+ The Package class provides functions to read the contents of a package. First of all, let's create a package descriptor:
+ +frictionless describe *-3.csv --json > country.package.json
+
+
+ from frictionless import describe
+
+package = describe('*-3.csv')
+package.to_json('country.package.json')
+
+
+ Note that --json is used here to output the descriptor in JSON format. Without this, the default output is in YAML format as we saw above.
+We can create a package from data files (using their paths) and then read the package's resources:
+ +from frictionless import Package
+
+package = Package('*-3.csv')
+pprint(package.get_resource('country-3').read_rows())
+pprint(package.get_resource('capital-3').read_rows())
+
+
+[{'id': 1, 'capital_id': 1, 'name': 'Britain', 'population': 67},
+ {'id': 2, 'capital_id': 3, 'name': 'France', 'population': 67},
+ {'id': 3, 'capital_id': 2, 'name': 'Germany', 'population': 83},
+ {'id': 4, 'capital_id': 5, 'name': 'Italy', 'population': 60},
+ {'id': 5, 'capital_id': 4, 'name': 'Spain', 'population': 47}]
+[{'id': 1, 'name': 'London'},
+ {'id': 2, 'name': 'Berlin'},
+ {'id': 3, 'name': 'Paris'},
+ {'id': 4, 'name': 'Madrid'},
+ {'id': 5, 'name': 'Rome'}]
+
+ The package by itself doesn't provide any read functions directly because it's just a contrainer. You can select a pacakge's resource and use the Resource API from above for data reading.
+ +++This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
+
Transforming data in Frictionless means modifying data and metadata from state A to state B. For example, it could be transforming a messy Excel file to a cleaned CSV file, or transforming a folder of data files to a data package we can publish more easily. To read more about the concepts behind Frictionless Transform, please check out the Transform Principles sections belows.
+In comparison to similiar Python software like Pandas, Frictionless provides better control over metadata, has a modular API, and fully supports Frictionless Specifications. Also, it is a streaming framework with an ability to work with large data. As a downside of the Frictionless architecture, it might be slower compared to other Python packages, especially to projects like Pandas.
+Keep reading below to learn about the principles underlying Frictionless Transform, or skip ahead to see how to use the Transform code.
+Frictionless Transform is based on a few core principles which are shared with other parts of the framework:
+Frictionless Transform can be thought of as a list of functions that accept a source resource/package object and return a target resource/package object. Every function updates the input's metadata and data - and nothing more. We tried to make this straightforward and conceptually simple, because we want our users to be able to understand the tools and master them.
+There are plenty of great ETL-frameworks written in Python and other languages. We use one of them (PETL) under the hood (described in more detail later). The core difference between Frictionless and others is that we treat metadata as a first-class citizen. This means that you don't lose type and other important information during the pipeline evaluation.
+Whenever possible, Frictionless streams the data instead of reading it into memory. For example, for sorting big tables we use a memory usage threshold and when it is met we use the file system to unload the data. The ability to stream data gives users power to work with files of any size, even very large files.
+With Frictionless all data manipulation happens on-demand. For example, if you reshape one table in a data package containing 10 big csv files, Frictionless will not even read the 9 other tables. Frictionless tries to be as explicit as possible regarding actions taken. For example, it will not use CPU resources to cast data unless a user adds a normalize
step. So it's possible to transform a rather big file without even casting types, for example, if you only need to reshape it.
For the core transform functions, Frictionless uses the amazing PETL project under the hood. This library provides lazy-loading functionality in running data pipelines. On top of PETL, Frictionless adds metadata management and a bridge between Frictionless concepts like Package/Resource and PETL's processors.
+Frictionless supports a few different kinds of data and metadata transformations:
+The main difference between these is that resource and package transforms are imperative while pipelines can be created beforehand or shared as a JSON file. We'll talk more about pipelines in the Transforming Pipeline section below. First, we will introduce the transform functions, then go into detail about how to transform a resource and a package. As a reminder, in the Frictionless ecosystem, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema. This concept is described in more detail in the Introduction.
+++ +Download
+transform.csv
to reproduce the examples (right-click and "Save link as". You might need to change the file extension from .txt to .csv).
cat transform.csv
+
+
+id,name,population
+1,germany,83
+2,france,66
+3,spain,47
+
+ The high-level interface to transform data is a set of transform
functions:
transform
: detects the source type and transforms data accordinglyreosurce.transform
: transforms a resourcepackage.transform
: transforms a packageWe'll see examples of these functions in the next few sections.
+Let's write our first transformation. Here, we will transform a data file (a resource) by defining a source resource, applying transform steps and getting back a resulting target resource:
+ +from frictionless import Resource, Pipeline, steps
+
+# Define source resource
+source = Resource(path="transform.csv")
+
+# Create a pipeline
+pipeline = Pipeline(steps=[
+ steps.table_normalize(),
+ steps.field_add(name="cars", formula='population*2', descriptor={'type': 'integer'}),
+])
+
+# Apply transform pipeline
+target = source.transform(pipeline)
+
+# Print resulting schema and data
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'cars', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name | population | cars |
++====+===========+============+======+
+| 1 | 'germany' | 83 | 166 |
++----+-----------+------------+------+
+| 2 | 'france' | 66 | 132 |
++----+-----------+------------+------+
+| 3 | 'spain' | 47 | 94 |
++----+-----------+------------+------+
+
+ Let's break down the transforming steps we applied:
+steps.table_normalize
- cast data types and shape the table according to the schema, inferred or providedsteps.field_add
- adds a field to data and metadata based on the information provided by the userThere are many more available steps that we will cover below.
+A package is a set of resources. Transforming a package means adding or removing resources and/or transforming those resources themselves. This example shows how transforming a package is similar to transforming a single resource:
+ +from frictionless import Package, Resource, transform, steps
+
+# Define source package
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+
+# Create a pipeline
+pipeline = Pipeline(steps=[
+ steps.resource_add(name="extra", descriptor={"data": [['id', 'cars'], [1, 166], [2, 132], [3, 94]]}),
+ steps.resource_transform(
+ name="main",
+ steps=[
+ steps.table_normalize(),
+ steps.table_join(resource="extra", field_name="id"),
+ ],
+ ),
+ steps.resource_remove(name="extra"),
+])
+
+# Apply transform steps
+target = source.transform(pipeline)
+
+# Print resulting resources, schema and data
+print(target.resource_names)
+print(target.get_resource("main").schema)
+print(target.get_resource("main").to_view())
+
+
+['main']
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'cars', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name | population | cars |
++====+===========+============+======+
+| 1 | 'germany' | 83 | 166 |
++----+-----------+------------+------+
+| 2 | 'france' | 66 | 132 |
++----+-----------+------------+------+
+| 3 | 'spain' | 47 | 94 |
++----+-----------+------------+------+
+
+ We have basically done the same as in Transforming a Resource section. This example is quite artificial and created only to show how to join two resources, but hopefully it provides a basic understanding of how flexible package transformations can be.
+A pipeline is a declarative way to write out metadata transform steps. With a pipeline, you can transform a resource, package, or write custom plugins too.
+For resource and package types it's mostly the same functionality as we have seen above, but written declaratively. So let's run the same resource transformation as we did in the Transforming a Resource section:
+ +from frictionless import Pipeline, transform
+
+pipeline = Pipeline.from_descriptor({
+ "steps": [
+ {"type": "table-normalize"},
+ {
+ "type": "field-add",
+ "name": "cars",
+ "formula": "population*2",
+ "descriptor": {"type": "integer"}
+ },
+ ],
+})
+print(pipeline)
+
+
+{'steps': [{'type': 'table-normalize'},
+ {'name': 'cars',
+ 'type': 'field-add',
+ 'formula': 'population*2',
+ 'descriptor': {'type': 'integer'}}]}
+
+ So what's the reason to use declarative pipelines if it works the same as the Python code? The main difference is that pipelines can be saved as JSON files which can be shared among different users and used with CLI and API. For example, if you implement your own UI based on Frictionless Framework you can serialize the whole pipeline as a JSON file and send it to the server. This is the same for CLI - if your colleague has given you a pipeline.json
file, you can run frictionless transform pipeline.json
in the CLI to get the same results as they got.
Frictionless includes more than 40+ built-in transform steps. They are grouped by the object so you can find them easily using code auto completion in a code editor. For example, start typing steps.table...
and you will see all the available steps for that group. The available groups are:
See Transform Steps for a list of all available steps. It is also possible to write custom transform steps: see the next section.
+Here is an example of a custom step written as a Python function. This example step removes a field from a data table (note: Frictionless already has a built-in function that does this same thing: steps.field_remove
).
from frictionless import Package, Resource, Step, transform, steps
+
+class custom_step(Step):
+ def transform_resource(self, resource):
+ current = resource.to_copy()
+
+ # Data
+ def data():
+ with current:
+ for list in current.cell_stream:
+ yield list[1:]
+
+ # Meta
+ resource.data = data
+ resource.schema.remove_field("id")
+
+source = Resource("transform.csv")
+pipeline = Pipeline(steps=[custom_step()])
+target = source.transform(pipeline)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++-----------+------------+
+| name | population |
++===========+============+
+| 'germany' | 83 |
++-----------+------------+
+| 'france' | 66 |
++-----------+------------+
+| 'spain' | 47 |
++-----------+------------+
+
+ As you can see you can implement any custom steps within a Python script. To make it work within a declarative pipeline you need to implement a plugin. Learn more about Custom Steps and Plugins.
+++Transform Utils is under construction.
+
In some cases, it's better to use a lower-level API to achieve your goal. A resource can be exported as a PETL table. For more information please visit PETL's documentation portal.
+ + + +from frictionless import Resource
+
+resource = Resource(path='transform.csv')
+petl_table = resource.to_petl()
+# Use it with PETL framework
+print(petl_table)
+
+
++----+---------+------------+
+| id | name | population |
++====+=========+============+
+| 1 | germany | 83 |
++----+---------+------------+
+| 2 | france | 66 |
++----+---------+------------+
+| 3 | spain | 47 |
++----+---------+------------+
+
+ ++This guide assumes basic familiarity with the Frictionless Framework. To learn more, please read the Introduction and Quick Start.
+
Tabular data validation is a process of identifying problems that have occured in your data so you can correct them. Let's explore how Frictionless helps to achieve this task using an invalid data table example:
+++ +Download
+capital-invalid.csv
to reproduce the examples (right-click and "Save link as")..
cat capital-invalid.csv
+
+
+id,name,name
+1,London,Britain
+2,Berlin,Germany
+3,Paris,France
+4,Madrid,Spain
+5,Rome,Italy
+6,Zagreb,Croatia
+7,Athens,Greece
+8,Vienna,Austria
+8,Warsaw
+
+x,Tokio,Japan,review
+
+ with open('capital-invalid.csv') as file:
+ print(file.read())
+
+
+id,name,name
+1,London,Britain
+2,Berlin,Germany
+3,Paris,France
+4,Madrid,Spain
+5,Rome,Italy
+6,Zagreb,Croatia
+7,Athens,Greece
+8,Vienna,Austria
+8,Warsaw
+
+x,Tokio,Japan,review
+
+ We can validate this file by using both command-line interface and high-level functions. Frictionless provides comprehensive error details so that errors can be understood by the user. Continue reading to learn the validation process in detail.
+ +frictionless validate capital-invalid.csv
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ duplicate-label │ Label "name" in the header at position "3" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 10 │ 3 │ missing-cell │ Row at position "10" has a missing cell in │
+│ │ │ │ field "name2" at position "3" │
+│ 11 │ None │ blank-row │ Row at position "11" is completely blank │
+│ 12 │ 1 │ type-error │ Type error in the cell "x" in row "12" and │
+│ │ │ │ field "id" at position "1": type is │
+│ │ │ │ "integer/default" │
+│ 12 │ 4 │ extra-cell │ Row at position "12" has an extra value in │
+│ │ │ │ field at position "4" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import validate
+
+report = validate('capital-invalid.csv')
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.007},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 5,
+ 'warnings': 0,
+ 'seconds': 0.007,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'missing-cell',
+ 'title': 'Missing Cell',
+ 'description': 'This row has less values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "10" has a missing cell in '
+ 'field "name2" at position "3"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['8', 'Warsaw'],
+ 'rowNumber': 10,
+ 'cell': '',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'blank-row',
+ 'title': 'Blank Row',
+ 'description': 'This row is empty. A row should '
+ 'contain at least one value.',
+ 'message': 'Row at position "11" is completely blank',
+ 'tags': ['#table', '#row'],
+ 'note': '',
+ 'cells': [],
+ 'rowNumber': 11},
+ {'type': 'type-error',
+ 'title': 'Type Error',
+ 'description': 'The value does not match the schema '
+ 'type and format for this field.',
+ 'message': 'Type error in the cell "x" in row "12" and '
+ 'field "id" at position "1": type is '
+ '"integer/default"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': 'type is "integer/default"',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'x',
+ 'fieldName': 'id',
+ 'fieldNumber': 1},
+ {'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "12" has an extra value in '
+ 'field at position "4"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'review',
+ 'fieldName': '',
+ 'fieldNumber': 4}]}]}
+
+ The high-level interface for validating data provided by Frictionless is a set of validate
functions:
validate
: detects the source type and validates data accordinglySchema.validate_descriptor
: validates a schema's metadataresource.validate
: validates a resource's data and metadatapackage.validate
: validates a package's data and metadatainquiry.validate
: validates a special Inquiry
object which represents a validation task instructionOn the command-line, there is only one command but there is a flag to adjust the behavior. It's useful when you have a file which has a ambiguous type, for example, a json file containing a data instead of metadata:
+ +frictionless validate your-data.csv
+frictionless validate your-schema.yaml --type schema
+frictionless validate your-data.csv --type resource
+frictionless validate your-package.json --type package
+frictionless validate your-inquiry.yaml --type inquiry
+
+
+ As a reminder, in the Frictionless ecosystem, a resource is a single file, such as a data file, and a package is a set of files, such as a data file and a schema. This concept is described in more detail in the Introduction.
+The Schema.validate_descriptor
function is the only function validating solely metadata. To see this work, let's create an invalid table schema:
import yaml
+from frictionless import Schema
+
+descriptor = {}
+descriptor['fields'] = 'bad' # must be a list
+with open('bad.schema.yaml', 'w') as file:
+ yaml.dump(descriptor, file)
+
+
+ And let's validate this schema:
+ +frictionless validate bad.schema.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ bad.schema │ json │ bad.schema.yaml │ INVALID │
+└────────────┴──────┴─────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ bad.schema
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ None │ schema-error │ Schema is not valid: 'bad' is not of type │
+│ │ │ │ 'array' at property 'fields' │
+└──────┴───────┴──────────────┴────────────────────────────────────────────────┘
+
+ from pprint import pprint
+from frictionless import validate
+
+report = validate('bad.schema.yaml')
+pprint(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.001},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad.schema',
+ 'type': 'json',
+ 'valid': False,
+ 'place': 'bad.schema.yaml',
+ 'labels': [],
+ 'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.001},
+ 'warnings': [],
+ 'errors': [{'type': 'schema-error',
+ 'title': 'Schema Error',
+ 'description': 'Provided schema is not valid.',
+ 'message': "Schema is not valid: 'bad' is not of type "
+ "'array' at property 'fields'",
+ 'tags': [],
+ 'note': "'bad' is not of type 'array' at property "
+ "'fields'"}]}]}
+
+ We see that the schema is invalid and the error is displayed. Schema validation can be very useful when you work with different classes of tables and create schemas for them. Using this function will ensure that the metadata is valid.
+As was shown in the "Describing Data" guide, a resource is a container having both metadata and data. We need to create a resource descriptor and then we can validate it:
+ +frictionless describe capital-invalid.csv > capital.resource.yaml
+
+
+ from frictionless import describe
+
+resource = describe('capital-invalid.csv')
+resource.to_yaml('capital.resource.yaml')
+
+
+ Note: this example uses YAML for the resource descriptor format, but Frictionless also supports JSON format also.
+Let's now validate to ensure that we are getting the same result that we got without using a resource:
+ +frictionless validate capital.resource.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ duplicate-label │ Label "name" in the header at position "3" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 10 │ 3 │ missing-cell │ Row at position "10" has a missing cell in │
+│ │ │ │ field "name2" at position "3" │
+│ 11 │ None │ blank-row │ Row at position "11" is completely blank │
+│ 12 │ 1 │ type-error │ Type error in the cell "x" in row "12" and │
+│ │ │ │ field "id" at position "1": type is │
+│ │ │ │ "integer/default" │
+│ 12 │ 4 │ extra-cell │ Row at position "12" has an extra value in │
+│ │ │ │ field at position "4" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ from frictionless import validate
+
+report = validate('capital.resource.yaml')
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.004},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 5,
+ 'warnings': 0,
+ 'seconds': 0.004,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'missing-cell',
+ 'title': 'Missing Cell',
+ 'description': 'This row has less values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "10" has a missing cell in '
+ 'field "name2" at position "3"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['8', 'Warsaw'],
+ 'rowNumber': 10,
+ 'cell': '',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'blank-row',
+ 'title': 'Blank Row',
+ 'description': 'This row is empty. A row should '
+ 'contain at least one value.',
+ 'message': 'Row at position "11" is completely blank',
+ 'tags': ['#table', '#row'],
+ 'note': '',
+ 'cells': [],
+ 'rowNumber': 11},
+ {'type': 'type-error',
+ 'title': 'Type Error',
+ 'description': 'The value does not match the schema '
+ 'type and format for this field.',
+ 'message': 'Type error in the cell "x" in row "12" and '
+ 'field "id" at position "1": type is '
+ '"integer/default"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': 'type is "integer/default"',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'x',
+ 'fieldName': 'id',
+ 'fieldNumber': 1},
+ {'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "12" has an extra value in '
+ 'field at position "4"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'review',
+ 'fieldName': '',
+ 'fieldNumber': 4}]}]}
+
+ Okay, why do we need to use a resource descriptor if the result is the same? The reason is metadata + data packaging. Let's extend our resource descriptor to show how you can edit and validate metadata:
+ +from frictionless import describe
+
+resource = describe('capital-invalid.csv')
+resource.add_defined('stats') # TODO: fix and remove this line
+resource.stats.md5 = 'ae23c74693ca2d3f0e38b9ba3570775b' # this is a made up incorrect
+resource.stats.bytes = 100 # this is wrong
+resource.to_yaml('capital.resource-bad.yaml')
+
+
+ We have added a few incorrect, made up attributes to our resource descriptor as an example. Now, the validation below reports these errors in addition to all the errors we had before. This example shows how concepts like Data Resource can be extremely useful when working with data.
+ +frictionless validate capital.resource-bad.yaml # TODO: it should have 7 errors
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ duplicate-label │ Label "name" in the header at position "3" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 10 │ 3 │ missing-cell │ Row at position "10" has a missing cell in │
+│ │ │ │ field "name2" at position "3" │
+│ 11 │ None │ blank-row │ Row at position "11" is completely blank │
+│ 12 │ 1 │ type-error │ Type error in the cell "x" in row "12" and │
+│ │ │ │ field "id" at position "1": type is │
+│ │ │ │ "integer/default" │
+│ 12 │ 4 │ extra-cell │ Row at position "12" has an extra value in │
+│ │ │ │ field at position "4" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ from frictionless import validate
+
+report = validate('capital.resource-bad.yaml')
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 5, 'warnings': 0, 'seconds': 0.004},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 5,
+ 'warnings': 0,
+ 'seconds': 0.004,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'missing-cell',
+ 'title': 'Missing Cell',
+ 'description': 'This row has less values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "10" has a missing cell in '
+ 'field "name2" at position "3"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['8', 'Warsaw'],
+ 'rowNumber': 10,
+ 'cell': '',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'blank-row',
+ 'title': 'Blank Row',
+ 'description': 'This row is empty. A row should '
+ 'contain at least one value.',
+ 'message': 'Row at position "11" is completely blank',
+ 'tags': ['#table', '#row'],
+ 'note': '',
+ 'cells': [],
+ 'rowNumber': 11},
+ {'type': 'type-error',
+ 'title': 'Type Error',
+ 'description': 'The value does not match the schema '
+ 'type and format for this field.',
+ 'message': 'Type error in the cell "x" in row "12" and '
+ 'field "id" at position "1": type is '
+ '"integer/default"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': 'type is "integer/default"',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'x',
+ 'fieldName': 'id',
+ 'fieldNumber': 1},
+ {'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "12" has an extra value in '
+ 'field at position "4"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'review',
+ 'fieldName': '',
+ 'fieldNumber': 4}]}]}
+
+ A package is a set of resources + additional metadata. To showcase a package validation we need to use one more tabular file:
+++ +Download
+capital-valid.csv
to reproduce the examples (right-click and "Save link as").
cat capital-valid.csv
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ with open('capital-valid.csv') as file:
+ print(file.read())
+
+
+id,name
+1,London
+2,Berlin
+3,Paris
+4,Madrid
+5,Rome
+
+ Now let's describe and validate a package which contains the data files we have seen so far:
+ +frictionless describe capital-*id.csv > capital.package.yaml
+frictionless validate capital.package.yaml
+
+
+──────────────────────────────────── Tables ────────────────────────────────────
+ dataset
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ None │ package-error │ The data package has an error: cannot │
+│ │ │ │ retrieve metadata "capital.package.yaml" │
+│ │ │ │ because "" │
+└──────┴───────┴───────────────┴───────────────────────────────────────────────┘
+
+ from frictionless import describe, validate
+
+# create package descriptor
+package = describe("capital-*id.csv")
+package.to_yaml("capital.package.yaml")
+# validate
+report = validate("capital.package.yaml")
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 2, 'errors': 5, 'warnings': 0, 'seconds': 0.007},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 5,
+ 'warnings': 0,
+ 'seconds': 0.003,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'missing-cell',
+ 'title': 'Missing Cell',
+ 'description': 'This row has less values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "10" has a missing cell in '
+ 'field "name2" at position "3"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['8', 'Warsaw'],
+ 'rowNumber': 10,
+ 'cell': '',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'blank-row',
+ 'title': 'Blank Row',
+ 'description': 'This row is empty. A row should '
+ 'contain at least one value.',
+ 'message': 'Row at position "11" is completely blank',
+ 'tags': ['#table', '#row'],
+ 'note': '',
+ 'cells': [],
+ 'rowNumber': 11},
+ {'type': 'type-error',
+ 'title': 'Type Error',
+ 'description': 'The value does not match the schema '
+ 'type and format for this field.',
+ 'message': 'Type error in the cell "x" in row "12" and '
+ 'field "id" at position "1": type is '
+ '"integer/default"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': 'type is "integer/default"',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'x',
+ 'fieldName': 'id',
+ 'fieldNumber': 1},
+ {'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "12" has an extra value in '
+ 'field at position "4"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'review',
+ 'fieldName': '',
+ 'fieldNumber': 4}]},
+ {'name': 'capital-valid',
+ 'type': 'table',
+ 'valid': True,
+ 'place': 'capital-valid.csv',
+ 'labels': ['id', 'name'],
+ 'stats': {'errors': 0,
+ 'warnings': 0,
+ 'seconds': 0.002,
+ 'md5': 'e7b6592a0a4356ba834e4bf1c8e8c7f8',
+ 'sha256': '04202244cbb3662b0f97bfa65adfad045724cbc8d798a7c0eb85533e9da40a5b',
+ 'bytes': 50,
+ 'fields': 2,
+ 'rows': 5},
+ 'warnings': [],
+ 'errors': []}]}
+
+ As we can see, the result is in a similar format to what we have already seen, and shows errors as we expected: we have one invalid resource and one valid resource.
+++The Inquiry is an advanced concept mostly used by software integrators. For example, under the hood, Frictionless Framework uses inquiries to implement client-server validation within the built-in API. Please skip this section if this information feels unnecessary for you.
+
Inquiry is a declarative representation of a validation job. It gives you an ability to create, export, and share arbitrary validation jobs containing a set of individual validation tasks. Tasks in the Inquiry accept the same arguments written in camelCase as the corresponding validate
functions.
Let's create an Inquiry that includes an individual file validation and a resource validation. In this example we will use the data file, capital-valid.csv
and the resource, capital.resource.json
which describes the invalid data file we have already seen:
from frictionless import Inquiry, InquiryTask
+
+inquiry = Inquiry(tasks=[
+ InquiryTask(path='capital-valid.csv'),
+ InquiryTask(resource='capital.resource.yaml'),
+])
+inquiry.to_yaml('capital.inquiry.yaml')
+
+
+ As usual, let's run validation:
+ +frictionless validate capital.inquiry.yaml
+
+
+─────────────────────────────────── Dataset ────────────────────────────────────
+ dataset
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ name ┃ type ┃ path ┃ status ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ capital-valid │ table │ capital-valid.csv │ VALID │
+│ capital-invalid │ table │ capital-invalid.csv │ INVALID │
+└─────────────────┴───────┴─────────────────────┴─────────┘
+──────────────────────────────────── Tables ────────────────────────────────────
+ capital-invalid
+┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
+┃ Row ┃ Field ┃ Type ┃ Message ┃
+┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
+│ None │ 3 │ duplicate-label │ Label "name" in the header at position "3" │
+│ │ │ │ is duplicated to a label: at position "2" │
+│ 10 │ 3 │ missing-cell │ Row at position "10" has a missing cell in │
+│ │ │ │ field "name2" at position "3" │
+│ 11 │ None │ blank-row │ Row at position "11" is completely blank │
+│ 12 │ 1 │ type-error │ Type error in the cell "x" in row "12" and │
+│ │ │ │ field "id" at position "1": type is │
+│ │ │ │ "integer/default" │
+│ 12 │ 4 │ extra-cell │ Row at position "12" has an extra value in │
+│ │ │ │ field at position "4" │
+└──────┴───────┴─────────────────┴─────────────────────────────────────────────┘
+
+ from frictionless import validate
+
+report = validate("capital.inquiry.yaml")
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 2, 'errors': 5, 'warnings': 0, 'seconds': 0.01},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-valid',
+ 'type': 'table',
+ 'valid': True,
+ 'place': 'capital-valid.csv',
+ 'labels': ['id', 'name'],
+ 'stats': {'errors': 0,
+ 'warnings': 0,
+ 'seconds': 0.004,
+ 'md5': 'e7b6592a0a4356ba834e4bf1c8e8c7f8',
+ 'sha256': '04202244cbb3662b0f97bfa65adfad045724cbc8d798a7c0eb85533e9da40a5b',
+ 'bytes': 50,
+ 'fields': 2,
+ 'rows': 5},
+ 'warnings': [],
+ 'errors': []},
+ {'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 5,
+ 'warnings': 0,
+ 'seconds': 0.003,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'missing-cell',
+ 'title': 'Missing Cell',
+ 'description': 'This row has less values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "10" has a missing cell in '
+ 'field "name2" at position "3"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['8', 'Warsaw'],
+ 'rowNumber': 10,
+ 'cell': '',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3},
+ {'type': 'blank-row',
+ 'title': 'Blank Row',
+ 'description': 'This row is empty. A row should '
+ 'contain at least one value.',
+ 'message': 'Row at position "11" is completely blank',
+ 'tags': ['#table', '#row'],
+ 'note': '',
+ 'cells': [],
+ 'rowNumber': 11},
+ {'type': 'type-error',
+ 'title': 'Type Error',
+ 'description': 'The value does not match the schema '
+ 'type and format for this field.',
+ 'message': 'Type error in the cell "x" in row "12" and '
+ 'field "id" at position "1": type is '
+ '"integer/default"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': 'type is "integer/default"',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'x',
+ 'fieldName': 'id',
+ 'fieldNumber': 1},
+ {'type': 'extra-cell',
+ 'title': 'Extra Cell',
+ 'description': 'This row has more values compared to '
+ 'the header row (the first row in the '
+ 'data source). A key concept is that '
+ 'all the rows in tabular data must have '
+ 'the same number of columns.',
+ 'message': 'Row at position "12" has an extra value in '
+ 'field at position "4"',
+ 'tags': ['#table', '#row', '#cell'],
+ 'note': '',
+ 'cells': ['x', 'Tokio', 'Japan', 'review'],
+ 'rowNumber': 12,
+ 'cell': 'review',
+ 'fieldName': '',
+ 'fieldNumber': 4}]}]}
+
+ At first sight, it might not be clear why such a construct exists, but when your validation workflow gets complex, the Inquiry can provide a lot of flexibility and power.
+++The Inquiry will use multiprocessing if there is the
+parallel
flag provided. It might speed up your validation dramatically especially on a 4+ cores processor.
All the validate
functions return a Validation Report. This is a unified object containing information about a validation: source details, the error, etc. Let's explore a report:
from frictionless import validate
+
+report = validate('capital-invalid.csv', pick_errors=['duplicate-label'])
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.006},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'capital-invalid',
+ 'type': 'table',
+ 'valid': False,
+ 'place': 'capital-invalid.csv',
+ 'labels': ['id', 'name', 'name'],
+ 'stats': {'errors': 1,
+ 'warnings': 0,
+ 'seconds': 0.006,
+ 'md5': 'dcdeae358cfd50860c18d953e021f836',
+ 'sha256': '95cc611e3b2457447ce62721a9b79d1a063d82058fc144d6d2a8dda53f30c3a6',
+ 'bytes': 171,
+ 'fields': 3,
+ 'rows': 11},
+ 'warnings': [],
+ 'errors': [{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the '
+ 'same value. Column names should be '
+ 'unique.',
+ 'message': 'Label "name" in the header at position "3" '
+ 'is duplicated to a label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}]}]}
+
+ As we can see, there is a lot of information; you can find a detailed description of the Validation Report in the API Reference. Errors are grouped by tasks (i.e. data files); for some validation there can be dozens of tasks. Let's use the report.flatten
function to simplify the representation of errors. This function helps to represent a report as a list of errors:
from pprint import pprint
+from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+pprint(report.flatten(["rowNumber", "fieldNumber", "code", "message"]))
+
+
+[[None,
+ 3,
+ None,
+ 'Label "name" in the header at position "3" is duplicated to a label: at '
+ 'position "2"']]
+
+ In some situations, an error can't be associated with a task; then it goes to the top-level report.errors
property:
from frictionless import validate
+
+report = validate("bad.json", type='schema')
+print(report)
+
+
+{'valid': False,
+ 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'name': 'bad',
+ 'type': 'json',
+ 'valid': False,
+ 'place': 'bad.json',
+ 'labels': [],
+ 'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.0},
+ 'warnings': [],
+ 'errors': [{'type': 'schema-error',
+ 'title': 'Schema Error',
+ 'description': 'Provided schema is not valid.',
+ 'message': 'Schema is not valid: cannot retrieve '
+ 'metadata "bad.json" because "[Errno 2] No '
+ 'such file or directory: \'bad.json\'"',
+ 'tags': [],
+ 'note': 'cannot retrieve metadata "bad.json" because '
+ '"[Errno 2] No such file or directory: '
+ '\'bad.json\'"'}]}]}
+
+ The Error object is at the heart of the validation process. The Report has report.errors
and report.tasks[].errors
, properties that can contain the Error object. Let's explore it by taking a deeper look at the duplicate-label
error:
from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+error = report.error # this is only available for one table / one error sitution
+print(f'Type: "{error.type}"')
+print(f'Title: "{error.title}"')
+print(f'Tags: "{error.tags}"')
+print(f'Note: "{error.note}"')
+print(f'Message: "{error.message}"')
+print(f'Description: "{error.description}"')
+
+
+Type: "duplicate-label"
+Title: "Duplicate Label"
+Tags: "['#table', '#header', '#label']"
+Note: "at position "2""
+Message: "Label "name" in the header at position "3" is duplicated to a label: at position "2""
+Description: "Two columns in the header row have the same value. Column names should be unique."
+
+ Above, we have listed universal error properties. Depending on the type of an error there can be additional ones. For example, for our duplicate-label
error:
from frictionless import validate
+
+report = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+error = report.error # this is only available for one table / one error sitution
+print(error)
+
+
+{'type': 'duplicate-label',
+ 'title': 'Duplicate Label',
+ 'description': 'Two columns in the header row have the same value. Column '
+ 'names should be unique.',
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+ 'label: at position "2"',
+ 'tags': ['#table', '#header', '#label'],
+ 'note': 'at position "2"',
+ 'labels': ['id', 'name', 'name'],
+ 'rowNumbers': [1],
+ 'label': 'name',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3}
+
+ {'code': 'duplicate-label',
+ 'description': 'Two columns in the header row have the same value. Column '
+ 'names should be unique.',
+ 'fieldName': 'name2',
+ 'fieldNumber': 3,
+ 'fieldPosition': 3,
+ 'label': 'name',
+ 'labels': ['id', 'name', 'name'],
+ 'message': 'Label "name" in the header at position "3" is duplicated to a '
+ 'label: at position "2"',
+ 'name': 'Duplicate Label',
+ 'note': 'at position "2"',
+ 'rowPositions': [1],
+ 'tags': ['#table', '#header', '#label']}
+
+Please explore the Errors Reference to learn about all the available errors and their properties.
+There are various validation checks included in the core Frictionless Framework along with an ability to create custom checks. See Validation Checks for a list of available checks.
+ +from pprint import pprint
+from frictionless import validate, checks
+
+checks = [checks.sequential_value(field_name='id')]
+report = validate('capital-invalid.csv', checks=checks)
+pprint(report.flatten(["rowNumber", "fieldNumber", "type", "note"]))
+
+
+[[None, 3, 'duplicate-label', 'at position "2"'],
+ [10, 3, 'missing-cell', ''],
+ [10, 1, 'sequential-value', 'the value is not sequential'],
+ [11, None, 'blank-row', ''],
+ [12, 1, 'type-error', 'type is "integer/default"'],
+ [12, 4, 'extra-cell', '']]
+
+ [[None, 3, 'duplicate-label', 'at position "2"'],
+ [10, 3, 'missing-cell', ''],
+ [10, 1, 'sequential-value', 'the value is not sequential'],
+ [11, None, 'blank-row', ''],
+ [12, 1, 'type-error', 'type is "integer/default"'],
+ [12, 4, 'extra-cell', '']]
+
+++Note that only the Baseline Check is enabled by default. Other built-in checks need to be activated as shown below.
+
There are many cases when built-in Frictionless checks are not enough. For instance, you might want to create a business logic rule or specific quality requirement for the data. With Frictionless it's very easy to use your own custom checks. Let's see with an example:
+ +from pprint import pprint
+from frictionless import Check, validate, errors
+
+# Create check
+class forbidden_two(Check):
+ Errors = [errors.CellError]
+ def validate_row(self, row):
+ if row['header'] == 2:
+ note = '2 is forbidden!'
+ yield errors.CellError.from_row(row, note=note, field_name='header')
+
+# Validate table
+source = b'header\n1\n2\n3'
+report = validate(source, format='csv', checks=[forbidden_two()])
+pprint(report.flatten(["rowNumber", "fieldNumber", "code", "note"]))
+
+
+[[3, 1, None, '2 is forbidden!']]
+
+ Usually, it also makes sense to create a custom error for your custom check. The Check class provides other useful methods like validate_header
etc. Please read the API Reference for more details.
Learn more about custom checks in the Check Guide.
+We can pick or skip errors by providing a list of error codes. This is useful when you already know your data has some errors, but you want to ignore them for now. For instance, if you have a data table with repeating header names. Let's see an example of how to pick and skip errors:
+ +from pprint import pprint
+from frictionless import validate
+
+report1 = validate("capital-invalid.csv", pick_errors=["duplicate-label"])
+report2 = validate("capital-invalid.csv", skip_errors=["duplicate-label"])
+pprint(report1.flatten(["rowNumber", "fieldNumber", "type"]))
+pprint(report2.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
+[[None, 3, 'duplicate-label']]
+[[10, 3, 'missing-cell'],
+ [11, None, 'blank-row'],
+ [12, 1, 'type-error'],
+ [12, 4, 'extra-cell']]
+
+ It's also possible to use error tags (for more information please consult the Errors Reference):
+ +from pprint import pprint
+from frictionless import validate
+
+report1 = validate("capital-invalid.csv", pick_errors=["#header"])
+report2 = validate("capital-invalid.csv", skip_errors=["#row"])
+pprint(report1.flatten(["rowNumber", "fieldNumber", "type"]))
+pprint(report2.flatten(["rowNumber", "fieldNumber", "type"]))
+
+
+[[None, 3, 'duplicate-label']]
+[[None, 3, 'duplicate-label']]
+
+ This option allows you to limit the amount of errors, and can be used when you need to do a quick check or want to "fail fast". For instance, here we use limit_errors
to find just the 1st error and add it to our report:
from pprint import pprint
+from frictionless import validate
+
+report = validate("capital-invalid.csv", limit_errors=1)
+pprint(report.flatten(["rowNumber", "fieldNumber", "type"]))
+
+[[None, 3, 'duplicate-label']]
+
+With CKAN portal feature you can load and publish packages from a +CKAN, an open-source Data Management System.
+To install this plugin you need to do:
+ +pip install frictionless[ckan] --pre
+pip install 'frictionless[ckan]' --pre # for zsh shell
+
+
+ To import a Dataset from a CKAN instance as a Frictionless Package you can do +as below:
+ +from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl()
+package = Package('https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos', control=ckan_control)
+
+
+ Where 'https://legado.dados.gov.br/dataset/bolsa-familia-pagamentos' is the URL for +the CKAN dataset. This will download the dataset and all its resources +metadata.
+You can pass parameters to CKAN Control to configure it, like the CKAN instance
+base URL (baseurl
) and the dataset that you do want to download (dataset
):
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', dataset='bolsa-familia-pagamentos')
+package = Package(control=ckan_control)
+
+
+ You don't need to pass the dataset
parameter to CkanControl. In the case that
+you pass only the baseurl
you can download a package as:
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br')
+package = Package('bolsa-familia-pagamentos', control=ckan_control)
+
+
+ In case that the CKAN dataset has a resource containing errors in its schema,
+you still can load the package passing the parameter ignore_schema=True
to
+CKAN Control:
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', ignore_schema=True)
+package = Package('bolsa-familia-pagamentos', control=ckan_control)
+
+
+ This will download the dataset and all its resources, saving the resources'
+original schemas on original_schema
.
To publish a Package to a CKAN instance you will need an API key from an CKAN's
+user that has permission to create datasets. This key can be passed to CKAN
+Control as the parameter apikey
.
from frictionless.portals import CkanControl
+from frictionless import Package
+
+ckan_control = CkanControl(baseurl='https://legado.dados.gov.br', apikey='YOUR-SECRET-API-KEY')
+package = Package(...) # Create your package
+package.publish(control=ckan_control)
+
+
+ You can download a list of CKAN datasets using the Catalog.
+ +
+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br')
+c = Catalog(control=ckan_control)
+
+
+ This will download all datasets from the instance, limited only by the maximum
+number of datasets returned by the instance CKAN API. If the instance returns
+only 10 datasets as default, you can request more packages passing the
+parameter num_packages
. In the example above if you want to download 1000
+datasets you can do as:
+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', num_packages=1000)
+c = Catalog(control=ckan_control)
+
+
+ It's possible that when you are requesting a large number of packages from
+CKAN, that some of them don't have a valid Package descriptor according to the
+specifications. In that case the standard behaviour will be to stop downloading
+a raise an exception. If you want to ignore individual package errors, you can
+pass the parameter ignore_package_errors=True
:
+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_errors=True, num_packages=1000)
+c = Catalog(control=ckan_control)
+
+
+ And the output of the command above will be the CKAN datasets ids with errors +and the total number of packages returned by your query to the CKAN instance:
+Error in CKAN dataset 8d60eff7-1a46-42ef-be64-e8979117a378: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
+Error in CKAN dataset 933d7164-8128-4e12-97e6-208bc4935bcb: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email")
+Error in CKAN dataset 93114fec-01c2-4ef5-8dfe-67da5027d568: [package-error] The data package has an error: descriptor is not valid (The data package has an error: property "contributors[].email" is not valid "email") (The data package has an error: property "contributors[].email" is not valid "email")
+Total number of packages: 13786
+
+You can see in the example above that 1000 packages were download from a total +13786 packages. You can download other packages passing an offset as:
+ +
+import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', ignore_package_erros=True, results_offset=1000)
+c = Catalog(control=ckan_control)
+
+
+ This will download 1000 packages after the the first 1000 packages.
+To fetch all packages from a organization will can use the CKAN Control
+parameter organization_name
. e.g. if you want to fetch all datasets from the
+organization https://legado.dados.gov.br/organization/agencia-espacial-brasileira-aeb
you can do
+as follows:
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', organization_name='agencia-espacial-brasileira-aeb')
+c = Catalog(control=ckan_control)
+
+
+ Similarly, if you want to download all datasets from a CKAN Group you can pass
+the parameter group_id
to the CKAN Control as:
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', group_id='ciencia-informacao-e-comunicacao')
+c = Catalog(control=ckan_control)
+
+
+ You can also fetch only the datasets that are returned by the CKAN Package
+Search endpoint.
+You can pass the search parameters as the parameter search
to CKAN Control.
import frictionless
+from frictionless import portals, Catalog
+
+ckan_control = portals.CkanControl(baseurl='https://legado.dados.gov.br', search={'q': 'name:bolsa*'})
+c = Catalog(control=ckan_control)
+
+
+ Ckan control representation
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, baseurl: Optional[str] = None, dataset: Optional[str] = None, apikey: Optional[str] = None, ignore_package_errors: Optional[bool] = False, ignore_schema: Optional[bool] = False, group_id: Optional[str] = None, organization_name: Optional[str] = None, search: Optional[Dict[str, Any]] = None, num_packages: Optional[int] = None, results_offset: Optional[int] = None, allow_update: Optional[bool] = False) -> None
++ Endpoint url for CKAN instance. e.g. https://dados.gov.br +
+Optional[str]
++ Unique identifier of the dataset to read or write. +
+Optional[str]
++ The access token to authenticate to the CKAN instance. It is required + to write files to CKAN instance. +
+Optional[str]
++ Ignore Package errors in a Catalog. If multiple packages are being downloaded + and one fails with an invalid descriptor, continue downloading the rest. +
+Optional[bool]
++ Ignore dataset resources schemas +
+Optional[bool]
++ CKAN Group id to get datasets in a Catalog +
+Optional[str]
++ CKAN Organization name to get datasets in a Catalog +
+Optional[str]
++ CKAN Search parameters as defined on https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_search +
+Optional[Dict[str, Any]]
++ Maximum number of packages to fetch +
+Optional[int]
++ Results page number +
+Optional[int]
++ Update a dataset on publish with an id is provided on the package descriptor +
+Optional[bool]
+Github read and publish feature makes easy to share data between frictionless and the github repositories. All read/write functionalities are the wrapper around PyGithub library which is used under the hood to make connection to github api.
+We need to install github extra dependencies to use this feature:
+ +pip install frictionless[github] --pre
+pip install 'frictionless[github]' --pre # for zsh shell
+
+
+ You can read data from a github repository as follows:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://github.com/fdtester/test-repo-without-datapackage")
+print(package)
+
+
+ {'name': 'test-repo-without-datapackage',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'countries',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'student',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
+ 'scheme': 'https',
+ 'format': 'xlsx',
+ 'mediatype': 'application/vnd.ms-excel'}]}
+
+You can also use alias function instead, for example:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://github.com/fdtester/test-repo-without-datapackage")
+print(package)
+
+
+ To increase the access limit, pass 'apikey' as the param to the reader function as follows:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.GithubControl(apikey=apikey)
+package = Package("https://github.com/fdtester/test-repo-without-datapackage", control=control)
+print(package)
+
+
+ The reader
function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the name same as the repo name as shown in the example above. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.
If the repo has a descriptor it simply returns the descriptor as shown below
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://https://github.com/fdtester/test-repo-with-datapackage-json")
+
+
+ print(package)
+{'name': 'test-tabulator',
+ 'resources': [{'name': 'first-resource',
+ 'path': 'table.xls',
+ 'schema': {'fields': [{'name': 'id', 'type': 'number'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'number-two',
+ 'path': 'table-reverse.csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}
+
+Once you read the package from the repo, you can then easily access the resources and its data, for example:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://github.com/fdtester/test-repo-without-datapackage")
+pprint(package.get_resource('capitals').read_rows())
+
+
+ [{'id': 1, 'cid': 1, 'name': 'London'},
+ {'id': 2, 'cid': 2, 'name': 'Paris'},
+ {'id': 3, 'cid': 3, 'name': 'Berlin'},
+ {'id': 4, 'cid': 4, 'name': 'Rome'},
+ {'id': 5, 'cid': 5, 'name': 'Lisbon'}]
+
+Catalog is a container for the packages. We can read single/multiple repositories from github and create a catalog.
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.GithubControl(search="'TestAction: Read' in:readme", apikey=apikey)
+catalog = Catalog(
+ "https://github.com/fdtester", control=control
+ )
+print("Total packages", len(catalog.packages))
+print(catalog.packages[:2])
+
+
+ Total packages 4
+[{'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'data/capitals.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'csv': {'skipInitialSpace': True}},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]},
+ {'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}]
+
+To read catalog, we need authenticated user so we have to pass the token as 'apikey' to the function. In the above example we are using search text to filter the repositories to small number. The search field is not mandatory.
+We can simply use 'control' parameters and get the same result as above, for example:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.GithubControl(search="'TestAction: Read' in:readme", user="fdtester", apikey=apikey)
+catalog = Catalog(control=control)
+print("Total packages", len(catalog.packages))
+print(catalog.packages[:2])
+
+
+ As shown in the example above, we can use different qualifiers to search the repos. The above example searches for all the repos which has 'TestAction: Read' text in readme files. Similary we can use many different qualifiers and combination of those. To get full list of qualifiers you can check the github document here.
+Some examples of the qualifiers:
+‘jquery’ in:name
+‘jquery’ in:name user:name
+sort:updated-asc ‘TestAction: Read’ in:readme
+
+If we want to read the list of repositories of user 'fdtester' which has 'jquery' in its name then we write search query as follows:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester jquery in:name")
+catalog = Catalog(control=control)
+print(catalog.packages)
+
+
+ [{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}]
+
+There is only one repository having 'jquery' in name for this user's account, so it returned only one repository.
+We can also read repositories in defined order using 'sort' param or qualifier. Here we are trying to read the repos with 'TestAction: Read' text in readme file in recently updated order, for example:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme")
+catalog = Catalog(control=control)
+for index,package in enumerate(catalog.packages):
+ print(f"package:{index}", "\n")
+ print(package)
+
+
+ package:0
+
+{'name': 'test-repo-without-datapackage',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'countries',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'student',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
+ 'scheme': 'https',
+ 'format': 'xlsx',
+ 'mediatype': 'application/vnd.ms-excel'}]}
+package:1
+
+{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}
+package:2
+
+{'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'data/capitals.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'csv': {'skipInitialSpace': True}},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}
+package:3
+
+{'name': 'test-tabulator',
+ 'resources': [{'name': 'first-resource',
+ 'path': 'table.xls',
+ 'schema': {'fields': [{'name': 'id', 'type': 'number'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'number-two',
+ 'path': 'table-reverse.csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}
+
+To write data to the repository, we use Package.publish
function as follows:
from pprint import pprint
+from frictionless import portals, Package
+
+package = Package('1174/datapackage.json')
+control = portals.GithubControl(repo="test-new-repo-doc", name='FD', email=email, apikey=apikey)
+response = package.publish(control=control)
+print(response)
+
+
+ Repository(full_name="fdtester/test-new-repo-doc")
+
+We need to mention name
and email
explicitly if the user doesn't have name set in his github account, and if email is private and hidden. Otherwise, it will take these info from the user account. In order to be able to publish/write to respository, we need to have the api token with 'repository write' access.
If the package is successfully published, the response is a 'Repository' instance.
+We can control the behavior of all the above three functions using various params.
+For example, to read only 'csv' files in package we use the following code:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage", apikey=apikey)
+package = Package("https://github.com/fdtester/test-repo-without-datapackage")
+print(package)
+
+
+ {'name': 'test-repo-without-datapackage',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'countries',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}
+
+In order to read first page of the search result and create a catalog, we use per_page
and page
params as follows:
from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme", per_page=1, page=1)
+catalog = Catalog(control=control)
+
+
+ [{'name': 'test-repo-jquery',
+ 'resources': [{'name': 'country-1',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}]
+
+Similary, we can also control the write function using params as follows:
+from pprint import pprint
+from frictionless import portals, Package
+
+package = Package('datapackage.json')
+control = portals.GithubControl(repo="test-repo", name='FD Test', email="test@gmail", apikey=apikey)
+response = package.publish(control=control)
+print(response)
+
+Repository(full_name="fdtester/test-repo")
+
+Github control representation
+(*, title: Optional[str] = None, description: Optional[str] = None, apikey: Optional[str] = None, basepath: Optional[str] = None, email: Optional[str] = None, formats: Optional[List[str]] = [csv, tsv, xlsx, xls, jsonl, ndjson], name: Optional[str] = None, order: Optional[str] = None, page: Optional[int] = None, per_page: Optional[int] = 30, repo: Optional[str] = None, search: Optional[str] = None, sort: Optional[str] = None, user: Optional[str] = None, filename: Optional[str] = None, enable_pages: Optional[bool] = None) -> None
+The access token to authenticate to the github API. It is required + to write files to github repo. + For reading, it is optional however using apikey increases the api + access limit from 60 to 5000 requests per hour. To write, access + token has to have write repository access. +
+Optional[str]
+Base path is the base folder, the package and resource files will be written to.
+Optional[str]
+Email is used while publishing the data to the github repo. It should be set explicitly, + if the primary email for the github account is not set to public.
+Optional[str]
+Formats instructs plugin to only read specified types of files. By default it is set to + 'csv,xls,xlsx'. +
+Optional[List[str]]
+Name of the github which is used while publishing the data. It should be provided explicitly, + if the name of the user is not set in the github account. +
+Optional[str]
+The order in which to retrieve the data sorted by 'sort' param. It can be one of: 'asc','desc'. + This parameter is ignored if 'sort' is not provided. +
+Optional[str]
+If specified, only the given page is returned.
+Optional[int]
+The number of results per page. Default value is 30. Max value is 100.
+Optional[int]
+Name of the repo to read or write.
+Optional[str]
+Search query containing one or more search keywords and qualifiers to filter the repositories. + For example, 'windows+label:bug+language:python'.
+Optional[str]
+Sorts the result of the query by number of stars, forks, help-wanted-issues or updated. + By default the results are sorted by best match in desc order.
+Optional[str]
+username of the github account.
+Optional[str]
+Custom data package file name while publishing the data. By default it will use 'datapackage.json'.
+Optional[str]
+Optional[bool]
+Zenodo API makes data sharing between frictionless framework and Zenodo easy. The data from the Zenodo repo can be read from +as well as written to zenodo seamlessly. The api uses 'zenodopy' library underneath to communicate with Zenodo REST API.
+We need to install zenodo extra dependencies to use this feature:
+ +pip install frictionless[zenodo] --pre
+pip install 'frictionless[zenodo]' --pre # for zsh shell
+
+
+ You can read data from a zenodo repository as follows:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078768")
+package.infer()
+print(package)
+
+
+ {'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'csv': {'skipInitialSpace': True}},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'table',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'https',
+ 'format': 'xls',
+ 'encoding': 'utf-8',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}
+
+To increase the access limit, pass 'apikey' as the param to the reader function as follows:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(apikey=apikey)
+package = Package("https://zenodo.org/record/7078768", control=control)
+print(package)
+
+
+ The reader
function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the name same as the repo name as shown in the example above. By default, the function reads files of type csv, xlsx, xls etc which is supported by frictionless framework but we can set the file types using control parameters also.
If the repo has a descriptor it simply returns the descriptor as shown below:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+package.infer()
+print(package)
+
+
+ {'name': 'testing',
+ 'title': 'Frictionless Data Test Dataset',
+ 'resources': [{'name': 'data',
+ 'path': 'data.csv',
+ 'schema': {'fields': [{'name': 'id',
+ 'type': 'string',
+ 'constraints': {'required': True}},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'description', 'type': 'string'},
+ {'name': 'amount', 'type': 'number'}],
+ 'primaryKey': ['id']}},
+ {'name': 'data2',
+ 'path': 'data2.csv',
+ 'schema': {'fields': [{'name': 'parent', 'type': 'string'},
+ {'name': 'comment', 'type': 'string'}],
+ 'foreignKeys': [{'fields': ['parent'],
+ 'reference': {'resource': 'data',
+ 'fields': ['id']}}]}}]}
+
+Once you read the package from the repo, you can then easily access the resources and its data, for example:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+pprint(package.get_resource('data').read_rows())
+
+
+ [{'amount': Decimal('10000.5'),
+ 'description': 'Taxes we collect',
+ 'id': 'A3001',
+ 'name': 'Taxes'},
+ {'amount': Decimal('2000.5'),
+ 'description': 'Parking fees we collect',
+ 'id': 'A5032',
+ 'name': 'Parking Fees'}]
+
+You can apply any functions available in frictionless framework. Here is an example of applying validation to the +package that was read.
+ +from pprint import pprint
+from frictionless import portals, Package
+
+package = Package("https://zenodo.org/record/7078760")
+report = catalog.packages[0].validate()
+pprint(report)
+
+
+ {'valid': True,
+ 'stats': {'tasks': 1, 'warnings': 0, 'errors': 0, 'seconds': 0.655},
+ 'warnings': [],
+ 'errors': [],
+ 'tasks': [{'valid': True,
+ 'name': 'first-http-resource',
+ 'type': 'table',
+ 'place': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+ 'labels': ['id', 'cid', 'name'],
+ 'stats': {'md5': '154d822b8c2aa259867067f01c0efee5',
+ 'sha256': '5ec3d8a4d137891f2f19ab9d244cbc2c30a7493f895c6b8af2506d9b229ed6a8',
+ 'bytes': 76,
+ 'fields': 3,
+ 'rows': 5,
+ 'warnings': 0,
+ 'errors': 0,
+ 'seconds': 0.651},
+ 'warnings': [],
+ 'errors': []}]}
+
+
+Catalog is a container for the packages. We can read single/multiple repositories from Zenodo repo and create a catalog.
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(search='notes:"TDWD"')
+catalog = Catalog(control=control)
+catalog.infer()
+print("Total packages", len(catalog.packages))
+print(catalog.packages)
+
+
+ Total packages 2
+[{'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'countries',
+ 'type': 'table',
+ 'path': 'countries.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'headerRows': [2]},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'neighbor_id', 'type': 'string'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population',
+ 'type': 'string'}]}}]}, {'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'csv': {'skipInitialSpace': True}},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'table',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'https',
+ 'format': 'xls',
+ 'encoding': 'utf-8',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}]
+
+In the above example we are using search text to filter the repositories to reduce the result size to small number. However, the search field is not mandatory. We can simply use 'control' parameters and create the catalog from a single repo, for example:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(record="7078768")
+catalog = Catalog(control=control)
+catalog.infer()
+print("Total packages", len(catalog.packages))
+print(catalog.packages)
+
+
+ Total packages 1
+[{'title': 'Frictionless Data Test Dataset Without Descriptor',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'encoding': 'utf-8',
+ 'mediatype': 'text/csv',
+ 'dialect': {'csv': {'skipInitialSpace': True}},
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}},
+ {'name': 'table',
+ 'type': 'table',
+ 'path': 'table.xls',
+ 'scheme': 'https',
+ 'format': 'xls',
+ 'encoding': 'utf-8',
+ 'mediatype': 'application/vnd.ms-excel',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}]}]
+
+As shown in the first catalog example above, we can use different search queries to filter the repos. The above example searches for all the repos which has 'notes:"TDWD"' text in readme files. Similary we can use many different queries combining many terms, phrases or field +search. To get full list of different queries you can check the zenodo official document here.
+Some examples of the search queries are:
+"open science"
+title:"open science"
++description:"frictionless" +title:"Bionomia"
++publication_date:[2022-10-01 TO 2022-11-01] +title:"frictionless"
+
+We can search for different terms such as "open science" and also use '+' to specify mandatory. If "+" is not specified, it will be optional and will apply 'OR' logic to the search. We can also use field search. All the search queries supported by Zenodo Rest API can be +used.
+If we want to read the list of repositories which has term "+frictionlessdata +science" then we write search query as follows:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+control = portals.ZenodoControl(search='+frictionlessdata +science')
+catalog = Catalog(control=control)
+print("Total Packages", len(catalog.packages))
+
+
+ Total Packages 1
+
+There is only one repository having terms '+frictionlessdata +science', so it returned only one repository.
+We can also read repositories in defined order using 'sort' param. Here we are trying to read the repos with 'creators.name:"FD Tester"' in recently updated order, for example:
+ +from pprint import pprint
+from frictionless import portals, Catalog
+
+catalog = Catalog(
+ control=portals.ZenodoControl(
+ search='creators.name:"FD Tester"',
+ sort="mostrecent",
+ page=1,
+ size=1,
+ ),
+ )
+catalog.infer()
+
+
+ [{'name': 'test-repo-resources-with-http-data-csv',
+ 'title': 'Test Write File - Remote',
+ 'resources': [{'name': 'first-http-resource',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'string'},
+ {'name': 'name', 'type': 'string'}]}}]}]
+
+To write data to the repository, we use Package.publish
function as follows:
from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+ metafn="data/zenodo/meta.json",
+ apikey=apikey
+ )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+print(deposition_id)
+
+
+ 1123500
+
+To publish the data, we need to provide metadata for the Zenodo repo which we are sending using "meta.json". In order to be able to publish/write to respository, we need to have the api token with 'repository write' access. If the package is successfully published, the deposition_id will be returned as shown in the example above.
+For testing, we can pass sandbox url using base_url param
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+ metafn="data/zenodo/meta.json",
+ apikey=apikey_sandbox,
+ base_url="https://sandbox.zenodo.org/api/"
+ )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+
+
+ If the metadata file is not provided, then the api will read available data from the package file. Metadata will be generated using title, contributors and description from Package descriptor.
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(
+ apikey=apikey_sandbox,
+ base_url="https://sandbox.zenodo.org/api/"
+ )
+package = Package("484/package-to-write/datapackage.json")
+deposition_id = package.publish(control=control)
+
+
+ We can control the behavior of all the above three functions using various params.
+For example, to read only 'csv' files in package we use the following code:
+ +from pprint import pprint
+from frictionless import portals, Package
+
+control = portals.ZenodoControl(formats=["csv"], record="7078725", apikey=apikey)
+package = Package(control=control)
+print(package)
+
+
+ {'name': 'test-repo-without-datapackage',
+ 'resources': [{'name': 'capitals',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'},
+ {'name': 'countries',
+ 'type': 'table',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
+ 'scheme': 'https',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}]}
+
+In order to read first page of the search result and create a catalog, we use page
and size
params as follows:
from pprint import pprint
+from frictionless import portals, Catalog
+
+catalog = Catalog(
+ control=portals.ZenodoControl(
+ search='creators.name"FD Tester"',
+ sort="mostrecent",
+ page=1,
+ size=1,
+ ),
+ )
+print(catalog.packages)
+
+
+ [{'name': 'test-repo-resources-with-http-data-csv',
+ 'title': 'Test Write File - Remote',
+ 'resources': [{'name': 'first-http-resource',
+ 'path': 'https://raw.githubusercontent.com/fdtester/test-repo-with-datapackage-yaml/master/data/capitals.csv',
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'cid', 'type': 'string'},
+ {'name': 'name', 'type': 'string'}]}}]}]
+
+Zenodo control representation
+(*, all_versions: Optional[int] = None, apikey: Optional[str] = None, base_url: str = https://zenodo.org/api/, title: Optional[str] = None, description: Optional[str] = None, author: Optional[str] = None, company: Optional[str] = None, bounds: Optional[str] = None, communities: Optional[str] = None, deposition_id: Optional[int] = None, doi: Optional[str] = None, formats: Optional[List[str]] = [csv, tsv, xlsx, xls, jsonl, ndjson, csv.zip, tsv.zip, xlsx.zip, xls.zip, jsonl.zip, ndjson.zip], name: Optional[str] = None, metafn: Optional[str] = None, page: Optional[str] = None, rcustom: Optional[str] = None, record: Optional[str] = None, rtype: Optional[str] = None, search: Optional[str] = None, size: Optional[int] = None, sort: Optional[str] = None, status: Optional[str] = None, subtype: Optional[str] = None, tmp_path: Optional[str] = None) -> None
+Show (true or 1) or hide (false or 0) all versions of records.
+Optional[int]
+The access token to authenticate to the zenodo API. It is required + to write files to zenodo deposit resource. + For reading, it is optional however using apikey increases the api + access limit from 60 to 100 requests per hour. To write, access + token has to have deposit:write access. +
+Optional[str]
+Endpoint for zenodo. By default it is set to live site (https://zenodo.org/api). For testing upload, + we can use sandbox for example, https://sandbox.zenodo.org/api. Sandbox does not work for + reading.
+str
+Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)
+Optional[str]
+Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)
+Optional[str]
+Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)
+Optional[str]
+Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)
+Optional[str]
+Return records filtered by a geolocation bounding box. + For example, (Format bounds=143.37158,-38.99357,146.90918,-37.35269)
+Optional[str]
+Return records that are part of the specified communities. (Use of community identifier).
+Optional[str]
+Id of the deposition resource. Deposition resource is used for uploading and + editing files to Zenodo.
+Optional[int]
+Digital Object Identifier(DOI). When the deposition is published, a unique DOI is registered by + Zenodo or user can set it manually. This is only for the published depositions. If set, it returns + record that matches this DOI
+Optional[str]
+Formats instructs plugin to only read specified types of files. By default it is set to + '"csv", "tsv", "xlsx", "xls", "jsonl", "ndjson"'. +
+Optional[List[str]]
+Custom name for a catalog or a package. Default name is 'catalog' or 'package'
+Optional[str]
+Metadata file path for deposition resource. Deposition resource is used for uploading + and editing records on Zenodo.
+Optional[str]
+Page number to retrieve from the search result.
+Optional[str]
+Return records containing the specified custom keywords. (Format custom=[field_name]:field_value)
+Optional[str]
+Unique identifier of a record. We can use it find the specific record while creating a + package or a catalog. For example, 7078768
+Optional[str]
+Return records of the specified type. (Publication, Poster, Presentation…)
+Optional[str]
+Search query containing one or more search keywords to filter the records. + For example, 'notes:"TDBASIC".
+Optional[str]
+Number of results to return per page.
+Optional[int]
+Sort order (bestmatch or mostrecent). Prefix with minus to change form + ascending to descending (e.g. -mostrecent)
+Optional[str]
+Filter result based on the deposit status (either draft or published)
+Optional[str]
+Return records that are part of the specified communities. (Use of community identifier).
+Optional[str]
+Temp path to create intermediate package/resource file/s to upload to the zenodo instance
+Optional[str]
+
+
+A file
resource is the most basic one. Actually, every data file can be maked as file
. For example:
from frictionless.resources import FileResource
+
+resource = FileResource(path='text.txt')
+resource.infer(stats=True)
+print(resource)
+
+
+{'name': 'text',
+ 'type': 'file',
+ 'path': 'text.txt',
+ 'scheme': 'file',
+ 'format': 'txt',
+ 'mediatype': 'text/txt',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:b9e68e1bea3e5b19ca6b2f98b73a54b73daafaa250484902e09982e07a12e733',
+ 'bytes': 5}
+
+ A json
resource contains a structured data like JSON or YAML (can be validated with JSONSchema -- under development):
from frictionless.resources import JsonResource
+
+resource = JsonResource(path='data.json')
+resource.infer(stats=True)
+print(resource)
+
+
+{'name': 'data',
+ 'type': 'json',
+ 'path': 'data.json',
+ 'scheme': 'file',
+ 'format': 'json',
+ 'mediatype': 'text/json',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:80af3283a5c57e5d3a8d1d4099bebe639c610c4ecc8ce39fe53f9f9d9c441c4a',
+ 'bytes': 21}
+
+ We can read the contents:
+ + + +from frictionless.resources import JsonResource
+
+resource = JsonResource(path='data.json')
+resource.infer(stats=True)
+print(resource.read_data())
+
+
+{'key': 'value'}
+
+ A table
resource contains a tabular data file (can be validated with Table Schema):
from frictionless.resources import TableResource
+
+resource = TableResource(path='table.csv')
+resource.infer(stats=True)
+print(resource)
+
+
+{'name': 'table',
+ 'type': 'table',
+ 'path': 'table.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:a1fd6c5ff3494f697874deeb07f69f8667e903dd94a7bc062dd57550cea26da8',
+ 'bytes': 30,
+ 'fields': 2,
+ 'rows': 2,
+ 'schema': {'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}}
+
+ We can read the contents:
+ + + +from frictionless.resources import TableResource
+
+resource = TableResource(path='table.csv')
+resource.infer(stats=True)
+print(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ A text
resource represents a textual file as a markdown document, for example:
from frictionless.resources import TextResource
+
+resource = TextResource(path='article.md')
+resource.infer(stats=True)
+print(resource)
+
+
+{'name': 'article',
+ 'type': 'text',
+ 'path': 'article.md',
+ 'scheme': 'file',
+ 'format': 'md',
+ 'mediatype': 'text/markdown',
+ 'encoding': 'utf-8',
+ 'hash': 'sha256:c3d88243a8bbb2d95787af6edd6b0017791a090d18c80765f92b486ab502cebb',
+ 'bytes': 20}
+
+ We can read the contents:
+ + + +from frictionless.resources import TextResource
+
+resource = TextResource(path='article.md')
+resource.infer(stats=True)
+print(resource.read_text())
+
+
+# Article
+
+Contents
+
+ Frictionless supports reading data from a AWS cloud source. You can read files in any format that is available in your S3 bucket.
+ +pip install frictionless[aws]
+pip install 'frictionless[aws]' # for zsh shell
+
+
+ You can read from this source using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='s3://bucket/table.csv')
+pprint(resource.read_rows())
+
+
+ For reading from a private bucket you need to setup AWS creadentials as it's described in the Boto3 documentation.
+A similiar approach can be used for writing:
+ +from frictionless import Resource
+
+resource = Resource(path='data/table.csv')
+resource.write('s3://bucket/table.csv')
+
+
+ There is a Control
to configure how Frictionless read files in this storage. For example:
from frictionless import Resource
+from frictionless.plugins.s3 import S3Control
+
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table.new.csv', control=controls.S3Control(endpoint_url='<url>'))
+
+
+ Aws control representation
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, s3_endpoint_url: str = https://s3.amazonaws.com) -> None
+str
+Frictionless supports working with bytes loaded into memory.
+You can read Buffer Data using Package/Resource
API, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(b'id,name\n1,english\n2,german', format='csv')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+
+ A similiar approach can be used for writing:
+ + + +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write(scheme='buffer', format='csv')
+print(target)
+print(target.read_rows())
+
+
+{'name': 'memory',
+ 'type': 'table',
+ 'data': [],
+ 'scheme': 'buffer',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': 'german'}]
+
+ You can read and write files locally with Frictionless. This is a basic functionality of Frictionless.
+You can read using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='table.csv')
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ A similiar approach can be used for writing:
+ + + +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write('table-output.csv')
+print(target)
+print(target.to_view())
+
+
+{'name': 'table-output',
+ 'type': 'table',
+ 'path': 'table-output.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ You can read and write files split into chunks with Frictionless.
+You can read using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+resource = Resource(path='chunk1.csv', extrapaths=['chunk2.csv'])
+pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ A similiar approach can be used for writing:
+ +from frictionless import Resource
+
+resource = Resource(path='table.json')
+resource.write('table{number}.json', scheme="multipart", control={"chunkSize": 1000000})
+
+
+ There is a Control
to configure how Frictionless reads files using this scheme. For example:
from frictionless import Resource
+from frictionless.plugins.multipart import MultipartControl
+
+control = MultipartControl(chunk_size=1000000)
+resource = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+resource.write('table{number}.json', scheme="multipart", control=control)
+
+
+ Multipart control representation
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, chunk_size: int = 100000000) -> None
++ Specifies chunk size for the multipart file. +
+int
+You can read files remotely with Frictionless. This is a basic functionality of Frictionless.
+You can read using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/master/data/table.csv'
+resource = Resource(path=path)
+pprint(resource.read_rows())
+
+
+ [{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+A similar approach can be used for writing:
+ +from frictionless import Resource
+
+resource = Resource(path='data/table.csv')
+resource.write('https://example.com/data/table.csv') # will POST the file to the server
+
+
+ There is a Control
to configure remote data, for example:
from frictionless import Resource
+from frictionless.plugins.remote import RemoteControl
+
+control = RemoteControl(http_timeout=10)
+path='https://raw.githubusercontent.com/frictionlessdata/frictionless-py/master/data/table.csv'
+resource = Resource(path=path, control=control)
+print(resource.to_view())
+
+
+ +----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | '中国人' |
++----+-----------+
+
+Remote control representation
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, http_timeout: int = 10, http_preload: bool = False) -> None
++ Specifies the time to wait, if the remote server + does not respond before raising an error. The default + value is 10. +
+int
++ Preloads data to the memory if set to True. It is set + to False by default. +
+bool
+Frictionless supports using data stored as File-Like objects in Python.
+++It's recommended to open files in byte-mode. If the file is opened in text-mode, Frictionless will try to re-open it in byte-mode.
+
You can read Stream using Package/Resource
, for example:
from pprint import pprint
+from frictionless import Resource
+
+with open('table.csv', 'rb') as file:
+ resource = Resource(file, format='csv')
+ pprint(resource.read_rows())
+
+
+[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
+
+ A similiar approach can be used for writing:
+ + + +from frictionless import Resource
+
+source = Resource(data=[['id', 'name'], [1, 'english'], [2, 'german']])
+target = source.write(scheme='stream', format='csv')
+print(target)
+print(target.to_view())
+
+
+{'name': 'memory',
+ 'type': 'table',
+ 'data': [],
+ 'scheme': 'stream',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'english' |
++----+-----------+
+| 2 | 'german' |
++----+-----------+
+
+ The Cell steps are responsible for cell operations like converting, replacing, or formating, along with others.
+Converts cell values of one or more fields using arbitrary functions, method +invocations or dictionary translations.
+We can provide a value to be set as a value of all cells of this field. Take into account that the value type needs to conform to the field type otherwise it will lead to a validation error:
+ +from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_convert(field_name='population', value="100"),
+ ],
+)
+print(target.to_view())
+
+
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 100 |
++----+-----------+------------+
+| 2 | 'france' | 100 |
++----+-----------+------------+
+| 3 | 'spain' | 100 |
++----+-----------+------------+
+
+ Another option to modify the field's cell is to provide a mapping. It's a translation table that uses literal matching to replace values. It's usually used for string fields:
+ +from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_convert(field_name='name', mapping = {'germany': 'GERMANY'}),
+ ],
+)
+print(target.to_view())
+
+
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'GERMANY' | 83 |
++----+-----------+------------+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ We can provide an arbitrary function to update the field cells. If you want to modify a non-string field it's really important to normalize the table first otherwise the function will be applied to a non-parsed value:
+ +from frictionless import Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.cell_convert(field_name='population', function=lambda v: v*2),
+ ],
+)
+print(target.to_view())
+
+
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 166 |
++----+-----------+------------+
+| 2 | 'france' | 132 |
++----+-----------+------------+
+| 3 | 'spain' | 94 |
++----+-----------+------------+
+
+ Convert cell + +Converts cell values of one or more fields using arbitrary functions, method +invocations or dictionary translations.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Optional[Any] = None, mapping: Optional[Dict[str, Any]] = None, function: Optional[Any] = None, field_name: Optional[str] = None) -> None
+Value to set in the field's cells
+Optional[Any]
+Mapping to apply to the column
+Optional[Dict[str, Any]]
+Function to apply to the column
+Optional[Any]
+Name of the field to apply the transform on
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_replace(pattern="france", replace=None),
+ steps.cell_fill(field_name="name", value="FRANCE"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'FRANCE' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Fill cell + +Replaces missing values with non-missing values from the adjacent row/column.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Optional[Any] = None, field_name: Optional[str] = None, direction: Optional[str] = None) -> None
+Value to replace in the field cell with missing value
+Optional[Any]
+Name of the field to replace the missing value cells
+Optional[str]
+Directions to read the non missing value from(left/right/above)
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_format(template="Prefix: {0}", field_name="name"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-------------------+------------+
+| id | name | population |
++====+===================+============+
+| 1 | 'Prefix: germany' | 83 |
++----+-------------------+------------+
+| 2 | 'Prefix: france' | 66 |
++----+-------------------+------------+
+| 3 | 'Prefix: spain' | 47 |
++----+-------------------+------------+
+
+ Format cell + +Formats all values in the given or all string fields using the `template` format string.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, template: str, field_name: Optional[str] = None) -> None
+format string to apply to cells
+str
+field name to apply template format
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_interpolate(template="Prefix: %s", field_name="name"),
+ ]
+)
+pprint(target.schema)
+pprint(target.read_rows())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
+[{'id': 1, 'name': 'Prefix: germany', 'population': 83},
+ {'id': 2, 'name': 'Prefix: france', 'population': 66},
+ {'id': 3, 'name': 'Prefix: spain', 'population': 47}]
+
+ Interpolate cell + +Interpolate all values in a given or all string fields using the `template` string.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, template: str, field_name: Optional[str] = None) -> None
+template string to apply to the field cells
+str
+field name to apply template string
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_replace(pattern="france", replace="FRANCE"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'FRANCE' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Replace cell + +Replace cell values in a given field or all fields using user defined pattern.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, pattern: str, replace: str, field_name: Optional[str] = None) -> None
+Pattern to search for in single or all fields
+str
+String to replace
+str
+field name to apply template string
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_set(field_name="population", value=100),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 100 |
++----+-----------+------------+
+| 2 | 'france' | 100 |
++----+-----------+------------+
+| 3 | 'spain' | 100 |
++----+-----------+------------+
+
+ Set cell
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, value: Any, field_name: str) -> None
++ Value to be set in cell of the given field. +
+Any
++ Specifies the field name where to set/replace the value. +
+str
+The Field steps are responsible for managing a Table Schema's fields. You can add or remove them along with more complex operations like unpacking.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_add(name="note", value="eu", descriptor={"type": "string"}),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+------+
+| id | name | population | note |
++====+===========+============+======+
+| 1 | 'germany' | 83 | 'eu' |
++----+-----------+------------+------+
+| 2 | 'france' | 66 | 'eu' |
++----+-----------+------------+------+
+| 3 | 'spain' | 47 | 'eu' |
++----+-----------+------------+------+
+
+ Add field. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, value: Optional[Any] = None, formula: Optional[Any] = None, function: Optional[Any] = None, position: Optional[int] = None, descriptor: Optional[types.IDescriptor] = None, incremental: bool = False) -> None
++ A human-oriented name for the field. +
+str
++ Specifies value for the field. +
+Optional[Any]
++ Evaluatable expressions to set the value for the field. The expressions are + processed using simpleeval library. +
+Optional[Any]
++ Python function to set the value for the field. +
+Optional[Any]
++ Position index where to add the field. For example, to + add the field in second position, we need to set it as 'position=2'. +
+Optional[int]
++ A dictionary, which contains metadata for the field which + describes the properties of the field. +
+Optional[types.IDescriptor]
++ Indicates if it is an incremental value. If True, the sequential value is set + to the new field. The default value is false. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_filter(names=["id", "name"]),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'}]}
++----+-----------+
+| id | name |
++====+===========+
+| 1 | 'germany' |
++----+-----------+
+| 2 | 'france' |
++----+-----------+
+| 3 | 'spain' |
++----+-----------+
+
+ Filter fields. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, names: List[str]) -> None
++ Names of the field to be read. Other fields will be ignored. +
+List[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ # seperator argument can be used to set delimeter. Default value is '-'
+ # preserve argument keeps the original fields
+ steps.field_merge(name="details", from_names=["name", "population"], preserve=True)
+ ],
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'details', 'type': 'string'}]}
++----+-----------+------------+--------------+
+| id | name | population | details |
++====+===========+============+==============+
+| 1 | 'germany' | 83 | 'germany-83' |
++----+-----------+------------+--------------+
+| 2 | 'france' | 66 | 'france-66' |
++----+-----------+------------+--------------+
+| 3 | 'spain' | 47 | 'spain-47' |
++----+-----------+------------+--------------+
+
+ Merge fields. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, from_names: List[str], separator: str = -, preserve: bool = False) -> None
++ Name of the new field that will be created after merge. +
+str
++ List of field names to merge. +
+List[str]
++ Separator to use while merging values of the two fields. +
+str
++ It indicates if the fields are preserved or not after merging. If True, + fields will not be removed and vice versa. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_move(name="id", position=3),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'id', 'type': 'integer'}]}
++-----------+------------+----+
+| name | population | id |
++===========+============+====+
+| 'germany' | 83 | 1 |
++-----------+------------+----+
+| 'france' | 66 | 2 |
++-----------+------------+----+
+| 'spain' | 47 | 3 |
++-----------+------------+----+
+
+ Move field. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, position: int) -> None
++ Field name to move. +
+str
++ New position for the field being moved. +
+int
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ # field_type returns packed fields as JSON Object. Default value for field_type is 'array'
+ # preserve argument keeps the original fields
+ steps.field_pack(name="details", from_names=["name", "population"], as_object=True, preserve=True)
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'details', 'type': 'object'}]}
++----+-----------+------------+-----------------------------------------+
+| id | name | population | details |
++====+===========+============+=========================================+
+| 1 | 'germany' | 83 | {'name': 'germany', 'population': '83'} |
++----+-----------+------------+-----------------------------------------+
+| 2 | 'france' | 66 | {'name': 'france', 'population': '66'} |
++----+-----------+------------+-----------------------------------------+
+| 3 | 'spain' | 47 | {'name': 'spain', 'population': '47'} |
++----+-----------+------------+-----------------------------------------+
+
+ Pack fields. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, from_names: List[str], as_object: bool = False, preserve: bool = False) -> None
++ Name of the new field. +
+str
++ List of fields to be packed. +
+List[str]
++ The packed value of the field will be stored as object if set to + True. +
+bool
++ Specifies if the field should be preserved or not. If True, fields + part of packing process will be preserved. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_remove(names=["id"]),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++-----------+------------+
+| name | population |
++===========+============+
+| 'germany' | 83 |
++-----------+------------+
+| 'france' | 66 |
++-----------+------------+
+| 'spain' | 47 |
++-----------+------------+
+
+ Remove field. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, names: List[str]) -> None
++ List of fields to remove. +
+List[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_split(name="name", to_names=["name1", "name2"], pattern="a"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'name1', 'type': 'string'},
+ {'name': 'name2', 'type': 'string'}]}
++----+------------+--------+-------+
+| id | population | name1 | name2 |
++====+============+========+=======+
+| 1 | 83 | 'germ' | 'ny' |
++----+------------+--------+-------+
+| 2 | 66 | 'fr' | 'nce' |
++----+------------+--------+-------+
+| 3 | 47 | 'sp' | 'in' |
++----+------------+--------+-------+
+
+ Split field. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, to_names: List[str], pattern: str, preserve: bool = False) -> None
++ Name of the field to split. +
+str
++ List of names of new fields. +
+List[str]
++ Pattern to split the field value, for example: "a". +
+str
++ Whether to preserve the fields after the split. If True, + the fields are not removed after split. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_update(name="id", value=[1, 1], descriptor={"type": "string"}),
+ steps.field_unpack(name="id", to_names=["id2", "id3"]),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'id2', 'type': 'any'},
+ {'name': 'id3', 'type': 'any'}]}
++-----------+------------+-----+-----+
+| name | population | id2 | id3 |
++===========+============+=====+=====+
+| 'germany' | 83 | 1 | 1 |
++-----------+------------+-----+-----+
+| 'france' | 66 | 1 | 1 |
++-----------+------------+-----+-----+
+| 'spain' | 47 | 1 | 1 |
++-----------+------------+-----+-----+
+
+ Unpack field. + +This step can be added using the `steps` parameter for the +`transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, to_names: List[str], preserve: bool = False) -> None
++ Name of the field to unpack. +
+str
++ List of names for new fields that will be created + after unpacking. +
+List[str]
++ Whether to preserve the source fields after unpacking. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_update(name="id", value=str, descriptor={"type": "string"}),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'string'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++------+-----------+------------+
+| id | name | population |
++======+===========+============+
+| None | 'germany' | 83 |
++------+-----------+------------+
+| None | 'france' | 66 |
++------+-----------+------------+
+| None | 'spain' | 47 |
++------+-----------+------------+
+
+ Update field. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, value: Optional[Any] = None, formula: Optional[Any] = None, function: Optional[Any] = None, descriptor: Optional[types.IDescriptor] = None) -> None
++ Name of the field to update. +
+str
++ Cell value to set for the field. +
+Optional[Any]
++ Evaluatable expressions to set the value for the field. The expressions + are processed using simpleeval library. +
+Optional[Any]
++ Python function to set the value for the field. +
+Optional[Any]
++ A descriptor for the field to set the metadata. +
+Optional[types.IDescriptor]
+The Resource steps are only available for a package transformation (except for steps.resource_update
available for standalone resources). This includes some basic resource management operations like adding or removing resources along with the hierarchical transform_resource
step.
This step add a new resource to a data package.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+ source,
+ steps=[
+ steps.resource_add(name='extra', descriptor={'path': 'transform.csv'}),
+ ],
+)
+print(target.resource_names)
+print(target.get_resource('extra').schema)
+print(target.get_resource('extra').to_view())
+
+
+['main', 'extra']
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Add resource. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, descriptor: Dict[str, Any]) -> None
++ Name of the resource to add. +
+str
++ A descriptor for the resource. +
+Dict[str, Any]
+This step remove an existent resource from a data package.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+ source,
+ steps=[
+ steps.resource_remove(name='main'),
+ ],
+)
+print(target)
+
+
+{'resources': []}
+
+ Remove resource. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str) -> None
++ Name of the resource to remove. +
+str
+It's a hierarchical step allowing to transform a data package's resource. It's possible to use any resource steps as a part of this package step.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+ source,
+ steps=[
+ steps.resource_transform(name='main', steps=[
+ steps.row_sort(field_names=['name'])
+ ]),
+ ],
+)
+print(target.resource_names)
+print(target.get_resource('main').schema)
+print(target.get_resource('main').to_view())
+
+
+['main']
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Transform resource. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: str, steps: List[Step]) -> None
++ Name of the resource to transform. +
+str
++ List of transformation steps to apply to the given + resource. +
+List[Step]
+This step update a resource's metadata. It can be used for both resource and package transformations.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Package(resources=[Resource(name='main', path="transform.csv")])
+target = transform(
+ source,
+ steps=[
+ steps.resource_update(
+ name='main',
+ descriptor={'title': 'Main Resource', 'description': 'For the docs'}
+ ),
+ ],
+)
+print(target.get_resource('main'))
+
+
+{'name': 'main',
+ 'type': 'table',
+ 'title': 'Main Resource',
+ 'description': 'For the docs',
+ 'path': 'transform.csv',
+ 'scheme': 'file',
+ 'format': 'csv',
+ 'mediatype': 'text/csv'}
+
+ Update resource. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, title: Optional[str] = None, description: Optional[str] = None, name: Optional[str] = None, descriptor: types.IDescriptor) -> None
++ Name of the resource to update. +
+Optional[str]
++ New descriptor for the resource to update metadata. +
+types.IDescriptor
+These steps are row-based including row filtering, slicing, and many more.
+This step filters rows based on a provided formula or function.
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.row_filter(formula="id > 1"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name | population |
++====+==========+============+
+| 2 | 'france' | 66 |
++----+----------+------------+
+| 3 | 'spain' | 47 |
++----+----------+------------+
+
+ Filter rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, formula: Optional[Any] = None, function: Optional[Any] = None) -> None
++ Evaluatable expressions to filter the rows. Rows that matches the formula + are returned and others are ignored. The expressions are processed using + simpleeval library. +
+Optional[Any]
++ Python function to filter the row. +
+Optional[Any]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.row_search(regex=r"^f.*"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name | population |
++====+==========+============+
+| 2 | 'france' | 66 |
++----+----------+------------+
+
+ Search rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, regex: str, field_name: Optional[str] = None, negate: bool = False) -> None
++ Regex pattern to search for rows. If field_name is set it + will only be applied to the specified field. For example, regex=r"^e.*". +
+str
++ Field name in which to search for. +
+Optional[str]
++ Whether to revert the result. If True, all the rows that does + not match the pattern will be returned. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.row_slice(head=2),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+
+ Slice rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, start: Optional[int] = None, stop: Optional[int] = None, step: Optional[int] = None, head: Optional[int] = None, tail: Optional[int] = None) -> None
++ Starting point from where to read the rows. If None, + defaults to the beginning. +
+Optional[int]
++ Stopping point for reading row. If None, defaults to + the end. +
+Optional[int]
++ It is the step size to read next row. If None, it defaults + to 1. +
+Optional[int]
++ Number of rows to read from head. +
+Optional[int]
++ Number of rows to read from the bottom. +
+Optional[int]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.row_sort(field_names=["name"]),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Sort rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_names: List[str], reverse: bool = False) -> None
++ List of field names by which the rows will be + sorted. If fields more than 1, sort applies from + left to right. +
+List[str]
++ The sort will be reversed if it is set to True. +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.row_split(field_name="name", pattern="a"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+--------+------------+
+| id | name | population |
++====+========+============+
+| 1 | 'germ' | 83 |
++----+--------+------------+
+| 1 | 'ny' | 83 |
++----+--------+------------+
+| 2 | 'fr' | 66 |
++----+--------+------------+
+| 2 | 'nce' | 66 |
++----+--------+------------+
+| 3 | 'sp' | 47 |
++----+--------+------------+
+...
+
+ Split rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, pattern: str, field_name: str) -> None
++ Pattern to search for in one or more fields. +
+str
++ Field name whose cell value will be split. +
+str
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.field_update(name="id", value=1),
+ steps.row_subset(subset="conflicts", field_name="id"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 1 | 'france' | 66 |
++----+-----------+------------+
+| 1 | 'spain' | 47 |
++----+-----------+------------+
+
+ Subset rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, subset: str, field_name: Optional[str] = None) -> None
++ It can take different values such as "conflicts","distinct","duplicates" + and "unique". +
+str
++ Name of field to which the subset functions will be applied. +
+Optional[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-groups.csv")
+target = transform(
+ source,
+ steps=[
+ steps.row_ungroup(group_name="name", selection="first"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'year', 'type': 'integer'}]}
++----+-----------+------------+------+
+| id | name | population | year |
++====+===========+============+======+
+| 3 | 'france' | 66 | 2020 |
++----+-----------+------------+------+
+| 1 | 'germany' | 83 | 2020 |
++----+-----------+------------+------+
+| 5 | 'spain' | 47 | 2020 |
++----+-----------+------------+------+
+
+ Ungroup rows. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, selection: str, group_name: str, value_name: Optional[str] = None) -> None
++ Specifies whether to return first or last row. The value + can be "first", "last", "min" and "max". +
+str
++ Field name which will be used to group the rows. And it returns the + first or last row with each group based on the 'selection'. +
+str
++ If the selection is set to "min" or "max", the rows will be grouped by + "group_name" field and min or max value will be then selected from the + "value_name" field. +
+Optional[str]
+These steps are meant to be used on a table level of a resource. This includes various different operations from simple validation or writing to the disc to complex re-shaping like pivoting or melting.
+Group rows under the given group_name then apply aggregation functions provided as aggregation dictionary (see example)
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-groups.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_aggregate(
+ group_name="name", aggregation={"sum": ("population", sum)}
+ ),
+ ],
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'}, {'name': 'sum', 'type': 'any'}]}
++-----------+-----+
+| name | sum |
++===========+=====+
+| 'france' | 120 |
++-----------+-----+
+| 'germany' | 160 |
++-----------+-----+
+| 'spain' | 80 |
++-----------+-----+
+
+ Aggregate table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, aggregation: Dict[str, Any], group_name: str) -> None
++ A dictionary with aggregation function. The values + could be max, min, len and sum. +
+Dict[str, Any]
++ Field by which the rows will be grouped. +
+str
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_attach(resource=Resource(data=[["note"], ["large"], ["mid"]])),
+ ],
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+---------+
+| id | name | population | note |
++====+===========+============+=========+
+| 1 | 'germany' | 83 | 'large' |
++----+-----------+------------+---------+
+| 2 | 'france' | 66 | 'mid' |
++----+-----------+------------+---------+
+| 3 | 'spain' | 47 | None |
++----+-----------+------------+---------+
+
+ Attach table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str]) -> None
++ Data Resource to attach to the existing table. +
+Union[Resource, str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_debug(function=print),
+ ],
+)
+print(target.to_view())
+
+
+{'id': 1, 'name': 'germany', 'population': 83}
+{'id': 2, 'name': 'france', 'population': 66}
+{'id': 3, 'name': 'spain', 'population': 47}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Debug table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, function: Any) -> None
++ Debug function to apply to the table row. +
+Any
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_diff(
+ resource=Resource(
+ data=[
+ ["id", "name", "population"],
+ [1, "germany", 83],
+ [2, "france", 50],
+ [3, "spain", 47],
+ ]
+ )
+ ),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+----------+------------+
+| id | name | population |
++====+==========+============+
+| 2 | 'france' | 66 |
++----+----------+------------+
+
+ Diff tables. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], ignore_order: bool = False, use_hash: bool = False) -> None
++ Resource with which to compare. +
+Union[Resource, str]
++ Specifies whether to ignore the order of the rows. +
+bool
++ Specifies whether to use hash or not. If yes, alternative implementation will + be used where the complement is executed by constructing an in-memory set for + all rows found in the right hand table. For more information + please see the link below: + https://petl.readthedocs.io/en/stable/transform.html#petl.transform.setops.hashcomplement +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_intersect(
+ resource=Resource(
+ data=[
+ ["id", "name", "population"],
+ [1, "germany", 83],
+ [2, "france", 50],
+ [3, "spain", 47],
+ ]
+ ),
+ ),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Intersect tables. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], use_hash: bool = False) -> None
++ Resource with which to apply intersection. +
+Union[Resource, str]
++ Specifies whether to use hash or not. If yes, an + alternative implementation will be used. For more + information please see the link below: + https://petl.readthedocs.io/en/stable/transform.html#petl.transform.setops.hashintersection +
+bool
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_join(
+ resource=Resource(data=[["id", "note"], [1, "beer"], [2, "vine"]]),
+ field_name="id",
+ ),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'},
+ {'name': 'note', 'type': 'string'}]}
++----+-----------+------------+--------+
+| id | name | population | note |
++====+===========+============+========+
+| 1 | 'germany' | 83 | 'beer' |
++----+-----------+------------+--------+
+| 2 | 'france' | 66 | 'vine' |
++----+-----------+------------+--------+
+
+ Join tables. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], field_name: Optional[str] = None, use_hash: bool = False, mode: str = inner) -> None
++ Resource with which to apply join. +
+Union[Resource, str]
++ Field name with which the join will be performed comparing it's value between two tables. + If not provided natural join is tried. For more information, please see the following document: + https://petl.readthedocs.io/en/stable/_modules/petl/transform/joins.html +
+Optional[str]
++ Specify whether to use hash or not. If True, an alternative implementation of join will be used. +
+bool
++ Specifies which mode to use. The available modes are: "inner", "left", "right", "outer", "cross" and + "negate". The default mode is "inner". +
+str
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_melt(field_name="name"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'name', 'type': 'string'},
+ {'name': 'variable', 'type': 'string'},
+ {'name': 'value', 'type': 'any'}]}
++-----------+--------------+-------+
+| name | variable | value |
++===========+==============+=======+
+| 'germany' | 'id' | 1 |
++-----------+--------------+-------+
+| 'germany' | 'population' | 83 |
++-----------+--------------+-------+
+| 'france' | 'id' | 2 |
++-----------+--------------+-------+
+| 'france' | 'population' | 66 |
++-----------+--------------+-------+
+| 'spain' | 'id' | 3 |
++-----------+--------------+-------+
+...
+
+ Melt tables. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, variables: Optional[str] = None, to_field_names: List[str] = NOTHING) -> None
++ Field name which will be use to melt table. It will keep + the field 'field_name' as it is but melt other fields into + data. +
+str
++ List of name of fields which will be melted into data. +
+Optional[str]
++ Labels for new fields that will be created "variable" and "value". +
+List[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_merge(
+ resource=Resource(data=[["id", "name", "note"], [4, "malta", "island"]])
+ ),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+ Merge tables. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, resource: Union[Resource, str], field_names: List[str] = NOTHING, sort_by_field: Optional[str] = None, ignore_fields: bool = False) -> None
++ Resource to merge with. +
+Union[Resource, str]
++ Specifies fixed headers for output table. +
+List[str]
++ Field name by which to sort the record after merging. +
+Optional[str]
++ If ignore_fields is set to True, it will merge two resource + without matching headers. +
+bool
+The table_normalize
step normalizes an underlaying tabular stream (cast types and fix dimensions) according to a provided or inferred schema. If your data is not really big it's recommended to normalize a table before any others steps.
from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource("table.csv")
+print(source.read_cells())
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ ]
+)
+print(target.read_cells())
+
+
+[['id', 'name'], ['1', 'english'], ['2', '中国人']]
+[['id', 'name'], [1, 'english'], [2, '中国人']]
+
+ Normalize table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform-pivot.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_pivot(f1="region", f2="gender", f3="units", aggfun=sum),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+ Pivot table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, f1: str, f2: str, f3: str, aggfun: Any) -> None
++ Field that makes the rows in the output pivot table. +
+str
++ Field that makes the columns in the output pivot table. +
+str
++ Field that forms the data in the output pivot table. +
+str
++ Function to process and create data in the output pivot table. + The function can be "sum", "max", "min", "len" etc. +
+Any
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_print(),
+ ]
+)
+
+
+== ======= ==========
+id name population
+== ======= ==========
+ 1 germany 83
+ 2 france 66
+ 3 spain 47
+== ======= ==========
+
+ Print table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_melt(field_name="id"),
+ steps.table_recast(field_name="id"),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
++----+-----------+------------+
+| id | name | population |
++====+===========+============+
+| 1 | 'germany' | 83 |
++----+-----------+------------+
+| 2 | 'france' | 66 |
++----+-----------+------------+
+| 3 | 'spain' | 47 |
++----+-----------+------------+
+
+ Recast table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, field_name: str, from_field_names: List[str] = NOTHING) -> None
++ Recast table by the field 'field_name'. +
+str
++ List of field names for the output table. +
+List[str]
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_normalize(),
+ steps.table_transpose(),
+ ]
+)
+print(target.schema)
+print(target.to_view())
+
+
+{'fields': [{'name': 'id', 'type': 'string'},
+ {'name': '1', 'type': 'any'},
+ {'name': '2', 'type': 'any'},
+ {'name': '3', 'type': 'any'}]}
++--------------+-----------+----------+---------+
+| id | 1 | 2 | 3 |
++==============+===========+==========+=========+
+| 'name' | 'germany' | 'france' | 'spain' |
++--------------+-----------+----------+---------+
+| 'population' | 83 | 66 | 47 |
++--------------+-----------+----------+---------+
+
+ Transpose table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.cell_set(field_name="population", value="bad"),
+ steps.table_validate(),
+ ]
+)
+pprint(target.schema)
+try:
+ pprint(target.to_view())
+except Exception as exception:
+ pprint(exception)
+
+
+{'fields': [{'name': 'id', 'type': 'integer'},
+ {'name': 'name', 'type': 'string'},
+ {'name': 'population', 'type': 'integer'}]}
+FrictionlessException('[step-error] Step is not valid: "table_validate" raises "[type-error] Type error in the cell "bad" in row "2" and field "population" at position "3": type is "integer/default" " ')
+
+ Validate table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None) -> None
+from pprint import pprint
+from frictionless import Package, Resource, transform, steps
+
+source = Resource(path="transform.csv")
+target = transform(
+ source,
+ steps=[
+ steps.table_write(path='transform.json'),
+ ]
+)
+
+
+ Let's read the output:
+ +cat transform.json
+
+
+[
+ [
+ "id",
+ "name",
+ "population"
+ ],
+ [
+ 1,
+ "germany",
+ 83
+ ],
+ [
+ 2,
+ "france",
+ 66
+ ],
+ [
+ 3,
+ "spain",
+ 47
+ ]
+]
+
+ with open('transform.json') as file:
+ print(file.read())
+
+
+[
+ [
+ "id",
+ "name",
+ "population"
+ ],
+ [
+ 1,
+ "germany",
+ 83
+ ],
+ [
+ 2,
+ "france",
+ 66
+ ],
+ [
+ 3,
+ "spain",
+ 47
+ ]
+]
+
+ Write table. + +This step can be added using the `steps` parameter +for the `transform` function.
+(*, name: Optional[str] = None, title: Optional[str] = None, description: Optional[str] = None, path: str) -> None
++ Path of the file to write the table content. +
+str
+Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data (DEVT Framework). It supports a great deal of data sources and formats, as well as provides popular platforms integrations. The framework is powered by the lightweight yet comprehensive Frictionless Standards.
+$ pip install frictionless
+
+$ frictionless validate data/invalid.csv
+[invalid] data/invalid.csv
+
+ row field code message
+----- ------- ---------------- --------------------------------------------
+ 3 blank-header Header in field at position "3" is blank
+ 4 duplicate-header Header "name" in field "4" is duplicated
+ 2 3 missing-cell Row "2" has a missing cell in field "field3"
+ 2 4 missing-cell Row "2" has a missing cell in field "name2"
+ 3 3 missing-cell Row "3" has a missing cell in field "field3"
+ 3 4 missing-cell Row "3" has a missing cell in field "name2"
+ 4 blank-row Row "4" is completely blank
+ 5 5 extra-cell Row "5" has an extra value in field "5"
+
+Please visit our documentation portal:
+ + +