Skip to content

Commit

Permalink
fix(docs): github portal documentation (#1707)
Browse files Browse the repository at this point in the history
Examples of the github portal documentation were pointing to resources
that no longer exist.

I updated them to have working examples. 

Also removed the unnecessary and unused `pprint` and `portals` imports.
  • Loading branch information
pierrecamilleri authored Nov 21, 2024
1 parent e94ecb0 commit 72b74a8
Showing 1 changed file with 38 additions and 106 deletions.
144 changes: 38 additions & 106 deletions docs/portals/github.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,98 +16,54 @@ pip install 'frictionless[github]' --pre # for zsh shell
You can read data from a github repository as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
from frictionless import Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
```
```
{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'student',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
'scheme': 'https',
'format': 'xlsx',
'mediatype': 'application/vnd.ms-excel'}]}
```
You can also use alias function instead, for example:
```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
print(package)
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
```

To increase the access limit, pass 'apikey' as the param to the reader function as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

control = portals.GithubControl(apikey=apikey)
package = Package("https://github.com/fdtester/test-repo-without-datapackage", control=control)
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json", control=control)
print(package)
```

The `reader` function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the name same as the repo name as shown in the example above. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.
The `reader` function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the same name as the repo name. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.

If the repo has a descriptor it simply returns the descriptor as shown below

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
If the repo has a descriptor it simply returns the descriptor as shown above.

package = Package("https://https://github.com/fdtester/test-repo-with-datapackage-json")
```
```
print(package)
{'name': 'test-tabulator',
'resources': [{'name': 'first-resource',
'path': 'table.xls',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}},
{'name': 'number-two',
'path': 'table-reverse.csv',
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
```
Once you read the package from the repo, you can then easily access the resources and its data, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
from frictionless import Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
pprint(package.get_resource('capitals').read_rows())
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package.get_resource('first-resource').read_rows())
```
```
[{'id': 1, 'cid': 1, 'name': 'London'},
{'id': 2, 'cid': 2, 'name': 'Paris'},
{'id': 3, 'cid': 3, 'name': 'Berlin'},
{'id': 4, 'cid': 4, 'name': 'Rome'},
{'id': 5, 'cid': 5, 'name': 'Lisbon'}]
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
```

## Reading Catalog

Catalog is a container for the packages. We can read single/multiple repositories from github and create a catalog.

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(search="'TestAction: Read' in:readme", apikey=apikey)
Expand Down Expand Up @@ -138,18 +94,20 @@ Total packages 4
'format': 'csv',
'mediatype': 'text/csv'}]}]
```

To read catalog, we need authenticated user so we have to pass the token as 'apikey' to the function. In the above example we are using search text to filter the repositories to small number. The search field is not mandatory.

We can simply use 'control' parameters and get the same result as above, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(search="'TestAction: Read' in:readme", user="fdtester", apikey=apikey)
catalog = Catalog(control=control)
print("Total packages", len(catalog.packages))
print(catalog.packages[:2])
```

As shown in the example above, we can use different qualifiers to search the repos. The above example searches for all the repos which has 'TestAction: Read' text in readme files. Similary we can use many different qualifiers and combination of those. To get full list of qualifiers you can check the github document [here](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories).

Some examples of the qualifiers:
Expand All @@ -159,9 +117,10 @@ Some examples of the qualifiers:
‘jquery’ in:name user:name
sort:updated-asc ‘TestAction: Read’ in:readme
```

If we want to read the list of repositories of user 'fdtester' which has 'jquery' in its name then we write search query as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester jquery in:name")
Expand All @@ -177,11 +136,12 @@ print(catalog.packages)
'format': 'csv',
'mediatype': 'text/csv'}]}]
```

There is only one repository having 'jquery' in name for this user's account, so it returned only one repository.

We can also read repositories in defined order using 'sort' param or qualifier. Here we are trying to read the repos with 'TestAction: Read' text in readme file in recently updated order, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme")
Expand All @@ -193,35 +153,14 @@ for index,package in enumerate(catalog.packages):
```
package:0
{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'student',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
'scheme': 'https',
'format': 'xlsx',
'mediatype': 'application/vnd.ms-excel'}]}
package:1
{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}
package:2
package:1
{'resources': [{'name': 'capitals',
'type': 'table',
Expand All @@ -234,7 +173,7 @@ package:2
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'cid', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
package:3
package:2
{'name': 'test-tabulator',
'resources': [{'name': 'first-resource',
Expand All @@ -251,7 +190,6 @@ package:3

To write data to the repository, we use `Package.publish` function as follows:
```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

package = Package('1174/datapackage.json')
Expand All @@ -273,33 +211,27 @@ We can control the behavior of all the above three functions using various param
For example, to read only 'csv' files in package we use the following code:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage", apikey=apikey)
package = Package("https://github.com/fdtester/test-repo-without-datapackage")
control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage")
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
```
```
{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
```

In order to read first page of the search result and create a catalog, we use `per_page` and `page` params as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme", per_page=1, page=1)
Expand All @@ -316,8 +248,8 @@ catalog = Catalog(control=control)
```

Similary, we can also control the write function using params as follows:

```
from pprint import pprint
from frictionless import portals, Package
package = Package('datapackage.json')
Expand Down

0 comments on commit 72b74a8

Please sign in to comment.