Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(docs): github portal documentation #1707

Merged
merged 2 commits into from
Nov 21, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 38 additions & 106 deletions docs/portals/github.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,98 +16,54 @@ pip install 'frictionless[github]' --pre # for zsh shell
You can read data from a github repository as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
from frictionless import Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
```
```
{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'student',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
'scheme': 'https',
'format': 'xlsx',
'mediatype': 'application/vnd.ms-excel'}]}
```
You can also use alias function instead, for example:
```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
print(package)
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
```

To increase the access limit, pass 'apikey' as the param to the reader function as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

control = portals.GithubControl(apikey=apikey)
package = Package("https://github.com/fdtester/test-repo-without-datapackage", control=control)
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json", control=control)
print(package)
```

The `reader` function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the name same as the repo name as shown in the example above. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.
The `reader` function can read package from repos with/without data package descriptor. If the repo does not have the descriptor it will create the descriptor with the same name as the repo name. By default, the function reads files of type csv, xlsx and xls but we can set the file types using control parameters.

If the repo has a descriptor it simply returns the descriptor as shown below

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
If the repo has a descriptor it simply returns the descriptor as shown above.

package = Package("https://https://github.com/fdtester/test-repo-with-datapackage-json")
```
```
print(package)
{'name': 'test-tabulator',
'resources': [{'name': 'first-resource',
'path': 'table.xls',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}},
{'name': 'number-two',
'path': 'table-reverse.csv',
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
```
Once you read the package from the repo, you can then easily access the resources and its data, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package
from frictionless import Package

package = Package("https://github.com/fdtester/test-repo-without-datapackage")
pprint(package.get_resource('capitals').read_rows())
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package.get_resource('first-resource').read_rows())
```
```
[{'id': 1, 'cid': 1, 'name': 'London'},
{'id': 2, 'cid': 2, 'name': 'Paris'},
{'id': 3, 'cid': 3, 'name': 'Berlin'},
{'id': 4, 'cid': 4, 'name': 'Rome'},
{'id': 5, 'cid': 5, 'name': 'Lisbon'}]
[{'id': 1, 'name': 'english'}, {'id': 2, 'name': '中国人'}]
```

## Reading Catalog

Catalog is a container for the packages. We can read single/multiple repositories from github and create a catalog.

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(search="'TestAction: Read' in:readme", apikey=apikey)
Expand Down Expand Up @@ -138,18 +94,20 @@ Total packages 4
'format': 'csv',
'mediatype': 'text/csv'}]}]
```

To read catalog, we need authenticated user so we have to pass the token as 'apikey' to the function. In the above example we are using search text to filter the repositories to small number. The search field is not mandatory.

We can simply use 'control' parameters and get the same result as above, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(search="'TestAction: Read' in:readme", user="fdtester", apikey=apikey)
catalog = Catalog(control=control)
print("Total packages", len(catalog.packages))
print(catalog.packages[:2])
```

As shown in the example above, we can use different qualifiers to search the repos. The above example searches for all the repos which has 'TestAction: Read' text in readme files. Similary we can use many different qualifiers and combination of those. To get full list of qualifiers you can check the github document [here](https://docs.github.com/en/search-github/searching-on-github/searching-for-repositories).

Some examples of the qualifiers:
Expand All @@ -159,9 +117,10 @@ Some examples of the qualifiers:
‘jquery’ in:name user:name
sort:updated-asc ‘TestAction: Read’ in:readme
```

If we want to read the list of repositories of user 'fdtester' which has 'jquery' in its name then we write search query as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester jquery in:name")
Expand All @@ -177,11 +136,12 @@ print(catalog.packages)
'format': 'csv',
'mediatype': 'text/csv'}]}]
```

There is only one repository having 'jquery' in name for this user's account, so it returned only one repository.

We can also read repositories in defined order using 'sort' param or qualifier. Here we are trying to read the repos with 'TestAction: Read' text in readme file in recently updated order, for example:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme")
Expand All @@ -193,35 +153,14 @@ for index,package in enumerate(catalog.packages):
```
package:0

{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'student',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/student.xlsx',
'scheme': 'https',
'format': 'xlsx',
'mediatype': 'application/vnd.ms-excel'}]}
package:1

{'name': 'test-repo-jquery',
'resources': [{'name': 'country-1',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-jquery/main/country-1.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}
package:2
package:1

{'resources': [{'name': 'capitals',
'type': 'table',
Expand All @@ -234,7 +173,7 @@ package:2
'schema': {'fields': [{'name': 'id', 'type': 'integer'},
{'name': 'cid', 'type': 'integer'},
{'name': 'name', 'type': 'string'}]}}]}
package:3
package:2

{'name': 'test-tabulator',
'resources': [{'name': 'first-resource',
Expand All @@ -251,7 +190,6 @@ package:3

To write data to the repository, we use `Package.publish` function as follows:
```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

package = Package('1174/datapackage.json')
Expand All @@ -273,33 +211,27 @@ We can control the behavior of all the above three functions using various param
For example, to read only 'csv' files in package we use the following code:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Package

control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage", apikey=apikey)
package = Package("https://github.com/fdtester/test-repo-without-datapackage")
control = portals.GithubControl(user="fdtester", formats=["csv"], repo="test-repo-without-datapackage")
package = Package("https://github.com/fdtester/test-repo-with-datapackage-json")
print(package)
```
```
{'name': 'test-repo-without-datapackage',
'resources': [{'name': 'capitals',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/capitals.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'},
{'name': 'countries',
{'name': 'test-package',
'resources': [{'name': 'first-resource',
'type': 'table',
'path': 'https://raw.githubusercontent.com/fdtester/test-repo-without-datapackage/master/data/countries.csv',
'scheme': 'https',
'format': 'csv',
'mediatype': 'text/csv'}]}
'path': 'table.xls',
'scheme': 'file',
'format': 'xls',
'mediatype': 'application/vnd.ms-excel',
'schema': {'fields': [{'name': 'id', 'type': 'number'},
{'name': 'name', 'type': 'string'}]}}]}
```

In order to read first page of the search result and create a catalog, we use `per_page` and `page` params as follows:

```python tabs=Python
from pprint import pprint
from frictionless import portals, Catalog

control = portals.GithubControl(apikey=apikey, search="user:fdtester sort:updated-desc 'TestAction: Read' in:readme", per_page=1, page=1)
Expand All @@ -316,8 +248,8 @@ catalog = Catalog(control=control)
```

Similary, we can also control the write function using params as follows:

```
from pprint import pprint
from frictionless import portals, Package

package = Package('datapackage.json')
Expand Down
Loading