Skip to content

Commit

Permalink
vignette updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mustberuss committed Jan 4, 2025
1 parent b93ea64 commit dab22c3
Show file tree
Hide file tree
Showing 2 changed files with 80 additions and 67 deletions.
36 changes: 20 additions & 16 deletions vignettes/patentsview-breaking-release.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: patentsview-breaking-release
title: "Breaking release of the Patentsview Package"
title: "Breaking Release of the Patentsview Package"
package_version: 1.0.0
author:
- Russ Allen
Expand All @@ -19,28 +19,29 @@ tags:
- API client
- USPTO
- r-universe
description: "Breaking release of the Patentsview Package"
description: "Breaking Release of the Patentsview Package"
editor:
vignette: >
%\VignetteIndexEntry{Breaking release of the Patentsview Package}
%\VignetteIndexEntry{Breaking Release of the Patentsview Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---



*This is a proposed Tech Note to be submitted to rOpenSci. It's here as an Rmd so it will be
knitted by the build process but can be submitted as an md file.*

The Patentsview API team has released a new version of their API, which is used by a
correspondingly new version of the [patentsview](https://docs.ropensci.org/patentsview/) R package. The problem for users
is that the API team has made **breaking changes**, existing programs will not run
correspondingly new version of the [patentsview](https://docs.ropensci.org/patentsview/) R package.
The problem for users is that the API team has made **breaking changes**, existing programs will not run
with the new version of the R package. Please don't shoot the messenger!

The new version of the R package handles some of the API team's changes where possible,
however an API key is now required. The Patentsview API team plans to shutdown
the original version of the API on February 12, 2025. At that
time the original version of the R package will stop working.

The original version of the R package is available on CRAN with the new version available on r-universe. After the original version of the API is shutdown, the new R package will be submitted to CRAN.
The original version of the R package is available on CRAN with the new version available on r-universe.
After the original version of the API is shutdown, the new R package will be submitted to CRAN.

## User Impacting API changes:
1. Users will need to [request an API key](https://patentsview-support.atlassian.net/servicedesk/customer/portals) and set an environmental variable PATENTSVIEW_API_KEY to its value.
Expand All @@ -56,10 +57,12 @@ available from the original endpoints (now some endpoint's returns are lighter,
3. Some fields are now nested and need to be fully qualified when used in a query,
for instance, ```search_pv('{"cpc_current.cpc_group_id":"A01B1/00"}')``` when using the patent endpoint.

In the fields parameter, nested fields can be fully qualified or a new API shorthand can be used, where group names can specified. When group names are used, all of the group's nested fields will be returned by the API. For example, the new version of the API and R package will accept fields=c("assignees") when using the patent endpoint and all nested assignees fields will be returned by the API.

In the fields parameter, nested fields can be fully qualified or a new API shorthand can be used, where
group names can specified. When group names are used, all of the group's nested fields will be returned
by the API. For example, the new version of the API and R package will accept fields=c("assignees") when
using the patent endpoint and all nested assignees' fields will be returned by the API.
4. Some field's names have changed, most significantly, patent_number is now patent_id,
and some fields were removed entirely, for instance, rawinventor_first_name and rawinventor_last_name.
and some fields were removed entirely, for instance, rawinventor_first_name and rawinventor_last_name.
5. The original version of the API had queryable fields and additional fields which could be
retrieved but couldn't be part of a conditional query. That notion does not apply to the
new version of the API as all fields are now queryable. You may be able
Expand All @@ -73,11 +76,11 @@ testthat/test-api-bugs.R in the testthat folder.
7. Result set paging has changed significantly. This would matter only if users implemented their own
paging, the R package continues to handle result set paging when search_pv's all_pages = TRUE.
There is a new result set paging vignette to explain the way the API now pages,
using the `after` parameter rather than using `page` and `per_page`.
using the `size` and `after` parameters rather than using `per_page` and `page`.

The API team also [renamed the API](https://search.patentsview.org/docs/#naming-update),
PatentsView's Search API is now the PatentSearch API.
Note that the R package will retain its name, continue to use library(patentsview)
Note that the R package will retain its name, continue to use `library(patentsview)`

## Highlights of the R package:

Expand All @@ -99,7 +102,7 @@ httr2::last_response() |> httr2::resp_body_json()

6. Three new functions were added
- `retrieve_linked_data()` to retrieve data from a HATEOAS link the API sent back, retrying if throttled
- `pad_patent_id()`
- `pad_patent_id()`, needed in custom paging using `patent_id`, see the new Result Set Paging vignette
- qry_funs$in_range() to generate range queries for you.

``` r
Expand All @@ -122,8 +125,9 @@ an endpoint's "Try it out" and "Execute" buttons. Even error responses can be i
usually pointing out what went wrong.

In a similar format, the [updated API documentation](https://search.patentsview.org/docs/docs/Search%20API/SearchAPIReference/#endpoints)
lists what each endpoint does. Additionally, the R package's fieldsdf data frame has been updated, now listing the new set
of endpoints and retrievable/queryable fields. The R package's reference pages have also been updated.
lists what each endpoint does. Additionally, the R package's fieldsdf data frame has been updated,
now listing the new set of endpoints and fields that can be queried and/or returned. The R package's
reference pages have also been updated.

## Final Thoughts
As shown in the updated Top Assignees vignette, there will be occasions now where multiple API calls are needed to retrieve the same data as in a single API call in the original version of the API and R package.
Expand Down
111 changes: 60 additions & 51 deletions vignettes/understanding-the-api.Rmd.orig
Original file line number Diff line number Diff line change
Expand Up @@ -30,43 +30,57 @@ See this under "constructing your query", I don't remember seeing this anywhere
> related entities can be requested in the API request's fields parameter as a group by using the
> group name in the fields parameter*, or individually by specifying the required field as "{entity_type}.{subfield}".

Mind blown, so we can, for example, request all the nested application fields from the patent endpoint by simply requesting "application" in the fields list.
Mind blown, so we can, for example, request all the nested application fields from the patent endpoint
by simply requesting "application" in the fields list.

The new version of the R package will let its users leverage this same "feature". (Purists will probably frown upon using it, as they
would with a select * in SQL. It can be helpful to see exactly what fields the API can return, should the documentation
be lagging.)
<a name="all-fields">
The new version of the R package will let its users leverage this same feature, allowing
group names to be specified in the fields parameter.

```{r}
library(patentsview)

pat_res <- search_pv(qry_funs$eq(patent_id = "10568228"), fields=c("application"), method="POST")
pat_res$data$patent$application
pat_res$request
query <- qry_funs$eq(patent_id = "10568228")

```
# get_fields() now uses the new API shorthand rather than returning all of the
# group's nested field names
shorthand <- get_fields("patent", groups=c("application"))
shorthand

The results should be the same if we used ```fields=get_fields("patent", groups=c("application"))```. The difference
is that in the case above, it's the API deciding what fields to return while in the get_fields() case, we parsed
the [API's OpenAPI object](https://search.patentsview.org/static/openapi.json) when building the R package to determine what fields can be requested. The results
could be different if the API's actual return is not in sync with the API's OpenAPI object. Here we see that
the requests are different but the results are the same (we used POSTs so the requests are easier to read since they don't need to be urlencoded):
shorthand_results <- search_pv(query, fields=shorthand, method="POST")

```{r}
app_fields <- get_fields("patent", groups=c("application"))
app_fields
# Now that the R package uses httr2, we can use its last_request()
# to see what was POSTed to the API
cat(httr2::last_request()$body$data)

# Here we view the results
shorthand_results$data$patent$application

pat_res <- search_pv(qry_funs$eq(patent_id = "10568228"), fields=app_fields, method="POST")
pat_res$data$patent$application
# Now we'll explicitly request all the application fields and make a POST to the API
explicit_fields <- fieldsdf[fieldsdf$endpoint == "patent" & fieldsdf$group == "application", "field"]
explicit_results <- search_pv(query, fields=explicit_fields, method="POST")

# the request here and the one above differ, but the results were the same!
pat_res$request
# what was POSTed is different
cat(httr2::last_request()$body$data)

# but the results from the API are the same
explicit_results$data$patent$application

# (Observation reported to the API team: application_type, series_code and filing_type
# all seem to have the same values and not just in this one example.)

```
<a name="all-fields">
Now, when requesting all fields, `get_fields()` uses the API's shorthand notation
rather than explicitly calling out every field. If it didn't do that, it would
be possible to get an error when doing a GET of every field at the patent endpoint.
It currently has 20 groups and 129 fields overall.

The difference in the requests is that in the former case, it's the API deciding what fields to return
while in the latter case, we used fieldsdf.
fieldsdf is created from the [API's OpenAPI object](https://search.patentsview.org/static/openapi.json)
when building the R package and determines what fields can be requested from each endpoint. The results
could be different if the API's actual return is not in sync with the API's OpenAPI object. Here we see
that the requests are different but the results are the same (we used POSTs so the requests are easier
to read since they don't need to be urlencoded).

The motivation to adopt the API's shorthand is that, with a modest query, explicitly requesting all of the
patent endpoint's fields can be too much to send via a GET request (the resulting URL can exceed 4K).

## Unexpected Results

Expand Down Expand Up @@ -108,18 +122,19 @@ pat_res <- search_pv(patents_query, fields=patent_fields, endpoint="patent")
dl <- unnest_pv_data(pat_res$data)

# We got back all the inventors on the patents that met our search criteria. We'll filter out
# the inventors that didn't strictly meet our criteria (they came along for the ride with
# the ones that met our criteria), we want the noted behavior to be clear.
# the inventors that didn't strictly meet our criteria (they're coinventors that came along for
# the ride with the ones that met our criteria), we want the noted behavior to be clear.

display_inventors <-
dl$inventors %>%
filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last)) %>%
arrange(nchar(patent_id), patent_id) # numeric sort on a string field
filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last))

display_inventors

```

Some rows act as you'd expect, like patent 4078607's Thomas Jefferson. In others, two inventors
combine to meet the search cititeria, like 6905071's
**Thomas** Amundsen and Matthew **Jefferson**.
Now we'll hit the inventor endpoint with a similar query, as the jupyter notebook suggests.

```{r}
Expand All @@ -140,41 +155,33 @@ inventors_query <-

inventor_fields <- c("inventor_id","inventor_name_first","inventor_name_last")
inventor_res <- search_pv(inventors_query, fields=inventor_fields, endpoint="inventor")
dl2 <- unnest_pv_data(inventor_res$data)

actual_inventors <-
dl2$inventors %>%
arrange(inventor_name_last, inventor_name_first)
actual_inventors <- unnest_pv_data(inventor_res$data)

actual_inventors
actual_inventors[[1]]
```

Now, with actual_inventors' inventor_ids in hand, we'll ask the patent endpoint for their patents.
The results are quite different than what the first query returned. (These patents would
have names matching at least one of our two famous forefathers. The first query non-intuitively
have names matching at least one of our two famous forefather's names. The first query non-intuitively
matched names where the first and last name matches did not necessarily both occur on the same inventor.)

```{r}
id_query <- qry_funs$eq(inventors.inventor_id=actual_inventors$inventor_id)
id_query <- qry_funs$eq(inventors.inventor_id=actual_inventors$inventors$inventor_id)

# We need to pass fields since we're sorting (sort field has to be passed as a field)
# Without a sort we could rely on the default fields being returned if we liked

patent_fields <-c("patent_id", "inventors.inventor_name_first", "inventors.inventor_name_last")
pat_res <- search_pv(id_query, fields=patent_fields, sort=c("patent_id" = "asc"))
patent_fields <-c("patent_id", "inventors.inventor_name_first", "inventors.inventor_name_last",
"inventors.inventor_id")
pat_res <- search_pv(id_query, fields=patent_fields, sort=c(patent_id = "asc"))

dl <- unnest_pv_data(pat_res$data)

# Also, the API's sort on patent_id, a string field, puts 10568228 first at the time of
# this writing. Would that be a bug or feature? Below we'll apply our own sort
dl$patents[[1]][[1]]

# we'll repeat the same filter we used on the first query's results
# we'll repeat the same name filter we used on the first query's results
display_inventors <-
dl$inventors %>%
filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last)) %>%
arrange(nchar(patent_id), patent_id) # numeric sort on a string field
filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last)) %>%
mutate(inventor = sub(".*/([^/]+)/$", "\\1", inventor)) # extract id from HATEOAS link

# sample pre-mutate value, note that we requested inventor_id but the API sent back `inventor`
dl$inventors$inventor[[1]]

display_inventors

Expand All @@ -188,4 +195,6 @@ in R package form. The repo doesn't have a stated license but when I checked, I
> For the repo license we are looking at the [GNU General Public License v3](https://www.gnu.org/licenses/quick-guide-gplv3.html) (GPL3).

That is the same license as R itself so I don't think we've violated anything. For extra fun check
out [Russ' fork](https://github.com/mustberuss/PatentsView-Code-Snippets/blob/master/07_PatentSearch_API_demo/PV%20PatentSearch%20API%20tutorial.ipynb). There was no reply when I asked if they'd be receptive to a PR.
out [Russ' fork](https://github.com/mustberuss/PatentsView-Code-Snippets/blob/master/07_PatentSearch_API_demo/PV%20PatentSearch%20API%20tutorial.ipynb)
where there's python code for retrieving Mr. Jefferson's patents etc. There was no reply when we asked if
they'd be receptive to a PR.

0 comments on commit dab22c3

Please sign in to comment.