vignette updates

mustberuss · Jan 4, 2025 · dab22c3 · dab22c3
1 parent b93ea64
commit dab22c3
Show file tree

Hide file tree

Showing 2 changed files with 80 additions and 67 deletions.
diff --git a/vignettes/patentsview-breaking-release.Rmd b/vignettes/patentsview-breaking-release.Rmd
@@ -1,6 +1,6 @@
 ---
 slug: patentsview-breaking-release
-title: "Breaking release of the Patentsview Package"
+title: "Breaking Release of the Patentsview Package"
 package_version: 1.0.0
 author:
   - Russ Allen
@@ -19,28 +19,29 @@ tags:
   - API client
   - USPTO
   - r-universe
-description: "Breaking release of the Patentsview Package"
+description: "Breaking Release of the Patentsview Package"
 editor:
 vignette: >
-  %\VignetteIndexEntry{Breaking release of the Patentsview Package}
+  %\VignetteIndexEntry{Breaking Release of the Patentsview Package}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-
-
+*This is a proposed Tech Note to be submitted to rOpenSci.  It's here as an Rmd so it will be
+knitted by the build process but can be submitted as an md file.*
 
 The Patentsview API team has released a new version of their API, which is used by a
-correspondingly new version of the [patentsview](https://docs.ropensci.org/patentsview/) R package.  The problem for users
-is that the API team has made **breaking changes**, existing programs will not run
+correspondingly new version of the [patentsview](https://docs.ropensci.org/patentsview/) R package.
+The problem for users is that the API team has made **breaking changes**, existing programs will not run
 with the new version of the R package. Please don't shoot the messenger!
 
 The new version of the R package handles some of the API team's changes where possible, 
 however an API key is now required.  The Patentsview API team plans to shutdown 
 the original version of the API on February 12, 2025. At that
 time the original version of the R package will stop working.
 
-The original version of the R package is available on CRAN with the new version available on r-universe.  After the original version of the API is shutdown, the new R package will be submitted to CRAN.
+The original version of the R package is available on CRAN with the new version available on r-universe.
+After the original version of the API is shutdown, the new R package will be submitted to CRAN.
 
 ## User Impacting API changes:
 1. Users will need to [request an API key](https://patentsview-support.atlassian.net/servicedesk/customer/portals) and set an environmental variable PATENTSVIEW_API_KEY to its value.
@@ -56,10 +57,12 @@ available from the original endpoints (now some endpoint's returns are lighter,
 3. Some fields are now nested and need to be fully qualified when used in a query,
 for instance, ```search_pv('{"cpc_current.cpc_group_id":"A01B1/00"}')``` when using the patent endpoint.
 
-   In the fields parameter, nested fields can be fully qualified or a new API shorthand can be used, where group names can specified. When group names are used, all of the group's nested fields will be returned by the API. For example, the new version of the API and R package will accept fields=c("assignees") when using the patent endpoint and all nested assignees fields will be returned by the API.
-
+   In the fields parameter, nested fields can be fully qualified or a new API shorthand can be used, where
+group names can specified. When group names are used, all of the group's nested fields will be returned
+by the API. For example, the new version of the API and R package will accept fields=c("assignees") when
+using the patent endpoint and all nested assignees' fields will be returned by the API.
 4. Some field's names have changed, most significantly, patent_number is now patent_id,
- and some fields were removed entirely, for instance, rawinventor_first_name and rawinventor_last_name.
+and some fields were removed entirely, for instance, rawinventor_first_name and rawinventor_last_name.
 5. The original version of the API had queryable fields and additional fields which could be 
 retrieved but couldn't be part of a conditional query.  That notion does not apply to the 
 new version of the API as all fields are now queryable.  You may be able
@@ -73,11 +76,11 @@ testthat/test-api-bugs.R in the testthat folder.
 7. Result set paging has changed significantly.  This would matter only if users implemented their own
 paging, the R package continues to handle result set paging when search_pv's all_pages = TRUE. 
 There is a new result set paging vignette to explain the way the API now pages, 
-using the `after` parameter rather than using `page` and `per_page`.
+using the `size` and `after` parameters rather than using `per_page` and `page`.
 
 The API team also [renamed the API](https://search.patentsview.org/docs/#naming-update), 
 PatentsView's Search API is now the PatentSearch API. 
-Note that the R package will retain its name, continue to use library(patentsview)
+Note that the R package will retain its name, continue to use `library(patentsview)`
 
 ## Highlights of the R package:
 
@@ -99,7 +102,7 @@ httr2::last_response() |> httr2::resp_body_json()
 
 6. Three new functions were added
    - `retrieve_linked_data()` to retrieve data from a HATEOAS link the API sent back, retrying if throttled
-   - `pad_patent_id()`
+   - `pad_patent_id()`, needed in custom paging using `patent_id`, see the new Result Set Paging vignette
    - qry_funs$in_range() to generate range queries for you.
 
 ``` r
@@ -122,8 +125,9 @@ an endpoint's "Try it out" and "Execute" buttons.  Even error responses can be i
 usually pointing out what went wrong.
 
 In a similar format, the [updated API documentation](https://search.patentsview.org/docs/docs/Search%20API/SearchAPIReference/#endpoints)
-lists what each endpoint does.  Additionally, the R package's fieldsdf data frame has been updated, now listing the new set 
-of endpoints and retrievable/queryable fields.  The R package's reference pages have also been updated.
+lists what each endpoint does.  Additionally, the R package's fieldsdf data frame has been updated,
+now listing the new set of endpoints and fields that can be queried and/or returned.  The R package's
+reference pages have also been updated.
 
 ## Final Thoughts
 As shown in the updated Top Assignees vignette, there will be occasions now where multiple API calls are needed to retrieve the same data as in a single API call in the original version of the API and R package.

diff --git a/vignettes/understanding-the-api.Rmd.orig b/vignettes/understanding-the-api.Rmd.orig
@@ -30,43 +30,57 @@ See this under "constructing your query",  I don't remember seeing this anywhere
 > related entities can be requested in the API request's fields parameter as a group by using the
 > group name in the fields parameter*, or individually by specifying the required field as "{entity_type}.{subfield}".
 
-Mind blown, so we can, for example, request all the nested application fields from the patent endpoint by simply requesting "application" in the fields list.  
+Mind blown, so we can, for example, request all the nested application fields from the patent endpoint
+by simply requesting "application" in the fields list.  
 
-The new version of the R package will let its users leverage this same "feature".  (Purists will probably frown upon using it, as they 
-would with a select * in SQL.  It can be helpful to see exactly what fields the API can return, should the documentation
-be lagging.)
+<a name="all-fields">
+The new version of the R package will let its users leverage this same feature, allowing
+group names to be specified in the fields parameter.  
 
 ```{r}
 library(patentsview)
 
-pat_res <- search_pv(qry_funs$eq(patent_id = "10568228"), fields=c("application"), method="POST")
-pat_res$data$patent$application
-pat_res$request
+query <- qry_funs$eq(patent_id = "10568228")
 
-```
+# get_fields() now uses the new API shorthand rather than returning all of the 
+# group's nested field names
+shorthand <- get_fields("patent", groups=c("application"))
+shorthand
 
-The results should be the same if we used ```fields=get_fields("patent", groups=c("application"))```.  The difference
-is that in the case above, it's the API deciding what fields to return while in the get_fields() case, we parsed
-the [API's OpenAPI object](https://search.patentsview.org/static/openapi.json) when building the R package to determine what fields can be requested.  The results 
-could be different if the API's actual return is not in sync with the API's OpenAPI object.  Here we see that
-the requests are different but the results are the same (we used POSTs so the requests are easier to read since they don't need to be urlencoded):
+shorthand_results <- search_pv(query, fields=shorthand, method="POST")
 
-```{r}
-app_fields <- get_fields("patent", groups=c("application"))
-app_fields
+# Now that the R package uses httr2, we can use its last_request()
+# to see what was POSTed to the API
+cat(httr2::last_request()$body$data)
+
+# Here we view the results
+shorthand_results$data$patent$application
 
-pat_res <- search_pv(qry_funs$eq(patent_id = "10568228"), fields=app_fields, method="POST")
-pat_res$data$patent$application
+# Now we'll explicitly request all the application fields and make a POST to the API
+explicit_fields <- fieldsdf[fieldsdf$endpoint == "patent" & fieldsdf$group == "application", "field"]
+explicit_results <- search_pv(query, fields=explicit_fields, method="POST")
 
-# the request here and the one above differ, but the results were the same!
-pat_res$request
+# what was POSTed is different
+cat(httr2::last_request()$body$data)
+
+# but the results from the API are the same
+explicit_results$data$patent$application
+
+# (Observation reported to the API team: application_type, series_code and filing_type
+# all seem to have the same values and not just in this one example.)
 
 ```
-<a name="all-fields">
-Now, when requesting all fields, `get_fields()` uses the API's shorthand notation
-rather than explicitly calling out every field.  If it didn't do that, it would
-be possible to get an error when doing a GET of every field at the patent endpoint.
-It currently has 20 groups and 129 fields overall.
+
+The difference in the requests is that in the former case, it's the API deciding what fields to return
+while in the latter case, we used fieldsdf.
+fieldsdf is created from the [API's OpenAPI object](https://search.patentsview.org/static/openapi.json)
+when building the R package and determines what fields can be requested from each endpoint.  The results 
+could be different if the API's actual return is not in sync with the API's OpenAPI object.  Here we see 
+that the requests are different but the results are the same (we used POSTs so the requests are easier 
+to read since they don't need to be urlencoded).
+
+The motivation to adopt the API's shorthand is that, with a modest query, explicitly requesting all of the
+patent endpoint's fields can be too much to send via a GET request (the resulting URL can exceed 4K). 
 
 ## Unexpected Results
 
@@ -108,18 +122,19 @@ pat_res <- search_pv(patents_query, fields=patent_fields, endpoint="patent")
 dl <- unnest_pv_data(pat_res$data)
 
 # We got back all the inventors on the patents that met our search criteria.  We'll filter out
-# the inventors that didn't strictly meet our criteria (they came along for the ride with
-# the ones that met our criteria), we want the noted behavior to be clear.
+# the inventors that didn't strictly meet our criteria (they're coinventors that came along for 
+# the ride with the ones that met our criteria), we want the noted behavior to be clear.
 
 display_inventors <- 
    dl$inventors %>%
-   filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last))  %>%
-   arrange(nchar(patent_id), patent_id)  # numeric sort on a string field
+   filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last))
 
 display_inventors
 
 ```
-
+Some rows act as you'd expect, like patent 4078607's Thomas Jefferson.  In others, two inventors
+combine to meet the search cititeria, like 6905071's
+ **Thomas** Amundsen and Matthew **Jefferson**.
 Now we'll hit the inventor endpoint with a similar query, as the jupyter notebook suggests.
 
 ```{r}
@@ -140,41 +155,33 @@ inventors_query <-
 
 inventor_fields <- c("inventor_id","inventor_name_first","inventor_name_last")
 inventor_res <- search_pv(inventors_query, fields=inventor_fields, endpoint="inventor")
-dl2 <- unnest_pv_data(inventor_res$data)
-
-actual_inventors <-
-   dl2$inventors %>%
-   arrange(inventor_name_last, inventor_name_first)
+actual_inventors <- unnest_pv_data(inventor_res$data)
 
-actual_inventors
+actual_inventors[[1]]
 ```
 
 Now, with actual_inventors' inventor_ids in hand, we'll ask the patent endpoint for their patents.
 The results are quite different than what the first query returned. (These patents would 
-have names matching at least one of our two famous forefathers.  The first query non-intuitively
+have names matching at least one of our two famous forefather's names.  The first query non-intuitively
 matched names where the first and last name matches did not necessarily both occur on the same inventor.)
 
 ```{r}
-id_query <- qry_funs$eq(inventors.inventor_id=actual_inventors$inventor_id)
+id_query <- qry_funs$eq(inventors.inventor_id=actual_inventors$inventors$inventor_id)
 
-# We need to pass fields since we're sorting (sort field has to be passed as a field)
-# Without a sort we could rely on the default fields being returned if we liked
-
-patent_fields <-c("patent_id", "inventors.inventor_name_first", "inventors.inventor_name_last")
-pat_res <- search_pv(id_query, fields=patent_fields, sort=c("patent_id" = "asc"))
+patent_fields <-c("patent_id", "inventors.inventor_name_first", "inventors.inventor_name_last",
+  "inventors.inventor_id")
+pat_res <- search_pv(id_query, fields=patent_fields, sort=c(patent_id = "asc"))
 
 dl <- unnest_pv_data(pat_res$data)
 
-# Also, the API's sort on patent_id, a string field, puts 10568228 first at the time of 
-# this writing.  Would that be a bug or feature?  Below we'll apply our own sort
-dl$patents[[1]][[1]]
-
-# we'll repeat the same filter we used on the first query's results
+# we'll repeat the same name filter we used on the first query's results
 display_inventors <- 
    dl$inventors %>%
-   filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last))  %>%
-   arrange(nchar(patent_id), patent_id)  # numeric sort on a string field
+   filter(grepl("^(George|Thomas)", inventor_name_first ) | grepl("^(Washington|Jefferson)", inventor_name_last)) %>%
+   mutate(inventor = sub(".*/([^/]+)/$", "\\1", inventor)) # extract id from HATEOAS link
 
+# sample pre-mutate value, note that we requested inventor_id but the API sent back `inventor`
+dl$inventors$inventor[[1]]
 
 display_inventors
 
@@ -188,4 +195,6 @@ in R package form.  The repo doesn't have a stated license but when I checked, I
 > For the repo license we are looking at the [GNU General Public License v3](https://www.gnu.org/licenses/quick-guide-gplv3.html) (GPL3).
 
 That is the same license as R itself so I don't think we've violated anything.  For extra fun check
-out [Russ' fork](https://github.com/mustberuss/PatentsView-Code-Snippets/blob/master/07_PatentSearch_API_demo/PV%20PatentSearch%20API%20tutorial.ipynb).  There was no reply when I asked if they'd be receptive to a PR.
+out [Russ' fork](https://github.com/mustberuss/PatentsView-Code-Snippets/blob/master/07_PatentSearch_API_demo/PV%20PatentSearch%20API%20tutorial.ipynb)
+where there's python code for retrieving Mr. Jefferson's patents etc.  There was no reply when we asked if 
+they'd be receptive to a PR.