#126: Add documentation for token for private models in prediction UD…

…Fs (#127)
exasol · Sep 8, 2023 · c82c6f1 · c82c6f1
1 parent 8b57c76
commit c82c6f1
Show file tree

Hide file tree

Showing 4 changed files with 438 additions and 269 deletions.
diff --git a/doc/changes/changes_0.5.0.md b/doc/changes/changes_0.5.0.md
@@ -1,11 +1,13 @@
-# Transformers Extension 0.5.0, released YYYY-MM-DD
+# Transformers Extension 0.5.0, released 2023-09-08
 
-Code name: TBD
+Code name: Support for transformer 4.31
 
 
 ## Summary
 
-TBD
+This release makes the extension compatible with Huggingface transformers v.4.31.0 and their new model cache format. 
+Furthermore, it makes the deployment scripts compatible with Exasol v8 by enabling encryption 
+and allows the user to configure the TLS verification.
 
 ### Features
 
@@ -33,6 +35,7 @@ TBD
 ### Documentation
 
  - #93: Added the Developer Guide
+ - #126: Add documentation for token for private models in prediction UDFs
 
 
 

diff --git a/doc/user_guide/user_guide.md b/doc/user_guide/user_guide.md
@@ -52,7 +52,7 @@ The extension provides two types of UDFs:
   ```buildoutcfg
     http[s]://<BUCKETFS_HOST>:<BUCKETFS_PORT>/<BUCKET_NAME>/<PATH_IN_BUCKET>;<BUCKETFS_NAME>
   ```
-  - A valid token is required to download private models from the Huggingface hub. 
+  - A valid token is required to download private models from the Huggingface hub and run prediction on them. 
   To avoid exposing such sensitive information, you can use Exasol Connection 
   objects. As seen in the example below, a token can be specified in the 
   password part of the Exasol connection object:
@@ -128,7 +128,7 @@ language container file released in GitHub Releases section.
 (see [the latest release](https://github.com/exasol/transformers-extension/releases/latest)).
 - Before installing the language container, these parts must be combined using the following command::
 ```shell
-cat language_container_part_* > language_container.tar.gz
+ls -v language_container_part_* | xargs cat > language_container.tar.gz
 ```
 
 ##### Install Language Container
@@ -265,6 +265,7 @@ classes of the specified  model. An example usage is given below:
 SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data
@@ -273,7 +274,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_SINGLE_TEXT_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -285,10 +290,10 @@ this UDF. In case of any error during model loading or prediction, these new
 columns are set to `null` and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL   | SCORE | ERROR_MESSAGE  |
-| ------------- | ------- | ---------- | --------- |---------| ----- |----------------|
-| conn_name     | dir/    | model_name | text      | label_1 | 0.75  | None           |          
-| ...           | ...     | ...        | ...       | ...     | ...   | ...            |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | LABEL   | SCORE | ERROR_MESSAGE  |
+| ------------- |-----------------|---------|------------| --------- |---------| ----- |----------------|
+| conn_name     | token_conn_name | dir/    | model_name | text      | label_1 | 0.75  | None           |          
+| ...           | ...             | ...     | ...        | ...       | ...     | ...   | ...            |
 
 
 ### Sequence Classification for Text Pair UDF
@@ -298,6 +303,7 @@ determine if two sequences are paraphrases of each other. An example usage is gi
 SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     first_text,
@@ -307,7 +313,11 @@ SELECT TE_SEQUENCE_CLASSIFICATION_TEXT_PAIR_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -329,6 +339,7 @@ An example usage is given below:
 SELECT TE_QUESTION_ANSWERING_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     question,
@@ -340,6 +351,10 @@ SELECT TE_QUESTION_ANSWERING_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -354,11 +369,11 @@ If `top_k` > 1, each input row is repeated for each answer. In case of any error
 during model loading or prediction, these new columns are set to `null` and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | QUESTION   | CONTEXT   | TOP_K | ANSWER   | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |------------|-----------| ----- |----------| ----- |------| ------------- |
-| conn_name     | dir/    | model_name | question_1 | context_1 | 2     | answer_1 | 0.75  | 1    | None          |
-| conn_name     | dir/    | model_name | question_2 | context_1 | 2     | answer_2 | 0.70  | 2    | None          |
-| ...           | ...     | ...        | ...        | ...       | ...   | ...      | ...   | ..   | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | QUESTION   | CONTEXT   | TOP_K | ANSWER   | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|------------|-----------| ----- |----------| ----- |------| ------------- |
+| conn_name     | token_conn_name | dir/    | model_name | question_1 | context_1 | 2     | answer_1 | 0.75  | 1    | None          |
+| conn_name     | token_conn_name | dir/    | model_name | question_2 | context_1 | 2     | answer_2 | 0.70  | 2    | None          |
+| ...           | ...             | ...     | ...        | ...        | ...       | ...   | ...      | ...   | ..   | ...           |
 
 
 ### Masked Language Modelling UDF
@@ -369,6 +384,7 @@ this UDF is ```<mask>```. An example usage is given below:
 SELECT TE_FILLING_MASK_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -379,7 +395,11 @@ SELECT TE_FILLING_MASK_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name 
+  - ```bucketfs_conn```: The BucketFS connection name
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -393,11 +413,11 @@ If `top_k` > 1, each input row is repeated for each prediction. In case of any
 error during model loading or prediction, these new columns are set to `null` 
 and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA     | TOP_K | FILLED_TEXT   | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |---------------| ----- |---------------| ----- |------|---------------|
-| conn_name     | dir/    | model_name | text `<mask>` | 2     | text filled_1 | 0.75  |   1  | None          |
-| conn_name     | dir/    | model_name | text `<mask>` | 2     | text filled_2 | 0.70  |   2  | None          |
-| ...           | ...     | ...        | ...           | ...   | ...           | ...   |  ... | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA     | TOP_K | FILLED_TEXT   | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|---------------| ----- |---------------| ----- |------|---------------|
+| conn_name     | token_conn_name | dir/    | model_name | text `<mask>` | 2     | text filled_1 | 0.75  |   1  | None          |
+| conn_name     | token_conn_name | dir/    | model_name | text `<mask>` | 2     | text filled_2 | 0.70  |   2  | None          |
+| ...           | ...             | ...     | ...        | ...           | ...   | ...           | ...   |  ... | ...           |
 
 
 ### Text Generation UDF
@@ -409,6 +429,7 @@ An example usage is given below:
 SELECT TE_TEXT_GENERATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -419,7 +440,11 @@ SELECT TE_TEXT_GENERATION_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```bucketfs_conn```: The BucketFS connection name.
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -443,6 +468,7 @@ There are two popular subtasks of token classification:
 SELECT TE_TOKEN_CLASSIFICATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -453,6 +479,10 @@ SELECT TE_TOKEN_CLASSIFICATION_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -469,10 +499,10 @@ In case of any error during model loading or prediction, these new
 columns are set to `null`, and column _ERROR_MESSAGE_ is set 
 to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
-| conn_name     | dir/    | model_name | text      | simple               | 0         | 4       | text | noun   | 0.75  | None          |
-| ...           | ...     | ...        | ...       | ...                  | ...       | ...     | ...  | ..     | ...   | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | AGGREGATION_STRATEGY | START_POS | END_POS | WORD | ENTITY | SCORE | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|----------------------|-----------|---------|------|--------|-------| ------------- |
+| conn_name     | token_conn_name | dir/    | model_name | text      | simple               | 0         | 4       | text | noun   | 0.75  | None          |
+| ...           | ...             | ...     | ...        | ...       | ...                  | ...       | ...     | ...  | ..     | ...   | ...           |
 
 
 
@@ -483,6 +513,7 @@ This UDF translates a given text from one language to another.
 SELECT TE_TRANSLATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -495,7 +526,11 @@ SELECT TE_TRANSLATION_UDF(
 - Parameters:
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
-  - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```bucketfs_conn```: The BucketFS connection name.
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -511,10 +546,10 @@ combined with the inputs used when calling this UDF. In case of any error during
 model loading or prediction, these new columns are set to `null`, and 
 column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|-----------------|-----------------|------------| ---------------- |---------------|
-| conn_name     | dir/    | model_name | context   | English         | German          | 100        | kontext          | None          |
-| ...           | ...     | ...        | ...       | ...             | ...             | ...        | ...              | ...           |
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | SOURCE_LANGUAGE | TARGET_LANGUAGE | MAX_LENGTH | TRANSLATION_TEXT | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|-----------------|-----------------|------------| ---------------- |---------------|
+| conn_name     | token_conn_name | dir/    | model_name | context   | English         | German          | 100        | kontext          | None          |
+| ...           | ...             | ...     | ...        | ...       | ...             | ...             | ...        | ...              | ...           |
 
 
 ### Zero-Shot Text Classification UDF
@@ -526,6 +561,7 @@ string, and generate probability scores prediction for each label.
 SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
     device_id,
     bucketfs_conn,
+    token_conn,
     sub_dir,
     model_name,
     text_data,
@@ -537,6 +573,10 @@ SELECT TE_ZERO_SHOT_TEXT_CLASSIFICATION_UDF(
   - ```device_id```: To run on GPU, specify the valid cuda device ID. Otherwise, 
   you can provide NULL for this parameter.
   - ```bucketfs_conn```: The BucketFS connection name. 
+  - ```token_conn```: The connection name containing the token required for 
+  private models. You can use NULL for public models. For details 
+  on how to create a connection object with token information, please check 
+  [here](#getting-started).
   - ```sub_dir```: The directory where the model is stored in the BucketFS.
   - ```model_name```: The name of the model to use for prediction. You can find the 
   details of the models in [huggingface models page](https://huggingface.co/models).
@@ -549,8 +589,8 @@ columns, combined with the inputs used when calling this UDF. In case of any
 error during model loading or prediction, these new  columns are set to `null`, 
 and column _ERROR_MESSAGE_ is set to the stacktrace of the error. For example:
 
-| BUCKETFS_CONN | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL  | SCORE | RANK | ERROR_MESSAGE |
-| ------------- | ------- | ---------- |-----------|------------------|--------|-------|------|---------------|
-| conn_name     | dir/    | model_name | text      | label1,label2..  | label1 | 0.75  | 1    | None          |
-| conn_name     | dir/    | model_name | text      | label1,label2..  | label2 | 0.70  | 2    | None          |
-| ...           | ...     | ...        | ...       | ...              | ...    | ...   | ..   | ...           |  
+| BUCKETFS_CONN | TOKEN_CONN      | SUB_DIR | MODEL_NAME | TEXT_DATA | CANDIDATE LABELS | LABEL  | SCORE | RANK | ERROR_MESSAGE |
+| ------------- |-----------------|---------|------------|-----------|------------------|--------|-------|------|---------------|
+| conn_name     | token_conn_name | dir/    | model_name | text      | label1,label2..  | label1 | 0.75  | 1    | None          |
+| conn_name     | token_conn_name | dir/    | model_name | text      | label1,label2..  | label2 | 0.70  | 2    | None          |
+| ...           | ...             | ...     | ...        | ...       | ...              | ...    | ...   | ..   | ...           |