Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add uax_url_email tokenizer docs #8485

84 changes: 84 additions & 0 deletions _analyzers/tokenizers/uax-url-email.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: default
title: UAX URL email
parent: Tokenizers
nav_order: 150
---

# UAX URL email tokenizer

In addition to regular text, the `uax_url_email` tokenizer is designed to handle URLs, email addresses, and domain names. It is based on the Unicode Text Segmentation algorithm ([UAX #29](https://www.unicode.org/reports/tr29/)), which allows it to correctly tokenize complex text, including URLs and email addresses.

## Example usage

The following example request creates a new index named `my_index` and configures an analyzer with a `uax_url_email` tokenizer:

```json
PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"uax_url_email_tokenizer": {
"type": "uax_url_email"
}
},
"analyzer": {
"my_uax_analyzer": {
"type": "custom",
"tokenizer": "uax_url_email_tokenizer"
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "my_uax_analyzer"
}
}
}
}
```
{% include copy-curl.html %}

## Generated tokens

Use the following request to examine the tokens generated using the analyzer:

```json
POST /my_index/_analyze
{
"analyzer": "my_uax_analyzer",
"text": "Contact us at [email protected] or visit https://example.com for details."
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{"token": "Contact","start_offset": 0,"end_offset": 7,"type": "<ALPHANUM>","position": 0},
{"token": "us","start_offset": 8,"end_offset": 10,"type": "<ALPHANUM>","position": 1},
{"token": "at","start_offset": 11,"end_offset": 13,"type": "<ALPHANUM>","position": 2},
{"token": "[email protected]","start_offset": 14,"end_offset": 33,"type": "<EMAIL>","position": 3},
{"token": "or","start_offset": 34,"end_offset": 36,"type": "<ALPHANUM>","position": 4},
{"token": "visit","start_offset": 37,"end_offset": 42,"type": "<ALPHANUM>","position": 5},
{"token": "https://example.com","start_offset": 43,"end_offset": 62,"type": "<URL>","position": 6},
{"token": "for","start_offset": 63,"end_offset": 66,"type": "<ALPHANUM>","position": 7},
{"token": "details","start_offset": 67,"end_offset": 74,"type": "<ALPHANUM>","position": 8}
]
}
```

## Parameters

The `uax_url_email` tokenizer can be configured with the following parameter.

Parameter | Required/Optional | Data type | Description
:--- | :--- | :--- | :---
`max_token_length` | Optional | Integer | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.

Loading