Skip to content

Commit

Permalink
Merge branch 'main' into lambda-config
Browse files Browse the repository at this point in the history
  • Loading branch information
vagimeli authored Nov 21, 2024
2 parents 67985cd + b50a3eb commit 1fbdced
Show file tree
Hide file tree
Showing 9 changed files with 200 additions and 60 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ Follow these steps to set up your local copy of the repository:

```
curl -sSL https://get.rvm.io | bash -s stable
rvm install 3.2.4
rvm install 3.3.2
ruby -v
```

Expand Down
44 changes: 23 additions & 21 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,29 @@ All spec insert components accept the following arguments:
- `component` (String; required): The name of the component to render, such as `query_parameters`, `path_parameters`, or `paths_and_http_methods`.
- `omit_header` (Boolean; Default is `false`): If set to `true`, the markdown header of the component will not be rendered.

### Paths and HTTP methods
To insert paths and HTTP methods for the `search` API, use the following snippet:
```markdown
<!-- spec_insert_start
api: search
component: paths_and_http_methods
-->
<!-- spec_insert_end -->
```

### Path parameters

To insert a path parameters table of the `indices.create` API, use the following snippet. Use the `x-operation-group` field from OpenSearch OpenAPI Spec for the `api` value:

```markdown
<!-- spec_insert_start
api: indices.create
component: path_parameters
-->
<!-- spec_insert_end -->
```
This table accepts the same arguments as the query parameters table except the `include_global` argument.

### Query parameters
To insert the API query parameters table of the `cat.indices` API, use the following snippet:
```markdown
Expand Down Expand Up @@ -110,24 +133,3 @@ pretty: true
-->
<!-- spec_insert_end -->
```

### Path parameters
To insert path parameters table of the `indices.create` API, use the following snippet:
```markdown
<!-- spec_insert_start
api: indices.create
component: path_parameters
-->
<!-- spec_insert_end -->
```
This table behaves the same as the query parameters table except that it does not accept the `include_global` argument.

### Paths and HTTP methods
To insert paths and HTTP methods for the `search` API, use the following snippet:
```markdown
<!-- spec_insert_start
api: search
component: paths_and_http_methods
-->
<!-- spec_insert_end -->
```
2 changes: 1 addition & 1 deletion _api-reference/index-apis/update-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ redirect_from:
**Introduced 1.0**
{: .label .label-purple }

You can use the update settings API operation to update index-level settings. You can change dynamic index settings at any time, but static settings cannot be changed after index creation. For more information about static and dynamic index settings, see [Create index]({{site.url}}{{site.baseurl}}/api-reference/index-apis/create-index/).
You can use the update settings API operation to update index-level settings. You can change dynamic index settings at any time, but static settings cannot be changed after index creation. For more information about static and dynamic index settings, see [Configuring OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/).

Aside from the static and dynamic index settings, you can also update individual plugins' settings. To get the full list of updatable settings, run `GET <target-index>/_settings?include_defaults=true`.

Expand Down
176 changes: 155 additions & 21 deletions _dashboards/management/accelerate-external-data.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
---
layout: default
title: Optimize query performance using OpenSearch indexing
parent: Connecting Amazon S3 to OpenSearch
grand_parent: Data sources
nav_order: 15
has_children: false
parent: Data sources
nav_order: 17
---

# Optimize query performance using OpenSearch indexing
Expand All @@ -14,35 +12,171 @@ Introduced 2.11

Query performance can be slow when using external data sources for reasons such as network latency, data transformation, and data volume. You can optimize your query performance by using OpenSearch indexes, such as a skipping index or a covering index.

A _skipping index_ uses skip acceleration methods, such as partition, minimum and maximum values, and value sets, to ingest and create compact aggregate data structures. This makes them an economical option for direct querying scenarios.
- A _skipping index_ uses skip acceleration methods, such as partition, minimum and maximum values, and value sets, to ingest and create compact aggregate data structures. This makes them an economical option for direct querying scenarios. For more information, see [Skipping indexes](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#skipping-indexes).
- A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. For more information, see [Covering indexes](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#covering-indexes).
- A _materialized view_ enhances query performance by storing precomputed and aggregated data from the source data. For more information, see [Materialized views](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#materialized-views).

A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. See the [Flint Index Reference Manual](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md) for comprehensive guidance on this feature's indexing process.
For comprehensive guidance on each indexing process, see the [Flint Index Reference Manual](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md).

## Data sources use case: Accelerate performance

To get started with the **Accelerate performance** use case available in **Data sources**, follow these steps:
To get started with accelerating query performance, perform the following steps:

1. Go to **OpenSearch Dashboards** > **Query Workbench** and select your Amazon S3 data source from the **Data sources** dropdown menu in the upper-left corner.
2. From the left-side navigation menu, select a database.
3. View the results in the table and confirm that you have the desired data.
1. Go to **OpenSearch Dashboards** > **Query Workbench** and select your data source from the **Data sources** dropdown menu.
2. From the navigation menu, select a database.
3. View the results in the table and confirm that you have the correct data.
4. Create an OpenSearch index by following these steps:
1. Select the **Accelerate data** button. A pop-up window appears.
2. Enter your details in **Select data fields**. In the **Database** field, select the desired acceleration index: **Skipping index** or **Covering index**. A _skipping index_ uses skip acceleration methods, such as partition, min/max, and value sets, to ingest data using compact aggregate data structures. This makes them an economical option for direct querying scenarios. A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality.
5. Under **Index settings**, enter the information for your acceleration index. For information about naming, select **Help**. Note that an Amazon S3 table can only have one skipping index at a time.
1. Select **Accelerate data**. A pop-up window appears.
2. Enter your database and table details under **Select data fields**.
5. For **Acceleration type**, select the type of acceleration according to your use case. Then, enter the information for your acceleration type. For more information, see the following sections:
- [Skipping indexes](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#skipping-indexes)
- [Covering indexes](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#covering-indexes)
- [Materialized views](https://opensearch.org/docs/latest/dashboards/management/accelerate-external-data/#materialized-views)

## Skipping indexes

A _skipping index_ uses skip acceleration methods, such as partition, min/max, and value sets, to ingest data using compact aggregate data structures. This makes them an economical option for direct querying scenarios.

With a skipping index, you can index only the metadata of the data stored in Amazon S3. When you query a table with a skipping index, the query planner references the index and rewrites the query to efficiently locate the data, instead of scanning all partitions and files. This allows the skipping index to quickly narrow down the specific location of the stored data.

### Define skipping index settings

1. Under **Skipping index definition**, select the **Add fields** button to define the skipping index acceleration method and choose the fields you want to add.
2. Select the **Copy Query to Editor** button to apply your skipping index settings.
3. View the skipping index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases.
1. Under **Skipping index definition**, select **Generate** to automatically generate a skipping index. Alternately, to manually choose the fields you want to add, select **Add fields**. Choose from the following types:
- `Partition`: Uses data partition details to locate data. This type is best for partitioning-based columns such as year, month, day, hour.
- `MinMax`: Uses lower and upper bound of the indexed column to locate data. This type is best for numeric columns.
- `ValueSet`: Uses a unique value set to locate data. This type is best for columns with low to moderate cardinality that require exact matching.
- `BloomFilter`: Uses the bloom filter algorithm to locate data. This type is best for columns with high cardinality that do not require exact matching.
2. Select **Create acceleration** to apply your skipping index settings.
3. View the skipping index query details and then click **Run**. OpenSearch adds your index to the left navigation pane.

Alternately, you can manually create a skipping index using Query Workbench. Select your data source from the dropdown and run a query like the following:

```sql
CREATE SKIPPING INDEX
ON datasourcename.gluedatabasename.vpclogstable(
`srcaddr` BLOOM_FILTER,
`dstaddr` BLOOM_FILTER,
`day` PARTITION,
`account_id`BLOOM_FILTER
) WITH (
index_settings = '{"number_of_shards":5,"number_of_replicas":1}',
auto_refresh = true,
checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint'
)
```

## Covering indexes

A _covering index_ ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality.

With a covering index, you can ingest data from a specified column in a table. This is the most performant of the three indexing types. Because OpenSearch ingests all data from your desired column, you get better performance and can perform advanced analytics.

OpenSearch creates a new index from the covering index data. You can use this new index to create visualizations, or for anomaly detection and geospatial capabilities. You can manage the covering view index with Index State Management. For more information, see [Index State Management](https://opensearch.org/docs/latest/im-plugin/ism/index/).

### Define covering index settings

1. Under **Index settings**, enter a valid index name. Note that each Amazon S3 table can have multiple covering indexes.
2. Once you have added the index name, define the covering index fields by selecting `(add fields here)` under **Covering index definition**.
3. Select the **Copy Query to Editor** button to apply your covering index settings.
4. View the covering index query details in the table pane and then select the **Run** button. Your index is added to the left-side navigation menu containing the list of your databases.
1. For **Index name**, enter a valid index name. Note that each table can have multiple covering indexes.
2. Choose a **Refresh type**. By default, OpenSearch automatically refreshes the index. Otherwise, you must manually trigger a refresh using a REFRESH statement.
3. Enter a **Checkpoint location**, which is a path for refresh job checkpoints. The location must be a path in an HDFS compatible file system.
4. Define the covering index fields by selecting **(add fields here)** under **Covering index definition**.
5. Select **Create acceleration** to apply your covering index settings.
6. View the covering index query details and then click **Run**. OpenSearch adds your index to the left navigation pane.

Alternately, you can manually create a covering index on your table using Query Workbench. Select your data source from the dropdown and run a query like the following:

```sql
CREATE INDEX vpc_covering_index
ON datasourcename.gluedatabasename.vpclogstable (version, account_id, interface_id,
srcaddr, dstaddr, srcport, dstport, protocol, packets,
bytes, start, action, log_status STRING,
`aws-account-id`, `aws-service`, `aws-region`, year,
month, day, hour )
WITH (
auto_refresh = true,
refresh_interval = '15 minute',
checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint'
)
```

## Materialized views

With _materialized views_, you can use complex queries, such as aggregations, to power Dashboards visualizations. Materialized views ingest a small amount of your data, depending on the query, into OpenSearch. OpenSearch then forms an index from the ingested data that you can use for visualizations. You can manage the materialized view index with Index State Management. For more information, see [Index State Management](https://opensearch.org/docs/latest/im-plugin/ism/index/).

### Define materialized view settings

1. For **Index name**, enter a valid index name. Note that each table can have multiple covering indexes.
2. Choose a **Refresh type**. By default, OpenSearch automatically refreshes the index. Otherwise, you must manually trigger a refresh using a `REFRESH` statement.
3. Enter a **Checkpoint location**, which is a path for refresh job checkpoints. The location must be a path in an HDFS compatible file system.
4. Enter a **Watermark delay**, which defines how late data can come and still be processed, such as 1 minute or 10 seconds.
5. Define the covering index fields under **Materialized view definition**.
6. Select **Create acceleration** to apply your materialized view index settings.
7. View the materialized view query details and then click **Run**. OpenSearch adds your index to the left navigation pane.

Alternately, you can manually create a materialized view index on your table using Query Workbench. Select your data source from the dropdown and run a query like the following:

```sql
CREATE MATERIALIZED VIEW {table_name}__week_live_mview AS
SELECT
cloud.account_uid AS `aws.vpc.cloud_account_uid`,
cloud.region AS `aws.vpc.cloud_region`,
cloud.zone AS `aws.vpc.cloud_zone`,
cloud.provider AS `aws.vpc.cloud_provider`,

CAST(IFNULL(src_endpoint.port, 0) AS LONG) AS `aws.vpc.srcport`,
CAST(IFNULL(src_endpoint.svc_name, 'Unknown') AS STRING) AS `aws.vpc.pkt-src-aws-service`,
CAST(IFNULL(src_endpoint.ip, '0.0.0.0') AS STRING) AS `aws.vpc.srcaddr`,
CAST(IFNULL(src_endpoint.interface_uid, 'Unknown') AS STRING) AS `aws.vpc.src-interface_uid`,
CAST(IFNULL(src_endpoint.vpc_uid, 'Unknown') AS STRING) AS `aws.vpc.src-vpc_uid`,
CAST(IFNULL(src_endpoint.instance_uid, 'Unknown') AS STRING) AS `aws.vpc.src-instance_uid`,
CAST(IFNULL(src_endpoint.subnet_uid, 'Unknown') AS STRING) AS `aws.vpc.src-subnet_uid`,

CAST(IFNULL(dst_endpoint.port, 0) AS LONG) AS `aws.vpc.dstport`,
CAST(IFNULL(dst_endpoint.svc_name, 'Unknown') AS STRING) AS `aws.vpc.pkt-dst-aws-service`,
CAST(IFNULL(dst_endpoint.ip, '0.0.0.0') AS STRING) AS `aws.vpc.dstaddr`,
CAST(IFNULL(dst_endpoint.interface_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-interface_uid`,
CAST(IFNULL(dst_endpoint.vpc_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-vpc_uid`,
CAST(IFNULL(dst_endpoint.instance_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-instance_uid`,
CAST(IFNULL(dst_endpoint.subnet_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-subnet_uid`,
CASE
WHEN regexp(dst_endpoint.ip, '(10\\..*)|(192\\.168\\..*)|(172\\.1[6-9]\\..*)|(172\\.2[0-9]\\..*)|(172\\.3[0-1]\\.*)')
THEN 'ingress'
ELSE 'egress'
END AS `aws.vpc.flow-direction`,

CAST(IFNULL(connection_info['protocol_num'], 0) AS INT) AS `aws.vpc.connection.protocol_num`,
CAST(IFNULL(connection_info['tcp_flags'], '0') AS STRING) AS `aws.vpc.connection.tcp_flags`,
CAST(IFNULL(connection_info['protocol_ver'], '0') AS STRING) AS `aws.vpc.connection.protocol_ver`,
CAST(IFNULL(connection_info['boundary'], 'Unknown') AS STRING) AS `aws.vpc.connection.boundary`,
CAST(IFNULL(connection_info['direction'], 'Unknown') AS STRING) AS `aws.vpc.connection.direction`,

CAST(IFNULL(traffic.packets, 0) AS LONG) AS `aws.vpc.packets`,
CAST(IFNULL(traffic.bytes, 0) AS LONG) AS `aws.vpc.bytes`,

CAST(FROM_UNIXTIME(time / 1000) AS TIMESTAMP) AS `@timestamp`,
CAST(FROM_UNIXTIME(start_time / 1000) AS TIMESTAMP) AS `start_time`,
CAST(FROM_UNIXTIME(start_time / 1000) AS TIMESTAMP) AS `interval_start_time`,
CAST(FROM_UNIXTIME(end_time / 1000) AS TIMESTAMP) AS `end_time`,
status_code AS `aws.vpc.status_code`,

severity AS `aws.vpc.severity`,
class_name AS `aws.vpc.class_name`,
category_name AS `aws.vpc.category_name`,
activity_name AS `aws.vpc.activity_name`,
disposition AS `aws.vpc.disposition`,
type_name AS `aws.vpc.type_name`,

region AS `aws.vpc.region`,
accountid AS `aws.vpc.account-id`
FROM
datasourcename.gluedatabasename.vpclogstable
WITH (
auto_refresh = true,
refresh_interval = '15 Minute',
checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint',
watermark_delay = '1 Minute',
)
```

## Limitations

This feature is still under development, so there are some limitations. For real-time updates, refer to the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations).
This feature is still under development, so there are some limitations. For real-time updates, see the [developer documentation on GitHub](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#limitations).
4 changes: 2 additions & 2 deletions _getting-started/communicate.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ curl -X GET "http://localhost:9200/_cluster/health"
If you're using the Security plugin, provide the username and password in the request:

```bash
curl -X GET "http://localhost:9200/_cluster/health" -ku admin:<custom-admin-password>
curl -X GET "https://localhost:9200/_cluster/health" -ku admin:<custom-admin-password>
```
{% include copy.html %}

Expand Down Expand Up @@ -317,4 +317,4 @@ Once a field is created, you cannot change its type. Changing a field type requi

## Next steps

- See [Ingest data into OpenSearch]({{site.url}}{{site.baseurl}}/getting-started/ingest-data/) to learn about ingestion options.
- See [Ingest data into OpenSearch]({{site.url}}{{site.baseurl}}/getting-started/ingest-data/) to learn about ingestion options.
1 change: 1 addition & 0 deletions _search-plugins/caching/request-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Setting | Data type | Default | Level | Static/Dynamic | Description
`indices.cache.cleanup_interval` | Time unit | `1m` (1 minute) | Cluster | Static | Schedules a recurring background task that cleans up expired entries from the cache at the specified interval.
`indices.requests.cache.size` | Percentage | `1%` | Cluster | Static | The cache size as a percentage of the heap size (for example, to use 1% of the heap, specify `1%`).
`index.requests.cache.enable` | Boolean | `true` | Index | Dynamic | Enables or disables the request cache.
`indices.requests.cache.maximum_cacheable_size` | Integer | `0` | Cluster | Dynamic | Sets the maximum `size` of queries to be added to the request cache.

### Example

Expand Down
Loading

0 comments on commit 1fbdced

Please sign in to comment.