Skip to content

Latest commit

 

History

History
380 lines (293 loc) · 13.7 KB

README.md

File metadata and controls

380 lines (293 loc) · 13.7 KB

Vector Test Harness


Full end-to-end test harness for the Vector log & metrics router. This is the test framework used to generate the performance and correctness results displayed in the Vector docs. You can learn more about how this test harness works in the How It Works section, and you can begin using this test harness via the Usage section.


TOC

Performance Tests

Correctness Tests


Directories

  • /ansible - global ansible resources and tasks
  • /bin - contains all scripts
  • /cases - contains all test cases
  • /packer - packer script to build the AMIs necessart for tests
  • /terraform - global terraform state, resources, and modules

Setup

  1. Ensure you have Ansible (2.7+) and Terraform (0.12.20+) installed.

  2. This step is optional, but highly recommended. Setup a vector specific AWS profile in your ~/.aws/credentials file. We highly recommend running the Vector test harness in a separate AWS sandbox account if possible.

  3. Create an Amazon compatible key pair. This will be used for SSH access to test instances.

  4. Run cp .envrc.example .envrc. Read through the file, update as necessary.

  5. Run source .envrc to prepare the environment. Alternatively install direnv to do this automatically. Note that the .env file, if it exists, will be automatically sourced into the scripts environment - so it's another option to set the environment variables for the bin/* commands of this repo.

  6. Run:

    ./bin/test -t [tcp_to_tcp_performance]

    This script will take care of running the necessary Terraform and Ansible scripts.

Usage

Results

  • High-level results can be found in the Vector performance and correctness documentation sections.
  • Detailed results can be found within each test case's README.
  • Raw performance result data can be found in our public S3 bucket.
  • You can run your own queries against the raw data. See the Usage section.

Development

Adding a test

We recommend cloning a similar to test since it removes a lot of the boilerplate. If you prefer to start from scratch:

  1. Create a new folder in the /cases directory. Your name should end with _performance or _correctness to clarify the type of test this is.
  2. Add a README.md providing an overview of the test. See the tcp_to_tcp_performance test for an example.
  3. Add a terraform/main.tf file for provisioning test resources.
  4. Add a ansible/bootstrap.yml to bootstrap the environment.
  5. Add a ansible/run.yml to run the test againt each subject.
  6. Add any additional files as you see fit for each test.
  7. Run bin/test -t <name_of_test>.

Changing a test

You should not be changing tests with historical test data. You can change test subject versions since test data is partitioned by version, but you cannot change a test's execution strategy as this would corrupt historical test data. If you need to change the test in such a way that would violate historical data we recommend creating an entirely new test.

Deleting a test

Simply delete the folder and any data in the s3 bucket.

Debugging

On a VM end

If you encounter an error it's likely you'll need to SSH onto the server to investigate.

SSHing

ssh  -o 'IdentityFile="~/.ssh/vector_management"' [email protected]

Where:

  • ~/.ssh/vector_management = the VECTOR_TEST_SSH_PRIVATE_KEY value provided in your .envrc file.
  • ubuntu = the default root username for the instance.
  • 51.5.210.84 = the public IP address of the instance.

We provide a command that wraps the system ssh and provides the same credentials that ansible uses when connecting to the VM:

./bin/ssh 51.5.210.84

Viewing logs

All services are configured with systemd where their logs can be accessed with journalctl:

sudo journactl -fu <service>

Failed services

If you find that the service failed to start, I find it helpful to manually attempt to start the service by inspecting the command in the .service file:

cat /etc/systemd/system/<name>.service

Then copy the command specified in ExecStart and run it manually. Ex:

/usr/bin/vector

On your end

Things can go wrong on your end (i.e. on the local system you're running the test harness) too.

Ansible Task Debugger

export ANSIBLE_ENABLE_TASK_DEBUGGER=True

Set the environment variable above, and Ansible will drop you in a debug mode on any task failure.

See Ansible documentation on Playbook Debugger to learn more.

Some useful commands:

pprint task_vars['hostvars'][str(host)]['last_message']

Verbose Ansible Execution

export ANSIBLE_EXTRA_ARGS=-vvv

Set the environment variable above, and Ansible will print verbose debug information for every task it executes.

How It Works

Design

The Vector test harness is a mix of bash, Terraform, and Ansible scripts. Each test case lives in the /cases directory and has full reign of it's bootstrap and test process via it's own Terraform and Ansible scripts. The location of these scripts is dictated by the test script and is outlined in more detail in the Adding a test section. Each test falls into one of 2 categories: performance tests and correctness tests:

Performance tests

Performance tests measure performance and MUST capture detailed performance data as outlined in the Performance Data and Rules sections.

In addition to the test script, there is a compare scripts. This script analyzes the performance data captured when executing a test. More information on this data and how it's captured and analyzed can be found in the Performance Data section. Finally, each script includes a usage overview that you can access with the --help flag.

Performance data

Performance test data is captured via dstat, which is a lightweight utility that captures a variety of system statistics in 1-second snapshot intervals. The final result is a CSV where each row represents a snapshot. You can see the dstat command used in the ansible/roles/profiling/start.yml file.

Performance data schema

The performance data schema is reflected in the Athena table definition as well as the CSV itself. The following is an ordered list of columns:

Name Type
epoch double
cpu_usr double
cpu_sys double
cpu_idl double
cpu_wai double
cpu_hiq double
cpu_siq double
disk_read double
disk_writ double
io_read double
io_writ double
load_avg_1m double
load_avg_5m double
load_avg_15m double
mem_used double
mem_buff double
mem_cach double
mem_free double
net_recv double
net_send double
procs_run double
procs_bulk double
procs_new double
procs_total double
sys_init double
sys_csw double
sock_total double
sock_tcp double
sock_udp double
sock_raw double
sock_frg double
tcp_lis double
tcp_act double
tcp_syn double
tcp_tim double
tcp_clo double
Performance data location

All performance data is made public via the vector-tests S3 bucket in the us-east-1 region. The partitioning structure follows the Hive partitioning structure with variable names in the path. For example:

name=tcp_to_tcp_performance/configuration=default/subject=vector/version=v0.2.0-dev.1-20-gae8eba2/timestamp=1559073720

And the same in a tree form:

name=tcp_to_tcp_performance/
  configuration=default/
    subject=vector/
      version=v0.2.0-dev.1-20-gae8eba2/
        timestamp=1559073720
  • name = the test name.
  • configuration = refers to the test's specific configuration (tests can have multiple configurations if necessary).
  • subject = the test subject, such as vector.
  • version = the version fo the test subject.
  • timestamp = when the test was executed.
Performance data analysis

Analysis of this data is performed through the AWS Athena service. This allows us to execute complex queries on the performance data stored in S3. You can see the queries ran in the compare script.

Correctness tests

Correctness tests simply verify behavior. These tests are not required to capture or to persist any data. The results can be manually verified and placed in the test's README.

Correctness data

Since correctness tests are pass/fail there is no data to capture other than the successful running of the test.

Correctness output

Generally, correctness tests verify the output. Because of the various test subjects, we use a variety of output methods to capture output (tcp, http, and file). This is highly dependent on the test subject and the methods available. For example, the Splunk Forwarders only support TCP and Splunk specific outputs.

To make capturing this data easy, we created a test_server Ansible role that spins up various test servers and provides a simple way to capture summary output.

Environments

Tests must operate in isolated reproducible environments, they must never run locally. The obvious benefit is that it removes variables across tests, but it also improves collaboration since remote environments are easily accessible and reproducible by other engineers.

Rules

  1. ALWAYS filter to resources specific to your test_name, test_configuration, and user_id (ex: ansible host targeting)
  2. ALWAYS make sure the initial instance state is identical across test subjects. We recommend explicitly stopping all test subjects to properly handle the case of preceding failure and the situation where a subject was not cleanly shutdown.
  3. ALWAYS use the profile ansible role to capture data. This ensures a consistent data structure across tests.
  4. ALWAYS run performance tests for at least 1 minute to calculate a 1m CPU load average.
  5. Use ansible roles whenever possible.
  6. If you are not testing local data collection we recommend using TCP as a data source since it is a lightweight source that is more likely to be consistent, performance wise, across subjects.