Skip to content

Integration Test Analysis

Paul Rogers edited this page Mar 25, 2022 · 4 revisions

The first thing to note about Druid integration tests is that they are a mess. The purpose of this page is to sort out that mess.

Integration tests live in the druid-integration-tests module.

Maven Lifecycle Mapping

Maven defines a lifecycle for projects. Druid integration tests map into that lifecycle (somewhat incorrectly) as follows:

Maven Lifecycle Phase Integration Test Mapping
process-resources copy ${project.build.outputDirectory}/wikipedia_hadoop_azure_input_index_task_template.json to ${project.build.outputDirectory}/wikipedia_hadoop_azure_input_index_task.sh
" copy ${project.build.outputDirectory}/wikipedia_hadoop_s3_input_index_task_template.json to ${project.build.outputDirectory}/wikipedia_hadoop_s3_input_index_task.json
" copy ${project.build.outputDirectory}/copy_resources_template.sh to target/gen-scripts/copy_resources.sh
...
pre-integration-test Sets a number of env vars, then runs build_run_cluster.sh to build Docker images and start them.
integration-test verify goal: invokes the DruidTestRunnerFactory via TestNG.
post-integration-test Sets a number of env vars, then runs stop_cluster.sh to stop the Docker containers.
...
verify Verifies the results of the integration tests.

Notes:

  • WFT is going on with copying a JSON file to an sh file?
  • The process-resources phase occurs early, it is not clear if those files are yet in the output directory.
  • The copy_resources.sh file was copied into the source tree. An in-flight PR moves it into the target tree where all artifacts should live.
  • Since test output is verified in the verify phase, the README.md for the integration tests says to run to the verify phase.

Maven Profiles

The profiles which affect integration tests:

  • hadoop3: Sets a number of Maven variables to include Hadoop 3 jars.
  • integration-tests: (druid-integration-test) builds Docker images and runs tests.
  • integration-test: (distribution) builds the integration test output directory via integration-test-assembly.xml

Notes:

  • Note the two spellings of the profile: integration-test (singular) and integration-tests (plural). The integration-test profile has the same name as a Maven phase, which is a bit confusing.

Ad-hoc Build

The pre-integration-test phase invokes build_run_cluster.sh which invokes the copied copy_resources_template.sh which invokes another Maven build:

mvn -DskipTests -T1C -Danimal.sniffer.skip=true -Dcheckstyle.skip=true -Ddruid.console.skip=true -Denforcer.skip=true -Dforbiddenapis.skip=true -Dmaven.javadoc.skip=true -Dpmd.skip=true -Dspotbugs.skip=true install -Pintegration-test

Notes:

  • The above line refers to -Pintegration-test (singular), but integration-test is a Maven phase. There is a profile called integration-tests (plural) in the above list.

Copying Resources

The pom.xml file contains this section to copy certain resources to the output directory:

            <plugin>
                <artifactId>maven-resources-plugin</artifactId>
                <groupId>org.apache.maven.plugins</groupId>
                <configuration>
                    <outputDirectory>${project.build.outputDirectory}</outputDirectory>
                    <resources>
                        <resource>
                            <directory>script</directory>
                            <includes>copy_resources_template.sh</includes>
                            <filtering>true</filtering>
                        </resource>
                        <resource>
                            <directory>src/test/resources/hadoop/</directory>
                            <includes>*template.json</includes>
                            <filtering>true</filtering>
                        </resource>
                        <resource>
                            <directory>src/test/resources</directory>
                            <filtering>false</filtering>
                        </resource>
                        <resource>
                            <directory>src/main/resources</directory>
                            <filtering>false</filtering>
                        </resource>
                    </resources>
                </configuration>
            </plugin>

Some issues with the above:

  • Per this documentation, the last two entries above are either wrong or unnecessary. The "standard" resources are copies automatically to the correct output locations.
  • Per this documentation, and the earlier reference, the list of resources must be within an execution section that specifies the copy-resources option and binds to a phase. Since that is done here, it is very likely that the entire section is a no-op. However, an inspection of the output suggests that the rules did run, but it is not clear in which phase.
  • According to this documentation the target ${project.build.outputDirectory} is the target/classes folder, which is decidedly not the place to put resources.
  • If the above ran (which it probably didn't) it would create two copies of the main resource, it would put the test resources into the main classes folder, and would add shell scripts to the classes folder. This is probably all wrong.

Docker Build Scripts

The Docker build process is convoluted and just plain wrong in many ways.

The integration-tests\pom.xml file:

  • Incorrect copies both test and compile resources into the compile target directory
  • Adds other resources to the compile target directory (but not within a subdirectory, polluting the root name space as a result.)
  • Launches the build process in the pre-integration-test phase: build_run_cluster.sh.
  • (Somehow) copies scripts/copy_resources_template.sh to gen-scripts/copy_resources.sh (in the source tree, not target!)

build_run_cluster.sh:

  • Creates a shared directory (as in, shared into the container) at ~/shared (note, this is outside of the build hierarchy!)
  • Creates a file called docker_ip within the integration-tests/docker source directory (not in the target directory!)
  • Creates keys (in the integration-tests/docker/tls source directory (should use target).
  • Runs gen-script/copy-resources.sh (copied above).
  • Runs script/docker_build_containers.sh
  • Runs stop_cluster.sh to stop the cluster.
  • Runs script/docker_run_cluster.sh to start the cluster
  • Runs script/copy_hadoop_resources.sh to do what the name suggests.

Issues:

  • Incorrect use of resources.
  • Unnecessary copying of a file.
  • Place derived files in the source directory tree.
  • Scripts are scattered in multiple locations.

Docker Compose

The tests make use of Docker Compose (docker-compose) to run the cluster. A single image is used for all services.

docker_compose_args.sh defines function getComposeArgs() to select YAML files to use based on

  • DRUID_INTEGRATION_TEST_GROUP: one of the test groups
  • DRUID_INTEGRATION_TEST_INDEXER: either indexer or middleManager. (The indexer option supports a subset of tests.)

Actual configuration is done in a large number of docker/docker-compose*.yml files. These files define some number of Druid services, sometimes running the same service twice on different ports. These files are similar to those described in the documentation. The basic idea is that there are environment variables that map to config file entries, with a process to generate the files from the environment variables.

There is one Druid or external service (in the Docker sense) per container. Services include:

  • Zookeeper plus Kafka
  • Metadata storage (MySQL)
  • Coordinator (one or two)
  • Overlord (one or two)
  • Broker
  • Router
  • Custom node role
  • Etc.

Comments:

  • Druid uses an ad-hoc way to define the various services. Compose provides a "profile" which is a simpler approach.
  • Druid uses an ad-hoc assortment of environment variables to configure services, but Compose provides a "config" option which is more general, and an "env file" feature which may be a simpler way to set environment variables.

Suggestions:

  • Use the test group name as a profile. Use profiles to enable service. Reduce the resulting number of YAML files and thus the large amount of duplication in the current design.

Explicit Dependency Population

The distribution/pom.xml file, for the integration-tests profile, invokes a Druid command that appears to bypass Maven to populate the local Maven repository with a selected set of dependencies. The class in question is PullDependencies which explains the implementation, but not purpose, of this effort.

The dependencies populated are mostly Druid build artifacts. Shouldn't these already be in the local repository as a result of the install action? Or, perhaps install comes too late and so the integration tests need to pull the artifacts from a location other than the build itself? If so, aren't we then pulling artifacts other than those we are building and trying to test?

The actual fact appears to be that Maven builds projects recursively. Each project is build through the entire lifecycle as shown by inspecting the details of a build:

[INFO] --- maven-install-plugin:2.3.1:install (default) @ druid-lookups-cached-global ---
[INFO] Installing /Users/paul/git/druid/extensions-core/lookups-cached-global/target/druid-lookups-cached-global-0.23.0-SNAPSHOT.jar to /Users/paul/.m2/repository/org/apache/druid/extensions/druid-lookups-cached-global/0.23.0-SNAPSHOT/druid-lookups-cached-global-0.23.0-SNAPSHOT.jar
[INFO] Installing /Users/paul/git/druid/extensions-core/lookups-cached-global/pom.xml to /Users/paul/.m2/repository/org/apache/druid/extensions/druid-lookups-cached-global/0.23.0-SNAPSHOT/druid-lookups-cached-global-0.23.0-SNAPSHOT.pom

The above is for one Druid project, others follow the same pattern. In order to compile, Maven must pull dependencies and install them into the local repository.

As a result, the local repository already contains all the needed artifacts. If the integration-tests profile somehow found it must use a tool to download then, then something is wrong. Perhaps the integration tests are build without the install step? Perhaps the module placement is wrong and modules are missing? Whatever the reason, the logic is wrong.

Furthermore, if the code above actually downloads Druid build artifacts from a repository, then the local Maven repository is corrupted: its contents are not what was built. A subsequent incremental build will put the build in an inconsistent state as some artifacts will be from a source other than the local source code. (It should go without saying that a corrupt build is seldom a helpful situation.)

Proposed change: remove the tool invocation. Determine how to use the artifacts from the current build.

Commands

The command line from the README.md is:

mvn verify -P integration-tests 

This seems to say:

  • Run the package step using the integration-tests profile.
  • Run to the verify phase, which includes the integration-test phase which runs the tests.
  • Include the verify phase which checks the results of the integration tests.

Docker File

See integration-tests/docker/Dockerfile and scripts/docker_build_containers.sh. The local working directory is $SHARED_DIR/docker.

Typical command line:

  docker build -t druid/cluster --build-arg ZK_VERSION --build-arg KAFKA_VERSION --build-arg CONFLUENT_VERSION --build-arg MYSQL_VERSION --build-arg MARIA_VERSION --build-arg MYSQL_DRIVER_CLASSNAME $SHARED_DIR/docker

This is:

  • -t druid/cluster - Set a tag on the container.
  • -build-arg ZK_VERSION - Set a build argument of the form KEY=value
  • $SHARED_DIR/docker - Specify the build context (the working directory for the build.)

Note the incorrect form for the -build-arg, the correct form is ZK_VERSION=$ZK_VERSION.

Thus, the context is ~/shared/docker which contains:

Dockerfile					docker-compose.yml
base-setup.sh					docker_ip
client_tls					druid.sh
docker-compose.base.yml				environment-configs
docker-compose.cli-indexer.yml			ldap-configs
docker-compose.druid-hadoop.yml			run-mysql.sh
docker-compose.high-availability.yml		schema-registry
docker-compose.ldap-security.yml		service-supervisords
docker-compose.query-error-test.yml		supervisord.conf
docker-compose.query-retry-test.yml		test-data
docker-compose.schema-registry-indexer.yml	tls
docker-compose.schema-registry.yml		wiki-simple-lookup.json
docker-compose.security.yml

The prior scripts set up this directory.

The Dockerfile itself:

  • Base image is the target JDK version: FROM openjdk:$JDK_VERSION as druidbase
  • Generates a single Docker image called druidbase
  • Requires a set of argument:
    • JDK_VERSION
    • KAFKA_VERSION (not used)
    • ZK_VERSION (not used)
    • APACHE_ARCHIVE_MIRROR_HOST
    • MYSQL_VERSION
    • MARIA_VERSION
    • MYSQL_DRIVER_CLASSNAME
    • CONFLUENT_VERSION
  • Uses a combination of files copied in, and downloaded files modified
  • Starts and stops MSQL twice
    • Once to create the metadata store (surprisingly, we have no existing script for this)
    • A second time to run Druid to create metastore tables.
  • Performs some operations in a script, others via Docker RUN commands.
  • Uses Perl (!) to adjust Kafka properties
  • Exposes a great number of ports, including those used by Druid and ZK.
  • Entry point "work dir" is /var/lib/druid
  • Has a complex entry point rather than just using a script.

Public Docker File (for Tutorial)

  • Resides in distribution/docker
  • Does its own build inside a docker image (!)
  • Uses a different set of bases than used for testing.

Docker Compose Structure

The Docker compose files configure the cluster. There is one file for test group. In fact, the test group is defined by the Compose file it uses. For example: docker-compile.high-availability.yml is the file for the high-availability test group.

Each of these files depends on a base file: docker-compose.base.yml which configures the common services and sets up defaults for the Druid services.

Service Launch

The test containers do not use the out-of-the-box Druid configs or launch scripts: it is all bespoke. The process is:

  • A druid.sh file exists in the public Docker to allow configs to be set via environment variables.
  • That file was cloned, and extended, in integration tests to add functionality.
  • The Dockerfile wraps that script to create TLS keys and to register sample data.
  • A set of serviced scripts do the work of the launch.

For launch:

  • Generate TLS keys for the server instance.
  • Use the env vars (set in the docker compose files) to edit a bespoke set of config files in /tmp/conf.
  • Launch MySQL and install some S3 keys, etc. for sample data, then shut it down. (Not that this is done for every service, so it is done multiple times per cluster.)
  • Set up some config variables used by Druid to point to the configuration files.
  • Launch supervisiord with the service launch script.
  • In druid.conf, assemble the command line from 7 different environment variables.

Expanded:

  • Each test group is run separately as a distinct Maven job and requires a full build of Druid (actually two builds, as above.) Each provides a set of options as described in README.md: "-Doverride.config.path=<PATH_TO_FILE> with your Cloud credentials".
  • The integration-tests/pom.xml launches the build_run_cluster.sh script to build and run the cluster. Much of this is explained above. For the config file: <DRUID_INTEGRATION_TEST_OVERRIDE_CONFIG_PATH>${override.config.path}</DRUID_INTEGRATION_TEST_OVERRIDE_CONFIG_PATH>
  • The above builds the containers and the shared directories, as explained above.
  • The above calls integration-tests/script/docker_run_cluster.sh to start the cluster.
  • 'docker_run_cluster.sh' sources script/docker_compose_args.sh to map from test group name to Docker compose file and other args.
  • Checks for the existence of DRUID_INTEGRATION_TEST_OVERRIDE_CONFIG_PATH for certain tests.

Missing:

  • How are supervisord.confs placed in the container?
  • The pom.xml file is "one pass": it starts the cluster and runs tests. How do we get specific test group launches?

Druid Configuration

Druid configuration is difficult to work with in normal times, and Docker makes the problem worse. Basically, Druid config consists of a set of static files. The distribution includes a variety of configurations. Tests want to override various properties. But, Druid has no form of configuration inheritance, forcing the code into a variety of ad-hoc solutions.

Config layers, in order of priority.

  • DRUID_INTEGRATION_TEST_OVERRIDE_CONFIG_PATH
  • Base compose environment variables
  • Group-specific compose environment variables.
  • Hard-coded defaults.

Existing Solution

The Docker compose files define a set of services e.g. docker-compose.cli-indexer.yaml. For each service, the compose files reference one or more "environment configs." Example:

  druid-overlord:
    extends:
      file: docker-compose.base.yml
      service: druid-overlord
    environment:
      - DRUID_INTEGRATION_TEST_GROUP=${DRUID_INTEGRATION_TEST_GROUP}
    depends_on:
      - druid-metadata-storage
      - druid-zookeeper-kafka
  • These configs define specially-encoded environment variables

Service Launch Details

The images use supervisor to run processes. An important feature of Supervisor is that the set of program(s) to run is defined statically as a set of files. This is a challenge when creating a single image that can run multiple kinds of services. Druid works around this by only mounting the service configs needed for a given container. This works, but has much redundancy. See this article for more background.

Non-Druid Services

There are three service other than Druid:

  • ZooKeeper
  • Kafka
  • MySQL

It seems that ZK and Kafka are run together in one container, MySQL in another.

  • Each resides in /usr/local/<proj>.
  • A Supervisord script launches the service. Example for Kafka:
[program:kafka]
command=/usr/local/kafka/bin/kafka-server-start.sh /usr/local/kafka/config/server.properties
priority=0
stdout_logfile=/shared/logs/kafka.log
  • docker-compose.base.yml defines the container:
  druid-zookeeper-kafka:
    image: org.apache.druid/test:${DRUID_VERSION}
    container_name: druid-zookeeper-kafka
    ...
    volumes:
      - ${HOME}/shared:/shared
      - ./service-supervisords/zookeeper.conf:/usr/lib/druid/conf/zookeeper.conf
      - ./service-supervisords/kafka.conf:/usr/lib/druid/conf/kafka.conf
    env_file:
      - ./environment-configs/common
  • docker-compose.<group>.yaml defines the actual service as a reference:
  druid-zookeeper-kafka:
    extends:
      file: ../../docker/docker-compose.base.yml
      service: druid-zookeeper-kafka
  • The environment-configs/common file defines Druid config: it is not clear why (or if) it is needed for non-Druid service.
  • The entry point (check) does (what) to launch supervisord.
  • The common config file, /etc/supervisor/conf.d/supervisord.conf includes any config files in /usr/lib/druid/conf/*.conf:
[supervisord]
nodaemon=true
logfile = /shared/logs/supervisord.log

[include]
files = /usr/lib/druid/conf/*.conf

MySQL

Several scripts start MySQL, do stuff, and shut it down. The problem here is that the database is not shared: creating it in one image doesn't actually affect the DB created in the actual MySQL image.

Druid configuration

  • docker-compose.*.yml files list all Druid properties as env vars.
  • Compose inheritance merges test-specific settings with base settings.
  • environment-configs/common has common properties for all services, and JVM args
  • These files are composed in the compose files:
    env_file:
      - ./environment-configs/common
      - ./environment-configs/overlord
      - ${OVERRIDE_ENV}
  • (What does OVERRIDE_DEV do? Where is it set?)
  • The configs set the location of the shared folders for logs.
  • The configs identify the names of the dependent services, such as MySQL or ZK.
  • The service specific env files, such as coordinator, identify the service:
DRUID_SERVICE=coordinator
  • Docker compose, and Docker, then set these within the environment of the container.
  • The entrypoint uses bits of druid.sh to setup the configuration:
ENTRYPOINT /tls/generate-server-certs-and-keystores.sh \
            && . /druid.sh \
            # Create druid service config files with all the config variables
            && setupConfig \
            # Some test groups require pre-existing data to be setup
            && setupData \
            # Export the service config file path to use in supervisord conf file
            && export DRUID_SERVICE_CONF_DIR="$(. /druid.sh; getConfPath ${DRUID_SERVICE})" \
            # Export the common config file path to use in supervisord conf file
            && export DRUID_COMMON_CONF_DIR="$(. /druid.sh; getConfPath _common)" \
            # Run Druid service using supervisord
            && exec /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf
  • setupConfig parses the environment variables to create the config files:
setupConfig()
{
  echo "$(date -Is) configuring service $DRUID_SERVICE"

  # We put all the config in /tmp/conf to allow for a
  # read-only root filesystem
  mkdir -p /tmp/conf/druid

  COMMON_CONF_DIR=$(getConfPath _common)
  SERVICE_CONF_DIR=$(getConfPath ${DRUID_SERVICE})

  mkdir -p $COMMON_CONF_DIR
  mkdir -p $SERVICE_CONF_DIR
  touch $COMMON_CONF_DIR/common.runtime.properties
  touch $SERVICE_CONF_DIR/runtime.properties

  setKey $DRUID_SERVICE druid.host $(resolveip -s $HOSTNAME)
  setKey $DRUID_SERVICE druid.worker.ip $(resolveip -s $HOSTNAME)

  # Write out all the environment variables starting with druid_ to druid service config file
  # This will replace _ with . in the key
  env | grep ^druid_ | while read evar;
  do
      # Can't use IFS='=' to parse since var might have = in it (e.g. password)
      val=$(echo "$evar" | sed -e 's?[^=]*=??')
      var=$(echo "$evar" | sed -e 's?^\([^=]*\)=.*?\1?g' -e 's?_?.?g')
      setKey $DRUID_SERVICE "$var" "$val"
  done
}
  • The Supervisor launch script combines the information to actually launch Druid:
[program:druid-service]
command=java %(ENV_SERVICE_DRUID_JAVA_OPTS)s %(ENV_COMMON_DRUID_JAVA_OPTS)s -cp %(ENV_DRUID_COMMON_CONF_DIR)s:%(ENV_DRUID_SERVICE_CONF_DIR)s:%(ENV_DRUID_DEP_LIB_DIR)s org.apache.druid.cli.Main server %(ENV_DRUID_SERVICE)s
redirect_stderr=true
priority=100
autorestart=false
stdout_logfile=%(ENV_DRUID_LOG_PATH)s

Test Internals

  • Based on the long-obsolete TestNG Documentation returns a 404.
  • IntegrationTestingConfig configuration of the cluster, etc.
    • ConfigFileConfigProvider creates an instance from a pile of information from a JSON config file.
  • Injected into the test.
  • Startup is via the TestNG class ITestRunnerFactory and its subclass DruidTestRunnerFactory
  • testng.xml provides the list of tests which reside in src/test org.apache.druid.tests and below.
  • TestNGGroup lists the test groups.
  • SuiteListener does suite (group?) specific setup.
    • DruidTestModuleFactory defines an injector.
  • DruidTestModule seems to be the only test-specific module.
    • Uses a Properties-style config file, with keys under druid.test.config.
    • Properties seem to be passed in from the pom.xml file.
    • Creates a dummy self node for tests.

Shared Directory Contents

ls ~/shared
docker			hadoop-dependencies	logs			tasklogs
druid			hadoop_xml		storage			wikiticker-it

~/shared/docker
Dockerfile					docker-compose.yml
base-setup.sh					docker_ip
client_tls					druid.sh
docker-compose.base.yml				environment-configs
docker-compose.cli-indexer.yml			ldap-configs
docker-compose.druid-hadoop.yml			run-mysql.sh
docker-compose.high-availability.yml		schema-registry
docker-compose.ldap-security.yml		service-supervisords
docker-compose.query-error-test.yml		supervisord.conf
docker-compose.query-retry-test.yml		test-data
docker-compose.schema-registry-indexer.yml	tls
docker-compose.schema-registry.yml		wiki-simple-lookup.json
docker-compose.security.yml
Clone this wiki locally