In addition to unit tests, the code in src/test
contains harnesses for
integration testing and benchmarking against Pokémon Showdown.
Due to the various build options (e.g. -Dshowdown
or -Dlog
) and the stochastic nature of Pokémon
as a game, testing the pkmn engine requires a little extra work. Helper functions exist to remove
the majority of the boilerplate from the library's unit tests:
Test
: the main helper type for testing, a test can be initialized withTest(rolls).init(p1, p2)
(Test.deinit()
should bedefer
-ed immediately after initialization to free resources), expected updates and logs can be tracked on theexpected
fields and finally theactual
state can beverify
-ed at the end of the test.expectProbability
can be used to check probabilities when-Dchance
is enabled, and if-Dcalc
is also enabled each update gets rerun on the original state with the chance actions from the original update used as overrides to ensure all RNG is accounted for.Battle.fixed
: under the hood,Test
uses this helper to create a battle with aFixedRNG
that returns a fixed sequence of results ("rolls") - this provides complete control over whether or not events should occur. One problem is that-Dshowdown
Pokemon Showdown compatibility mode requires a different number and order of rolls, meaning both must be specified. Furthermore, at the end of the test it's important to verify that all of the rolls provided were actually required withtry expect(battle.rng.exhausted())
- unexpectedly unused rolls could point to bugs (Test.verify()
automatically checks that therng
is exhausted).
The pkmn engine aims to match Pokémon Showdown when run in -Dshowdown
compatibility mode, but
unfortunately it's impossible to match Pokémon Showdown's behavior without also duplicating its
incorrect architecture and event/handler/action system due to how this architecture results in many
artificial "speed ties" which cause RNG frame advances. This is deemed to be out of scope for the
pkmn engine, as it seeks to match Pokémon Showdown purely for practical reasons (to leverage for
integration testing purposes/to provide more "accurate" playouts for AI applications built to play
on Pokémon Showdown) only, and adding the byzantine logic and fields required to be able to
perfectly replicate Pokémon Showdown's bugs simply distracts from the goal of building an optimal
Pokémon battle engine.
In order to reconcile this, the pkmn engine instead aims to match a patched version of Pokémon Showdown, where minimal changes have been made to Pokémon Showdown to improve correctness and eliminate unnecessary nondeterministic elements:
Battle#eachEvent
andBattle#residualEvent
have been changed to not perform aBattle#speedSort
in Generation I and II, which should result in events being executed in the order they're added, ultimately resulting in Player 1's events occurring before Player 2's regardless of speed, effectively recreating the cartridge's default "host" ordering semanticsBattleQueue#insertChoice
is patched to also obey "host" ordering in Generation I and II- "priorities" have been added to various handler functions to break speed ties and ensure that there either no unnecessary rolls or events deterministically get resolved in the order they're resolved on the cartridge
These patches do not fix Pokémon Showdown implementation bugs beyond a subset of speed tie
semantics, and do not fix all issues regarding unnecessary RNG frame advances from speed ties
(e.g. moves with a beforeTurnCallback
on Pokémon Showdown still potentially result in speed tie
rolls), they simply aim to make minimally intrusive changes that allow for Pokémon Showdown behavior
to be reproduced by the pkmn engine. These patches should also strictly result in a performance
improvement compared to vanilla Pokémon Showdown, as they cause Pokémon Showdown to perform less
sorting and RNG frame advances than it otherwise would, which effectively
"steelmans" the implementation for
benchmarking purposes.
In order to verify Pokémon Showdown's behavior, many of the pkmn engine's unit tests are mirrored in
the showdown
directory. It should be emphasized that these are tests
against patched Pokémon Showdown, not the pkmn engine (engine code isn't being
tested). Pokémon Showdown's own unit tests are inadequate for the pkmn engine's purposes as they
mostly cover the latest generation, don't use a fixed RNG, and don't verify logs (both of which
are crucial for matching Pokémon Showdown's RNG and output).
The following guidelines should be taken into consideration when adding new unit tests:
- Support should be added to the
generate
tool for the generation in question -npm run generate -- tests <GEN>
is used to first generate stubs for all of the effects that need to be tested (the generated stubs need to be massaged quite a bit, but serve as a good first rough draft). - Tests should be ordered such that they match the order from previous generations as closely as possible, and new effects should be grouped with similar effects. General engine/battle flow test cases should also be preserved from past generations where applicable.
- Effects should have all of the behavior outlined in their descriptions on Bulbapedia and Smogon tested (though these sources shouldn't be assumed to be correct). Additionally, any reported Pokémon Showdown bugs and all documented cartridge glitches should be tested.
- Copy as much as possible from previous generations' test cases - preserving the original test makes it easier to see how much of the behavior has changed vs. remained over the generations. If the behavior has significantly diverged then coming up with a brand new test case might be preferable.
- Ideally, every species, item, move, ability, etc should show up at least once in the test file - try to use as diverse a range of options as possible, though prefer movesets which are "natural" (choose similarly tiered species using moves that occur in their movesets, prefer that "signature" moves and abilities are present on the correct evolutionary lines, etc).
- If creating a test for a Pokémon Showdown bug or a video demonstrating a glitch, prefer to match the original reproduction's setup.
- Prefer most tests use the "default" stats (level 100, full stat experience in Generation I & II and no Effort Values in Generation III+, etc).
- Long test cases that demonstrate the majority of an effects behavior are preferable to individually scoped testing (this is contrary to most testing best practices, but minimizing the amount of setup necessary is deemed preferable to the alternative).
- For complex effects which interact with a wide variety of other effects (e/g. Substitute, Baton Pass), prefer to test just the "base" functionality in the complex effect's test case and its various interactions with other effects in the test case demonstrating the other effect's behavior.
- Avoid unnecessary rolls and log messages. Fall back on moves like Splash where reasonable and avoid test setups involving speed ties unless speed ties are specifically being tested.
- Avoid performing artificial alterations of the battle state mid-test unless required. Tests which require status should acquire it as part of the test (Pokémon Showdown's cleric clause means starting the battle with status is difficult), and HP/PP should ideally be manipulated before the test begins.
- Minimize the number of sub-tests that are created - they should only be necessary if they involve a vastly different setup than the primary test case.
- If behavior differs in single vs. double battle styles, test both. Otherwise choose a mix of single and double battles for testing.
The integration test exists to ensure the pkmn engine compiled in
Pokémon Showdown compatibility mode with -Dshowdown
produces comparable output to
patched Pokémon Showdown. For each supported generation, both Pokémon Showdown and the
pkmn engine are run with an
ExhaustiveRunner
that attempts to use as many different effects as possible in the battles it randomly simulates and
the results are collected. While Pokémon Showdown always produces its text protocol streams, pkmn
must be built specially to opt-in to introspection support (-Dlog
).
The pkmn binary protocol isn't expected to be equivalent to Pokémon Showdown for several reasons:
- pkmn doesn't have any notion of a 'format' or custom 'rules'
- the ordering of keyword arguments in Pokémon Showdown isn't strictly defined
- several of Pokémon Showdown's protocol messages are redundant/implementation specific
- pkmn always returns a single "stream" and always includes exact HP (i.e. Pokémon Showdown's "omniscient" stream) - other streams of information must be computed from this
- despite what it may claim, Pokémon Showdown does not implement the correct pseudo-random number generator for each format (it implements the Generation V & VI PRNG and applies it to all generations and performs a different amount of calls, with different arguments and in different order than the cartridge)
The integration test contains logic to configure Pokémon Showdown to produce the correct results and for massaging the output from the pkmn engine into something which can be compared to Pokémon Showdown. Care is taken to ensure that where they disagree the actual cartridge decompilations are used as the arbiter of correctness, but it's still possible that since Pokémon Showdown and the pkmn engine are both independent implementations of the actual Pokémon cartridge logic despite being in agreement they may both be incorrect when it comes to the actual cartridge1.
Most integration test failures result in new unit tests being added, though the failing logs are
also saved as fixtures which can then be replayed to protect
against regressions. The integration test also supports being run in
standalone mode for various durations, e.g. npm run integration -- --duration=15m
which can be
useful for fuzzing purposes.
Some of Pokémon Showdown's bugs are too convoluted to be implemented in the pkmn engine, even after patches are applied. The engine tries its best to reproduce the behavior of even the most misunderstood and broken mechanics of Pokémon Showdown, but in the same way that implementing the cartridge behavior correctly is difficult starting from Pokémon Showdown's architecture, implementing Pokémon Showdown's mechanics is also difficult starting from an architecture that mirrors the cartridge.
For the purposes of the benchmark one could simply choose to not generate any sets with problematic Pokémon / Items / Abilities / Moves, but for integration testing purposes it makes sense to add some complexity to be able to test as much as possible (teams are validated before starting a battle to ensure a battle isn't started with moves that have issues when used together, and during a battle if Pokémon Showdown is observed to be in an undesirable state it simply aborts and move to the next battle).
Benchmarking the pkmn engine vs. Pokémon Showdown is slightly more complicated than simply using a
tool like hyperfine
due to the need to account for the
runtime overhead and warmup period required by V8 (hyperfine --warmup
is intended to help with
disk caching, not JIT warmup). As such, a custom benchmark tool exists
which can be used to run the benchmark. The benchmark measures how long it takes to play out N
randomly generated battles, excluding any set up time and time spent warming up the JS
configurations. This benchmark scenario is useful for approximating the Monte Carlo tree
search use case where various battles are
played out each turn to the end numerous times to determine the best course of action.
Notably, the benchmark doesn't attempt to measure the performance of Pokémon Showdown via either its
BattleStream
abstraction or the pokemon-showdown
binary. The BattleStream
isn't that difficult
to use (though you need to use a special RandomPlayerAI
that directly inspects the Battle
to
avoid making unavailable choices and matches the AI used by all of the other configuration in
addition to directly accessing the BattleStream
s internal Battle
object to more easily be able
to grab the turn count and also to patch fix various speed ties.), the main concern is
that due to Pokémon Showdown's poor handling of promises internally it's fairly trivial to encounter
race conditions that desync the benchmark. Pokémon Showdown's root pokemon-showdown
binary is
technically the blessed approach to using the simulator, but BattleStream
is effectively the same
thing but without the (sizeable) I/O overhead. Attempting to use the actual pokemon-showdown
binary is deemed too difficult as there would then be no way to inspect the Battle
to avoid making
unavailable choices2, meaning it would be difficult to keep in sync with the other
configurations.
Before running the benchmark, care needs to be taken to set up the environment to be as stable as possible, e.g. disabling CPU performance scaling, Intel Turbo Boost, etc. The benchmark tool measures 3 different configurations:
-
DirectBattle
: this configuration introduces the concept of aDirectBattle
which overrides the Pokémon ShowdownBattle
class to strip out unused functionality:- methods which add to the battle log are overridden to drop any messages immediately
sendUpdates
is overridden to not send any updatesmakeRequest
avoids serializing the request for each side
The
DirectBattle
is then used synchronously as opposed to via the asyncBattleStream
which is about 10% faster and obviates needing to care about races. This configuration minimizes string processing overhead and unnecessary delays due toasync
calls and is as close to as fast as Pokémon Showdown can be run (there is room for further optimization by simplifying choice parsing to not perform any verification, though this is significantly less trivial than the aforementioned optimizations). This is closer to how the pkmn engine runs without-Dlog
. Finally,DirectBattle
is patched to eliminate unnecessary as covered earlier. -
@pkmn/engine
: this configuration uses the@pmn/engine
driver package to run battles with the pkmn engine. -
libpkmn
: this configuration runs battles directly with thelibpkmn
library and doesn't interface with JS at all. The benchmark runner invokesbenchmark.zig
to directly run the benchmark and report the results.
Both pkmn engine configurations are intended to be used -Dshowdown
build option but with all other
build options turned off. Both of the Pokémon Showdown configurations are run beforehand for a warmup
period to ensure the measured duration is representative of the actual best case runtime.
In order to ensure all configurations are testing the same thing, one must ensure that the exact same battles are generated, the same sequence of moves are chosen, and the battle results are match. As such, all benchmarks are run with the same PRNGs that have been initialized with the same seeds, and the logic for generating battles/randomly choosing moves is duplicated across both the Zig and TypeScript implementations. Finally, in addition to total duration, the benchmarking tool tracks and compares the total number of turns across all battles and the final RNG seed to serve as a "checksum" and verify that all of the configurations are in agreement - Pokémon Showdown requires that one:
- serialize the player's teams passed to the
Battle
constructor, as Pokémon Showdown mutates them - drive both players with separate PRNGs from each other and from the
Battle
, as there is no guarantee around the order of operations (Pokémon Showdown has numerous races and unpleasantries)
Note that how long a given battle takes is heavily dependent on the teams in question. The benchmark runs on teams that have effectively been generated using "Challenge Cup" semantics, and because this includes numerous sub-optimal moves (e.g. Thunder Shock in addition to Thunderbolt, instead of just the latter) it's expected to take substantially longer than more traditional "Random Battle" sets or handcrafted teams. Experimentally the random sets used by the benchmark are expected to be roughly 2-3× slower than what would be typical in practice.
The results for the table below come from running the benchmarks against
pkmn/engine@9ce6e379 on an n2d-standard-48
Google
Cloud Compute Engine machine with 192 GB of memory and an AMD EPYC 7B12 CPU running 64-bit x86 Linux
which has undergone the pre-benchmark tuning detailed below via the command npm run benchmark -- --battles=10000
:
Generation | libpkmn |
@pkmn/engine |
DirectBattle |
---|---|---|---|
RBY | 195 ms | 737 ms (3.78×) | 618 s (3167×) |
It's important to note that the relative performance differences between the various configurations depend on the exact choice of machine used for testing (though the orders of magnitude seen here are expected to hold).
CPU Details
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cp uid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefet ch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save um ip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 768 KiB (24 instances) L1i: 768 KiB (24 instances) L2: 12 MiB (24 instances) L3: 96 MiB (6 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-11,24-35 NUMA node1 CPU(s): 12-23,36-47 Vulnerabilities: Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Srbds: Not affected Tsx async abort: Not affected
Setup
Provision a spot Google Cloud Compute Engine
instance with a Minimal Ubuntu LTS image that's deleted after
after 30 minutes using the gcloud
command-line tool and SSH
into it:
# Configure to auto-terminate for safety; when done we can also manually run:
#
# $ gcloud compute instances stop pkmn-engine-benchmark
# $ gcloud compute instances delete pkmn-engine-benchmark
#
gcloud beta compute instances create pkmn-engine-benchmark \
--zone=us-central1-a \
--machine-type=n2d-standard-48 \
--image-project=ubuntu-os-cloud \
--image-family=ubuntu-minimal-2204-lts \
--max-run-duration=30m \
--provisioning-model=SPOT \
--instance-termination-action=DELETE
# Need to wait a bit before SSH will succeed...
sleep 15
gcloud compute ssh pkmn-engine-benchmark
On the VM, install dependencies:
# Install system packages
sudo apt update
sudo apt --assume-yes install git cpuset
# Shallow clone the pkmn engine code
git clone -–depth 1 https://github.com/pkmn/engine.git
cd engine
# Set up Node
curl -fsSL https://raw.githubusercontent.com/tj/n/master/bin/n | sudo bash -s lts
# Install package dependencies + put the locally installed Zig on the PATH
npm install
export PATH="$(pwd)/build/bin/zig:$PATH"
The tuning script can then be run as root to perform the benchmark:
sudo --preserve-env=PATH env ./benchmark.sh
Tuning
#!/bin/bash
function cleanup() {
# Turn back on hyperthreading
for cpu in {1..47}
do
echo 1 > /sys/devices/system/cpu/cpu$cpu/online
done
# remove CPU shielding
cset shield --reset >/dev/null 2>&1
}
trap cleanup EXIT
# Turn off hyperthreading based on /sys/devices/system/cpu/cpu*/topology/thread_siblings
for cpu in {24..47}
do
echo 0 > /sys/devices/system/cpu/cpu$cpu/online
done
# Sadly we are unable to disable CPU boosting or change the CPU governor to performance
# Set up a shield and move all threads (including kernel threads) out
cset shield -c 1-9 -k on >/dev/null 2>&1
# Drop filesystem cache
echo 3 > /proc/sys/vm/drop_caches
sync
# Run benchmark command within shield at highest possible priority
# (can add '--battles=100000 --iterations=50' flags to execute regression benchmark)
cset shield --exec -- nice -n -19 node build/test/benchmark
In addition to being used to compare the pkmn engine to Pokémon Showdown, the benchmark
tool has an alternative mode that allows it to better detect regressions
in the engine's performance. When the --iterations
flag is used the tool instead runs multiple
iterations of battle playouts from the same seed against the engine and outputs a TSV with the
results. These results can then be fed back into the script to determine how performance changed:
npm --silent run benchmark -- --iterations=50 > logs/before.tsv
# <make changes>
npm --silent run benchmark -- --iterations=50 logs/before.tsv
Alternatively, a text or JSON --summary
can be produced - in order to minimize noise, the mean of
all the iterations is reported after outliers have been removed.
The integration tests and a standalone fuzz are also used
for fuzzing. A GitHub
workflow exists to run these tests on a schedule from random seeds
for various durations to attempt to uncover latent bugs. The fuzz tests differ from the benchmark in
that they run for predefined time durations as opposed to a given number of battles and enable the
unimplementable effects that are usually excluded in -Dshowdown
compatibility
mode. When run with the -Dlog
flag, additional binary data is dumped on crashes to allow for
debugging with the help of fuzz.ts
and the debug UI
rendered by display.ts
. If -Dchance
and -Dcalc
are enabled the fuzz test also
ensures a transitions
function can correctly detect all valid transitions without crashing.
To run the fuzz tool locally use:
$ npm run --silent fuzz -- <pkmn|showdown> <GEN> <DURATION> <SEED?>
Footnotes
-
A stretch goal for the project is to be able to run integration tests against the actual cartridge code. Examples exist of scripting battles to run on the cartridge via an emulator, though the fact that integration testing the engine properly requires support for "link" battling and the ability to detect desyncs makes such a goal decidedly nontrivial. ↩
-
It's possible to remain in sync between configurations which can inspect
Battle
and those that can't by always saving the raw result returned by the last RNG call and reapplying it to the next request in the event of an "[Unavailable choice]" error (e.g. call the RNG and get backr
, attempt to choose ther % N
-th choice, get rejected, on the next request don't generate a newr
but instead now make ther % M
-th choice whereM
is the actual available choices post rejection). Since it isn't especially important to demonstrate how much slower the (already slow) asyncBattleStream
API when you introduce syscall overhead into the mix, this workaround is left as an exercise to the reader. ↩