[EMNLP 2024] CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research
We create the first similar command lines dataset generated by large language models and introduce an efficient command-line embedding model CmdCaliper, surpassing current models in performance, with all data and model weights publicly released.
This research addresses command-line embedding in cybersecurity, a field obstructed by the lack of comprehensive datasets due to privacy and regulation concerns. We propose the first dataset of similar command lines, named CyPHER, for training and unbiased evaluation. The training set is generated using a set of large language models (LLMs) comprising 28,520 similar command-line pairs. Our testing dataset consists of 2,807 similar command-line pairs sourced from authentic command-line data.
In addition, we propose a command-line embedding model named CmdCaliper, enabling the computation of semantic similarity with command lines. Performance evaluations demonstrate that the smallest version of CmdCaliper (30 million parameters) suppresses state-of-the-art (SOTA) sentence embedding models with ten times more parameters across various tasks (e.g., malicious command-line detection and similar command-line retrieval).
Our study explores the feasibility of data generation using LLMs in the cybersecurity domain. Furthermore, we release our proposed command-line dataset, embedding models’ weights and all program codes to the public. This advancement paves the way for more effective command-line embedding for future researchers.
conda create -yn cmdcaliper python=3.10
conda activate cmdcaliper
pip install -r requirements.txt
- Copy the template configuration file to create your own configuration file first
cp credential_config_template.yaml credential_config.yaml
- Please configure your
llm_pool_info
in thecredential_config.yaml
file, including specifying the inferenceengine_name
,model_name
and the correspondingapi_key
andbase_url
. - Below is an example of using both gpt-4o and gpt-4o-mini in the LLM pool:
# credential_config.yaml
llm_pool_info:
- engine_name: "OpenAIInferenceEngine"
model_name: "gpt-4o-mini"
engine_arguments:
api_key: [YOU OPENAI APIKEY]
- engine_name: "OpenAIInferenceEngine"
model_name: "gpt-4o"
engine_arguments:
api_key: [YOU OPENAI APIKEY]
- The currently supported engines are
OpenAIInferenceEngine
,GoogleInferenceEngine
, andAnthropicInferenceEngine
. (Please refer to./src/inferencer.py
for more details.)
python3 synthesized_cmds.py \
--path-to-seed-cmds [PATH TO SEED CMD]
--path-to-output-dir [PATH TO OUTPUT DIR] \
--max-generation-num [DATA NUM TO GENERATE] \
--path-to-credential-config ./credential_config.yaml
--path-to-seed-cmds
: Path to the initial seed commands file. This JSON file contains the starting data required for the synthesis process.--path-to-output-dir
: Directory where the generated data and logs will be stored.--max-generation-num
: The total number of data items to generate.--path-to-credential-config
: Path to the credential configuration file. This YAML file includes settings for the LLM pool and API key information.
python3 synthesize_positive_cmds.py \
--path-to-all-cmds [PATH TO ALL CMDS] \
--path-to-output-dir [PATH TO OUTPUT DIR] \
--path-to-credential-config ./credential_config.yaml
--path-to-all-cmds
: Path to a file containing command lines for which you want to generate similar commands.--path-to-output-dir
: Directory where the generated data and logs will be stored.--path-to-credential-config
: Path to the credential configuration file. This YAML file includes settings for the LLM pool and API key information.
Model | Params (B) | MRR@10 | Top@10 |
---|---|---|---|
CmdCaliper-Small | 0.03 | 87.78 | 94.76 |
CmdCaliper-Base | 0.11 | 88.47 | 95.26 |
CmdCaliper-Large | 0.335 | 89.9 | 95.65 |
Evaluation on CyPHER
- To reproduce the performance on the testing set of CyPHER as reported in the paper, you can evaluate different models using the following command:
python3 evaluate.py --model-name [MODEL_NAME] \
--batch-size 16 --device cuda
- Replace
[MODEL_NAME]
with one of the following options to evaluate the respective model:"CyCraftAI/CmdCaliper-small"
"CyCraftAI/CmdCaliper-base"
"CyCraftAI/CmdCaliper-large"
"thenlper/gte-small"
- Adjust the
--batch-size
parameter if necessary to accommodate hardware constraints.
- The training script of CmdCaliper.
@inproceedings{huang2024cmdcaliper,
title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
year={2024}
}