Skip to content

Latest commit

 

History

History
104 lines (79 loc) · 4.42 KB

README.md

File metadata and controls

104 lines (79 loc) · 4.42 KB

LoggingDescriptions

This repository maintains a set of <code, log> pairs extracted from popular open-source projects, which are amendable to logging description generation research. More details about the dataset can be found in our paper:

The projects are listed as follows, including 10 Java projects and 7 C# projects:

No Java Projects C# Projects
01 ActiveMQ Azure SDK
02 Ambari CoreRT
03 Brooklyn CoreFX
04 Camel Mono
05 CloudStack MonoDevelop
06 Hadoop Orleans
07 Hbase SharpDevelop
08 Hive
09 Synapse
10 Ignite

Each folder of a project includes the following two files:

  • (project)_code_log_pairs.txt: This file contains all the code-log pairs extracted from the project. The pairs from different files of the project are separated.
  • file_trace.txt: To facilitate our data processing, different files of a project are renamed in the form of "sameple_ID". This file is used to help readers trace back to the original file.

Data Extraction

In the paper, each <code, log> pair is extracted from a single function and composed of two parts: the code text and the logging description. The code text contains 10 lines (if it has) of code statements preceeding the studied logging statement. The logging description contains the descriptive text in the same logging statement. Non-description parts such as variables are removed.

Processing Details:

  1. All empty lines are skipped.
  2. All English characters are converted to their lower cases.
  3. In code text part, code lines are separeted by \tab.
  4. Log statements that do not contain any description text are not considered as logging description but ordinary code statement in this dataset.
  5. The extracted preceeding 10 lines of code statements do not exceed current function scope (see the following example for details).

A Simplified Example

For easy demonstration, in the following Java example, we simply extract 6 lines of code insteaed of 10 for the code text part.

public	void catchException() {
try {
		operation 1;
		operation 2;

	} catch (Exception1 e1) {
		LOGGER.error("Exception 1 happens", e1);

	} catch (Exception2 e2) {
		LOGGER.error(e2);

	} catch (Exception3 e3) {
		LOGGER.error("Exception 3 happens", e3);
	}
}

In this function, two <code, log> pairs can be extracted (\tab indicates new lines of code statement):

  • <code, log> pair 1:

Code Text:

public void catchexception() {     try {     operation 1;     operation 2;     } catch (exception1 e1) {

Logging Description:

exception 1 happens
  • <code, log> pair 2:

Code Text:

operation 2;     } catch (exception1 e1) {     } catch (exception2 e2) {     logger.error(e2);     } catch (exception3 e3) {

Logging Description:

exception 3 happens

Further Explanation:

  1. Logging statement "LOGGER.error(e2);" can not produce a <code, log> pair since it does not contain any descriptive text except a variable. This kind of statement is treated as an ordinary code line, see <code, log> pair 2, while others with descriptive text will not appear in the code part of any pairs, see <code, log> pair 1.
  2. In <code, log> pair 1, the code text contains only 5 (<6) code lines, but it will not include code outside the function.

Cite

If you use this dataset, please cite our paper using the following reference:

@inproceedings{he2018characterizing,
title={Characterizing the natural language descriptions in software logging statements},
author={He, Pinjia and Chen, Zhuangbin and He, Shilin and Lyu, Michael R},
booktitle={Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering},
pages={178--189},
year={2018},
organization={ACM}
}

License

All datasets in this repository will follow the MIT license for free reuse.