Skip to content

Commit

Permalink
Order1Smoker: computes order-1 entropy of 8-bit bytes (0.7-1.5 GB/s)
Browse files Browse the repository at this point in the history
  • Loading branch information
Bulat-Ziganshin committed Feb 8, 2014
1 parent a1d51a2 commit df52968
Show file tree
Hide file tree
Showing 2 changed files with 49 additions and 9 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@ DataSmoke

Datatype detection in order to choose appropriate compression algorithm.

Since already compressed, text and multimedia files are better compressed with specific algorithms, we need a fast and reliable way to detect those data. I call it data smoking.
Since incompresible, text and multimedia files are better compressed with specific algorithms, we need a fast and reliable way to detect those data. I call it data smoking.

This project will provide various experimental algorithms that can recognize some of special datatypes (not necessary all), as well as samples of data that are especially hard to smoke correctly.


The full list of smells:
The full list of smells (speeds measured on the single core of i7-4770):

- ByteSmoker: computes entropy of individual bytes (2 GB/s).
- WordSmoker: computes entropy of 16-bit words (1 GB/s).
- WordSmoker: computes entropy of 16-bit words (0.7-1.5 GB/s).
- DWordSmoker: computes entropy of 32-bit dwords (3 GB/s).
- Order1Smoker: computes order-1 entropy of 8-bit bytes (0.7-1.5 GB/s).
51 changes: 45 additions & 6 deletions smoke.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,6 @@ void ByteSmoker::smoke (void *buf, size_t bufsize, double *entropy)
class WordSmoker : public Smoker
{
uint32_t *count;
size_t bits[256];
public:
WordSmoker() {count = new uint32_t[256*256];}
virtual const char* name() {return "WordSmoker";};
Expand All @@ -84,7 +83,7 @@ void WordSmoker::smoke (void *buf, size_t bufsize, double *entropy)

byte *p = (byte*) buf;
for (int i=0; i<bufsize-1; i++)
count[ *(uint16_t*)(p+i) ]++;
count[ *(unsigned*)(p+i) & 0xFFFF ]++;

double order0 = 0;
for (int i=0; i<256*256; i++)
Expand All @@ -97,6 +96,45 @@ void WordSmoker::smoke (void *buf, size_t bufsize, double *entropy)
}


/****************************************************************************/
/* Order-1 smoker: calculate compression ratio with the 8-bit order-1 model */
/****************************************************************************/

class Order1Smoker : public Smoker
{
uint32_t *count;
public:
Order1Smoker() {count = new uint32_t[256*256];}
virtual const char* name() {return "Order1Smoker";};
virtual ~Order1Smoker() {delete[] count;}
virtual void smoke (void *buf, size_t bufsize, double *entropy);
};

void Order1Smoker::smoke (void *buf, size_t bufsize, double *entropy)
{
memset (count, 0, 256*256*sizeof(*count));

byte *p = (byte*) buf;
for (int i=0; i<bufsize-1; i++)
count[ *(unsigned*)(p+i) & 0xFFFF ]++;

double order1 = 0;
for (int i=0; i<256; i++)
{
size_t total = 0;
for (int j=0; j<256; j++)
total += count[i*256+j];

if (total)
for (int j=0; j<256; j++)
if (count[i*256+j])
order1 += count[i*256+j] * log(double(total)/count[i*256+j])/log(double(2)) / 8;
}

*entropy = order1 / bufsize;
}


/***************************************************************************/
/* DWord smoker: calculate compression ratio with the 32-bit order-0 model */
/***************************************************************************/
Expand Down Expand Up @@ -187,10 +225,11 @@ int main (int argc, char **argv)
FILE *infile = fopen (argv[file], "rb"); if (infile==NULL) {fprintf (stderr, "Can't open input file %s!\n", argv[file]); return EXIT_FAILURE;}
fprintf(stderr, "%sProcessing %s: ", file>1?"\n":"", argv[file]);

ByteSmoker ByteS;
WordSmoker WordS;
DWordSmoker DWordS;
Smoker *smokers[] = {&ByteS, &WordS, &DWordS};
ByteSmoker ByteS;
WordSmoker WordS;
DWordSmoker DWordS;
Order1Smoker Order1S;
Smoker *smokers[] = {&ByteS, &WordS, &Order1S, &DWordS};
const int NumSmokers = sizeof(smokers)/sizeof(*smokers);
double entropy, min_entropy[NumSmokers], avg_entropy[NumSmokers] = {0}, max_entropy[NumSmokers] = {0};
for (int i=0; i<NumSmokers; ++i) min_entropy[i] = 1;
Expand Down

0 comments on commit df52968

Please sign in to comment.