Skip to content
forked from puid/Elixir

Fast and efficient generation of cryptographically strong probably unique indentifiers (puid, aka random string) of specified entropy from various character sets.

License

Notifications You must be signed in to change notification settings

ruby2elixir/Elixir

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Puid

Fast and efficient generation of cryptographically strong probably unique identifier (puid, aka random string) of specified entropy from various character sets.

Build Status   Hex Version   License: MIT

TOC

Installation

Add puid to mix.exs dependencies:

def deps,
  do: [
    {:puid, "~> 1.0"}
  ]

Update dependencies

mix deps.get

TOC

Usage

Create a module for generating puids:

  iex> defmodule(MyPuid, do: use(Puid))
  iex> MyPuid.generate()
  "8nGA2UaIfaawX-Og61go5A"

By default, Puid modules generate puids with 128-bit of entropy using the RFC 4648 file system and URL safe characters. There are 16 pre-defined character sets specified in Puid.CharSet.

To generate puids with 92-bits of entropy from alphanumeric characters:

  iex> defmodule(AlphanumPuid, do: use(Puid, bits: 92, charset: :alphanum))
  iex> AlphanumPuid.generate()
  "4ParCeRyqN8jgWh0"

Or to use custom characters for 64-bit entropy puids:

  iex> defmodule(DingoskyPuid, do: use(Puid, bits: 64, chars: "dingosky"))
  iex> DingoskyPuid.generate()
  "dskyyssgiydygkgndoykgs"

You can even use Unicode characters:

  iex> defmodule(UnicodePuid, do: use(Puid, chars: "ŮήιƈŏδεĊħąŕαсτəř"))
  iex> UnicodePuid.generate()
  "ĊŮəαсŕąδřτąƈιřήсąιŕŮτąąƈτŏřŏτсřŏ"

Rather than explicitly setting the bits parameter, Puid provides a simple, intuitive way to specify the amount of entropy for generated puids. By specifying a total number of IDs with a risk of a repeat, Puid will calculate the required entropy bits.

To generate up to 10 million puids with 1 in a trillion chance of repeat:

  iex> defmodule(SafePuid, do: use(Puid, total: 10.0e6, risk: 1.0e12))
  iex> SafePuid.generate()
  "q4SbN9yEXEiVCyc"

Each defined module has an info/0 function that provides detail on the module specification:

  iex> SafePuid.info()
  %Puid.Info{
    chars: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_",
    charset: :safe64,
    entropy_bits: 90.0,
    entropy_bits_per_char: 6.0,
    ere: 0.75,
    rand_bytes: &:crypto.strong_rand_bytes/1,
    length: 15
  }

TOC

Overview

We frequently have a need for unique identifiers. Regardless of how we generate these identifiers, we should question if they are, well, unique. Guaranteeing uniqueness requires either deterministic generation (e.g., a counter) that is not random, or that each newly created random identifier be compared against all existing IDs. However, often neither deterministic IDs nor the overhead of comparing all generated IDs suites our need.

So we use random IDs (aka random strings), which means we drop guaranteed uniqueness and adopt a weaker strategy of probabilistic uniqueness. Specifically, rather than being absolutely sure of uniqueness, we settle for a statement such as "there is less than a 1 in a billion chance that two of my strings are the same". We use an implicit version of this very strategy every time we use a hash as a key. We assume there will be no hash collision, but we do not have any true guarantee of uniqueness per se.

Understanding probabilistic uniqueness requires an understanding of entropy and of how to estimate the probability of a collision (i.e., the probability that two strings in a set of randomly generated strings might be the same). The blog post Hash Collision Probabilities provides an excellent overview of deriving an expression for calculating the probability of a collision in some number of instances of a perfect N-bit hash. Although the blog post provides detail for calculating the probability for hash collisions, it does not provide an answer to quantifying what we mean by "there is less than a 1 in a billion chance that 1 million strings of this form will have a repeat". That requires we calculate the entropy necessary to generate a total number of strings (from a much larger pool of possible strings) with a given risk of a repeat.

Puid provides exactly this capability by creating modules for randomly generating probably unique identifiers (puids) from a given character set such that a total number of the strings can be generated with an explicitly risk of repeat.

TOC

Why Puid?

As developers, we aren't accustom to thinking of random strings as being probably unique, but that's exactly what they are. But more importantly, we often don't consider what the actual probability of that uniqueness is. As example, consider any of the libraries (other than Puid and EntropyString, a precursor to Puid) or common schemes for generating random strings. They all accept as specification the length of the string, and perhaps the characters to use. However, they do not address the critical question: How likely am I to create a repeat if I use this library for N number of strings?

That question really should drive the parameterization of random string generation. You don't need a string of length 12; you need 100,000 identifiers using some character set with a explicit probability of a repeat. Let the library determine how long the string will be.

TOC

Why Not uuid?

Given the issue raised in Why Puid?, developers often punt and simply adopt a strategy of using uuids instead. The leading reason for using uuids seems to be "I want unique IDs" (with an implicit "and I don't want to think about it any further"). But uuids (the version 4 string representation defined in Section 4.4 of RFC 4122) are neither universal nor unique.

Far too often the rational of using uuids is that the probability of a repeat is low. (This is actually an underspecified statement; calculating the probability requires specifying the total number of uuids generated within a certain context). But if that rationale holds, why not concatenate two uuids and even be "better". And we sink into 7 Minute Abs logic.

Better would be to explicitly declare your intentions. Suppose you need 1 million IDs (for whatever reason). If you use uuids, what is the risk of repeat? Be quick now. ... OK, don't be quick. The risk is about 1 in 10^25. That's 1 in 10 septillion, or perhaps we should call it 1 in a 7 Minute Ab. Sure, that risk is really low. But is it appropriate and necessary?

Suppose instead you accept a risk of repeat as being about the same as that of you being hit by a meteorite as you are writing code. We'll estimate that event as 1 in 2.0e13. Using Puid, we can generate 1 million of these IDs at that risk using:

  iex> defmodule(MeteorId, do: use(Puid, total: 1.0e6, risk: 2.0e13))
  iex> MeteorId.generate()
  "EN8jD6p0NucjpA"

The Puid specification is explicit. The code clearly shows the expected number of IDs to generate under the given risk of a repeat. No guesswork needed.

A uuid has 122-bits of entropy (although most libraries use 128 bits to actually generate the uuid). The string representation requires 36 characters. Let's look at that string length. Using the MeteorId module, your random puids are 14 characters each. Using the overkill of uuids, each random ID is 36 characters. One solution requires 14 characters per ID. The other 36. If you're fine with that inefficiency, enough said.

Well, maybe not. Suppose you really, really want that overkill. OK, let's overkill with Puid:

  iex> defmodule(OverkillPuid, do: use(Puid, bits: 122))
  iex> OverkillPuid.generate()
  "geDpoXs5KMDgPKbDD5ch"

The OverkillPuid pool of random strings is slightly larger size than uuids, hence the risk of repeat in some number of instances is similar. And yet, each OverkillPuid puid still only requires 21 characters, whereas each equivalent uuid is stuck at 36 characters. Basically, a uuid is a puid with a fixed entropy of 122 bits and a comparatively inefficient string representation. Overkill if you must, but even then Puid is more efficient that using uuids.

TOC

Features

Module API

Each module defined using Puid has two auto-generated functions, generate/0 and info/0.

  • generate/0 generates a new puid using module parameterization
  • info/0 returns a Puid.Info structure consisting of
    • puid string length
    • The source character set
    • The pre-defined Puid.CharSet used, or if characters are custom
    • puid entropy bits per character
    • puid total entropy bits
      • May be larger than the specified bits since the total is a product of the puid length and the entropy bits per character.
    • puid entropy representation efficiency.
      • The ratio of the puid total entropy to the bits required for the puid string representation.
    • Source function for entropy

CharSets

  • 16 pre-defined character sets
    • Each has optimized ID generation
  • Custom
    • Any string of unique characters can be used for puids, including Unicode characters.

Random Bytes

By default, Puid uses :crypto.strong_rand_bytes/1 for entropy. Any function of the form (non_neg_integer) -> binary can be used instead.

TOC

Library Comparisons

The following provides comparisons to existing Elixir methods of generating random IDs. In each case, an equivalent Puid module is created. The Timing section includes a rough execution time comparison. Where appropriate, the existing Elixir method is run under pseudo-random number generation (PRNG) as well as cryptographically strong pseudo-random number generation (CSPRNG), the latter being slower. All comparisons use the default Puid CSPRNG entropy source :crypto.strong_rand_bytes/1.

The code used for the Timing output is in the test/timing.exs file. Testing module tags provide an easy means of running the timing test for a particular existing solution. For example, to run the timing test for Misc.Random:

  > mix test test/timing.exs --only misc_random

Creating equivalent Puid modules for the timing comparisons requires use of the Puid.Entropy.bits_for_len!/2 and Puid.Entropy.len_for_bits!/2 functions to determine parameters to match the comparison methods. This is not typical in the expected use of Puid.

TOC

Common Solution

The common solution to generating random strings in just about every computer language boils down to the same strategy: from a source character set, create a string where each character is plucked from the source by randomly indexing into the set. In Elixir, this looks like:

defmodule CommonSolution do
  def rand_string(len, chars) do
    char_count = chars |> String.length()

    for(_ <- 1..len, do: :rand.uniform(char_count) - 1)
    |> Enum.map(&(chars |> String.at(&1)))
    |> List.to_string()
  end
end

Specification

  • specify string length and character set
  • no pre-defined character sets
  • supports custom characters
    • handles Unicode strings

Examples

  iex> len = 12
  iex> chars = ?a..?z |> Enum.to_list() |> to_string()
  "abcdefghijklmnopqrstuvwxyz"
  iex> CommonSolution.rand_string(len, chars)
  "ckukdbpynhev"

  iex> bits = Puid.Entropy.bits_for_len!(len, :alpha_lower)
  iex> defmodule(AlphaLowerPuid, do: use(Puid, bits: bits, charset: :alpha_lower))
  iex> AlphaLowerPuid.generate()
  "atszyoutahxm"

  iex> CommonSolution.rand_string(len, "ŮήιƈŏδεĊħąŕαсτəř")
  "ετδąřŕτŏŮŕəŮ"

  iex> bits = Puid.Entropy.bits_for_len!(len, "ŮήιƈŏδεĊħąŕαсτəř")
  iex> defmodule(UnicodePuid, do: use(Puid, bits: bits, chars: "ŮήιƈŏδεĊħąŕαсτəř"))
  iex> UnicodePuid.generate()
  "ααιĊδħąιссήą"

Timing

Generate 50000 random IDs with 128 bits of entropy using alphanum characters
  Common Solution   (PRNG) : 5.796959
  Common Solution (CSPRNG) : 7.824981
  Puid            (CSPRNG) : 0.528811

Generate 50000 random IDs with 128 bits of entropy using 8 custom characters
  Common Solution   (PRNG) : 1.365407
  Common Solution (CSPRNG) : 4.430985
  Puid            (CSPRNG) : 0.720929

Generate 50000 random IDs with 92 bits of entropy using 16 unicode characters
  Common Solution   (PRNG) : 2.437136
  Common Solution (CSPRNG) : 4.922621
  Puid            (CSPRNG) : 2.760375

TOC

Specification

  • specify string entropy
  • 6 pre-defined character sets
  • supports custom characters
    • character set count is restricted to powers of 2
    • does not handle Unicode

Examples

  iex> defmodule(Safe64ES, do: use(EntropyString, charset: :charset64))
  iex> Safe64ES.random()
  "RfYP7I5fitDij2Ow4eYgnd"

  iex> defmodule(Safe64Puid, do: use(Puid, charset: :safe64))
  iex> Safe64Puid.generate()
  "q0E0ra29Xe-sacO71Y4jjQ"

  iex> defmodule(DingoSkyES64, do: use(EntropyString, bits: 64, charset: "dingosky"))
  iex> DingoSkyES64.random()
  "kggsodyyynkioyigyoyyoo"

  iex> defmodule(DingoskyPuid64, do: use(Puid, bits: 64, chars: "dingosky"))
  iex> DingoskyPuid64.generate()
  "koynkddggokggyinnsogii"

  iex> defmodule(UnicodeES92, do: use(EntropyString, bits: 92, charset: "Unicode-Charsət"))
  iex> UnicodeES92.random()
  <<97, 201, 101, 85, 110, 99, 45, 115, 100, 116, 153, 67, 115, 201, 99, 153, 97,
  85, 153, 104, 99, 110, 97>>

  iex> defmodule(UnicodePuid92, do: use(Puid, bits: 92, chars: "Unicode-Charsət"))
  iex> UnicodePuid92.generate()
  "hsieadCcəhtts-ennastCtcə"

Timing

Generate 100000 random IDs with 128 bits of entropy using safe64 characters
  Entropy String (CSPRNG) : 2.007152
  Puid           (CSPRNG) : 0.410303

Generate 100000 random IDs with 92 bits of entropy using 8 custom characters
  Entropy String (CSPRNG) : 3.024128
  Puid           (CSPRNG) : 2.03751

TOC

defmodule Id do
  def gen_reference() do
    min = String.to_integer("100000", 36)
    max = String.to_integer("ZZZZZZ", 36)

    max
    |> Kernel.-(min)
    |> :rand.uniform()
    |> Kernel.+(min)
    |> Integer.to_string(36)
  end
end

Specification

  • no argument input
    • fixed string length of 6
  • 1 pre-defined character set
    • fixed entropy of 31 bits
  • does not support custom characters

Examples

  iex> Id.gen_reference()
  "DABRC1"

  iex> bits = Puid.Entropy.bits_for_len!(6, :alphanum_upper)
  iex> defmodule(UpperAlphanumPuid, do: use(Puid, bits: bits, charset: :alphanum_upper))
  iex> UpperAlphanumPuid.generate()
  "2GEIUC"

Timing

Generate 500000 random IDs with 31 bits of entropy using alphanum_upper characters
  gen_reference   (PRNG) : 0.489205
  gen_reference (CSPRNG) : 1.459515
  Puid          (CSPRNG) : 3.564886

Comments

This solution is clearly faster than Puid; however, it has significant shortcomings. The generated strings have a fixed 31 bits of entropy, so there are only 2.15 billion possible IDs. That's fine if you don't need many IDs, but quickly becomes problematic as more are generated. Here are the approximate probabilities of a repeat for a given number of generated IDs:

Generated Repeat Risk
5,000 0.5 %
10,000 20 %
50,000 44 %
100,000 90 %
200,000 99 %

With no flexibility and limited utility, this solution is basically an interesting novelty.

TOC

Specification

  • specify string length
  • 1 pre-defined character set
  • does not support custom characters
  • does not support CSPRNG

Examples

  iex> Misc.Random.string(22)
  "DGMnEn91xlPYGVOc2lK3Uv"

  iex> defmodule(AlphanumPuid, do: use(Puid, charset: :alphanum))
  iex> AlphanumPuid.generate()
  "6bOXdwc5aP2qhRWARtZtpM"

Timing

Generate 50000 random IDs with 128 bits of entropy using alphanum characters
  Misc.Random (PRNG) : 13.406748
  Puid      (CSPRNG) :  0.450581

Comments

As of v0.2.6, :misc_random uses the :random module. Although not deprecated, the Erlang docs recommend using :rand instead. :random generated IDs are also not cryptographically strong.

TOC

Specification

  • specify string length
  • 4 pre-defined character sets
  • does not support custom characters

Examples

  iex> len = Puid.Entropy.len_for_bits!(128, :alphanum)
  22
  iex> NotQwerty123.RandomPassword.gen_password(length: len)
  "eM7ZCAtzGTHLQpH85Bav5g"

  iex> defmodule(AlphanumPuid, do: use(Puid, charset: :alphanum))
  iex> AlphanumPuid.generate()
  "RCvrqKohgT5I4bqSHV9sQy"

  iex> len = Puid.Entropy.len_for_bits!(128, :printable_ascii)
  20
  iex> NotQwerty123.RandomPassword.gen_password(length: len, characters: :letters_digits_punc)
  "-y[UNW(qYhHmy1#N'mg,"

  iex> defmodule(PrintablePuid, do: use(Puid, charset: :printable_ascii))
  iex> PrintablePuid.generate()
  "8{\"46>7166BQ!vp+PF;3"

Timing

Generate 50000 random IDs with 128 bits of entropy using alphanum characters
  NotQwerty123 (CSPRNG) : 7.079803
  Puid         (CSPRNG) : 0.524233

Generate 50000 random IDs with 128 bits of entropy using printable_ascii characters
  NotQwerty123 (CSPRNG) : 8.090046
  Puid         (CSPRNG) : 0.659449

TOC

Specification

  • specify string length
  • 5 pre-defined character sets
  • does not support custom characters

Examples

  iex> len = Puid.Entropy.len_for_bits!(128, :alphanum)
  22
  iex> Randomizer.generate!(len)
  "3VEwz3TXAzxZ1H4zcxIszk"

  iex> defmodule(AlphanumPuid, do: use(Puid, charset: :alphanum))
  iex> AlphanumPuid.generate()
  "TgFEtGVBuWZuVfayY60Eww"

  iex> len = Puid.Entropy.len_for_bits!(92, :alpha_lower)
  20
  iex> Randomizer.generate!(len, :downcase)
  "TWRIXlTZwNbRJkanFbXQ"

  iex> defmodule(AlphaLowerPuid, do: use(Puid, bits: 92, charset: :alpha_lower))
  iex> AlphaLowerPuid.generate()
  "halpyhdogjafbmipdvsw"

Timing

Generate 5000 random IDs with 128 bits of entropy using alphanum characters
   Randomizer   (PRNG) : 1.643205
   Randomizer (CSPRNG) : 16.815926
   Puid       (CSPRNG) : 0.072043

TOC

Specification

  • specify string length
  • 1 pre-defined character set
  • supports custom characters
    • does not handle unicode

Examples

  iex> len = Puid.Entropy.len_for_bits!(128, :alphanum)
  22
  iex> :rand_str.get(len)
  'vkG5RkWlioZoYGxguk3xex'

  iex> defmodule(AlphanumPuid, do: use(Puid, charset: :alphanum))
  iex> AlphanumPuid.generate()
  "ov0KKui8RbzTtUgjetKABL"

  iex> len = Puid.Entropy.len_for_bits!(48, "dingosky")
  16
  iex> :rand_str.get(len, 'dingosky')
  'sydyiikiokyynyiy'

  iex> defmodule(DingoskyPuid, do: use(Puid, bits: 48, chars: "dingosky"))
  iex> DingoskyPuid.generate()
  "ksossdgsysyndgko"

  iex> len = Puid.Entropy.len_for_bits!(64, "ŮήιƈŏδεĊħąŕαсτəř")
  16
  iex> :rand_str.get(len, 'ŮήιƈŏδεĊħąŕαсτəř')
  [266, 964, 335, 341, 964, 261, 948, 295, 948, 392, 392, 392, 942, 366, 949, 261]

  iex> defmodule(UnicodePuid, do: use(Puid, bits: 64, chars: "ŮήιƈŏδεĊħąŕαсτəř"))
  iex> UnicodePuid.generate()
  "ŮŕδħƈήřŏεřřĊŮąсř"

Timing

Generate 100000 random IDs with 128 bits of entropy using safe64 characters
  :rand_str   (PRNG) : 1.039695
  :rand_str (CSPRNG) : 6.370006
  Puid      (CSPRNG) : 0.293184

Generate 100000 random IDs with 128 bits of entropy using alphanum characters
  :rand_str   (PRNG) : 1.139668
  :rand_str (CSPRNG) : 5.276775
  Puid      (CSPRNG) : 1.247633

TOC

Specification

  • specified argument is not consistent
  • 3 pre-defined character sets
  • does not support custom characters

Examples

  iex> len = 12
  len
  iex> SecureRandom.base64(len)
  "SXkaWlieKm+hlID1"

  iex> bits = Puid.Entropy.bits_for_len!(SecureRandom.base64(len) |> String.length(), :safe64)
  96
  iex> defmodule(Safe64Puid, do: use(Puid, bits: bits, charset: :safe64))
  iex> Safe64Puid.generate()
  "5rd9ql-fJpX6gaA9"

  iex> SecureRandom.urlsafe_base64(len)
  "Y1M2Y2xqTmNEREp2SXdLeg=="

As of version 0.5.1, the input argument is not treated consistently. Furthermore, the output of SecureRandom.urlsafe_base64/1 uses superfluous padding characters.

Timing

Generate 500000 random IDs with 128 bits of entropy using hex characters
  SecureRandom (CSPRNG) : 2.14521
  Puid         (CSPRNG) : 1.896126

Generate 500000 random IDs with 128 bits of entropy using safe64 characters
  SecureRandom (CSPRNG) : 3.329939
  Puid         (CSPRNG) : 2.612979

TOC

Specification

  • no input argument
    • fixed string length of 36
  • 1 pre-defined form
    • fixed entropy of 122 bits
  • does not support custom characters

Examples

  iex> UUID.uuid4()
  "39c8277a-9b9e-4c8d-af50-31c17fd39bd0"

  iex> defmodule(HexPuid122, do: use(Puid, bits: 122, charset: :hex))
  iex> HexPuid122.generate()
  "2bc031932cc22c051af9e8b5f598275f"

  iex> defmodule(Safe64Puid122, do: use(Puid, bits: 122, charset: :safe64))
  iex> Safe64Puid122.generate()
  "xJMZ85YkdkYng3yP9W9s"

Timing

Generate 500000 random IDs with 122 bits of entropy using hex
  UUID     : 1.925131
  Puid hex : 1.823116

Generate 500000 random IDs with 122 bits of entropy using safe64
  UUID        : 1.751625
  Puid safe64 : 1.367201

TOC

About

Fast and efficient generation of cryptographically strong probably unique indentifiers (puid, aka random string) of specified entropy from various character sets.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Elixir 100.0%