Skip to content

jpivarski-talks/2024-02-13-numba-usage-stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Numba usage stats

Jim Pivarski

Note: The URLs to https://pivarski-princeton.s3.amazonaws.com will no longer work because those files have been removed. All but one have been moved to the files-from-AWS directory in this repository. The one that has not been saved, GitHub-numba-user-nonfork-raw-data-1Mcut-imports.tar, was 179.4 GB of repository data. If you're following the instructions below, you'll be able to produce an updated version of that dataset, but the original was too large to keep around. The other files are much smaller and serve as good checkpoints for testing that you're following the procedure correctly.

How the data were collected


Step 1: Scrape the dependents graph for numba/numba on GitHub (the repositories, not the packages).

Web-scraping script in step1.py.

When I did it, there were 62903 of these.




Step 2: For each of those repos, get the repo metadata using the GitHub API, taking care to not exceed the rate limit.

My list has 62900 of these. (I guess 3 were lost.)

My copy of the repo info can be found in https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-REPO-INFO.tgz (32.1 MB).

My copy of the user info, also from GitHub API (for bios) can be found in https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-USER-INFO.tgz (2.0 MB).




Step 3: For the repos in which "fork": false (users created the repo themselves), download all of the repos.

59233 repos from my previous list are non-fork.

I have the final results on a 380 GB disk, but I think I used a 1 TB disk during the process (all on AWS).

The step3.py script performs the giant git clone of all these repos. It's a parallized pipeline (ProcessPoolExecutor with max_workers=24 on a computer with 4 CPU cores... it's I/O limited) with the following steps:

  1. git clone with --depth 1 to get the latest snapshot, but not the history.
  2. Do a grep -i for \bnumba\b to cross-check GitHub's identification of these as depending on Numba and keep the result in a *.grep file beside the final tarball.
  3. Drop any files that are greater than 1 MB (some GitHub repos contain large data files) if they do not have an interesting file suffix: py, PY, ipynb, IPYNB, c, cc, cpp, cp, cxx, c++, C, CC, CPP, CP, CXX, C++, h, hpp, hp, hh, H, HPP, HP, HH, cu, cuh, CU, CUH.
  4. Tarball-and-compress what remains.

Occasionally, one of the 24 workers would get stuck with a large download, but the others moved past it. In the end, I think there were only a couple that couldn't be downloaded after a few attempts. (The script does not re-download, so it can be used to clean up after failed attempts.)




Step 4: Further select only the repos that actually contain

\b(import\s+([A-Za-z_][A-Za-z_0-9]*\s*,\s*)*numba|from\s+numba\s+import)\b

in some file. After this selection, only 13512 repos were kept (22.8%). Some of the repos that GitHub identified mentioned Numba in text or used it in markdown examples, but didn't import it: GitHub's interpretation of a "dependent repo" is very broad.

Finally, tarball (without compression!) the directory full of gzipped tarballs. My copy is at https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-raw-data-1Mcut-imports.tar (179.4 GB).




Step 5: Do a static code analysis on all of the Python and Jupyter notebook files. This is another ProcessPoolExecutor pipeline, which results in a JSON file that will be used in interactive analysis. The steps of the pipeline are:

  1. Identify programming language by file extension, to learn which programming languages are used alongside Numba.
  2. For all C/C++/CUDA files,
  • try to parse it as a pure C file using pycparser (mostly to distinguish between C and C++),
  • look for CUDA's triple angle brackets, and
  • regex-search it for \s*#include [<\"](.*)[>\"] to get a list of includes (and identify if the include-file name matches the name of a file in the repo, so that locally defined files can be excluded).
  1. For all Python and Jupyter notebook files, parse the file with Python 3 (3.10.12) and indicate if parsing failed. For Jupyter, use jupytext to transform the Jupyter JSON into an in-memory pure Python, with IPython magics removed. Then, walk the Python AST to
  • collect all information on top-level imports and nested imports, keeping track of how imported modules or symbols are renamed,
  • if any of these are under the numba module, collect all symbol references and argument lists of function calls, including whether or not a function was used as a decorator, and
  • pay close attention to JIT-compilation functions/decorators: numba.jit, numba.njit, numba.generated_jit, numba.vectorize, numba.guvectorize, numba.cfunc.

My copy of the static analysis results is at https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-static-analysis-results.jsons.gz (77.0 MB).




Analysis


import json
from collections import Counter

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
results = []
with open("static-analysis-results.jsons") as file:
    for line in file:
        results.append(json.loads(line))
len(results)
13511
df = pd.DataFrame([{"suffix": cfile["suffix"], "is_c": cfile["data"]["is_c"], "is_cuda": cfile["data"]["num_cuda"] > 0} for result in results for cfile in result["c"]])
df
suffix is_c is_cuda
0 cpp True False
1 c False False
2 c False False
3 cpp False False
4 cpp False False
... ... ... ...
897086 h False False
897087 h False False
897088 h False False
897089 h False False
897090 c False False

897091 rows Ă— 3 columns

The file extension is useless for determining if something is pure C versus C++.

df["suffix"][df["is_c"]].value_counts()
suffix
h      9157
cpp    3236
c      3024
hpp    2311
cxx     160
cc      106
cuh      49
cu       20
hh        6
hxx       5
Name: count, dtype: int64
df["suffix"][~df["is_c"]].value_counts()
suffix
h      401378
cpp    194933
c       85414
cc      73949
hpp     45596
cu      40778
cxx     24752
cuh      6170
hh       3046
hxx      2982
c++        13
cp          3
hp          3
Name: count, dtype: int64

But it's a pretty good indicator that a CUDA file is a CUDA file (unless it's a header file, but then my method of checking for <<< >>> doesn't work, either).

df["suffix"][df["is_cuda"]].value_counts()
suffix
cu     18197
h       1274
cuh      962
c        101
cpp       85
hpp       82
cc        73
hxx        2
hh         1
Name: count, dtype: int64
df["suffix"][~df["is_cuda"]].value_counts()
suffix
h      409261
cpp    198084
c       88337
cc      73982
hpp     47825
cxx     24912
cu      22601
cuh      5257
hh       3051
hxx      2985
c++        13
cp          3
hp          3
Name: count, dtype: int64
languages = []
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
            break
    else:
        continue
    username, reponame = result["name"].split("/", 1)
    languages.append({"user": username, "repo": reponame})
    for cfile in result["c"]:
        if cfile["data"]["num_cuda"] > 0 or cfile["suffix"] in ("cu", "cuh"):
            languages[-1]["CUDA"] = True
        elif cfile["data"]["is_c"]:
            languages[-1]["C"] = True
        else:
            languages[-1]["C++"] = True
    for k, v in result["other_language"].items():
        if v > 0:
            languages[-1][k] = True

df = pd.DataFrame(languages).fillna(False)
df
/tmp/ipykernel_76379/2710413475.py:21: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df = pd.DataFrame(languages).fillna(False)
user repo C C++ Cython Julia Swift Go CUDA Java ... R Rust MATLAB Fortran Groovy Scala Kotlin F# Haskell Ada
0 JeffreyMinucci ht_occupational False False False False False False False False ... False False False False False False False False False False
1 dreamento dreamento False False False False False False False False ... False False False False False False False False False False
2 nitin7478 Backorder_Prediction False False False False False False False False ... False False False False False False False False False False
3 exafmm pyexafmm False False False False False False False False ... False False False False False False False False False False
4 astro-informatics sleplet False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13087 haowen-xu tensorkit False False False False False False False False ... False False False False False False False False False False
13088 WONDER-project OASYS1-WONDER False False False False False False False False ... False False False False False False False False False False
13089 WONDER-project Orange3-WONDER False False False False False False False False ... False False False False False False False False False False
13090 maciej-sypetkowski autoascend False False False False False False False False ... False False False False False False False False False False
13091 FS-CSCI150-F21 FS-CSCI150-F21-Team4 True True True False False False False False ... False False True True False False False False False False

13092 rows Ă— 24 columns

len(df) / len(results)
0.9689882318111168

In the following, "C++" and "C" are mutually exclusive categories of file ("does it compile in pycparser or not?"), but the bars are not mutually exclusive because a repo can contain a C++ file and also a pure C file.

"CUDA" is not exclusive with respect to "C++" and "C"; it corresponds to any C-like file with <<< >>> in it.

fig, ax = plt.subplots(figsize=(6, 4.5))

(df.drop(columns=["user", "repo"]).sum(axis=0).sort_values() / len(df)).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("non-Python languages (represented by at least one file)")

None

png

fig, ax = plt.subplots(figsize=(9, 4))

(df.drop(columns=["repo"]).groupby("user").any().sum(axis=0).sort_values()[4:] * 100 / len(df)).plot.barh(ax=ax)
ax.set_xlabel("Percentage of GitHub users who 'import numba' or 'from numba import' in Python")
ax.set_ylabel("Other language used (at least one file)")
ax.xaxis.grid(linestyle="--")
ax.set_axisbelow(True)

# fig.savefig("numba-users-other-language.svg")
# fig.savefig("numba-users-other-language.pdf")

png

df.drop(columns=["repo"]).groupby("user").any().sum(axis=0).sort_values()
Ada               1
Haskell          13
F#               14
Kotlin           24
Swift            29
Groovy           32
Scala            43
Ruby             90
Rust             91
Julia            92
Go               93
C#              115
Perl            177
Java            300
R               317
Fortran         465
MATLAB          533
Mathematica     692
CUDA           1270
Cython         1406
C              1511
C++            2850
dtype: int64
STDLIB_MODULES = {
    "__main__",
    "string",
    "re",
    "difflib",
    "textwrap",
    "unicodedata",
    "stringprep",
    "readline",
    "rlcompleter",
    "struct",
    "codecs",
    "datetime",
    "calendar",
    "collections",
    "heapq",
    "bisect",
    "array",
    "weakref",
    "types",
    "copy",
    "pprint",
    "reprlib",
    "enum",
    "numbers",
    "math",
    "cmath",
    "decimal",
    "fractions",
    "random",
    "statistics",
    "itertools",
    "functools",
    "operator",
    "pathlib",
    "fileinput",
    "stat",
    "filecmp",
    "tempfile",
    "glob",
    "fnmatch",
    "linecache",
    "shutil",
    "macpath",
    "pickle",
    "copyreg",
    "shelve",
    "marshal",
    "dbm",
    "sqlite3",
    "zlib",
    "gzip",
    "bz2",
    "lzma",
    "zipfile",
    "tarfile",
    "csv",
    "configparser",
    "netrc",
    "xdrlib",
    "plistlib",
    "hashlib",
    "hmac",
    "secrets",
    "os",
    "io",
    "time",
    "argparse",
    "getopt",
    "logging",
    "getpass",
    "curses",
    "platform",
    "errno",
    "ctypes",
    "threading",
    "multiprocessing",
    "concurrent",
    "subprocess",
    "sched",
    "queue",
    "_thread",
    "_dummy_thread",
    "dummy_threading",
    "contextvars",
    "asyncio",
    "socket",
    "ssl",
    "select",
    "selectors",
    "asyncore",
    "asynchat",
    "signal",
    "mmap",
    "email",
    "json",
    "mailcap",
    "mailbox",
    "mimetypes",
    "base64",
    "binhex",
    "binascii",
    "quopri",
    "uu",
    "html",
    "xml",
    "webbrowser",
    "cgi",
    "cgitb",
    "wsgiref",
    "urllib",
    "ftplib",
    "poplib",
    "imaplib",
    "nntplib",
    "smtplib",
    "smtpd",
    "telnetlib",
    "uuid",
    "socketserver",
    "xmlrpc",
    "ipaddress",
    "audioop",
    "aifc",
    "sunau",
    "wave",
    "chunk",
    "colorsys",
    "imghdr",
    "sndhdr",
    "ossaudiodev",
    "gettext",
    "locale",
    "turtle",
    "cmd",
    "shlex",
    "tkinter",
    "typing",
    "pydoc",
    "doctest",
    "unittest",
    "lib2to3",
    "test",
    "bdb",
    "faulthandler",
    "pdb",
    "timeit",
    "trace",
    "tracemalloc",
    "distutils",
    "ensurepip",
    "venv",
    "zipapp",
    "sys",
    "sysconfig",
    "builtins",
    "warnings",
    "dataclasses",
    "contextlib",
    "abc",
    "atexit",
    "traceback",
    "__future__",
    "gc",
    "inspect",
    "site",
    "code",
    "codeop",
    "zipimport",
    "pkgutil",
    "modulefinder",
    "runpy",
    "importlib",
    "parser",
    "ast",
    "symtable",
    "symbol",
    "token",
    "keyword",
    "tokenize",
    "tabnanny",
    "pyclbr",
    "py_compile",
    "compileall",
    "dis",
    "pickletools",
    "formatter",
    "msilib",
    "msvcrt",
    "winreg",
    "winsound",
    "posix",
    "pwd",
    "spwd",
    "grp",
    "crypt",
    "termios",
    "tty",
    "pty",
    "fcntl",
    "pipes",
    "resource",
    "nis",
    "syslog",
    "optparse",
    "imp",
    "posixpath",
    "ntpath",
}
# https://stackoverflow.com/a/2029106/1623645

C_STDLIB_MODULES = set([
    "aio.h",
    "algorithm",
    "any",
    "arpa/inet.h",
    "array",
    "assert.h",
    "atomic",
    "barrier",
    "bit",
    "bitset",
    "cassert",
    "ccomplex",
    "cctype",
    "cerrno",
    "cfenv",
    "cfloat",
    "charconv",
    "chrono",
    "cinttypes",
    "ciso646",
    "climits",
    "clocale",
    "cmath",
    "codecvt",
    "compare",
    "complex",
    "complex.h",
    "concepts",
    "condition_variable",
    "coroutine",
    "cpio.h",
    "csetjmp",
    "csignal",
    "cstdalign",
    "cstdarg",
    "cstdbool",
    "cstddef",
    "cstdint",
    "cstdio",
    "cstdlib",
    "cstring",
    "ctgmath",
    "ctime",
    "ctype.h",
    "cuchar",
    "curses.h",
    "cwchar",
    "cwctype",
    "deque",
    "dirent.h",
    "dlfcn.h",
    "errno.h",
    "exception",
    "execution",
    "expected",
    "fcntl.h",
    "fenv.h",
    "filesystem",
    "flat_map",
    "flat_set",
    "float.h",
    "fmtmsg.h",
    "fnmatch.h",
    "format",
    "forward_list",
    "fstream",
    "ftw.h",
    "functional",
    "future",
    "generator",
    "glob.h",
    "grp.h",
    "iconv.h",
    "initializer_list",
    "inttypes.h",
    "iomanip",
    "ios",
    "iosfwd",
    "iostream",
    "iso646.h",
    "istream",
    "iterator",
    "langinfo.h",
    "latch",
    "libgen.h",
    "limits",
    "limits.h",
    "list",
    "locale",
    "locale.h",
    "map",
    "math.h",
    "mdspan",
    "memory",
    "memory_resource",
    "monetary.h",
    "mqueue.h",
    "mutex",
    "ndbm.h",
    "netdb.h",
    "net/if.h",
    "netinet/in.h",
    "netinet/tcp.h",
    "new",
    "nl_types.h",
    "numbers",
    "numeric",
    "optional",
    "ostream",
    "poll.h",
    "print",
    "pthread.h",
    "pwd.h",
    "queue",
    "random",
    "ranges",
    "ratio",
    "regex",
    "regex.h",
    "sched.h",
    "scoped_allocator",
    "search.h",
    "semaphore",
    "semaphore.h",
    "set",
    "setjmp.h",
    "shared_mutex",
    "signal.h",
    "source_location",
    "span",
    "spanstream",
    "spawn.h",
    "sstream",
    "stack",
    "stacktrace",
    "stdalign.h",
    "stdarg.h",
    "stdatomic.h",
    "stdbit.h",
    "stdbool.h",
    "stdckdint.h",
    "stddef.h",
    "stdexcept",
    "stdfloat",
    "stdint.h",
    "stdio.h",
    "stdlib.h",
    "stdnoreturn.h",
    "stop_token",
    "streambuf",
    "string",
    "string.h",
    "strings.h",
    "string_view",
    "stropts.h",
    "strstream",
    "syncstream",
    "sys/ipc.h",
    "syslog.h",
    "sys/mman.h",
    "sys/msg.h",
    "sys/resource.h",
    "sys/select.h",
    "sys/sem.h",
    "sys/shm.h",
    "sys/socket.h",
    "sys/stat.h",
    "sys/statvfs.h",
    "system_error",
    "sys/time.h",
    "sys/times.h",
    "sys/types.h",
    "sys/uio.h",
    "sys/un.h",
    "sys/utsname.h",
    "sys/wait.h",
    "tar.h",
    "term.h",
    "termios.h",
    "tgmath.h",
    "thread",
    "threads.h",
    "time.h",
    "trace.h",
    "tuple",
    "typeindex",
    "typeinfo",
    "type_traits",
    "uchar.h",
    "ulimit.h",
    "uncntrl.h",
    "unistd.h",
    "unordered_map",
    "unordered_set",
    "utility",
    "utime.h",
    "utmpx.h",
    "valarray",
    "variant",
    "vector",
    "version",
    "wchar.h",
    "wctype.h",
    "wordexp.h",
])
num_with_numba = 0
python_imports = Counter()
c_imports = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
            break
    else:
        continue
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"]):
                if x not in STDLIB_MODULES and x != "numba":
                    counter[x] += 1
    for x in counter:
        python_imports[x] += 1

    counter = Counter()
    for cfile in result["c"]:
        if cfile["data"] is not None:
            for x in list(cfile["data"]["global"]) + list(cfile["data"]["local"]):
                if x not in C_STDLIB_MODULES and x != "numba":
                    counter[x] += 1
    for x in counter:
        c_imports[x] += 1

python_imports = sorted(python_imports.items(), key=lambda x: -x[1])
c_imports = sorted(c_imports.items(), key=lambda x: -x[1])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 9))

(pd.Series(dict(python_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax1)
ax1.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax1.set_title("top Python imports (not standard library)")
ax1.set_xlim(0, 1)

(pd.Series(dict(c_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax2)
ax2.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax2.set_title("top C and C++ includes (not standard library)")
# ax2.set_xlim(0, 1)

plt.subplots_adjust(wspace=0.8)

None

png

num_with_numba = 0
numba_references = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
            break
    else:
        continue
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                y = x.lstrip("@").split("(")[0]
                if x.startswith("numba.jit") and "nopython=True" in x:
                    y = "numba.njit"
                counter[y] += 1

    for x in counter:
        numba_references[x] += 1

numba_references = sorted(numba_references.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 9))

(pd.Series(dict(numba_references[:50])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top Numba API calls")
# ax1.set_xlim(0, 1)

None

png

JIT_FUNCTIONS = {"numba.jit", "numba.njit", "numba.generated_jit", "numba.vectorize", "numba.guvectorize", "numba.cfunc", "numba.cuda.jit"}
fig, ax = plt.subplots(figsize=(6, 2))

(pd.Series({k: v for k, v in numba_references if k in JIT_FUNCTIONS}).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("Numba JIT API calls")
# ax1.set_xlim(0, 1)

None

png

num_with_numba = 0
jit_arguments = Counter()
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
            break
    else:
        continue
    num_with_numba += 1

    counter = Counter()
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                if "(" in x and (x.lstrip("@").startswith("numba.jit") or x.lstrip("@").startswith("numba.njit")):
                    for arg in x.split("(", 1)[1].rstrip(")").split(","):
                        if "=" in arg:
                            counter[arg.strip()] += 1
                if x.lstrip("@").startswith("numba.njit"):
                    counter["nopython=True"] += 1

    for x in counter:
        jit_arguments[x] += 1

jit_arguments = sorted(jit_arguments.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 4.5))

(pd.Series(dict(jit_arguments[:17])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top numba.jit arguments")
# ax1.set_xlim(0, 1)

None

png

num_with_numba = 0
num_with_numba_cuda = 0
for result in results:
    for pyfile in result["python"]:
        if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
            break
    else:
        continue
    num_with_numba += 1

    any_cuda = False
    for pyfile in result["python"]:
        if pyfile["data"] is not None:
            for x in pyfile["data"]["numba"]:
                if x.startswith("numba.cuda"):
                    any_cuda = True

    if any_cuda:
        num_with_numba_cuda += 1

num_with_numba_cuda / num_with_numba
0.13978001833180567

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published