Jim Pivarski
Note: The URLs to https://pivarski-princeton.s3.amazonaws.com
will no longer work because those files have been removed. All but one have been moved to the files-from-AWS directory in this repository. The one that has not been saved, GitHub-numba-user-nonfork-raw-data-1Mcut-imports.tar
, was 179.4 GB of repository data. If you're following the instructions below, you'll be able to produce an updated version of that dataset, but the original was too large to keep around. The other files are much smaller and serve as good checkpoints for testing that you're following the procedure correctly.
Step 1: Scrape the dependents graph for numba/numba on GitHub (the repositories, not the packages).
Web-scraping script in step1.py.
When I did it, there were 62903 of these.
Step 2: For each of those repos, get the repo metadata using the GitHub API, taking care to not exceed the rate limit.
My list has 62900 of these. (I guess 3 were lost.)
My copy of the repo info can be found in https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-REPO-INFO.tgz (32.1 MB).
My copy of the user info, also from GitHub API (for bios) can be found in https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-USER-INFO.tgz (2.0 MB).
Step 3: For the repos in which "fork": false
(users created the repo themselves), download all of the repos.
59233 repos from my previous list are non-fork.
I have the final results on a 380 GB disk, but I think I used a 1 TB disk during the process (all on AWS).
The step3.py script performs the giant git clone
of all these repos. It's a parallized pipeline (ProcessPoolExecutor
with max_workers=24
on a computer with 4 CPU cores... it's I/O limited) with the following steps:
git clone
with--depth 1
to get the latest snapshot, but not the history.- Do a
grep -i
for\bnumba\b
to cross-check GitHub's identification of these as depending on Numba and keep the result in a*.grep
file beside the final tarball. - Drop any files that are greater than 1 MB (some GitHub repos contain large data files) if they do not have an interesting file suffix:
py
,PY
,ipynb
,IPYNB
,c
,cc
,cpp
,cp
,cxx
,c++
,C
,CC
,CPP
,CP
,CXX
,C++
,h
,hpp
,hp
,hh
,H
,HPP
,HP
,HH
,cu
,cuh
,CU
,CUH
. - Tarball-and-compress what remains.
Occasionally, one of the 24 workers would get stuck with a large download, but the others moved past it. In the end, I think there were only a couple that couldn't be downloaded after a few attempts. (The script does not re-download, so it can be used to clean up after failed attempts.)
Step 4: Further select only the repos that actually contain
\b(import\s+([A-Za-z_][A-Za-z_0-9]*\s*,\s*)*numba|from\s+numba\s+import)\b
in some file. After this selection, only 13512 repos were kept (22.8%). Some of the repos that GitHub identified mentioned Numba in text or used it in markdown examples, but didn't import it: GitHub's interpretation of a "dependent repo" is very broad.
Finally, tarball (without compression!) the directory full of gzipped tarballs. My copy is at https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-raw-data-1Mcut-imports.tar (179.4 GB).
Step 5: Do a static code analysis on all of the Python and Jupyter notebook files. This is another ProcessPoolExecutor
pipeline, which results in a JSON file that will be used in interactive analysis. The steps of the pipeline are:
- Identify programming language by file extension, to learn which programming languages are used alongside Numba.
- For all C/C++/CUDA files,
- try to parse it as a pure C file using pycparser (mostly to distinguish between C and C++),
- look for CUDA's triple angle brackets, and
- regex-search it for
\s*#include [<\"](.*)[>\"]
to get a list of includes (and identify if the include-file name matches the name of a file in the repo, so that locally defined files can be excluded).
- For all Python and Jupyter notebook files, parse the file with Python 3 (3.10.12) and indicate if parsing failed. For Jupyter, use jupytext to transform the Jupyter JSON into an in-memory pure Python, with IPython magics removed. Then, walk the Python AST to
- collect all information on top-level imports and nested imports, keeping track of how imported modules or symbols are renamed,
- if any of these are under the
numba
module, collect all symbol references and argument lists of function calls, including whether or not a function was used as a decorator, and - pay close attention to JIT-compilation functions/decorators:
numba.jit
,numba.njit
,numba.generated_jit
,numba.vectorize
,numba.guvectorize
,numba.cfunc
.
My copy of the static analysis results is at https://pivarski-princeton.s3.amazonaws.com/GitHub-numba-user-nonfork-static-analysis-results.jsons.gz (77.0 MB).
import json
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
results = []
with open("static-analysis-results.jsons") as file:
for line in file:
results.append(json.loads(line))
len(results)
13511
df = pd.DataFrame([{"suffix": cfile["suffix"], "is_c": cfile["data"]["is_c"], "is_cuda": cfile["data"]["num_cuda"] > 0} for result in results for cfile in result["c"]])
df
suffix | is_c | is_cuda | |
---|---|---|---|
0 | cpp | True | False |
1 | c | False | False |
2 | c | False | False |
3 | cpp | False | False |
4 | cpp | False | False |
... | ... | ... | ... |
897086 | h | False | False |
897087 | h | False | False |
897088 | h | False | False |
897089 | h | False | False |
897090 | c | False | False |
897091 rows Ă— 3 columns
The file extension is useless for determining if something is pure C versus C++.
df["suffix"][df["is_c"]].value_counts()
suffix
h 9157
cpp 3236
c 3024
hpp 2311
cxx 160
cc 106
cuh 49
cu 20
hh 6
hxx 5
Name: count, dtype: int64
df["suffix"][~df["is_c"]].value_counts()
suffix
h 401378
cpp 194933
c 85414
cc 73949
hpp 45596
cu 40778
cxx 24752
cuh 6170
hh 3046
hxx 2982
c++ 13
cp 3
hp 3
Name: count, dtype: int64
But it's a pretty good indicator that a CUDA file is a CUDA file (unless it's a header file, but then my method of checking for <<<
>>>
doesn't work, either).
df["suffix"][df["is_cuda"]].value_counts()
suffix
cu 18197
h 1274
cuh 962
c 101
cpp 85
hpp 82
cc 73
hxx 2
hh 1
Name: count, dtype: int64
df["suffix"][~df["is_cuda"]].value_counts()
suffix
h 409261
cpp 198084
c 88337
cc 73982
hpp 47825
cxx 24912
cu 22601
cuh 5257
hh 3051
hxx 2985
c++ 13
cp 3
hp 3
Name: count, dtype: int64
languages = []
for result in results:
for pyfile in result["python"]:
if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
break
else:
continue
username, reponame = result["name"].split("/", 1)
languages.append({"user": username, "repo": reponame})
for cfile in result["c"]:
if cfile["data"]["num_cuda"] > 0 or cfile["suffix"] in ("cu", "cuh"):
languages[-1]["CUDA"] = True
elif cfile["data"]["is_c"]:
languages[-1]["C"] = True
else:
languages[-1]["C++"] = True
for k, v in result["other_language"].items():
if v > 0:
languages[-1][k] = True
df = pd.DataFrame(languages).fillna(False)
df
/tmp/ipykernel_76379/2710413475.py:21: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df = pd.DataFrame(languages).fillna(False)
user | repo | C | C++ | Cython | Julia | Swift | Go | CUDA | Java | ... | R | Rust | MATLAB | Fortran | Groovy | Scala | Kotlin | F# | Haskell | Ada | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | JeffreyMinucci | ht_occupational | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | dreamento | dreamento | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | nitin7478 | Backorder_Prediction | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | exafmm | pyexafmm | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | astro-informatics | sleplet | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13087 | haowen-xu | tensorkit | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
13088 | WONDER-project | OASYS1-WONDER | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
13089 | WONDER-project | Orange3-WONDER | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
13090 | maciej-sypetkowski | autoascend | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
13091 | FS-CSCI150-F21 | FS-CSCI150-F21-Team4 | True | True | True | False | False | False | False | False | ... | False | False | True | True | False | False | False | False | False | False |
13092 rows Ă— 24 columns
len(df) / len(results)
0.9689882318111168
In the following, "C++" and "C" are mutually exclusive categories of file ("does it compile in pycparser or not?"), but the bars are not mutually exclusive because a repo can contain a C++ file and also a pure C file.
"CUDA" is not exclusive with respect to "C++" and "C"; it corresponds to any C-like file with <<<
>>>
in it.
fig, ax = plt.subplots(figsize=(6, 4.5))
(df.drop(columns=["user", "repo"]).sum(axis=0).sort_values() / len(df)).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("non-Python languages (represented by at least one file)")
None
fig, ax = plt.subplots(figsize=(9, 4))
(df.drop(columns=["repo"]).groupby("user").any().sum(axis=0).sort_values()[4:] * 100 / len(df)).plot.barh(ax=ax)
ax.set_xlabel("Percentage of GitHub users who 'import numba' or 'from numba import' in Python")
ax.set_ylabel("Other language used (at least one file)")
ax.xaxis.grid(linestyle="--")
ax.set_axisbelow(True)
# fig.savefig("numba-users-other-language.svg")
# fig.savefig("numba-users-other-language.pdf")
df.drop(columns=["repo"]).groupby("user").any().sum(axis=0).sort_values()
Ada 1
Haskell 13
F# 14
Kotlin 24
Swift 29
Groovy 32
Scala 43
Ruby 90
Rust 91
Julia 92
Go 93
C# 115
Perl 177
Java 300
R 317
Fortran 465
MATLAB 533
Mathematica 692
CUDA 1270
Cython 1406
C 1511
C++ 2850
dtype: int64
STDLIB_MODULES = {
"__main__",
"string",
"re",
"difflib",
"textwrap",
"unicodedata",
"stringprep",
"readline",
"rlcompleter",
"struct",
"codecs",
"datetime",
"calendar",
"collections",
"heapq",
"bisect",
"array",
"weakref",
"types",
"copy",
"pprint",
"reprlib",
"enum",
"numbers",
"math",
"cmath",
"decimal",
"fractions",
"random",
"statistics",
"itertools",
"functools",
"operator",
"pathlib",
"fileinput",
"stat",
"filecmp",
"tempfile",
"glob",
"fnmatch",
"linecache",
"shutil",
"macpath",
"pickle",
"copyreg",
"shelve",
"marshal",
"dbm",
"sqlite3",
"zlib",
"gzip",
"bz2",
"lzma",
"zipfile",
"tarfile",
"csv",
"configparser",
"netrc",
"xdrlib",
"plistlib",
"hashlib",
"hmac",
"secrets",
"os",
"io",
"time",
"argparse",
"getopt",
"logging",
"getpass",
"curses",
"platform",
"errno",
"ctypes",
"threading",
"multiprocessing",
"concurrent",
"subprocess",
"sched",
"queue",
"_thread",
"_dummy_thread",
"dummy_threading",
"contextvars",
"asyncio",
"socket",
"ssl",
"select",
"selectors",
"asyncore",
"asynchat",
"signal",
"mmap",
"email",
"json",
"mailcap",
"mailbox",
"mimetypes",
"base64",
"binhex",
"binascii",
"quopri",
"uu",
"html",
"xml",
"webbrowser",
"cgi",
"cgitb",
"wsgiref",
"urllib",
"ftplib",
"poplib",
"imaplib",
"nntplib",
"smtplib",
"smtpd",
"telnetlib",
"uuid",
"socketserver",
"xmlrpc",
"ipaddress",
"audioop",
"aifc",
"sunau",
"wave",
"chunk",
"colorsys",
"imghdr",
"sndhdr",
"ossaudiodev",
"gettext",
"locale",
"turtle",
"cmd",
"shlex",
"tkinter",
"typing",
"pydoc",
"doctest",
"unittest",
"lib2to3",
"test",
"bdb",
"faulthandler",
"pdb",
"timeit",
"trace",
"tracemalloc",
"distutils",
"ensurepip",
"venv",
"zipapp",
"sys",
"sysconfig",
"builtins",
"warnings",
"dataclasses",
"contextlib",
"abc",
"atexit",
"traceback",
"__future__",
"gc",
"inspect",
"site",
"code",
"codeop",
"zipimport",
"pkgutil",
"modulefinder",
"runpy",
"importlib",
"parser",
"ast",
"symtable",
"symbol",
"token",
"keyword",
"tokenize",
"tabnanny",
"pyclbr",
"py_compile",
"compileall",
"dis",
"pickletools",
"formatter",
"msilib",
"msvcrt",
"winreg",
"winsound",
"posix",
"pwd",
"spwd",
"grp",
"crypt",
"termios",
"tty",
"pty",
"fcntl",
"pipes",
"resource",
"nis",
"syslog",
"optparse",
"imp",
"posixpath",
"ntpath",
}
# https://stackoverflow.com/a/2029106/1623645
C_STDLIB_MODULES = set([
"aio.h",
"algorithm",
"any",
"arpa/inet.h",
"array",
"assert.h",
"atomic",
"barrier",
"bit",
"bitset",
"cassert",
"ccomplex",
"cctype",
"cerrno",
"cfenv",
"cfloat",
"charconv",
"chrono",
"cinttypes",
"ciso646",
"climits",
"clocale",
"cmath",
"codecvt",
"compare",
"complex",
"complex.h",
"concepts",
"condition_variable",
"coroutine",
"cpio.h",
"csetjmp",
"csignal",
"cstdalign",
"cstdarg",
"cstdbool",
"cstddef",
"cstdint",
"cstdio",
"cstdlib",
"cstring",
"ctgmath",
"ctime",
"ctype.h",
"cuchar",
"curses.h",
"cwchar",
"cwctype",
"deque",
"dirent.h",
"dlfcn.h",
"errno.h",
"exception",
"execution",
"expected",
"fcntl.h",
"fenv.h",
"filesystem",
"flat_map",
"flat_set",
"float.h",
"fmtmsg.h",
"fnmatch.h",
"format",
"forward_list",
"fstream",
"ftw.h",
"functional",
"future",
"generator",
"glob.h",
"grp.h",
"iconv.h",
"initializer_list",
"inttypes.h",
"iomanip",
"ios",
"iosfwd",
"iostream",
"iso646.h",
"istream",
"iterator",
"langinfo.h",
"latch",
"libgen.h",
"limits",
"limits.h",
"list",
"locale",
"locale.h",
"map",
"math.h",
"mdspan",
"memory",
"memory_resource",
"monetary.h",
"mqueue.h",
"mutex",
"ndbm.h",
"netdb.h",
"net/if.h",
"netinet/in.h",
"netinet/tcp.h",
"new",
"nl_types.h",
"numbers",
"numeric",
"optional",
"ostream",
"poll.h",
"print",
"pthread.h",
"pwd.h",
"queue",
"random",
"ranges",
"ratio",
"regex",
"regex.h",
"sched.h",
"scoped_allocator",
"search.h",
"semaphore",
"semaphore.h",
"set",
"setjmp.h",
"shared_mutex",
"signal.h",
"source_location",
"span",
"spanstream",
"spawn.h",
"sstream",
"stack",
"stacktrace",
"stdalign.h",
"stdarg.h",
"stdatomic.h",
"stdbit.h",
"stdbool.h",
"stdckdint.h",
"stddef.h",
"stdexcept",
"stdfloat",
"stdint.h",
"stdio.h",
"stdlib.h",
"stdnoreturn.h",
"stop_token",
"streambuf",
"string",
"string.h",
"strings.h",
"string_view",
"stropts.h",
"strstream",
"syncstream",
"sys/ipc.h",
"syslog.h",
"sys/mman.h",
"sys/msg.h",
"sys/resource.h",
"sys/select.h",
"sys/sem.h",
"sys/shm.h",
"sys/socket.h",
"sys/stat.h",
"sys/statvfs.h",
"system_error",
"sys/time.h",
"sys/times.h",
"sys/types.h",
"sys/uio.h",
"sys/un.h",
"sys/utsname.h",
"sys/wait.h",
"tar.h",
"term.h",
"termios.h",
"tgmath.h",
"thread",
"threads.h",
"time.h",
"trace.h",
"tuple",
"typeindex",
"typeinfo",
"type_traits",
"uchar.h",
"ulimit.h",
"uncntrl.h",
"unistd.h",
"unordered_map",
"unordered_set",
"utility",
"utime.h",
"utmpx.h",
"valarray",
"variant",
"vector",
"version",
"wchar.h",
"wctype.h",
"wordexp.h",
])
num_with_numba = 0
python_imports = Counter()
c_imports = Counter()
for result in results:
for pyfile in result["python"]:
if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
break
else:
continue
num_with_numba += 1
counter = Counter()
for pyfile in result["python"]:
if pyfile["data"] is not None:
for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"]):
if x not in STDLIB_MODULES and x != "numba":
counter[x] += 1
for x in counter:
python_imports[x] += 1
counter = Counter()
for cfile in result["c"]:
if cfile["data"] is not None:
for x in list(cfile["data"]["global"]) + list(cfile["data"]["local"]):
if x not in C_STDLIB_MODULES and x != "numba":
counter[x] += 1
for x in counter:
c_imports[x] += 1
python_imports = sorted(python_imports.items(), key=lambda x: -x[1])
c_imports = sorted(c_imports.items(), key=lambda x: -x[1])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 9))
(pd.Series(dict(python_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax1)
ax1.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax1.set_title("top Python imports (not standard library)")
ax1.set_xlim(0, 1)
(pd.Series(dict(c_imports[:50])).sort_values() / num_with_numba).plot.barh(ax=ax2)
ax2.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax2.set_title("top C and C++ includes (not standard library)")
# ax2.set_xlim(0, 1)
plt.subplots_adjust(wspace=0.8)
None
num_with_numba = 0
numba_references = Counter()
for result in results:
for pyfile in result["python"]:
if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
break
else:
continue
num_with_numba += 1
counter = Counter()
for pyfile in result["python"]:
if pyfile["data"] is not None:
for x in pyfile["data"]["numba"]:
y = x.lstrip("@").split("(")[0]
if x.startswith("numba.jit") and "nopython=True" in x:
y = "numba.njit"
counter[y] += 1
for x in counter:
numba_references[x] += 1
numba_references = sorted(numba_references.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 9))
(pd.Series(dict(numba_references[:50])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top Numba API calls")
# ax1.set_xlim(0, 1)
None
JIT_FUNCTIONS = {"numba.jit", "numba.njit", "numba.generated_jit", "numba.vectorize", "numba.guvectorize", "numba.cfunc", "numba.cuda.jit"}
fig, ax = plt.subplots(figsize=(6, 2))
(pd.Series({k: v for k, v in numba_references if k in JIT_FUNCTIONS}).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("Numba JIT API calls")
# ax1.set_xlim(0, 1)
None
num_with_numba = 0
jit_arguments = Counter()
for result in results:
for pyfile in result["python"]:
if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
break
else:
continue
num_with_numba += 1
counter = Counter()
for pyfile in result["python"]:
if pyfile["data"] is not None:
for x in pyfile["data"]["numba"]:
if "(" in x and (x.lstrip("@").startswith("numba.jit") or x.lstrip("@").startswith("numba.njit")):
for arg in x.split("(", 1)[1].rstrip(")").split(","):
if "=" in arg:
counter[arg.strip()] += 1
if x.lstrip("@").startswith("numba.njit"):
counter["nopython=True"] += 1
for x in counter:
jit_arguments[x] += 1
jit_arguments = sorted(jit_arguments.items(), key=lambda x: -x[1])
fig, ax = plt.subplots(figsize=(6, 4.5))
(pd.Series(dict(jit_arguments[:17])).sort_values() / num_with_numba).plot.barh(ax=ax)
ax.set_xlabel("fraction of non-fork repos that contain 'import numba'")
ax.set_title("top numba.jit arguments")
# ax1.set_xlim(0, 1)
None
num_with_numba = 0
num_with_numba_cuda = 0
for result in results:
for pyfile in result["python"]:
if pyfile["data"] is not None and any(x == "numba" or x.startswith("numba.") for x in list(pyfile["data"]["top"]) + list(pyfile["data"]["nested"])):
break
else:
continue
num_with_numba += 1
any_cuda = False
for pyfile in result["python"]:
if pyfile["data"] is not None:
for x in pyfile["data"]["numba"]:
if x.startswith("numba.cuda"):
any_cuda = True
if any_cuda:
num_with_numba_cuda += 1
num_with_numba_cuda / num_with_numba
0.13978001833180567