subsuper-proposal.tex

\documentclass[10pt,english]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{verbose,lmargin=1in,rmargin=1in}
\usepackage{babel}
\usepackage{mathrsfs}
\usepackage{amsbsy}
\usepackage{amssymb}
\usepackage[unicode=true]{hyperref}

\usepackage{fouriernc}
\usepackage[T1]{fontenc}

\usepackage{todonotes} %XXX REMOVE ME WHEN DONE
\newcommand{\TODO}[1]{\todo[inline]{#1}}

\makeatletter

\usepackage{upgreek}
\usepackage{cite}

\newcommand{\Secref}[1]{Section~\ref{sec:#1}}
\newcommand{\secref}[1]{Sec.~\ref{sec:#1}}

\makeatother

\begin{document}

\title{Proposal to Encode Subscripts/Superscripts\\
for Use in Programming-Language Source Code}


\author{
Dr.\ Jiahao Chen, AI Research Director, JPMorgan AI Research and Research Scientist, MIT CSAIL
\and
Prof.\ Alan Edelman, MIT Department of Mathematics
\and
Prof.\ Steven G.\ Johnson, MIT Department of Mathematics
\and
Stefan Karpinski, Chief Open Source Officer, Julia Computing Inc.
}

\maketitle


\TODO{%
.... add other endorsers here --- we don't want to turn this into
an Internet petition, but endorsements from well-known developers,
organizations, open-source projects, and companies, could be helpful
}

\abstract{Widespread Unicode support in programming-language parsers now allows software developers in technical fields to use notations like Greek letters and sub/superscripts \emph{in their source code} that closely follow standard mathematical notations. This development has been hampered by the fact that only a subset of Latin and Greek characters have sub/superscript codepoints in Unicode 13, and so we propose to expand the set of sub/superscript characters to encompass common mathematical usage, ideally by simply adding two new combining characters.}


\section{Summary}
\label{sec:summary}

Many recent programming languages (including C\#~\cite{Csharp}, Fortress~\cite{Fortress},
Go~\cite{Go}, Java~\cite{Java}, Julia~\cite{Julia}, Python~3~\cite{Python}, Scala\cite{Scala}, and Swift\cite{Swift}) allow
a broad range of Unicode characters to be employed \textbf{in source code} by the programmer
for identifiers (names of functions and variables), and indeed such
behavior is recommended by the \emph{Unicode Standard Annex~\#31}~\cite{UAX31}.
For programmers working in mathematical fields, including statistics,
science, and engineering, where a wide array of mathematical notations
and symbols are commonplace, this often enables computer programs
to be written in a notation that closely mimics standard mathematical
descriptions. For example, in a mathematical context one might easily
have a quantity with a name like $\hat{x}$ or $\alpha_{x}$, and
the same names can now be used in computer programs for this quantity
($\mathrm{\hat{{x}}}$ = U+0078 U+0302, $\mathrm{{\upalpha_{x}}}$
= U+03B1 U+2093) rather than ASCII approximations such as \texttt{xhat}
and \texttt{alpha\_x} that are used in older programming languages.
Furthermore, Unicode-capable \emph{code-editing software} has made it easy
for programmers to type such symbols by ``tab completion''
(e.g. typing ``$\mathrm{{\upalpha_{x}}}$''
as \texttt{\textbackslash{}alpha}\texttt{\emph{{[}tab{]}}}\texttt{\textbackslash{}\_x}\texttt{\emph{{[}tab{]}}})
or related techniques in several popular editors~\cite{vsCode,Atom,Emacs,Sublime,Vim}
and interactive-programming environments~\cite{Julia,IPython}. This
is an important and positive trend. To quote a 2001 draft of \emph{Unicode
Technical Report~\#25}~\cite[section 5.3]{UTR25}, \emph{Using real
mathematical expressions in computer programs would be far superior
in terms of readability, reduced coding times, program maintenance,
and streamlined documentation.}

One notation that is extremely common throughout all of mathematics
is the use of subscripts and superscripts. Unfortunately, Unicode
13.0 provides only a very limited selection of subscript and superscript
characters (digits 0--9, parentheses, $+$/$-$/$=$, a subset of
the Latin and Greek alphabets, and a few phonetic symbols). Therefore,\textbf{
we propose that Unicode be extended to encode all the subscript/superscript
characters most commonly used in mathematical notation}: all lower-
and upper-case Latin and Greek letters (and variants thereof) and
a selection of other mathematical symbols described below. Because
these are all derived from existing Unicode codepoints, the assignment
of their properties and the development of their fonts will be straightforward.
As one possible implementation strategy, we suggest that the Consortium
\textbf{add two new ``mathematical subscript'' and ``mathematical
superscript'' combining characters} that convert the preceding character
(or character + combining characters) into a sub/superscript. (This is
closely related to the superscript and ``scientific
inferiors'' features already present in OpenType~\cite{OpenType}.)
Although the combining-character approach is potentially very general, it could
initially be suggested that the new combining characters may be ignored
except following a limited subset of characters commonly used in
mathematical notation (similar to how the emoji skin-tone modifiers only affect
rendering of a small set of emoji characters with skin-tone variants).

We should mention that there were past proposals to extend the range
of subscript and superscript characters in Unicode~\cite{L2-10-230,L2-11-208}
(neither of which mentioned mathematical or programming applications)
that were rejected. We have been unable to find an official record
of the grounds for rejection of these proposals, but there seem to
be two common objections in discussions of this topic on Unicode forums~\cite{Miller10,UCDF}.
First, that this is \emph{mere ``text presentation''}
devoid of semantic content, which is better accomplished via markup
languages like MathML or other typesetting technologies such as the OpenType MATH table. Second, that
the set of characters that deserve sub/superscript variants is unclear
and might be subjected to never-ending extension. We believe that
both of those objections must be re-evaluated in light of mathematical
programming. First, subscripts and superscripts of any kind in mathematics
always have semantic content ($\mathrm{{\upalpha_{x}}}$ is very different
in meaning from $\mathrm{\upalpha\mathrm{{x}}}$). More importantly, \textbf{programming
language source code is almost invariably restricted to ``plain text''}---in
ordinary usage, computer code (and especially identifiers) cannot
be intermingled with MathML markup or other formatting metadata---with
a few exceptions such as Mathematica~\cite{Mathematica} that are
edited and/or rendered mainly within specialized software. Finally,
the restriction to
\emph{common mathematical notations} greatly limits the scope
of the additions, excluding the vast majority of Unicode characters
from consideration.

It is not practical to encode the full range of mathematical
notations in Unicode; MathML and similar will always be useful,
and generally should not be combined with Unicode-based formatting.
However, in programming contexts, where non-Unicode math formatting
is unavailable, increasing the set of available subscripts
and superscripts is a ``low-hanging fruit'' that would be tremendously
beneficial in technical fields. We respectfully urge you to expand the
set of mathematical subscripts and superscripts, and especially to
eliminate frustrating omissions from the Latin and Greek alphabets.


\section{Suggested Subscript/Superscript Characters}
\label{sec:required}

The following existing Unicode codepoints are widely used in subscripts and superscripts of mathematical notation, and should therefore have new sub/superscript variants (ideally via combining characters as described below, which will allow the set of available characters to be expanded incrementally by font providers as described in \secref{fonts}):

\begin{itemize}

\item Latin upper- and lower-case characters `a'--`z' (U+0061--007A) and `A'--`Z' (U+0041--005A). (See \secref{existingchars} regarding the subset of these already present in Unicode as sub/superscripts.)

\item Greek upper- and lower-case characters `$\upalpha$'--`$\upomega$' (U+03B1--03C9) and `A'--`$\Upomega$' (U+0391--03A9) as well as lunate epsilon `$\upepsilon$' (U+03F5).   (See \secref{existingchars} regarding the small subset of these already present in Unicode as sub/superscripts.)

\item Decimal digits `0'--`9'  (U+0030--0039): these already have acceptable sub/superscript characters (U+2080--2089 and U+2070--2079), which should be canonically equivalent to the corresponding digit followed by a sub/superscript combining character as described below.

\item Common mathematical operators: `$+$` (U+002B), `$-$` (U+002D and U+2212), `$\pm$' (U+00B1), `$*$' (U+002A), `$\times$' (U+00D7), `/' (U+002F), `$=$' (U+003D), `$<$` (U+003C), `$>$' (U+003E).   Of these, `$+$', and `$-$', `$=$' already have acceptable sub/superscript characters (U+207A--207C and U+0208A--208C), which should be canonically equivalent to the corresponding character followed by a sub/superscript combining character as described below.

\item Other important mathematical symbols widely used in sub/superscript form: `$\Vert$'  (U+2016), `$\perp$' (U+27C2), `$\infty$' (U+221E), `$\dagger$' (U+2020), `$\angle$' (U+2220).

\item Brackets: `(' and `)' (U+0028--0029),  `[ and `]' (U+005B and U+005D), `\{' and `\}' (U+007B and U+007D).   Of these, parentheses `()' already have acceptable sub/superscript characters (U+208D--208E and U+207D--207E), which should be canonically equivalent to the corresponding character followed by a sub/superscript combining character as described below.

\item Comma `,' (U+002C), colon `:' (U+003A), and semicolon `;' (U+003B).

\end{itemize}

\subsection{Relationship to existing sub/superscript letters}
\label{sec:existingchars}

A subset of Latin and Greek letters already have sub/superscript variants in Unicode.  For example, `$^\mathrm{a}$' is U+1D43 and `$_\mathrm{a}$' is U+2090, while `$^\mathrm{A}$' is U+1D2C but there is no subscript `A' character.   However, these codepoints seem to have been added piecemeal into the Unicode standard for a variety of semantic meanings unrelated to their usage in mathematical notation (e.g. `$^\mathrm{a}$' is in the Latin-1 Supplement block, whereas `$^\mathrm{k}$' is in the Phonetic Extensions block).  Therefore we suggest that these existing characters \textbf{should not be canonically equivalent} to the proposed new mathematical sub/superscript characters (whether implemented by new codepoints or by combining characters).   However, the existing Latin and Greek sub/superscripts \textbf{should at least be compatible} with the new characters because in a number of contexts (e.g. programming code where the existing superscripts are already in use) they should be interchangeable in practice.

Several other proposed canonical equivalences are noted in \secref{required}, above.

\section{Proposed new combining characters}

We believe that the simplest and most flexible way to support mathematical subscripts and superscripts would be to \textbf{add two new combining characters}, ``MATHEMATICAL SUBSCRIPT'' and ``MATHEMATICAL SUPERSCRIPT,'' which would indicate that the \textbf{preceding grapheme} should form a sub/superscript, respectively.   As mentioned in \secref{summary}, text-rendering software might optionally display only a limited set of characters in sub/superscript form, following a recommended list such as the one in \secref{required}, implemented as variant characters or ligatures (e.g. similar to emoji skin-tone variants) as described in \secref{fonts}.

Both of these characters could be put into the general category Mn (nonspacing marks). Placing them in the nonspacing marks category would automatically allow the sub/superscript combining characters to be used following any valid identifier character in programming languages implementing \emph{Unicode Standard Annex~\#31}~\cite{UAX31} or similar rules, since category Mn is included in the \texttt{ID\_Continue} lexical class of \emph{Annex~\#31}.  (That is, the languages would support the new combining characters automatically once their parsers are updated to use the new Unicode data tables, which typically happens within a reasonable interval after each Unicode revision.)   Alternatively, they could be added to category Sk (symbol modifiers), similar to the emoji skin-tone variants, with an amendment to \emph{Annex~\#31} adding them to \texttt{ID\_Continue}.

The new combining characters should have bidirectional class ``Other Neutrals (ON)'' and mirror class ``N'' similar to the emoji modifiers, with no upper/lowercase mapping, should be treated like other combining characters for line-breaking and collation, and should be allowed in identifiers as lexical class \texttt{ID\_Continue} as mentioned above.   Several  canonical equivalencies to existing sub/superscript characters are suggested in \secref{required}.

Some programming languages may further opt to allow additional sub/superscript mathematical symbols in identifiers on a case-by-case basis. For example, Julia~\cite{Julia} already allows sub/superscript parentheses in identifiers (e.g. $\upchi^{(3)}$ is a valid Julia identifier for a ``Kerr susceptibility'' in optics) so it might be natural in Julia to allow sub/superscript commas and asterisks (e.g. $\mathrm{p}^*$ or $\mathrm{p}_*$ may denote the ``primal optimum'' in convex-optimization problems).

\subsection{Alternative: Many new sub/superscript characters}

The alternative to adding new combining characters is to simply add new characters for the individual sub/superscripts proposed in \secref{required}, which would then be treated similarly to existing sub/superscript characters.  They would be placed in general category Lm (modifier letter) for Latin and Greek letters, Ps/Pe (open/close punctuation) for brackets, Po (other punctuation) for comma and semicolon, and Sm (math symbol) for mathematical symbols.   The disadvantages of this approach are obvious: many new codepoints must be chosen and documented, and supporting additional mathematical sub/superscripts in the future will require additional revisions to the standard (as opposed to improvements in fonts or rendering software).

\section{Lack of Possible Equivalents}

As explained in \secref{required}, Unicode currently contains only a subset of the most common mathematical subscripts and superscripts, and omits many Latin and Greek letters (e.g. there is no superscript `q' or subscript `A').  Furthermore, formatting/presentation markup like MathML, OpenDocument format (ODF), and similar typesetting/text-formatting technologies are \textbf{not usable for source code} in programming languages, nearly all of which require ``plain text'' input with no XML or other formatting metadata.

\section{Provision of Fonts}
\label{sec:fonts}

Because the required glyphs are merely sub/superscript variants of existing characters, relatively little effort will be required to support the new combining characters for common mathematical usage. Common font editing software provides automated ways to take a set of glyphs and generate transformed variants by scaling them and shifting the baseline.  Having generated a sub/superscript glyph, e.g. a subscript `A', the font can then define a glyph substitution sequence indicating that `A' followed by the new subscript combining character will be rendered as a subscript-`A' \emph{ligature}, e.g. using an OpenType glyph-substitution table~\cite{opentypeGSUB}.  (Even if every glyph suggested in \secref{required} is implemented in this way, it still represents only about 250 glyphs.)

Draft versions of the proposed glyphs in \secref{required} have already been implemented in the JuliaMono TrueType font~\cite{juliamono}, and in collaboration with Stefan~Karpinski (co-author of this proposal) we plan to support the new combining character via these glyphs in JuliaMono if this proposal is accepted.

In the longer run, text-rendering software such as Microsoft DirectWrite, Apple Core Text, or the free/open-source HarfBuzz engine may choose to automate this process even for fonts
lacking such substitution sequences.   While this would add flexibility and improve adoption, however, it is not required---\emph{existing text rendering software would be sufficient} given a font
providing sub/superscript ligatures.

\section{Contact Information}

\TODO{``names and addresses of appropriate contacts
within national body or user organizations''}
\begin{itemize}
\item Steven G. Johnson, MIT Room 2-345, 77 Massachusetts Ave. Cambridge,
MA 02139.

\end{itemize}
\begin{thebibliography}{10}

\bibitem{Csharp}Microsoft Corporation, \textit{C\# Language Specification},
version 5.0, section 2.4.2 (2012).

\bibitem{Fortress}Sun Microsystems, Inc., \textit{The Fortress Language
Specification}, version 1.0, \url{http://www.ccs.neu.edu/home/samth/fortress-spec.pdf}
(2008-03-31).

\bibitem{Go}The Go Authors, \textit{The Go Programming Language Specification},
\url{https://golang.org/ref/spec} (2020-01-14).

\bibitem{Java}James Gosling, Bill Joy, Guy Steele, Gilad Bracha,
and Alex Buckley, \textit{The Java Language Specification: Java SE 8
Edition}, section 3.8, \url{http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html\#jls-3.8}
(2015-02-13).

\bibitem{Julia}The Julia Authors, \textit{The Julia Programming Language
Manual}, \url{http://docs.julialang.org/en/latest/manual/variables/}
(retrieved 2016-08-25).

\bibitem{Python}Martin von L{\"o}wis, ``Python PEP 3131:
Supporting Non-ASCII Identifiers,'' \url{https://www.python.org/dev/peps/pep-3131/}
(2007-05-01).

\bibitem{Scala}M.~Odersky \textit{et al.}, \textit{Scala Language Specification},
version 2.13, \url{https://scala-lang.org/files/archive/spec/2.13/01-lexical-syntax.html}
(2020-08-25).

\bibitem{Swift}Apple Inc, \textit{The Swift Programming Language},
version 5.3, \url{https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html}
(2020-06-22).

\bibitem{UAX31}Mark Davis, \textit{Unicode Standard Annex \#31: Unicode
Identifier and Pattern Syntax}, version 13.0.0, \url{https://www.unicode.org/reports/tr31/tr31-33.html}
(2020-02-13).

\bibitem{vsCode}Unicode LaTeX plugin for the vsCode editor,
\url{https://marketplace.visualstudio.com/items?itemName=oijaz.unicode-latex}
(retrieved 2020-08-24).

\bibitem{Atom}atom-latex-completions plugin for the Atom editor,
\url{https://github.com/JunoLab/atom-latex-completions}
(retrieved 2020-08-24).

\bibitem{Emacs}\TeX{} Input Method for the Emacs editor, \url{https://www.emacswiki.org/emacs/TeXInputMethod}
(retrieved 2020-08-24).

\bibitem{Sublime}UnicodeMath plugin for the Sublime editor, \url{https://github.com/mvoidex/UnicodeMath}
(retrieved 2020-08-24).

\bibitem{Vim}julia-vim plugin for the vim editor, \url{https://github.com/JuliaLang/julia-vim}
(retrieved 2020-08-24).

\bibitem{IPython}Brian E. Granger, ``Adds Julia-style
latex->unicode tab completion,'' \textit{IPython pull
request \#6380}, \url{https://github.com/ipython/ipython/pull/6380}
(2014-08-28).

\bibitem{UTR25}Barbara Beeton, Asmus Freytag, and Murray Sargent
III, ``Proposed Draft Unicode Technical Report \#25:
Unicode Support for Mathematics'' \url{http://www.unicode.org/unicode/reports/tr25/tr25-1.html} (2001-10-10).

\bibitem{L2-10-230}Karl Pentzlin, ``Proposal to encode
a modifier letter used in French abbreviations in the UCS,''
\url{http://www.unicode.org/L2/L2010/10230-modifier-q.pdf}
(2010-07-13).

\bibitem{L2-11-208}ISO/IEC JTC1/SC35/WG1, ``Proposal
to encode missing Latin small capital and modifier letters in the
UCS,'' \url{http://www.unicode.org/L2/L2011/11208-n4068.pdf}
(2011-03-12).

\bibitem{Miller10}Eric Miller, ``Comment on L2-10-230,
Proposal to encode a modifier letter used in French abbreviations
in the UCS,'' \url{http://www.unicode.org/L2/L2010/10315-comment.pdf}
(2010-08-09).

\bibitem{UCDF}``Superscript latin small letter q,''
discussion on \emph{The Unicode Consortium Discussion Forum}, \url{http://www.unicode.org/forum/viewtopic.php?f=24&t=142}
(2011) (retrieved at at \url{https://web.archive.org/web/20151019072726/http://www.unicode.org/forum/viewtopic.php?f=24&t=142}  2020-08-24).

\bibitem{Mathematica}Stephen Wolfram, ``Mathematical
Notation: Past and Future,'' \url{http://www.stephenwolfram.com/publications/mathematical-notation-past-future/}
(2000).

\bibitem{OpenType}Microsoft Corporation, ``Registered
Features,'' \emph{OpenType Specification}, version
1.7 \url{https://www.microsoft.com/typography/otspec/features_pt.htm\#sinf}
(2008-11-19). 

\bibitem{opentypeGSUB}Microsoft Corporation, ``GSUB---Glyph Substitution Table,'' \textit{OpenType Specification} \url{https://docs.microsoft.com/en-us/typography/opentype/spec/gsub} (2018-08-15).

\bibitem{juliamono}``JuliaMono---A monospaced font for scientific and technical computing,'' \url{https://cormullion.github.io/pages/2020-07-26-JuliaMono/} (2020-07-26).

\end{thebibliography}

\end{document}