Improve and maintain `Doc/data/refcounts.dat` #127443

picnixz · 2024-11-30T10:11:46Z

The Doc/data/refcounts.dat file is sometimes updated and sometimes not. I suggest we try to make it up-to-date as much as possible, at least by adding the functions part of the stable ABI.

In addition, the entries are not sorted alphabetically. While they are semantically sorted, it does not help when adding new entries (where should we put it?). So I suggest to reformat the file once I'm done with adding the new entries.

Finally, I suggest adding a small script that checks whether the file is up-to-date, at least by checking whether the stable ABI is a subset of that file and whether the refcounts.dat is sorted or not (again it will be part of a CI workflow).

The main motivation behind this is to help newcomers in understanding whether a reference is borrowed or not (sometimes the comment doesn't say anything). It could also help in managing a future linter. Note that we have no syntax to indicate whether a reference is stolen or not so we should also improve that part.

cc @vstinner @encukou @Eclips4 @ZeroIntensity

Linked PRs

Tasks

Give feedback

Add refcounts entries from the stable ABI (only func entries)
Allow to specify when a reference is stolen (use "$" instead of "0")
Add a script for checking whether the file is correctly formatted (check order, syntax, but not advanced format)
Add a script for checking if the file is up-to-date when a new function is added to the Stable ABI
Unify format (sort entries, add blank lines, add space before pointer dereference marker (optional, likely not))
Add CI jobs
Use a script to auto-generate the refcounts.dat and let the developer fill out the entries
Options

The text was updated successfully, but these errors were encountered:

ZeroIntensity · 2024-11-30T14:22:46Z

I think it's worth trying to maintain this for all of the public C API, not just the stable ABI. If the goal is to design a static analyzer, then it would be ideal to have everything.

picnixz · 2024-11-30T14:36:02Z

I think it's worth trying to maintain this for all of the public C API, not just the stable ABI. If the goal is to design a static analyzer, then it would be ideal to have everything.

Yes, but I'd like to start with the stable ABI for now. It was quite simple to extract the diff but in order to extract all the exported functions and make the diff, I think my one-liner will be a bit too long...

Related to python#127443 Add a script to check and update `refcounts.dat` and integrate it into CI workflows. * **Add `Doc/tools/check_refcounts.py`**: - Create a script to check if `refcounts.dat` is up-to-date and sorted. - Implement logic to compare `refcounts.dat` with the stable ABI. - Implement logic to check if `refcounts.dat` is alphabetically sorted. - Handle cases where input files do not exist or cannot be read. - Allow passing file paths as command-line arguments. - Print total counts of entries and mismatches for better insights. - Optionally auto-sort and save the `refcounts.dat` file. * **Add `Doc/tools/Makefile`**: - Add a target to run the `check_refcounts.py` script. * **Update `.github/workflows/build.yml`**: - Add a CI job to run the `check_refcounts.py` script. - Ensure the job runs when a new function is added to the Stable ABI. * **Update `.github/workflows/lint.yml`**: - Add a CI linter job to check if `refcounts.dat` is correctly formatted. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/python/cpython/issues/127443?shareId=XXXX-XXXX-XXXX-XXXX).

picnixz · 2024-11-30T15:13:38Z

Ok, so having an autoformatter may be a bit tricky. I really don't want to create a C parser just to parse the type declaration. I think I'll just manually change them with some powerful regexes and that's it. But for now, I'll just wait for the other PRs to land before changing them. (Anyway, at some point I'll need to sort the file alphabetically, and the diff won't really matter :D)

tomasr8 · 2024-11-30T18:37:05Z

I really don't want to create a C parser just to parse the type declaration.

I've never used it so I don't know what the current status is, but there is a C parser under Tools/c-analyzer/c_parser. Might also be a good starting point if we decide to add a proper static analyzer at some point..

picnixz · 2024-11-30T18:40:18Z

Yes, but I felt it would have been an overkill to use that one actually. I just wanted a simplest parser for declarations without the rest.

picnixz · 2024-12-01T08:39:32Z

@ZeroIntensity For the TOML conversion, we have some limitations. We need to quote '$' and 'null' if they appear and we need to use either inline tables or inline arrays to reduce the size of the file IMO. Instead, I suggest to always create a top-level section per function:

[$funcname$]

We can add "deprecated" and "comment" to that top-level section, helping the newcomers.

Followed by the the return section:

[$funcname$.return]
ctype = "PyObject *"
refdelta = 1
# or for functions that always return NULL
[$funcname$.return]
ctype = "PyObject *"
idempotent = true

And finally by the section for the parameters, here's an example:

[$funcname$.params.a]
ctype = "PyObject *"
steals = true
[$funcname$.params.b]
ctype = "PyObject *"
refdelta = 1
[$funcname$.params.c]
ctype = "PyObject *"
refdelta = -1
[$funcname$.params.d]
ctype = "PyObject *"
refdelta = 0
[$funcname$.params.e]
ctype = "const char *"

Each section will always have an optional "comment" key. For instance:

PyBytes_AS_STRING:char*:::
PyBytes_AS_STRING:PyObject*:string:0:

becomes

[PyBytes_AS_STRING]
[PyBytes_AS_STRING.return]
ctype = "char *"
[PyBytes_AS_STRING.params.string]
ctype = "PyObject *"
refdelta = 0

We can also compress it into:

[PyBytes_AS_STRING]
return.ctype = "char *"
params.string.ctype = "PyObject *"
params.string.refdelta = 0

wDYT?

ZeroIntensity · 2024-12-01T14:11:11Z

I'm not all that worried about compression--it's going to be a big file any way we do it. I think we should use an array of tables for the parameters instead of trying to create new keys for each parameter (it's significantly easier to parse it back into Python that way). For reference, the TOML I generated ended up looking like this:

[PyArg_Parse]
name = "PyArg_Parse"
return_type = "int"

[[PyArg_Parse.args]]
name = "args"
type = "PyObject *"
effect = 0

[[PyArg_Parse.args]]
name = "format"
type = "const char *"

[[PyArg_Parse.args]]
name = "..."
type = "ellipsis"

I'm not sure how we should denote some weird things like "always NULL" though. Special boolean flags like always_null = true or steal = true look pretty, but they'll pile up weirdly over time.

picnixz · 2024-12-01T14:34:44Z

I denote the always NULL by saying that the function is idempotent. It's not entirely accurate. Or we can have always_null = true as you said which could probably be a better naming.

By the way, it's probably better to have an array rather than a mapping in order to handle the variadic argument.

ZeroIntensity · 2024-12-01T14:47:29Z

An array rather than a mapping for what?

picnixz · 2024-12-01T14:49:45Z

For the list of parameters. My first suggestion was using a mapping where the key is the parameter name:

[PyBytes_AS_STRING.params.string]
ctype = "PyObject *"
refdelta = 0

but for you, it would be:

[[PyArg_Parse.params]]
name = "string"
ctype = "PyObject *"
refdelta = 0

Fix incorrect entries in `Doc/data/refcounts.dat`

…GH-127451) Fix incorrect entries in `Doc/data/refcounts.dat` (cherry picked from commit 1f8267b) Co-authored-by: Bénédikt Tran <[email protected]>

…7451) (#127496) gh-127443: Fix some entries in `Doc/data/refcounts.dat` (GH-127451) Fix incorrect entries in `Doc/data/refcounts.dat` (cherry picked from commit 1f8267b) Co-authored-by: Bénédikt Tran <[email protected]>

…7451) (#127497) gh-127443: Fix some entries in `Doc/data/refcounts.dat` (GH-127451) Fix incorrect entries in `Doc/data/refcounts.dat` (cherry picked from commit 1f8267b) Co-authored-by: Bénédikt Tran <[email protected]>

…#127451) Fix incorrect entries in `Doc/data/refcounts.dat`

encukou · 2024-12-02T17:52:26Z

I suggest putting the info in C headers rather than a new format -- unless you can generate the headers (or something else that's helpful) from the new format.

That said, I've tried it and found some dead ends. I still think it'd be worthwhile, though: and external file is bound to become unmaintained.
My latest idea (that I haven't really started on yet, alas) is that if you look for PyAPI_FUNC, everything up to the next ; is a function definition.

picnixz · 2024-12-02T17:56:51Z

Actually refcounts.dat is still used by the docs so we need this file (whether it's auto-generated or not). Adding to the C headers directly still require them to be parsed (we could use macros just for fake annotations but I'm not sure we can do it easily for the entire project).

What we can do however, is generate the (unfilled) entries from the headers directly and the dev should only fill out the reference count effect.

encukou · 2024-12-03T12:55:52Z

What we can do however, is generate the (unfilled) entries from the headers directly and the dev should only fill out the reference count effect.

Do you want to treat it as the first step of a long project? It would be a good first step :)

picnixz · 2024-12-03T13:07:07Z

Honestly, yes. Because knowing exactly what a function does on reference counts is always hard when you don't want to read the entire code... or open the docs (which sometimes do not tell you more =/)

encukou · 2024-12-04T14:21:34Z

I commented with parts elsewhere but I'll consolidate here:

Would it make sense to prefix existing comments with #, and leave field 5 to annotations like stolen?
Existing tools (if there are any) would treat the whole thing as a comment.

IMO, if we're filling this info in, it could make sense to look at each function and add more annotations to it than just stolen. Otherwise we'll need to go through the list again, looking at similar things. We don't need to always add everything, but I'd appreciate a mechanism to add several in one go.

Adding stolen is an improvement; IMO we should also add borrowed where we know an argument is not stolen/weird, so it's possible to know what's been checked.
Nullability is a useful category: NULLs OK, non-NULL (crashes/asserts), non-NULL (raises exception)
PyObject* (and subclasses) can be type-checked (raise TypeError) or unchecked (crashes/asserts)
stolen/borrowed can apply to any kind of managed resource, not just the refcounted PyObjects. For example, any Py_buffer* has an owner who must call PyBuffer_Release, and the owner can change hands. But you can't incref (=“duplicate” the reference) here. (Some of these can be “immortal”, expected to outlive the function call, e.g. a static const char* name. I hope we don't have many of those left.)
For a PyObject**, we might want to annotate each “star”: the whole thing can be a stolen/borrowed array or an output argument that the function fills in; if it's an array the individual items can also be stolen/borrowed.
For return values, we might want to indicate if the function follows the convention that NULL/-1 is returned iff an exception was raised.
(edit) For output arguments passed by pointer: whether the output is always (re)set, even on failure.

picnixz · 2024-12-05T17:27:55Z

Would it make sense to prefix existing comments with #, and leave field 5 to annotations like stolen?

I think it's fine? I mean, field 5 is always a comment so whatever we put there...

Nullability is a useful category: NULLs OK, non-NULL (crashes/asserts), non-NULL (raises exception)

I don't think the file was meant to incorporate those additional considerations. For that we probably need additional fields, making it harder for newcomers to understand the format. However, if we were to auto-generate a TOML-based file, then this would definitely be better if we also indicate those subtleties.

PyObject* (and subclasses) can be type-checked (raise TypeError) or unchecked (crashes/asserts)

With the current format, this would probably too hard (probably require an additional field or symbol and it's already hard to read IMO). But with a TOML-based format, then we can easily add this (e.g., "checked = true/false" and missing "checked" means "don't know")

stolen/borrowed can apply to any kind of managed resource, not just the refcounted PyObjects.

AFAIK, the file was only meant for PyObjects and not for other types. Most of the const char * parameters are borrowed but if we were to return a const char *, it'd be good to know whether the string was dynamically allocated or not (in which case the caller would be responsible for freeing later).

For a PyObject**, we might want to annotate each “star”: the whole thing can be a stolen/borrowed array or an output argument that the function fills in; if it's an array the individual items can also be stolen/borrowed.

I don't really know how to make it simple with the current format. IIRC, we either have PyObject ** destinations objects or PyObject *const * arguments for which we don't have a reference count effect (or maybe I marked them as 0, I don't remember). Concerning the "if it's an array the individual items can also be stolen/borrowed.", you're right and I hit that kind of question when I added the vectorcall functions =/ at least, I don't think we should continue with this file format if we want to be more precise.

For return values, we might want to indicate if the function follows the convention that NULL/-1 is returned iff an exception was raised.

Most of the time, the header file says it but for consistency (and for helping the docs), it'd be a very nice addition.

encukou · 2024-12-06T12:06:36Z

IMO, there are two tasks to do:
a) Add information about the functions
b) Find a good format for the data

I believe that for a), we should be able to add all relevant data at once. Going through the list, analyzing each function, and only adding stolen is not prouductive.
Note that I don't think we should add all data at once -- just that there should be a place to note things down while going through the list.

And I think b) is a false blocker; in fact it would be better to pick/design a format/schema after we have a good idea of the data we want to store.
So, I propose extending refcounts.dat by adding free-form but parseable “keyword” notes in field 5, for example:

...:PyObject*:obj:0:stolen nullable type-checked(PyLongObject)  # does something weird when NULL
...:PyObject**:items:0:borrowed array
...:int:count::length-for(items)

Ugly, but human-readable as long as you can auto-convert the relevant pieces into a stricter format, it's a good first step.
(It should be easy to hack the parser, e.g. do changes like add $ & $$, and to make that possible, I don't think we need to spend effort on making it pretty.)

That said, there's one advantage to only adding stolen: it'll be much easier to convince grumpy old core devs like me to accept the change.
What's your opinion on moving this to a personal repo, and then porting changes (like stolen from there?
As the SC says, we shouldn't do experiments in main, and I think this deserves the freedom that comes from being an experiment.

picnixz · 2024-12-06T14:00:42Z

I believe that for a), we should be able to add all relevant data at once. Going through the list, analyzing each function, and only adding stolen is not prouductive.

Personally, I really prefer doing an incremental work when it's about "configuration" or "informative" data. It's easier to review when you always review the same thing and it's also easier to backport the changes IMO (you only need to check for the same thing if there is a conflict). But I can accept changing everything in one go.

So, I propose extending refcounts.dat by adding free-form but parseable “keyword” notes in field 5, for example:

I like this. A gcc-style __attribute__ style. That works for me. I'll make a list of what to do. For now, I'll keep the linter PR but for the future of this file, I'll continue the project on my side.

What's your opinion on moving this to a personal repo, and then porting changes (like stolen from there?

I don't really care! I can do it the same way you did for the .. versionadded:: next tool.

Now, what I'd like to make before anything is to determine which keywords we will keep. Once we have them, it's easy to add the information that we need.

encukou · 2024-12-06T16:04:45Z

Personally, I really prefer doing an incremental work when it's about "configuration" or "informative" data.

For review, sure. But if you want me to contribute to the data collection, I'd rather add more stuff at once.
(I just volunteered, didn't I.)

I can do it the same way you did for the .. versionadded:: next tool.

Please invite me to the repo :)

Now, what I'd like to make before anything is to determine which keywords we will keep.

Allow any for now, or have a list (to avoid typos) but make it easy to add a new one.
Make the determination just before filing a CPython PR, and filter out the keywords that aren't immediately useful.

ZeroIntensity · 2024-12-06T16:39:57Z

Something that hasn't been discussed yet: once we've figured out how to maintain and manage refcounts.dat, what do we do with the data? Include it in documentation somehow? Develop some sort reference count linter?

picnixz · 2024-12-06T16:42:11Z

what do we do with the data? Include it in documentation somehow? Develop some sort reference count linter?

It's already included in the docs in some sense so we need to make it work with the existing Sphinx extension. As for the refcount linter, it will be a separate project IMO that would use the file on its own.

encukou · 2024-12-09T16:17:43Z

What do we do with the data?

Good question!
In addition to documentation and refcount linters, this could be used to auto-generate various wrappers:

_testcapi functions that wrap a single C func in a PyCFunction
ctypes.pythonapi but with proper type information? Maybe this could be public, maybe just for testing (it could be more ergonomic than writing/generating those _testcapi functions)
if the data is public, other projects can generate wrappers rather than manually copy info from the headers or try parsing them.

picnixz added the docs Documentation in the Doc dir label Nov 30, 2024

picnixz self-assigned this Nov 30, 2024

github-project-automation bot added this to docs issues Nov 30, 2024

github-project-automation bot moved this to Todo in docs issues Nov 30, 2024

bedevere-app bot mentioned this issue Nov 30, 2024

gh-127443: Update refcounts data with stable ABI functions. #127444

Open

picnixz changed the title ~~Include stable ABI functions in Doc/data/refcounts.dat~~ Improve and maintain Doc/data/refcounts.dat Nov 30, 2024

bedevere-app bot mentioned this issue Nov 30, 2024

gh-127443: fix some entries in Doc/data/refcounts.dat #127451

Merged

gianfrancomauro mentioned this issue Nov 30, 2024

Update and maintain refcounts.dat #127452

Closed

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

picnixz mentioned this issue Dec 1, 2024

gh-127443: Add semantics for stolen references in Doc/data/refcounts.dat #127468

Open

bedevere-app bot mentioned this issue Dec 1, 2024

gh-127443: add tool for linting Doc/data/refcounts.dat #127476

Open

Eclips4 pushed a commit that referenced this issue Dec 2, 2024

gh-127443: Fix some entries in Doc/data/refcounts.dat (#127451)

1f8267b

Fix incorrect entries in `Doc/data/refcounts.dat`

bedevere-app bot mentioned this issue Dec 2, 2024

[3.13] gh-127443: Fix some entries in Doc/data/refcounts.dat (GH-127451) #127496

Merged

bedevere-app bot mentioned this issue Dec 2, 2024

[3.12] gh-127443: Fix some entries in Doc/data/refcounts.dat (GH-127451) #127497

Merged

picnixz added a commit to picnixz/cpython that referenced this issue Dec 2, 2024

pythongh-127443: Fix some entries in Doc/data/refcounts.dat (python…

a7c22a3

…#127451) Fix incorrect entries in `Doc/data/refcounts.dat`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve and maintain `Doc/data/refcounts.dat` #127443

Improve and maintain `Doc/data/refcounts.dat` #127443

picnixz commented Nov 30, 2024 •

edited

Loading

Tasks

ZeroIntensity commented Nov 30, 2024

picnixz commented Nov 30, 2024

picnixz commented Nov 30, 2024

tomasr8 commented Nov 30, 2024

picnixz commented Nov 30, 2024

picnixz commented Dec 1, 2024

ZeroIntensity commented Dec 1, 2024

picnixz commented Dec 1, 2024

ZeroIntensity commented Dec 1, 2024

picnixz commented Dec 1, 2024 •

edited

Loading

encukou commented Dec 2, 2024

picnixz commented Dec 2, 2024

encukou commented Dec 3, 2024

picnixz commented Dec 3, 2024

encukou commented Dec 4, 2024 •

edited

Loading

picnixz commented Dec 5, 2024

encukou commented Dec 6, 2024

picnixz commented Dec 6, 2024

encukou commented Dec 6, 2024

ZeroIntensity commented Dec 6, 2024

picnixz commented Dec 6, 2024

encukou commented Dec 9, 2024

Improve and maintain Doc/data/refcounts.dat #127443

Improve and maintain Doc/data/refcounts.dat #127443

Comments

picnixz commented Nov 30, 2024 • edited Loading

Linked PRs

Tasks

ZeroIntensity commented Nov 30, 2024

picnixz commented Nov 30, 2024

picnixz commented Nov 30, 2024

tomasr8 commented Nov 30, 2024

picnixz commented Nov 30, 2024

picnixz commented Dec 1, 2024

ZeroIntensity commented Dec 1, 2024

picnixz commented Dec 1, 2024

ZeroIntensity commented Dec 1, 2024

picnixz commented Dec 1, 2024 • edited Loading

encukou commented Dec 2, 2024

picnixz commented Dec 2, 2024

encukou commented Dec 3, 2024

picnixz commented Dec 3, 2024

encukou commented Dec 4, 2024 • edited Loading

picnixz commented Dec 5, 2024

encukou commented Dec 6, 2024

picnixz commented Dec 6, 2024

encukou commented Dec 6, 2024

ZeroIntensity commented Dec 6, 2024

picnixz commented Dec 6, 2024

encukou commented Dec 9, 2024

Improve and maintain `Doc/data/refcounts.dat` #127443

Improve and maintain `Doc/data/refcounts.dat` #127443

picnixz commented Nov 30, 2024 •

edited

Loading

picnixz commented Dec 1, 2024 •

edited

Loading

encukou commented Dec 4, 2024 •

edited

Loading