(obsolete) feat: add bibliography support #200

acmepjz · 2024-06-22T14:46:53Z

This PR adds bibliography support to doc-gen4. It includes:

DocGen4/Output/References.lean which contains generic function support for bibliography. It is independent of the actual implementation of bib file formatting.
DocGen4/Output/Pybtex.lean which contains an implementation of bib file formatting by calling external Python program pybtex. In the future this can be replaced by a pure Lean implementation.
Changes in DocGen4/Output/DocString.lean which analyzes citations in markdown text, inserting and modifying citation links, and collect back references.
Changes in various output files to add back references collecting feature. Currently the codes are ad-hoc; in the future they can be reimplemented by MonadState and modify.
Add a subcommand bibPrepass and change the build file accordingly.

To use it in mathlib, the references.bib in mathlib needs to be changed slightly, as it contains something unrecognized by pybtex. A PR to mathlib will be open later.

This PR has all commit history squashed. To see the full history, see the branch https://github.com/acmepjz/doc-gen4/tree/test2.

acmepjz · 2024-06-22T14:58:32Z

Screenshots:

hargoniX

First thanks for putting in the effort of setting up this infra structure and integrating it with doc-gen. It does produce good results!

That being said, there are a couple of issues (apart from the Python) that need to be addressed before this can be merged. The most pressing one being, that you are effectively doing a state monad by hand. This complicates a lot of your implementation unnecessarily. If you track the state in `HtmlM, you should be able to reduce the complexity of your implementation a lot.

Besides this and other issues/questions I've marked I would again like to bring up the question: Why should we merge this with Python as is? I can only see two reasons for allowing another language into this project:

It is currently not feasible to implement this properly otherwise. For example doing some CSS/JS is absolutely necessary. With C projects like MD4Lean it is certainly doable but not short time feasible to implement a parser and HTML generator + it relies on a C toolchain which Lean is already integrated with as of today.
The feature is so high priority that we need to have it right now and can thus throw this principle overboard.

I do not see 1. holding here, given the fact that people have already started work on a BibTex parser. And I do also not see 2. holding here, at least I personally don't see us loosing users due to the fact that the documentation tool is not able to provide proper linkification of bibliography.

Do you have compelling arguments for why either 1 or 2 hold or a potential third reason for why we should pull in the additional dependency on Python as is?

lakefile.lean

Main.lean

hargoniX · 2024-06-22T15:22:09Z

README.md

@@ -36,6 +36,8 @@ need to serve them from a proper http server for it to work. An easy way to do t
 In order to compile itself `doc-gen4` requires:
 - a Lean 4 or `elan` installation
 - a C compiler if on Linux or MacOS (on Windows it will use Lean's built-in clang compiler)
+- for `references.bib` parsing, an installation of `pybtex` (https://pybtex.org/) is needed,


This should a) give clear instruction on what to install and b) the instructions should force the user to install one particular version.

a) give clear instruction on what to install

Install the package pybtex if you're using Linux and the package manager have this; or install python/pip and run pip install pybtex. Is this OK?

b) the instructions should force the user to install one particular version.

I'm not sure about this, perhaps a version which is not too old should work, since we are only calling command line tools, not calling libraries.

If we go through with the python approach (which I'm still not seeing) I would prefer to have a clear:

pip install pybtex==x.y.z

with a clear version that we have tested and can guarantee works. Doing just pip install or using a package manager has the potential of causing behavior on other people's machines that we do not know about due to version differences.

I've updated the README, now pybtex is optional, as the built-in bibtex parser is work in progress. I hope this is OK.

DocGen4/Output/Structure.lean

DocGen4/Output/References.lean

DocGen4/Output/Module.lean

DocGen4/Output/Inductive.lean

DocGen4/Output/References.lean

DocGen4/Output.lean

DocGen4/Output/DocString.lean

eric-wieser · 2024-06-22T17:48:51Z

DocGen4/Output/Pybtex.lean

+      (basePath / "output.tmp").toString
+    ]
+  }
+  -- parse the returned XML file


Would it make sense to write straight to stdout here?

Let me study if it supports stdin/stdout.

Unfortunately, the command line tool does not support reading from stdin and writing to stdout.

eric-wieser · 2024-06-22T17:49:38Z

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

acmepjz · 2024-06-22T19:37:46Z

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

I think this is also OK if current mathlib build process already uses python. I can change the build script to recognize both references.json and references.bib in the docs path.

hargoniX · 2024-06-22T19:44:39Z

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

acmepjz · 2024-06-22T19:49:27Z

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

No, the user is not needed to use python, if they can generate references.json by other means. (In fact pybtex itself cannot generate that file either, it's obtained from postprocessing the output HTML of pybtex.) In fact the format of this file is very straightforward (see https://github.com/acmepjz/doc-gen4/blob/ef68e7450a414b38a3de50f9b3c0a5f3249cd0b9/DocGen4/Output/Base.lean#L18), it is an array of objects which consists of 4 fields:

/-- The structure representing a processed bibitem. -/
structure BibItem where
  /-- The cite key as in the bib file. -/
  citekey : String
  /-- The tag generated by bib processor, e.g. `[Doe12]`.
  Should be plain text and should not be escaped. -/
  tag : String
  /-- The HTML generated by bib processor, e.g.
  `John Doe. <i>Test</i>. 2012.`-/
  html : String
  /-- The plain text form of `html` field. Should not be escaped. -/
  plaintext : String
  deriving FromJson, ToJson

hargoniX · 2024-06-22T19:52:38Z

Yes that is the precise issue, the whole feature will be less integrated. There will be no trivial way to generate the .json file apart from what you described. Hence you are effectively forcing people to use the solution that I am trying to avoid here.

eric-wieser · 2024-06-22T21:31:24Z

Users could write the json file by hand if wanted. Agreed that's less convenient than doing it in .bib, but I think mathematical users would much prefer to be able to include references in their documentation today at the expense of either running pybtex once or adding to CI, than the status quo since lean 3 of not being able to have them at all. The compromise I suggested was intended to avoid having doc-gen itself take the brunt of the dependency management.

Either way, I think there's a lot of good stuff in this PR that can be cleaned up and reused whichever way this python thing goes, so maybe discussion here should focus on the other technical aspects and leave the Python discussion to an issue / zulip?

hargoniX

Okay I think the monadic part is almost there now, already looking much cleaner than in the first attempt!

DocGen4/Output/Base.lean

DocGen4/Output/DocString.lean

hargoniX

I think this is looking very promising now. After these couple of nitpicks the only thing that's left is to replace pybtex with the Lean Bibtex parser. Note that I did not take a look at Pybtex.lean for that reason.

After the bibtex parser has removed the batteries dependency we can just pull it in here as a dependency already and implement the feature, then merge this PR. Once the bibtex parser gets strong enough to understand mathlib's references.bib we should be able to simply upgrade it and mathlib gets the feature.

DocGen4/Output/Module.lean

DocGen4/Output/DocString.lean

DocGen4/Output/References.lean

DocGen4/Output/DocString.lean

acmepjz · 2024-06-23T16:27:06Z

... the only thing that's left is to replace pybtex with the Lean Bibtex parser. Note that I did not take a look at Pybtex.lean for that reason.

After the bibtex parser has removed the batteries dependency we can just pull it in here as a dependency already and implement the feature, then merge this PR. Once the bibtex parser gets strong enough to understand mathlib's references.bib we should be able to simply upgrade it and mathlib gets the feature.

I should say that we need a bibtex renderer, not only parser. There are several things need to be done, for example (1) generate abbreviations from author names and year, which is not trivial (how to generate abbreviations of last names? there are a lot of corner cases); (2) generate the output from bibitem, again there are a lot of corner cases, for example if there are no authors, but only editors, then use editors instead, etc. etc...

acmepjz · 2024-07-04T19:28:24Z

Oops. What's wrong with the build script? I think it has not changed since the last successful build.

hargoniX · 2024-07-04T21:15:38Z

Yes but batteries changed, I fixed this on main. If you rebase this branch on top of it it should work again.

hargoniX · 2024-07-04T21:54:11Z

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

acmepjz · 2024-07-04T21:56:38Z

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

Sorry for that. I think it's better for me to re-open a new PR when all of this are done; now some of the codes are still WIP.

# Conflicts: # DocGen4/Output.lean # Main.lean # lakefile.lean

hargoniX · 2024-07-08T22:03:47Z

Just to make this clear: I will not merge code that uses pybtex for the reasons outlined in previous discussions, putting effort into documenting it or exploring optimizations such as stdin/stdout does not seem to be a productive use of time for me.

hargoniX · 2024-07-08T22:14:24Z

This is not to say that this PR has generally no chance of being merged, as previously explained: All of the infrastructure that you set up here is good after the rounds of review now. However we have the capability to do BibTex things ourselves with Bibtex.lean so I will not allow in a Python dependency or workarounds that don't explicitly say but effectively force a Python dependency. If you make this work with Bibtex.lean I am going to merge the PR.

acmepjz · 2024-07-08T22:16:15Z

I'll re-open a PR without Python alternatives when the Bibtex.lean is done.

acmepjz · 2024-07-12T20:00:34Z

I think I'll close this PR. The branch will be kept as a historical reference.

feat: add bibliography support

80b96e2

hargoniX requested changes Jun 22, 2024

View reviewed changes

eric-wieser reviewed Jun 22, 2024

View reviewed changes

chore: apply suggestions

ef68e74

acmepjz added 3 commits June 23, 2024 07:07

chore: add explicit error message

7b642e0

chore: use get!

8cb53d5

chore: use StateT instead

2b0fd89

hargoniX reviewed Jun 23, 2024

View reviewed changes

DocGen4/Output/Base.lean Outdated Show resolved Hide resolved

DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved

acmepjz added 2 commits June 23, 2024 23:12

chore: move monad state to HtmlM

9f6d20a

feat: add error reporting for markdown parse

bbf0592

hargoniX reviewed Jun 23, 2024

View reviewed changes

DocGen4/Output/Module.lean Show resolved Hide resolved

DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved

DocGen4/Output/References.lean Outdated Show resolved Hide resolved

DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved

acmepjz added 8 commits June 24, 2024 00:51

chore: apply suggestions

080736d

feat: redesign the bibPrepass command line argument

c2ee8b2

feat: add BibtexQuery as a dependency

b21398d

feat: add TeX diacritics process code

4404d73

chore: fix the error message propagation

08bf731

chore: fix diacritic character placing

41e593b

feat: bibtex name processing

a79f227

feat: remove upper case Roman numerals in last name

8006bff

acmepjz mentioned this pull request Jul 4, 2024

[Merged by Bors] - doc: fix some author fields in references leanprover-community/mathlib4#14391

Closed

acmepjz added 2 commits July 5, 2024 05:45

Merge branch 'main' into acmepjz_bib

f7564ee

chore: fix lake-manifest.json

e1e1e8d

acmepjz added 3 commits July 8, 2024 02:10

Merge branch 'main' into acmepjz_bib

2001dbb

# Conflicts: # DocGen4/Output.lean # Main.lean # lakefile.lean

fix build script

5c40c1c

bump BibtexQuery version and delete corresponding files

4770992

acmepjz changed the title ~~feat: add bibliography support~~ (draft) feat: add bibliography support Jul 7, 2024

acmepjz added 3 commits July 8, 2024 17:38

Merge branch 'main' into acmepjz_bib

40d0d6d

feat: add built-in bibtex processor

d077836

docs: update README to explain the usage of pybtex

b2a7497

acmepjz added 3 commits July 13, 2024 03:40

chore: fix temp file location

1708a36

chore: bump dependency and toolchain

0150e02

feat: complete built-in bibtex processor

6bf0bbe

acmepjz changed the title ~~(draft) feat: add bibliography support~~ (obsolete) feat: add bibliography support Jul 12, 2024

acmepjz mentioned this pull request Jul 12, 2024

feat: add bibliography support #209

Merged

acmepjz closed this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(obsolete) feat: add bibliography support #200

(obsolete) feat: add bibliography support #200

acmepjz commented Jun 22, 2024 •

edited

Loading

acmepjz commented Jun 22, 2024

hargoniX left a comment

hargoniX Jun 22, 2024

acmepjz Jun 22, 2024

hargoniX Jun 22, 2024

acmepjz Jul 8, 2024

eric-wieser Jun 22, 2024

acmepjz Jun 22, 2024

acmepjz Jul 8, 2024

eric-wieser commented Jun 22, 2024

acmepjz commented Jun 22, 2024

hargoniX commented Jun 22, 2024

acmepjz commented Jun 22, 2024

hargoniX commented Jun 22, 2024

eric-wieser commented Jun 22, 2024

hargoniX left a comment

hargoniX left a comment

acmepjz commented Jun 23, 2024

acmepjz commented Jul 4, 2024 •

edited

Loading

hargoniX commented Jul 4, 2024

hargoniX commented Jul 4, 2024

acmepjz commented Jul 4, 2024

hargoniX commented Jul 8, 2024 •

edited

Loading

hargoniX commented Jul 8, 2024

acmepjz commented Jul 8, 2024

acmepjz commented Jul 12, 2024

(obsolete) feat: add bibliography support #200

(obsolete) feat: add bibliography support #200

Conversation

acmepjz commented Jun 22, 2024 • edited Loading

acmepjz commented Jun 22, 2024

hargoniX left a comment

Choose a reason for hiding this comment

hargoniX Jun 22, 2024

Choose a reason for hiding this comment

acmepjz Jun 22, 2024

Choose a reason for hiding this comment

hargoniX Jun 22, 2024

Choose a reason for hiding this comment

acmepjz Jul 8, 2024

Choose a reason for hiding this comment

eric-wieser Jun 22, 2024

Choose a reason for hiding this comment

acmepjz Jun 22, 2024

Choose a reason for hiding this comment

acmepjz Jul 8, 2024

Choose a reason for hiding this comment

eric-wieser commented Jun 22, 2024

acmepjz commented Jun 22, 2024

hargoniX commented Jun 22, 2024

acmepjz commented Jun 22, 2024

hargoniX commented Jun 22, 2024

eric-wieser commented Jun 22, 2024

hargoniX left a comment

Choose a reason for hiding this comment

hargoniX left a comment

Choose a reason for hiding this comment

acmepjz commented Jun 23, 2024

acmepjz commented Jul 4, 2024 • edited Loading

hargoniX commented Jul 4, 2024

hargoniX commented Jul 4, 2024

acmepjz commented Jul 4, 2024

hargoniX commented Jul 8, 2024 • edited Loading

hargoniX commented Jul 8, 2024

acmepjz commented Jul 8, 2024

acmepjz commented Jul 12, 2024

acmepjz commented Jun 22, 2024 •

edited

Loading

acmepjz commented Jul 4, 2024 •

edited

Loading

hargoniX commented Jul 8, 2024 •

edited

Loading