Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(obsolete) feat: add bibliography support #200

Closed
wants to merge 26 commits into from

Conversation

acmepjz
Copy link
Contributor

@acmepjz acmepjz commented Jun 22, 2024

This PR adds bibliography support to doc-gen4. It includes:

  • DocGen4/Output/References.lean which contains generic function support for bibliography. It is independent of the actual implementation of bib file formatting.
  • DocGen4/Output/Pybtex.lean which contains an implementation of bib file formatting by calling external Python program pybtex. In the future this can be replaced by a pure Lean implementation.
  • Changes in DocGen4/Output/DocString.lean which analyzes citations in markdown text, inserting and modifying citation links, and collect back references.
  • Changes in various output files to add back references collecting feature. Currently the codes are ad-hoc; in the future they can be reimplemented by MonadState and modify.
  • Add a subcommand bibPrepass and change the build file accordingly.

To use it in mathlib, the references.bib in mathlib needs to be changed slightly, as it contains something unrecognized by pybtex. A PR to mathlib will be open later.

This PR has all commit history squashed. To see the full history, see the branch https://github.com/acmepjz/doc-gen4/tree/test2.

@acmepjz
Copy link
Contributor Author

acmepjz commented Jun 22, 2024

Screenshots:

2024-06-20-034859

2024-06-22-044049

screenshot

Copy link
Collaborator

@hargoniX hargoniX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thanks for putting in the effort of setting up this infra structure and integrating it with doc-gen. It does produce good results!

That being said, there are a couple of issues (apart from the Python) that need to be addressed before this can be merged. The most pressing one being, that you are effectively doing a state monad by hand. This complicates a lot of your implementation unnecessarily. If you track the state in `HtmlM, you should be able to reduce the complexity of your implementation a lot.

Besides this and other issues/questions I've marked I would again like to bring up the question: Why should we merge this with Python as is? I can only see two reasons for allowing another language into this project:

  1. It is currently not feasible to implement this properly otherwise. For example doing some CSS/JS is absolutely necessary. With C projects like MD4Lean it is certainly doable but not short time feasible to implement a parser and HTML generator + it relies on a C toolchain which Lean is already integrated with as of today.
  2. The feature is so high priority that we need to have it right now and can thus throw this principle overboard.

I do not see 1. holding here, given the fact that people have already started work on a BibTex parser. And I do also not see 2. holding here, at least I personally don't see us loosing users due to the fact that the documentation tool is not able to provide proper linkification of bibliography.

Do you have compelling arguments for why either 1 or 2 hold or a potential third reason for why we should pull in the additional dependency on Python as is?

lakefile.lean Outdated Show resolved Hide resolved
Main.lean Outdated Show resolved Hide resolved
README.md Outdated
@@ -36,6 +36,8 @@ need to serve them from a proper http server for it to work. An easy way to do t
In order to compile itself `doc-gen4` requires:
- a Lean 4 or `elan` installation
- a C compiler if on Linux or MacOS (on Windows it will use Lean's built-in clang compiler)
- for `references.bib` parsing, an installation of `pybtex` (https://pybtex.org/) is needed,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should a) give clear instruction on what to install and b) the instructions should force the user to install one particular version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) give clear instruction on what to install

Install the package pybtex if you're using Linux and the package manager have this; or install python/pip and run pip install pybtex. Is this OK?

b) the instructions should force the user to install one particular version.

I'm not sure about this, perhaps a version which is not too old should work, since we are only calling command line tools, not calling libraries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go through with the python approach (which I'm still not seeing) I would prefer to have a clear:

pip install pybtex==x.y.z

with a clear version that we have tested and can guarantee works. Doing just pip install or using a package manager has the potential of causing behavior on other people's machines that we do not know about due to version differences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the README, now pybtex is optional, as the built-in bibtex parser is work in progress. I hope this is OK.

DocGen4/Output/Structure.lean Outdated Show resolved Hide resolved
DocGen4/Output/References.lean Outdated Show resolved Hide resolved
DocGen4/Output/Module.lean Outdated Show resolved Hide resolved
DocGen4/Output/Inductive.lean Outdated Show resolved Hide resolved
DocGen4/Output/References.lean Outdated Show resolved Hide resolved
DocGen4/Output.lean Outdated Show resolved Hide resolved
DocGen4/Output.lean Outdated Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
(basePath / "output.tmp").toString
]
}
-- parse the returned XML file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to write straight to stdout here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me study if it supports stdin/stdout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the command line tool does not support reading from stdin and writing to stdout.

@eric-wieser
Copy link
Contributor

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

@acmepjz
Copy link
Contributor Author

acmepjz commented Jun 22, 2024

Maybe the python compromise here is to have doc-gen consume the json file, and let mathlib convert its bib file to json (via python) during the build?

I think this is also OK if current mathlib build process already uses python. I can change the build script to recognize both references.json and references.bib in the docs path.

@hargoniX
Copy link
Collaborator

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

@acmepjz
Copy link
Contributor Author

acmepjz commented Jun 22, 2024

This just amounts to changing the feature from "use python and it's integrated with doc-gen" to "use python and it's less integrated with doc-gen". That's not acceptable, the solution needs to be properly integrated with doc-gen and should either use only Lean or if someone has a compelling argument for why it has to be pybtex Python.

No, the user is not needed to use python, if they can generate references.json by other means. (In fact pybtex itself cannot generate that file either, it's obtained from postprocessing the output HTML of pybtex.) In fact the format of this file is very straightforward (see https://github.com/acmepjz/doc-gen4/blob/ef68e7450a414b38a3de50f9b3c0a5f3249cd0b9/DocGen4/Output/Base.lean#L18), it is an array of objects which consists of 4 fields:

/-- The structure representing a processed bibitem. -/
structure BibItem where
  /-- The cite key as in the bib file. -/
  citekey : String
  /-- The tag generated by bib processor, e.g. `[Doe12]`.
  Should be plain text and should not be escaped. -/
  tag : String
  /-- The HTML generated by bib processor, e.g.
  `John Doe. <i>Test</i>. 2012.`-/
  html : String
  /-- The plain text form of `html` field. Should not be escaped. -/
  plaintext : String
  deriving FromJson, ToJson

@hargoniX
Copy link
Collaborator

Yes that is the precise issue, the whole feature will be less integrated. There will be no trivial way to generate the .json file apart from what you described. Hence you are effectively forcing people to use the solution that I am trying to avoid here.

@eric-wieser
Copy link
Contributor

Users could write the json file by hand if wanted. Agreed that's less convenient than doing it in .bib, but I think mathematical users would much prefer to be able to include references in their documentation today at the expense of either running pybtex once or adding to CI, than the status quo since lean 3 of not being able to have them at all. The compromise I suggested was intended to avoid having doc-gen itself take the brunt of the dependency management.

Either way, I think there's a lot of good stuff in this PR that can be cleaned up and reused whichever way this python thing goes, so maybe discussion here should focus on the other technical aspects and leave the Python discussion to an issue / zulip?

Copy link
Collaborator

@hargoniX hargoniX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think the monadic part is almost there now, already looking much cleaner than in the first attempt!

DocGen4/Output/Base.lean Outdated Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
Copy link
Collaborator

@hargoniX hargoniX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking very promising now. After these couple of nitpicks the only thing that's left is to replace pybtex with the Lean Bibtex parser. Note that I did not take a look at Pybtex.lean for that reason.

After the bibtex parser has removed the batteries dependency we can just pull it in here as a dependency already and implement the feature, then merge this PR. Once the bibtex parser gets strong enough to understand mathlib's references.bib we should be able to simply upgrade it and mathlib gets the feature.

DocGen4/Output/Module.lean Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
DocGen4/Output/References.lean Outdated Show resolved Hide resolved
DocGen4/Output/DocString.lean Outdated Show resolved Hide resolved
@acmepjz
Copy link
Contributor Author

acmepjz commented Jun 23, 2024

... the only thing that's left is to replace pybtex with the Lean Bibtex parser. Note that I did not take a look at Pybtex.lean for that reason.

After the bibtex parser has removed the batteries dependency we can just pull it in here as a dependency already and implement the feature, then merge this PR. Once the bibtex parser gets strong enough to understand mathlib's references.bib we should be able to simply upgrade it and mathlib gets the feature.

I should say that we need a bibtex renderer, not only parser. There are several things need to be done, for example (1) generate abbreviations from author names and year, which is not trivial (how to generate abbreviations of last names? there are a lot of corner cases); (2) generate the output from bibitem, again there are a lot of corner cases, for example if there are no authors, but only editors, then use editors instead, etc. etc...

@acmepjz
Copy link
Contributor Author

acmepjz commented Jul 4, 2024

Oops. What's wrong with the build script? I think it has not changed since the last successful build.

@hargoniX
Copy link
Collaborator

hargoniX commented Jul 4, 2024

Yes but batteries changed, I fixed this on main. If you rebase this branch on top of it it should work again.

@hargoniX
Copy link
Collaborator

hargoniX commented Jul 4, 2024

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

@acmepjz
Copy link
Contributor Author

acmepjz commented Jul 4, 2024

You merged the branch, please do a rebase instead, I prefer to keep a linear history in doc-gen if possible (which it is here)

Sorry for that. I think it's better for me to re-open a new PR when all of this are done; now some of the codes are still WIP.

@acmepjz acmepjz changed the title feat: add bibliography support (draft) feat: add bibliography support Jul 7, 2024
@hargoniX
Copy link
Collaborator

hargoniX commented Jul 8, 2024

Just to make this clear: I will not merge code that uses pybtex for the reasons outlined in previous discussions, putting effort into documenting it or exploring optimizations such as stdin/stdout does not seem to be a productive use of time for me.

@hargoniX
Copy link
Collaborator

hargoniX commented Jul 8, 2024

This is not to say that this PR has generally no chance of being merged, as previously explained: All of the infrastructure that you set up here is good after the rounds of review now. However we have the capability to do BibTex things ourselves with Bibtex.lean so I will not allow in a Python dependency or workarounds that don't explicitly say but effectively force a Python dependency. If you make this work with Bibtex.lean I am going to merge the PR.

@acmepjz
Copy link
Contributor Author

acmepjz commented Jul 8, 2024

I'll re-open a PR without Python alternatives when the Bibtex.lean is done.

@acmepjz acmepjz changed the title (draft) feat: add bibliography support (obsolete) feat: add bibliography support Jul 12, 2024
@acmepjz
Copy link
Contributor Author

acmepjz commented Jul 12, 2024

I think I'll close this PR. The branch will be kept as a historical reference.

@acmepjz acmepjz closed this Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants