Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for converting parsed tables into Markdown or other LLM-friendly formats #1417

Closed
GISStd opened this issue Jan 6, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@GISStd
Copy link

GISStd commented Jan 6, 2025

Is your feature request related to a problem? Please describe.
您的特性请求是否与某个问题相关?请描述。
Currently, MinerU only supports saving parsed tables as images. While this is useful for visual representation, it poses challenges when integrating with large language models (LLMs), as LLMs cannot directly interpret images. Users often need to extract table content into formats like Markdown, CSV, or JSON to enable better understanding and reasoning by LLMs.

Describe the solution you'd like
描述您期望的解决方案
Provide functionality to export parsed tables into text-based formats that are easily interpretable by LLMs, such as:
Markdown Table Format: A simple, human-readable format that can be directly processed by LLMs. Example:
markdown

Column 1 Column 2 Column 3
Value 1 Value 2 Value 3
Plain Text (Tab-Delimited): A minimalistic format for lightweight processing. Example:
mathematica
Column 1 Column 2 Column 3
Value 1 Value 2 Value 3
JSON: A structured format suitable for hierarchical data or API integrations. Example:
json
[
{"Column 1": "Value 1", "Column 2": "Value 2", "Column 3": "Value 3"}

]
CSV: For users who want to integrate with spreadsheet tools.

Describe alternatives you've considered
描述您已考虑的替代方案
Manually extracting text from parsed images, which is tedious and error-prone.
Using external tools to convert table images to text, which disrupts the workflow and reduces productivity.

Additional context
提供更多细节
Add any other context or screenshots about the feature request here.
请附上任何相关截图、链接或文件,以帮助我们更好地理解您的请求。

@GISStd GISStd added the enhancement New feature or request label Jan 6, 2025
@myhloli
Copy link
Collaborator

myhloli commented Jan 22, 2025

Currently, since Markdown tables do not support the representation of cell merging, we use HTML to represent tables. HTML can be easily rendered in various Markdown readers. The demands for cell merging and LLM (Large Language Model) compatibility are difficult to reconcile, so we currently prioritize format fidelity.

@myhloli myhloli closed this as completed Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants