Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ten line token convert bug #65

Closed
wants to merge 1 commit into from
Closed

Conversation

UserWangZz
Copy link
Contributor

PPOCRLabel/libs/utils.py中的convert_token方法中
col.split()返回值n为str类型,在下方token_list.append中,只对n的第一位进行format

@@ -233,10 +233,10 @@ def convert_token(html_list):
token_list.append("<td")
if "colspan" in col:
_, n = col.split("colspan=")
token_list.append(' colspan="{}"'.format(n[0]))
token_list.append(' colspan="{}"'.format(str(int(n[0]))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是把n[0]改为n也可以

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

早晨代码写的急了,应该直接n

@GreatV
Copy link
Collaborator

GreatV commented Sep 6, 2024

def convert_token(html_list):
    """
    Convert raw html to label format
    """
    token_list = ["<tbody>"]
    # final html list:
    for row in html_list:
        token_list.append("<tr>")
        for col in row:
            if col == None:
                continue
            elif col == "td":
                token_list.extend(["<td>", "</td>"])
            else:
                token_list.append("<td")
                if "colspan" in col:
                    _, n = col.split("colspan=")
                    token_list.append(' colspan="{}"'.format(str(int(n[0]))))
                if "rowspan" in col:
                    _, n = col.split("rowspan=")
                    token_list.append(' rowspan="{}"'.format(str(int(n[0]))))
                token_list.extend([">", "</td>"])
        token_list.append("</tr>")
    token_list.append("</tbody>")

    return token_list

input_html_list = [
            ["td", "rowspan=31"],
            ["td", "td"],
            ["td", None]
        ]
print(convert_token(input_html_list))

这里两种修改都是

['<tbody>', '<tr>', '<td>', '</td>', '<td', ' rowspan="3"', '>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '<td>', '</td>', '</tr>', '<tr>', '<td>', '</td>', '</tr>', '</tbody>']

@UserWangZz UserWangZz closed this Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants