Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文章解析相关问题 #5

Open
hogking opened this issue Jun 7, 2024 · 1 comment
Open

文章解析相关问题 #5

hogking opened this issue Jun 7, 2024 · 1 comment

Comments

@hogking
Copy link

hogking commented Jun 7, 2024

你好,过程中碰到一些文章解析问题:
1.有的老网站的author和publish_date都被放在一个元素内了,解析出来的作者和发布日期都为【发布日期:2021-06-11 作者:招生办 来源: 继续教育学院 点击:451】
case:https://jxjy.gdou.edu.cn/info/1176/2828.htm

2.内容获取不正确:
case:http://sce.stu.edu.cn/show/article/1060.html

3.文章页面内容中如果包含附件(doc、pdf)链接,能否将它们的内容放在一个额外的fields中返回呢?
case:https://jxjy.scau.edu.cn/2024/0514/c4910a374510/page.htm
比如底部含有2个附件,放在类似如下结构中

{
   'files': [
        {
           'title':  '附件1',
            'link': ...
        },
        {
           'title':  '附件2',
            'link': ...
        }
    ]
}
@hogking
Copy link
Author

hogking commented Jun 21, 2024

感觉 文章 能支持自定义fields会比较好

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant