-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better URL extraction logic #548
Comments
Hey, I'd like to send a PR. Do you think linkify-react would be a great solution for this? |
Hi @weii41392 thanks for taking a look at the issue! On the other hand, simplify fixing the regexp used in current implementation may be a more straight forward solution -- we can focus more on how to correctly spliiting the URLs rather than worry about integrating anything. |
I understand that introducing a new library for a single function would be a pain, but when I was searching for URL regex patterns, I found each of them sacrifices in some way.
P.S. At the time I was writing this comment I found that linkify also failed to handle 全形 parentheses, so I guess we can either choose some regex from the above list or implement some parser for this. |
Thanks for the investigation! The URL regex patterns looks cool. From the original request we should not include 全形 When fact checkers copy-paste URLs from the browser URL bar, the URLs should always be URL encoded (with
Suddenly I recall that we are extracting hyperlinks in rumors-api as well, because we scrap the URL content when new messages and replies are created. It seems that url-regex is used. Would you test if it satisfies the need? |
url-regex seems to be one of the regex patterns in the above list. I would lean towards using linkify and sanitizing the results (say removing closing symbols).
P.S. Add a reference to the issue to linkify: nfrasser/linkifyjs#460 |
Also, code changes to linkify() would not be an issue. (Sadly, this implementation doesn't work with fullwidth symbols...)
|
Thank you for your thorough investigation and demonstration! My previous comment was towards linkify-react. |
Tried to type the link from the facebook to LINE and Telegram. Both messengers included full-width parenthesis to the url: So, I suppose that should be the standard way to handle full-width parenthesis in url. Since also none of the regexp nor linkifyjs support them, it should be no need of handling it for us. Though half-width brackets aren't handled correctly on the webiste: The issue in linkifyjs is still open, but it's related to full-width brackets inside url, which seems not to be a standard. So what I suggest is to use linkifyjs to handle half-width brackets in correct way, and forget about full-width brackets, since nobody supports them. @MrOrz What do you think? |
Thanks for the analysis. I think we can proceed with |
Ok. I will create a PR by the end of this week |
In https://www.facebook.com/groups/cofacts/posts/3648747782023691/?__cft__[0]=AZWNmv5K_H7F-skP4SOkIgZkb_Zv2i6ot3SXeHigYKawA2MnWSlDmycGq3hfNilD_slvYWqz1M-TriCfusgM7iiguSYqfbf8hBuuDN7Jx98GrMObD8796wLbjw5EJpsyuCzpU12KXm3U_jICgBIgX7KrpyAMBof29c6JJxdT4fDSyBQGDqA4okhj5v4I9uaQN3U&__tn__=%2CO%2CP-R , Cofacts collaborators has pointed out that LINE can actually break URLs and Mandarin characters as expected.
On the contrary, currently our URL matching mechanism just matches all non-space characters following
http://
andhttps://
. We should improve this so that URLs followed by Mandarin characters don't break on the website.The text was updated successfully, but these errors were encountered: