Better URL extraction logic #548

MrOrz · 2023-09-24T14:58:48Z

In https://www.facebook.com/groups/cofacts/posts/3648747782023691/?__cft__[0]=AZWNmv5K_H7F-skP4SOkIgZkb_Zv2i6ot3SXeHigYKawA2MnWSlDmycGq3hfNilD_slvYWqz1M-TriCfusgM7iiguSYqfbf8hBuuDN7Jx98GrMObD8796wLbjw5EJpsyuCzpU12KXm3U_jICgBIgX7KrpyAMBof29c6JJxdT4fDSyBQGDqA4okhj5v4I9uaQN3U&__tn__=%2CO%2CP-R , Cofacts collaborators has pointed out that LINE can actually break URLs and Mandarin characters as expected.

On the contrary, currently our URL matching mechanism just matches all non-space characters following http:// and https://. We should improve this so that URLs followed by Mandarin characters don't break on the website.

The text was updated successfully, but these errors were encountered:

weii41392 · 2023-11-13T07:07:22Z

Hey, I'd like to send a PR. Do you think linkify-react would be a great solution for this?

MrOrz · 2023-11-13T08:07:34Z

Hi @weii41392 thanks for taking a look at the issue!
The currently implemented linkify also shortens URL in a way that the user can still copy-paste the full URL by selecting the text. Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.

On the other hand, simplify fixing the regexp used in current implementation may be a more straight forward solution -- we can focus more on how to correctly spliiting the URLs rather than worry about integrating anything.

weii41392 · 2023-11-13T08:57:33Z

I understand that introducing a new library for a single function would be a pain, but when I was searching for URL regex patterns, I found each of them sacrifices in some way.

~~I find linkify a good choice as it was built exactly for linkifying plaintext and is implemented with a parser, which I believe can take care of context more than regex.~~

P.S. At the time I was writing this comment I found that linkify also failed to handle 全形 parentheses, so I guess we can either choose some regex from the above list or implement some parser for this.

MrOrz · 2023-11-13T09:37:19Z

Thanks for the investigation! The URL regex patterns looks cool.

From the original request we should not include 全形 ） characters in the URL: https://www.facebook.com/groups/cofacts/posts/3648747782023691/

When fact checkers copy-paste URLs from the browser URL bar, the URLs should always be URL encoded (with %XX). However, if the user copy paste text from another reply on Cofacts, then it may copy the decoded URL, since we display all URLs in decoded format (it makes URLs more human readable). Dropping all 全型 characters may break URLs with decoded 全型 inside.

Personally I think that when users provide URLs, they may do things like "請參考http://google.com。", 「(請參考http://google.com)」 . Therefore, maybe we can cut down URLs when encountering punctuations (both 半形 and 全形) .

Suddenly I recall that we are extracting hyperlinks in rumors-api as well, because we scrap the URL content when new messages and replies are created. It seems that url-regex is used. Would you test if it satisfies the need?

weii41392 · 2023-11-17T21:19:32Z

url-regex seems to be one of the regex patterns in the above list.
In the following I test both url-regex and linkify in the presence of halfwidth/fullwidth parentheses.
In a nutshell, url-regex doesn't work well with both halfwidth/fullwidth parentheses and linkify can handle halfwidth parentheses.
I guess this is because halfwidth parenthesis is defined as non-accepting symbol here while fullwidth parenthesis isn't. Note that the same issue may also happen to braces({}), brackets([]), and angle brackets (<>).

I would lean towards using linkify and sanitizing the results (say removing closing symbols).

import urlRegex from 'url-regex';
import { tokenize } from "linkifyjs";

const links = [
    "http://foo.com/blah_blah",
    "http://foo.com/blah_blah_(wikipedia)_(again)"
];

const texts = [
    `${links[0]} ${links[1]}`,
    `網址1(${links[0]}) 網址2(${links[1]})`,
    `網址1（${links[0]}） 網址2（${links[1]}）`,
];

for (const text of texts) {
    const urls = text.match(urlRegex()) || [];
    urls.forEach((url) => console.log(`"${url}"`));
}

// url-regex
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[1]: fail to handle halfwidth parentheses
// "http://foo.com/blah_blah)"
// "http://foo.com/blah_blah_(wikipedia)_(again))"

// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah）"
// "http://foo.com/blah_blah_(wikipedia)_(again)）"

for (const text of texts) {
    const tokens = tokenize(text);
    tokens.filter(token => token.isLink).forEach((token) => console.log(`"${token.v}"`));
}

// linkifyjs
// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[1]: succeed with halfwidth parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah）"
// "http://foo.com/blah_blah_(wikipedia)_(again)）"

P.S. Add a reference to the issue to linkify: nfrasser/linkifyjs#460

weii41392 · 2023-11-17T21:23:18Z

Introducing a whole library and trying to integrate it while retaining the original behavior may be quite challenging.

Also, code changes to linkify() would not be an issue. (Sadly, this implementation doesn't work with fullwidth symbols...)

import { tokenize } from 'linkifyjs';

const tokenized = tokenize(str).map((token, i) =>
  token.isLink ? (
    <a key={`link${i}`} href={token.v} {...props}>
      {shortenUrl(token.v, maxLength)}
    </a>
  ) : (
    token.toString()
  )
);

MrOrz · 2023-11-18T02:15:33Z

Thank you for your thorough investigation and demonstration! My previous comment was towards linkify-react. tokenize() from linkifyjs looks very promising! Let's proceed with linkifyjs 👍

Tucchhaa · 2024-04-05T06:36:18Z

Tried to type the link from the facebook to LINE and Telegram. Both messengers included full-width parenthesis to the url:

So, I suppose that should be the standard way to handle full-width parenthesis in url. Since also none of the regexp nor linkifyjs support them, it should be no need of handling it for us.

Though half-width brackets aren't handled correctly on the webiste:

The issue in linkifyjs is still open, but it's related to full-width brackets inside url, which seems not to be a standard.

So what I suggest is to use linkifyjs to handle half-width brackets in correct way, and forget about full-width brackets, since nobody supports them. @MrOrz What do you think?

MrOrz · 2024-04-08T08:07:52Z

Thanks for the analysis. I think we can proceed with linkifyjs 👍

Tucchhaa · 2024-04-09T04:46:45Z

Ok. I will create a PR by the end of this week

MrOrz added enhancement good first issue labels Sep 24, 2023

Tucchhaa mentioned this issue Apr 14, 2024

Use linkifyjs for better url extraction lgic #569

Merged

MrOrz linked a pull request Apr 14, 2024 that will close this issue

Use linkifyjs for better url extraction lgic #569

Merged

MrOrz closed this as completed in #569 Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better URL extraction logic #548

Better URL extraction logic #548

MrOrz commented Sep 24, 2023

weii41392 commented Nov 13, 2023

MrOrz commented Nov 13, 2023 •

edited

Loading

weii41392 commented Nov 13, 2023

MrOrz commented Nov 13, 2023

weii41392 commented Nov 17, 2023 •

edited

Loading

weii41392 commented Nov 17, 2023

MrOrz commented Nov 18, 2023

Tucchhaa commented Apr 5, 2024

MrOrz commented Apr 8, 2024 •

edited

Loading

Tucchhaa commented Apr 9, 2024

Better URL extraction logic #548

Better URL extraction logic #548

Comments

MrOrz commented Sep 24, 2023

weii41392 commented Nov 13, 2023

MrOrz commented Nov 13, 2023 • edited Loading

weii41392 commented Nov 13, 2023

MrOrz commented Nov 13, 2023

weii41392 commented Nov 17, 2023 • edited Loading

weii41392 commented Nov 17, 2023

MrOrz commented Nov 18, 2023

Tucchhaa commented Apr 5, 2024

MrOrz commented Apr 8, 2024 • edited Loading

Tucchhaa commented Apr 9, 2024

MrOrz commented Nov 13, 2023 •

edited

Loading

weii41392 commented Nov 17, 2023 •

edited

Loading

MrOrz commented Apr 8, 2024 •

edited

Loading