feat: fullwidth characters support #460

weii41392 · 2023-11-17T22:05:29Z

Currently the parser can recognize opening parentheses and closing parentheses and exclude closing parentheses when appropriate, while we don't have the same behavior with fullwidth characters. See this example:

import { tokenize } from "linkifyjs";

const links = [
    "http://foo.com/blah_blah",
    "http://foo.com/blah_blah_(wikipedia)_(again)"
];

const texts = [
    `${links[0]} ${links[1]}`,
    `Link 1(${links[0]}) Link 2(${links[1]})`,      // halfwidth parentheses
    `Link 1（${links[0]}） Link 2（${links[1]}）`,   // fullwidth parentheses
];

for (const text of texts) {
    const tokens = tokenize(text);
    tokens.filter(token => token.isLink).forEach((token) => console.log(`"${token.v}"`));
}

// texts[0]: succeed without parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[1]: succeed with halfwidth parentheses
// "http://foo.com/blah_blah"
// "http://foo.com/blah_blah_(wikipedia)_(again)"

// texts[2]: fail to handle fullwidth parentheses
// "http://foo.com/blah_blah）"
// "http://foo.com/blah_blah_(wikipedia)_(again)）"

My proposal is to define fullwidth characters as tokens, and add new behaviors in the parser.
The logic should be fairly simple as fullwidth brackets are semantically the same as their halfwidth counterparts.
(In our use case we care more about fullwidth parentheses （）, but in general this can apply to other fullwidth characters, e.g. 「」『』＜＞.)

The text was updated successfully, but these errors were encountered:

nfrasser · 2023-11-22T03:46:19Z

@weii41392 thanks for the report and the fix! This has been released in the latest linkifyjs v4.1.3

weii41392 · 2023-11-22T05:46:17Z

@weii41392 thanks for the report and the fix! This has been released in the latest linkifyjs v4.1.3

Thank you @nfrasser! But with further testing we found that the current logic doesn't work as expected. Is this intended or can we also modify this behavior?

Work as expected

http://foo.com/blah_blah
- Result: http://foo.com/blah_blah
http://foo.com/blah_blah）
- Result: http://foo.com/blah_blah）
http://foo.com/blah_blah（123）
- Result: http://foo.com/blah_blah（123）
http://foo.com/blah_blah） withWhitespace
- Result: http://foo.com/blah_blah） withWhitespace
- Note: There is a whitespace between ） and withWhitespace.

Not work as expected

http://foo.com/blah_blah）withoutWhitespace
- Result: http://foo.com/blah_blah）withoutWhitespace
- Expected: http://foo.com/blah_blah）withoutWhitespace
- Note: There is no whitespace between ） and withoutWhitespace.

Different from English, we don't add whitespaces in Chinese (at least in formal writing). That's why http://foo.com/blah_blah） withWhitespace works for English convention but http://foo.com/blah_blah）withoutWhitespace doesn't work for Chinese.

This was referenced Nov 17, 2023

Better URL extraction logic cofacts/rumors-site#548

Closed

Add support for fullwidth parentheses #461

Merged

nfrasser closed this as completed in #461 Nov 22, 2023

nfrasser mentioned this issue Nov 22, 2023

Bracket parsing refactor and support for 「」『』＜＞ brackets #463

Merged

nfrasser reopened this Nov 22, 2023

nfrasser self-assigned this Dec 4, 2024

nfrasser added the parsing Related to string parsing label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fullwidth characters support #460

feat: fullwidth characters support #460

weii41392 commented Nov 17, 2023

nfrasser commented Nov 22, 2023

weii41392 commented Nov 22, 2023

feat: fullwidth characters support #460

feat: fullwidth characters support #460

Comments

weii41392 commented Nov 17, 2023

nfrasser commented Nov 22, 2023

weii41392 commented Nov 22, 2023

Work as expected

Not work as expected