-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HPCC-31457 Use PCRE2 for native UTF-8 regex #18543
HPCC-31457 Use PCRE2 for native UTF-8 regex #18543
Conversation
3043696
to
a805e2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There are some regression failures (on all engines):
Moreover the ECLCC terminated with signal SIGSEGV, Segmentation fault and generated core files during to compile regexfindset.ecl. |
04a3258
to
4ca8f20
Compare
@dcamper currently failing a regression test |
2b64b1e
to
70429ef
Compare
@ghalliday ready for a recheck when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dcamper generally looks good. Some of this code is >15 years old. I wouldn't code it in the same way today - but copying the style of the existing code makes sense, so I haven't commented in general when it could have been cleaned up.
A few fairly minor comments, but looks like it is close to ready.
ecl/hql/hqlgram2.cpp
Outdated
ITypeInfo *t1 = a.queryExprType(); | ||
if (t1 && !isUTF8Type(t1)) | ||
{ | ||
if (isStringType(t1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: is StringType() || isUnicodeType() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; changed.
isUnicodeType() tests for UTF-8 and there seems to be a lot of code that leverages the implied "UTF-8 is a subset of Unicode" logic. This is debatable, but not something to fix here.
ecl/hql/hqlutil.cpp
Outdated
const size32_t numUChars = *((size32_t *) presult); | ||
presult += sizeof(size32_t); | ||
results.append(*createConstant(createUtf8Value((unsigned)numUChars, (const char*)presult, makeUtf8Type(numUChars, NULL)))); | ||
presult += numUChars; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be.
presult += rtlUtf8Size(numUChars, presult)
Is there a test case where the set returns multi-byte utf-8 characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, added test cases for multibyte Unicode and UTF-8 REGEXFINDSET. This uncovered some issues in both folded and unfolded code, cascading some further changes that will be in the next commit.
ecl/hql/hqlutil.cpp
Outdated
if (isUnicodeType(expr->queryType())) | ||
if (isUTF8Type(expr->queryType())) | ||
{ | ||
ICompiledStrRegExpr * compiled = rtlCreateCompiledU8StrRegExpr(rtlUtf8Size(value->getSize(), value->queryValue()), (const char *)value->queryValue(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSize() is the size of the string, so no need to call rtlUtf8Size again. I think possibly this should be a call to rtlUtf8Length(). Alternatively value->queryType()->getStringLen().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
rtl/eclrtl/eclregex.cpp
Outdated
int errNum = 0; | ||
PCRE2_SIZE errOffset; | ||
uint32_t options = ((_isCaseSensitive ? 0 : PCRE2_CASELESS) | (_enableUTF8 ? PCRE2_UTF : 0)); | ||
size32_t regexLength = (isUTF8Enabled ? rtlUtf8Size(_regexLength, _regex) : _regexLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regexSize a better variable name for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; changed.
rtl/eclrtl/eclregex.cpp
Outdated
|
||
// Update offset | ||
offset += matchLen + 1; | ||
offset += matchSize + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new code, but is this +1 correct? E.g. it suggests that matching "ab" against "abab" would only get the first match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better yet:
offset = ovector[1];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good . Please squash and I will merge.
4841746
to
1a033e1
Compare
@ghalliday squashed, please merge. Thanks! |
Type of change:
Checklist:
Smoketest:
Testing: