Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

Merged
merged 1 commit into from
May 2, 2024

Conversation

dcamper
Copy link
Contributor

@dcamper dcamper commented Apr 16, 2024

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@dcamper dcamper requested a review from ghalliday April 16, 2024 20:05
@dcamper dcamper marked this pull request as ready for review April 16, 2024 20:05
@dcamper dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 3043696 to a805e2b Compare April 16, 2024 20:06
@dcamper dcamper changed the title WIP: Use PCRE2 for native UTF-8 regex HPCC-31457 Use PCRE2 for native UTF-8 regex Apr 16, 2024
@dcamper dcamper requested a review from jackdelv April 17, 2024 11:47
Copy link
Contributor

@jackdelv jackdelv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@AttilaVamos
Copy link
Contributor

There are some regression failures (on all engines):

599. Test: regex2.ecl
    eclcc: W20240417-140558-4.cpp(399,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(405,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(411,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(417,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(423,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(440,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140558-4.cpp(473,64): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclccserver: (0,0): Error C3000: Compile/Link failed for W20240417-140558-4 (see W20240417-140558-4.cc.log for details)
    8 error(s), 7 warning(s)
    
    600. Test: regex3.ecl
    eclcc: W20240417-140559.cpp(188,62): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140559_1.cpp(78,49): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclcc: W20240417-140559_1.cpp(90,49): Error C6003: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
    eclccserver: (0,0): Error C3000: Compile/Link failed for W20240417-140559 (see W20240417-140559.cc.log for details)
    4 error(s), 3 warning(s)
    
    603. Test: regexfindset.ecl
    eclccserver: (0,0): Error C9999: eclcc killed - likely to be out of memory - see compile log for details
    1 error(s), 0 warning(s)
W20240417-142609-5.cpp: In member function ‘virtual int MyEclProcess::perform(IGlobalCodeContext*, unsigned int)’:
W20240417-142609-5.cpp:188:62: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
  188 |                                 regexI->replace(vH,vG.refustr(),22U,((char *)"the cat sat on the mat"),3U,((char *)"$1p"));
      |                                                    ~~~~~~~~~~^~
      |                                                              |
      |                                                              UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~
W20240417-142609-5_1.cpp: In member function ‘virtual size32_t cAc3::transform(ARowBuilder&, const void*)’:
W20240417-142609-5_1.cpp:78:49: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
   78 |                 regexG1->replace(vI1,vH1.refustr(),*((unsigned *)(left + 30U)),(char *)(left + 34U),1U,((char *)"\303\253"));
      |                                      ~~~~~~~~~~~^~
      |                                                 |
      |                                                 UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5_1.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~
W20240417-142609-5_1.cpp:90:49: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
   90 |                 regexM1->replace(vO1,vN1.refustr(),*((unsigned *)(left + 30U)),(char *)(left + 34U),1U,((char *)"\303\253"));
      |                                      ~~~~~~~~~~~^~
      |                                                 |
      |                                                 UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5_1.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~

Moreover the ECLCC terminated with signal SIGSEGV, Segmentation fault and generated core files during to compile regexfindset.ecl.

@dcamper dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 04a3258 to 4ca8f20 Compare April 19, 2024 17:39
@ghalliday
Copy link
Member

@dcamper currently failing a regression test

@dcamper dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 2b64b1e to 70429ef Compare April 23, 2024 13:21
@dcamper
Copy link
Contributor Author

dcamper commented Apr 29, 2024

@ghalliday ready for a recheck when you have time.

Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcamper generally looks good. Some of this code is >15 years old. I wouldn't code it in the same way today - but copying the style of the existing code makes sense, so I haven't commented in general when it could have been cleaned up.

A few fairly minor comments, but looks like it is close to ready.

ITypeInfo *t1 = a.queryExprType();
if (t1 && !isUTF8Type(t1))
{
if (isStringType(t1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: is StringType() || isUnicodeType() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; changed.

isUnicodeType() tests for UTF-8 and there seems to be a lot of code that leverages the implied "UTF-8 is a subset of Unicode" logic. This is debatable, but not something to fix here.

const size32_t numUChars = *((size32_t *) presult);
presult += sizeof(size32_t);
results.append(*createConstant(createUtf8Value((unsigned)numUChars, (const char*)presult, makeUtf8Type(numUChars, NULL))));
presult += numUChars;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be.

presult += rtlUtf8Size(numUChars, presult)

Is there a test case where the set returns multi-byte utf-8 characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, added test cases for multibyte Unicode and UTF-8 REGEXFINDSET. This uncovered some issues in both folded and unfolded code, cascading some further changes that will be in the next commit.

if (isUnicodeType(expr->queryType()))
if (isUTF8Type(expr->queryType()))
{
ICompiledStrRegExpr * compiled = rtlCreateCompiledU8StrRegExpr(rtlUtf8Size(value->getSize(), value->queryValue()), (const char *)value->queryValue(), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSize() is the size of the string, so no need to call rtlUtf8Size again. I think possibly this should be a call to rtlUtf8Length(). Alternatively value->queryType()->getStringLen().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

int errNum = 0;
PCRE2_SIZE errOffset;
uint32_t options = ((_isCaseSensitive ? 0 : PCRE2_CASELESS) | (_enableUTF8 ? PCRE2_UTF : 0));
size32_t regexLength = (isUTF8Enabled ? rtlUtf8Size(_regexLength, _regex) : _regexLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexSize a better variable name for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed; changed.


// Update offset
offset += matchLen + 1;
offset += matchSize + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new code, but is this +1 correct? E.g. it suggests that matching "ab" against "abab" would only get the first match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet:

offset = ovector[1];

@dcamper dcamper requested a review from ghalliday May 1, 2024 14:15
Copy link
Member

@ghalliday ghalliday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good . Please squash and I will merge.

@dcamper dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 4841746 to 1a033e1 Compare May 2, 2024 11:59
@dcamper
Copy link
Contributor Author

dcamper commented May 2, 2024

@ghalliday squashed, please merge. Thanks!

@ghalliday ghalliday merged commit 73bfe4c into hpcc-systems:master May 2, 2024
48 of 49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants