HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

dcamper · 2024-04-16T15:50:03Z

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

jackdelv

Looks good.

AttilaVamos · 2024-04-17T16:14:20Z

There are some regression failures (on all engines):

599. Test: regex2.ecl
    eclcc: W20240417-140558-4.cpp(399,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(405,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(411,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(417,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(423,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(440,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140558-4.cpp(473,64): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclccserver: (0,0): Error C3000: Compile/Link failed for W20240417-140558-4 (see W20240417-140558-4.cc.log for details)
    8 error(s), 7 warning(s)
    
    600. Test: regex3.ecl
    eclcc: W20240417-140559.cpp(188,62): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140559_1.cpp(78,49): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclcc: W20240417-140559_1.cpp(90,49): Error C6003: cannot convert â€˜UChar*â€™ {aka â€˜short unsigned int*â€™} to â€˜char*&â€™
    eclccserver: (0,0): Error C3000: Compile/Link failed for W20240417-140559 (see W20240417-140559.cc.log for details)
    4 error(s), 3 warning(s)
    
    603. Test: regexfindset.ecl
    eclccserver: (0,0): Error C9999: eclcc killed - likely to be out of memory - see compile log for details
    1 error(s), 0 warning(s)

W20240417-142609-5.cpp: In member function ‘virtual int MyEclProcess::perform(IGlobalCodeContext*, unsigned int)’:
W20240417-142609-5.cpp:188:62: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
  188 |                                 regexI->replace(vH,vG.refustr(),22U,((char *)"the cat sat on the mat"),3U,((char *)"$1p"));
      |                                                    ~~~~~~~~~~^~
      |                                                              |
      |                                                              UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~
W20240417-142609-5_1.cpp: In member function ‘virtual size32_t cAc3::transform(ARowBuilder&, const void*)’:
W20240417-142609-5_1.cpp:78:49: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
   78 |                 regexG1->replace(vI1,vH1.refustr(),*((unsigned *)(left + 30U)),(char *)(left + 34U),1U,((char *)"\303\253"));
      |                                      ~~~~~~~~~~~^~
      |                                                 |
      |                                                 UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5_1.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~
W20240417-142609-5_1.cpp:90:49: error: cannot convert ‘UChar*’ {aka ‘short unsigned int*’} to ‘char*&’
   90 |                 regexM1->replace(vO1,vN1.refustr(),*((unsigned *)(left + 30U)),(char *)(left + 34U),1U,((char *)"\303\253"));
      |                                      ~~~~~~~~~~~^~
      |                                                 |
      |                                                 UChar* {aka short unsigned int*}
In file included from /opt/HPCCSystems/componentfiles/cl/include/eclinclude4.hpp:66,
                 from W20240417-142609-5_1.cpp:6:
/opt/HPCCSystems/componentfiles/cl/include/eclrtl.hpp:89:54: note:   initializing argument 2 of ‘virtual void ICompiledStrRegExpr::replace(size32_t&, char*&, size32_t, const char*, size32_t, const char*) const’
   89 |     virtual void replace(size32_t & outlen, char * & out, size32_t slen, char const * str, size32_t rlen, char const * replace) const = 0;
      |                                             ~~~~~~~~~^~~

Moreover the ECLCC terminated with signal SIGSEGV, Segmentation fault and generated core files during to compile regexfindset.ecl.

ghalliday · 2024-04-22T15:59:07Z

@dcamper currently failing a regression test

dcamper · 2024-04-29T12:17:33Z

@ghalliday ready for a recheck when you have time.

ghalliday

@dcamper generally looks good. Some of this code is >15 years old. I wouldn't code it in the same way today - but copying the style of the existing code makes sense, so I haven't commented in general when it could have been cleaned up.

A few fairly minor comments, but looks like it is close to ready.

ghalliday · 2024-05-01T11:02:25Z

ecl/hql/hqlgram2.cpp

+    ITypeInfo *t1 = a.queryExprType();
+    if (t1 && !isUTF8Type(t1))
+    {
+        if (isStringType(t1))


minor: is StringType() || isUnicodeType() ?

Good catch; changed.

isUnicodeType() tests for UTF-8 and there seems to be a lot of code that leverages the implied "UTF-8 is a subset of Unicode" logic. This is debatable, but not something to fix here.

ghalliday · 2024-05-01T11:03:53Z

ecl/hql/hqlutil.cpp

+                const size32_t numUChars = *((size32_t *) presult);
+                presult += sizeof(size32_t);
+                results.append(*createConstant(createUtf8Value((unsigned)numUChars, (const char*)presult, makeUtf8Type(numUChars, NULL))));
+                presult += numUChars;


I think this should be.

presult += rtlUtf8Size(numUChars, presult)

Is there a test case where the set returns multi-byte utf-8 characters?

Also, added test cases for multibyte Unicode and UTF-8 REGEXFINDSET. This uncovered some issues in both folded and unfolded code, cascading some further changes that will be in the next commit.

ghalliday · 2024-05-01T11:09:38Z

ecl/hql/hqlutil.cpp

-                if (isUnicodeType(expr->queryType()))
+                if (isUTF8Type(expr->queryType()))
+                {
+                    ICompiledStrRegExpr * compiled = rtlCreateCompiledU8StrRegExpr(rtlUtf8Size(value->getSize(), value->queryValue()), (const char *)value->queryValue(), false);


getSize() is the size of the string, so no need to call rtlUtf8Size again. I think possibly this should be a call to rtlUtf8Length(). Alternatively value->queryType()->getStringLen().

ghalliday · 2024-05-01T11:14:48Z

rtl/eclrtl/eclregex.cpp

+        int errNum = 0;
+        PCRE2_SIZE errOffset;
+        uint32_t options = ((_isCaseSensitive ? 0 : PCRE2_CASELESS) | (_enableUTF8 ? PCRE2_UTF : 0));
+        size32_t regexLength = (isUTF8Enabled ? rtlUtf8Size(_regexLength, _regex) : _regexLength);


regexSize a better variable name for consistency

Agreed; changed.

ghalliday · 2024-05-01T11:16:50Z

rtl/eclrtl/eclregex.cpp


                // Update offset
-                offset += matchLen + 1;
+                offset += matchSize + 1;


Not new code, but is this +1 correct? E.g. it suggests that matching "ab" against "abab" would only get the first match.

Better yet:

offset = ovector[1];

ghalliday

Looks good . Please squash and I will merge.

dcamper · 2024-05-02T12:06:43Z

@ghalliday squashed, please merge. Thanks!

dcamper requested a review from ghalliday April 16, 2024 20:05

dcamper marked this pull request as ready for review April 16, 2024 20:05

dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 3043696 to a805e2b Compare April 16, 2024 20:06

dcamper changed the title ~~WIP: Use PCRE2 for native UTF-8 regex~~ HPCC-31457 Use PCRE2 for native UTF-8 regex Apr 16, 2024

dcamper requested a review from jackdelv April 17, 2024 11:47

jackdelv approved these changes Apr 17, 2024

View reviewed changes

dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 04a3258 to 4ca8f20 Compare April 19, 2024 17:39

dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 2b64b1e to 70429ef Compare April 23, 2024 13:21

ghalliday reviewed May 1, 2024

View reviewed changes

dcamper requested a review from ghalliday May 1, 2024 14:15

ghalliday approved these changes May 2, 2024

View reviewed changes

HPCC-31457 Add UTF-8 specific regex support using PCRE2

1a033e1

dcamper force-pushed the hpcc-31457-pcre2-for-utf8 branch from 4841746 to 1a033e1 Compare May 2, 2024 11:59

ghalliday merged commit 73bfe4c into hpcc-systems:master May 2, 2024
48 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

dcamper commented Apr 16, 2024 •

edited

Loading

jackdelv left a comment

AttilaVamos commented Apr 17, 2024

ghalliday commented Apr 22, 2024

dcamper commented Apr 29, 2024

ghalliday left a comment

ghalliday May 1, 2024

dcamper May 1, 2024

ghalliday May 1, 2024

dcamper May 1, 2024

dcamper May 1, 2024

ghalliday May 1, 2024

dcamper May 1, 2024

ghalliday May 1, 2024

dcamper May 1, 2024

ghalliday May 1, 2024

dcamper May 1, 2024

ghalliday left a comment

dcamper commented May 2, 2024

HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

HPCC-31457 Use PCRE2 for native UTF-8 regex #18543

Conversation

dcamper commented Apr 16, 2024 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

jackdelv left a comment

Choose a reason for hiding this comment

AttilaVamos commented Apr 17, 2024

ghalliday commented Apr 22, 2024

dcamper commented Apr 29, 2024

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghalliday left a comment

Choose a reason for hiding this comment

dcamper commented May 2, 2024

dcamper commented Apr 16, 2024 •

edited

Loading