Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special strings in Lua #6

Open
charlenni opened this issue Nov 9, 2015 · 20 comments · Fixed by OpenRA/Eluant#4
Open

Special strings in Lua #6

charlenni opened this issue Nov 9, 2015 · 20 comments · Fixed by OpenRA/Eluant#4
Assignees
Labels

Comments

@charlenni
Copy link

Hi,

after two years of testing, I had to say, that Eluant works very, very stable. Good job and thank you for sharing it with the community 👍

Now I have the problem, that strings in Lua and strings in C# are different and so there could be problems while converting them from Lua to C#. It would be helpfull, if we could retrive strings as byte arrays.

I encountered the problem, that strings, which are a byte array in Lua could not transfered to C#, saved, loaded again and transfered to Lua. The transfer to Lua doesn't work. The string is created in Lua:

kk = { 252, 255 }  -- 0xFC, 0xFF
k = ""
for i = 1, 2 do
  local x = ""
  x = string.char(kk[i])
  k = k .. x
end

This is a valid string in Lua, which I want to transfer to C# (to save/load it later). But if I now try to bring this string back to Lua, I get problems. The string isn't the same in Lua as before. Perhaps it has something to do with the problem in this thread: http://lists.ximian.com/pipermail/mono-devel-list/2006-March/017493.html.

So it would be nice to get/set a string in Lua by a byte array (ToArray, FromArray) instead of a string, so that no conversions take place.

Best regards,
Dirk

@cdhowie
Copy link
Owner

cdhowie commented Nov 11, 2015

On my version of Mono (3.12.1), Marshal.PtrToStringAnsi() returns null on this string, which isn't correct behavior. (It should only return null when the input IntPtr is not equal to IntPtr.Zero.) So there is one bug I need to fix.

However, thinking about this, I'm not sure that I can reasonably fix this issue without breaking existing code. Consider the case where you have non-ASCII characters in a CLR string and you pass that to Lua. It would be encoded by the CLR using the system's default codepage. Lua would receive a string that differs in ordinal values from the CLR string. When extracting that string from Lua, it would be converted back.

So we have two paths to consider:

  1. CLR string to Lua string and back to CLR. This should work currently with any CLR string (crossing my fingers; might depend on the system's codepage).
  2. Lua string to CLR string and back to Lua. This is currently broken because the Lua string will be interpreted in the system's codepage and converted. The conversion back may not yield the same string, as you have found.

I'm not sure that path 2 can be fixed without breaking path 1.

This may require an extension of the Eluant API. Give me some time to think about this.

@charlenni
Copy link
Author

Thank you for looking into this problem and your help.

Yes, you described the problem very good. It would be nice, if you could find a global solution, but a "small" solution would be good too. A ".AsByteArray()" and ".FromByteArray()" would be enough ;).

@cdhowie
Copy link
Owner

cdhowie commented Nov 11, 2015

Yes, but that approach has its own disadvantages. After a Lua string value is converted to a LuaString, all ties to the string object in Lua are broken. This means that there is no longer a way to get to the Lua string object to bring it into the CLR a second time, but without codepage conversion.

The easiest way to solve this would be to convert the string from Lua twice, once through the system's codepage and one using just the ordinals, and store both representations on the LuaString object, but this has performance and memory consumption disadvantages.

A more involved way would be to introduce a new LuaStringReference that has a link back to the Lua string, at which point the value could be converted into either a C# string or byte[] object. This would break existing code that uses Eluant.

@charlenni
Copy link
Author

If I understand the problem correct, than the problem is the marshaling when transfering the string from Lua to C# and vice versa. Isn't it possible to transfer the string as byte array, so that the marshaling conversion of strings doesn't take place?

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

Not without breaking the CLR -> Lua -> CLR path. CLR strings are made of chars, which are 2-byte elements. If a char has an ordinal value out of the range of a byte (>255) then the string can't be marshaled to Lua exactly as it exists in the CLR; some conversion is required. Then there would be an expectation that the string comes back out of Lua as an equal CLR string, which means that the Lua string would need to be converted back.

So yes, what you describe would fix the Lua -> CLR -> Lua path, but it would break the CLR -> Lua -> CLR path.

@charlenni
Copy link
Author

But the CLR -> Lua -> CLR path was never correct. You couldn't transfer CLR strings in Unicode to Lua. So you have to do a conversion on the fly. Normally you would convert Unicode to UTF-8 and transfer this to a Lua string. Back you would transfer the Lua string to CLR and convert it from UTF-8 to Unicode.

I see no possibility to transfer strings between the two systems without converting them.

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

But the CLR -> Lua -> CLR path was never correct.

Well, then that's a bug. 😃

The best way I can think to resolve this issue would be to have an interface that defines a string converter (string <-> byte[]). The default implementation would use UTF-8 encoding, but it could be replaced if your application needs some other kind of conversion.

@charlenni
Copy link
Author

No, it isn't a bug. The problem is, that Lua doesn't have "string". They are normal byte arrays. But CLR uses Unicode. So how should this be converted without knowing, which encoding the byte array in Lua uses?

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

That's precisely the problem. There is a mismatch between CLR strings and Lua strings.

There needs to be a compromise on how that's handled somewhere. If all Lua strings get extracted into the CLR as byte arrays, a significant amount of convenience is lost -- you have to manually convert any time you want to use a Lua string in the CLR world. LuaString would effectively become a byte[] wrapper. How would it handle ToString()?

Perhaps the best solution would indeed be to make LuaString wrap a byte[], giving you direct access to the byte array, but also allowing automatic UTF-8 conversion when the object is converted to a CLR string.

@charlenni
Copy link
Author

That would be good! So we could have the original data, but for all normal cases the byte array is converted automatically to a correct CLR Unicode string. That's fine :)

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

This change is going to be fairly invasive, so it may not be ready soon. I effectively need to replace every Lua library API that accepts a string to accept a byte[] instead. This may seem like overkill, but consider the case where you are accessing table entries by string key in the CLR. These strings need this special handling as well.

Note that some Lua APIs (lua_getfield(), for example) accept a (byte-)string argument but not a length. This means that any uses of that API are not safe with respect to strings containing embedded null characters. I will need to replace such uses with a more robust pushstring,gettable sequence.

Basically I need to audit every Lua API that accepts a string and either ensure that it will behave correctly, or replace it with something else.

@cdhowie cdhowie self-assigned this Nov 13, 2015
@cdhowie cdhowie added the bug label Nov 13, 2015
@Mangatome
Copy link

Hi, I'm jumping in the thread to second charlenni in thanking you for the great lib.

Basically I need to audit every Lua API that accepts a string and either ensure that it will behave correctly, or replace it with something else.

That sounds extreme, but I understand it might be necessary to implement the change correctly.

But what about the solution you described earlier...

The easiest way to solve this would be to convert the string from Lua twice, once through the system's codepage and one using just the ordinals, and store both representations on the LuaString object, but this has performance and memory consumption disadvantages.

What about this, simply as an optional (conditional compilation) feature of the lib? That would be more hacky and less generic, but maybe easier to implement and less invasive for general users.

@charlenni
Copy link
Author

Perhaps Mangatome is right. It's a special use case. Normally all (99,999%) users of Eluant will transfer "normal" strings to and from Lua. To transfer a string, which isn't printable, is a very special use case. I found this only once, when someone tries to encrypt strings by a special string, which contains the given bytes.

So a conditional compilation would be ok. Others be aware of the string transfer problem and could use it in their code. All others use the code as is. It works, how said above, in nearly all cases.

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

The approach I'm taking now is to add a byte[] constructor to LuaString. This solves the problem of how to get a CLR byte-string into Lua without conversion.

When marshalling a Lua string to the CLR, I am going to pull the byte-string out of Lua and use the same constructor internally. A new AsByteArray() method will return a copy of the byte[].

Then, the first time that a LuaString object is used in a context that requires a CLR string (such as evaluation of LuaString.Value) the byte array will be interpreted as UTF-8 and converted to a string. This way, most users won't notice any difference. (They might experience an exception, however, if an invalid UTF-8 sequence is encountered when doing this conversion. To me, that is acceptable behavior because they are effectively trying to use non-string data as a string.)

The LuaString(string) constructor will simply UTF-8 encode the string into a byte array, so that's what would get passed to Lua code.

I think this approach will solve all of the conversion issues. In hindsight, performance should be either equal or better, as the string conversion still happens only once, just in a different place -- and it might not even happen at all, if only AsByteArray() is used.

Memory requirements may be increased a bit but I think this is an acceptable compromise to properly resolve this issue.

@charlenni
Copy link
Author

That sounds great. It's not to difficult and there are no changes top the code for the others. And yes, the exception should be ok, because normally you expect, that the string is printable.

Thank you for your help.

@cdhowie
Copy link
Owner

cdhowie commented Nov 13, 2015

I have pushed a fix for this issue to the lua-string-marshal branch. There are new tests to verify the new behavior. Please test and let me know if you have any issues.

@charlenni I used a modified version of your testcase and it's working. (Your test can be rewritten as simply k = string.char(0xFC, 0xFF).)

@cdhowie
Copy link
Owner

cdhowie commented Dec 4, 2015

@charlenni Have you been able to test the fix yet?

@charlenni
Copy link
Author

Sorry for the late answer. I had the last two weeks no time to test it. Yesterday I changed my code in the way you did it. And yes, it works. Now saving and loading again works. Thank you very much.

@ALL: Be carefull. If you use string.dump() to save a function from Lua, you could get the same problem with strings.

@cdhowie
Copy link
Owner

cdhowie commented Dec 4, 2015

Thanks. I will get the branch merged soon.

@Mailaender
Copy link

See also OpenRA#3 which seems to address a very similar problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants