Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arabic/Hebrew/Persian languages. #3137

Merged
merged 10 commits into from
Jul 2, 2017

Conversation

ashensis
Copy link
Contributor

@ashensis ashensis commented Nov 8, 2016

This PR is intended to provide ACE with proper Arabic/Hebrew/Farsi (aka. Bidi) characters typing/display. (references the discussion under issue 'RTL editing' #1429)
Currently ACE fails to do this for 2 primary reasons.

  1. ACE bases itself on monospaced font in calculating screen columns offsets.
    Unfortunately, Bidi character's width differs from that of regular (say English)
    characters. This poses challenge, especially when Bidi and none Bidi (English or neutral) characters
    are present in the same line.
    Moreover, for Chrome engine based browsers (like Chrome and Opera), on some OS the default 'Monako' font
    defined in editor.css is not monospaced when it comes to Bidi characters.
    This might happen or not happen even on different Windows7 systems.
    Apparently this happens since 'Monaco' has no glifs for Bidi characters and font substitution
    on different OS(s) picks up different fonts from awailable stock.

  2. Bidi languages differ from none Bidi ones in their flow direction (from right to left),
    thus, character order on display (right-to-left) differs from respective order in text buffer (left-to-right).
    The respective character positions on display versus buffer are defined by Bidi algorithm implemented by platform
    (either by OS for native applications or by browser for web based applications).
    The correspondence between respective character positions on display (aka Visual) and in buffer (aka Logical)
    is defined by Visual<->Logical maps produced by Bidi Engine implementing Bidi algorithm.

  3. Between additional miscellaneous challenges it worth mentioning:

  • Arabic LamAlef pair processing (this particular combination of two characters is displayed as one character)
  • 'Tab' and 'wide' characters processing (their width participates in calculating the screen column position offset)

For references to Bidi algorithm see, for instance, http://unicode.org/reports/tr9/
For general references regarding Bidi see, for instance, https://www.w3.org/International/articlelist#direction

As a consequence, the character screen position (offset) in pure Bidi or mixed Bidi/None-Bidi text lines
should be calculated basing on character screen position rather then buffer one.

The following implementation details are important for understanding how this PR deals with aforementioned problems.

  1. Statefull 'bidihandler' class is introduced to concentrate all Bidi related activities.
    It depends on lightweight Bidi engine implemented by library class 'bidiutil'.

  2. 'bidihandler' class stores information related to text line under processing, this info
    consists of so named 'bidiMap' that contains Bidi levels and Visual<->Logical map (see above for terms definition).
    Above mentioned information is updated when caret is moved to another text line.

  3. In order to cope with the problem of calculating correct horizontal offsets of displayed characters
    'bidihandler' class stores width of Bidi characters alongside with widths for None Bidi characters, end-of-lines etc.
    In cases when 'Monaco' font doesn't furnish monospaced glyphs for Bidi characters, this font gets dynamically
    replaced by "Courier' font on per line basis (for lines containing Bidi characters)

  4. Special methods of 'bidihandler' class, like 'getPosLeft', 'offsetToCol', 'getSelections' ensure correct caret display,
    calculation of mouse click position and text (single or multiple) selection.

This solution has been tested on IE, FF, Chrome/Opera, Safary.

The test cases will be added to this PR after current approach is approved and code implementation is finalized.

@tomerm tomerm mentioned this pull request Nov 8, 2016
@tomerm
Copy link

tomerm commented Nov 8, 2016

@ebraminio the fix for issues reported in #1429 was submitted in this PR. Do you know who should review it ? I just wonder if we need to attract anybody's attention to it.

@ebraminio
Copy link

ebraminio commented Nov 8, 2016

I am not familiar with Ace project maintainers but this looks good, even the fact I was not in favor of adding a bidi algorithm but if this in not avoidable, perhaps ace should go for it.

Also, I just spotted a minor issue on reverting functionality (double click a RTL word, remove it then press Ctrl+Z, highlighting area is misplaced) which IMO, this totally can go without it; so functionally, this LGTM.

BTW, "Pharsi" is misspelling of "Farsi" but even better you can use "Persian" to refer to the language.

Thanks for this great effort, I like to see this ASAP on websites using Ace.

@tomerm
Copy link

tomerm commented Nov 8, 2016

@ebraminio thanks a lot for the feedback. This patch is far from being perfect / final. It is basically a prototype which illustrate basic approach to solution (and make essential features to work - cursor movement, typing etc.). Once we finalize the approach we will obviously polish the implementation to make sure all use cases are covered.

@tomerm
Copy link

tomerm commented Nov 8, 2016

@nightwing it looks like you were involved in discussion and review of this issue from the very beginning. Would you please have a look at proposed solution and share your comments ? Many thanks in advance.

@ashensis
Copy link
Contributor Author

ashensis commented Nov 8, 2016

@ebraminio Thank you for your comments.
Regarding your remark about Ctrl+Z causing misplaced selection, I tried to reproduce this scenario with no success. Could you please specify the exact word you are selecting/deleting and what kind of browser & OS have been used?

@ebraminio
Copy link

ebraminio commented Nov 8, 2016

I am getting the issue consistently across different browsers but I hope this would help to repro there, let me know if not:

  1. Open kitchen-sink.html, default setting
  2. Put "متن متن متن متن" on it
  3. Select second word, delete it with backspace
  4. Ctrl+Z (Undo)
  5. Ctrl+A (Select all)
    image

Anyway, I think this has less importance for being a merge blocker. Thanks

@ashensis
Copy link
Contributor Author

ashensis commented Nov 9, 2016

Thank you very much for cooperation @ebraminio.
I already found the problem, now got it fixed.

@rocketinventor
Copy link

@ashensis Thanks so much for the detailed write up. It was very informative.

@nightwing
Copy link
Member

Sorry for not finding time to review this earlier.
The changes here look fairly good, but there are still some issues that should be fixed to get this into mergeable state.

Hardcoding "Courier" font doesn't work for mac, and doesn't work well with wrapping, since character limits are computed for the main font, maybe for now we should just ask users pick a monospace font that supports their language.

Caret movement is generally different from that of textarea. E.g. up/down keep logical column instead of the visual one, left/right move by logical column too, is this intentional?

@ashensis
Copy link
Contributor Author

ashensis commented Jan 4, 2017

Thank you @nightwing for your comments.

  1. Regarding what you said about asking users pick a mono-spaced font that supports their language.
    Let me make sense of this. Currently the list of preferred fonts is hard coded in editor.css as follows:
    font: 12px/normal 'Monaco', 'Menlo', 'Ubuntu Mono', 'Consolas', 'source-code-pro', monospace;
    So obviously it is a "administrator" choice to define a font to be used.
    However the problem of aforementioned fonts that doesn't work for Arabic/Hebrew and few other languages is only relevant for "Chrome" engine browsers, like Chrome and Opera.
    For IE or FF this problem doesn't exist.
    In my code I automatically chose "Courier" only if lack of "monospacety" is detected and this should resolve above mentioned problem for "Chrome". Please note that in this case working with currently predefined fonts isn't an option at all from perspective of user experience..
    I could add additional check for "mac" to avoid font switching on this OS. What do you think about this amendment?
  2. As for your remark that
    "forcing 'courier' on Chrome doesn't work well with wrapping, since character limits are computed for the main font_"
    please note that this applies to proposed solution as well, since all relevant measurements in my code are performed on per character basis and Bidi characters have different width from regular ones.
  3. You made a good point in observing that caret movement is different, as for up/down arrow keys it was my omission, this should be fixed to ensure "visual" movement across different lines. I will try to resolve this.
  4. Regarding left/right arrow keys, this behavior of "logical" movement is standard across all editors in case of left-to-right base text direction, you can check this with Windows system editor (notepad) for instance, as well as with MS Word and any other Browser or OS based text editor (and ACE currently works with this left-to-right direction).
    (It worth mentioning although, that in right-to-left base text direction the caret movement does coincide with arrow direction)

@ashensis
Copy link
Contributor Author

ashensis commented Feb 6, 2017

@nightwing

  • Fixed problem of up/down arrow keys to ensure "visual" movement across different lines.
  • Added check to exclude forcing "Courier" font on Chrome for 'MAC'
  • Forcing "Courier" font on Chrome is limited only when document been edited contains Bidi (Arabic/Hebrew) characters and font is changed only for lines containing such characters.

@ebraminio
Copy link

Please rename "Pharsi" to "Persian" (or, "Farsi", which is less favorable) on the title.

@ashensis ashensis changed the title Support for Arabic/Hebrew/Pharsi languages. Support for Arabic/Hebrew/Persian languages. Feb 7, 2017
@lyudmil
Copy link

lyudmil commented Mar 1, 2017

@nightwing Anything I can help with to get this merged in? Perhaps with testing?

@ashensis
Copy link
Contributor Author

ashensis commented Mar 1, 2017

Thank you @nightwing.
Your testing is most welcome.
From my perspective the code is ready.

@adir01
Copy link

adir01 commented May 14, 2017

@nightwing when you think @ashensis code can be merged ?

Copy link
Member

@nightwing nightwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, there are only a couple of changes still needed to merge this.
Since this has non-trivial impact on performance, and possible optimizations for very long lines, it would be better to merge this as a preference turned off by default for now.
We'll turn it on after testing this for a while.

* @param {Number} the document row to be checked [optional]
* @param {Number} the wrapped screen line index [ optional]
**/
this.isBidiRow = function(screenRow, docRow, splitIndex) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you change tabs to 4 spaces to match the rest of the code, please.

Copy link
Contributor Author

@ashensis ashensis Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.bidiUtil = bidiUtil;
/* Arabic/Hebrew character width differs from regular character width */
this.charWidths = [];
this.EOL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • bidiUtil references lightweight Bidi engine used to produce Logical <-> Visual maps. These maps are used to correctly calculate caret position.
  • charWidths contains widths of various characters (English, Bidi, Zero width, Arabic Lam-Alef). Again, they are used to correctly calculate caret position.
  • EOL - end of line character, it may vary in ACE. Line may be followed by end-of-line or end-of-file symbol (when show invisibles) and they are used in caret position calculations (sometimes EOL occurs inside the Bidi line, not at the end)

Copy link

@ebraminio ebraminio Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the issue is this.EOL; is somehow NOOP here and it is not a function call like this.EOL(); or an assignment and the question is how this line and the next contributing anything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the suggestion would be putting some default sane value here instead this NOOP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this for both EOL and showInvisibles

/************************************************************************************************************************************/
/* 0 1 2 3 4 5 6 7 8 9 A B C D E F */
/************************************************************************************************************************************/
/*0-*/ TB00, L , L , L , L , TB05, TB06, TB07, R , L , L , L , L , L , L , L ,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to reduce the size of these tables? Maybe we could use something similar to https://github.com/codemirror/CodeMirror/blob/master/src/util/bidi.js#L62?

Copy link
Contributor Author

@ashensis ashensis Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact, approach implemented in codemirror is a rough one, it doesn't account for many outstanding cases.
However, after revising my classification tables I concluded that few of them might be disposed off without much harm.
But I would prefer to retain, partly, at least few tables, they might be compressed to incur less upload impact.
What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should find a way to compress this significantly, otherwise we have to keep it as a separate extension, and not include in ace.js

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retained only 2 character classification tables (actually only part of each one) and compressed them, the rest has been replaced by regular expressions

@@ -311,6 +319,21 @@ var Text = function(parentEl) {
row++;
}
this.element.innerHTML = html.join("");
if (this.session.$bidiHandler.isCourierFontFace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be better to switch font for the whole editor, or let the author of the site to pick font.
This doesn't work on mac anyway, and we are going to add proper support for variable width fonts soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would agree, I can just remove dynamic font change and let site owner to pick correct font.
But I would say that the font name suitable for Chrome should be somehow documented, isn't it?
May be in the the same editor.css

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dynamic font change

@@ -1116,6 +1118,7 @@ var EditSession = function(text, mode) {
*
**/
this.insert = function(position, text) {
this.$bidiHandler.markAsDirty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking I do not understand this proposal.
Line 250 of edit-session.js falls with 'onChangeFold' method that has no bearing on our subject.
The intended changes of mine (invocation of 'markAsDirty') serve to rebuilt Bidi info pertinent to line being edited, in order to be able to correctly calculate caret position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i meant line 255 https://github.com/ashensis/ace/blob/8b21722f4dff1e43a5c4debf066a54b0bcf6a9d3/lib/ace/edit_session.js#L255, session can be changed without calls to insert/remove. The only reliable way to handle that is to listen to the change event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1128,6 +1131,7 @@ var EditSession = function(text, mode) {
*
**/
this.remove = function(range) {
this.$bidiHandler.markAsDirty();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this and the other markAsDirty calls

@@ -2430,6 +2443,10 @@ var EditSession = function(text, mode) {
this.$stopWorker();
};

this.isFullWidth = function(c) {
return isFullWidth(c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to do this.isFullWidth = isFullWidth;

Copy link
Contributor Author

@ashensis ashensis Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ashensis
Copy link
Contributor Author

@nightwing I addressed all your comments, please have a look at submitted changes.

@adir01
Copy link

adir01 commented Jul 2, 2017

@nightwing Did @ashensis answered all your comments ? Are you happy with his changes ?

@nightwing nightwing merged commit 255c3a1 into ajaxorg:master Jul 2, 2017
@nightwing
Copy link
Member

looks good, i've merged with some trivial style changes, and will try using it for a while.
I amy have to turn this off by default, if performance difference is noticeable.

Thanks for all your hard work on implementing this, and your persistence and patience for getting this merged.

@adir01
Copy link

adir01 commented Jul 3, 2017

Thank you @Nigtwing

@lyndondrake
Copy link

I am very eager to test this too. How do I access a build with this change included?

@nightwing
Copy link
Member

@lyndondrake clone this repository, install node, then run

git clone https://github.com/ajaxorg/ace 
cd ace
npm install
node Makefile.dryice.js

or simply open https://rawgit.com/ajaxorg/ace/255c3a1793276b77eeba90e40df99c3ccc941780/kitchen-sink.html

@shreevatsa
Copy link

This may not be exactly related to this PR but asking here as most knowledgeable people seem to be here:

The spacing issues as described at #2667 and #3037 seem to still not be fixed even at https://rawgit.com/ajaxorg/ace/255c3a1793276b77eeba90e40df99c3ccc941780/kitchen-sink.html -- is that expected? For example if I put some random Indic text on a long line:

मम हि निष्कारणबान्धवं भवन्तमालोक्यैव दुःखान्धकारभाराक्रान्तेन महतः कालादुच्छ्वसितमिव चेतसा, श्रावयित्वा स्ववृत्तान्तमिमं सह्यतामिव गतः शोकः । दुःखितमपि जनं रमयन्ति सज्जनसमागमाः । परसुखोपपादनपराधीनश्च भवादृशां गुणोदयःऽ इत्युक्तवतीञ्चैनां चन्द्रापीडो ऽव्रवीत्-"भगवति! दर्शनात्प्रभृति परवानयं जनः कर्त्तव्येषु यथेष्टमशङ्किततया नियुज्यताम्ऽ इत्यभिधाय तया सहैवोदचलत् । क्रमेण च गत्वा हेमकूटमासाद्य गन्धर्वराजकुलम्, समतीत्य काञ्चनतोरणानि सप्तकक्षान्तराणि कन्यान्तःपुरद्वारमवाप । महाश्वेतादर्शनप्रधावितेन दूरादेव कृतप्रणामेन कनकवेत्रलताहस्तेन

and click somewhere in it and hit Enter, the line break gets inserted somewhere other than where the cursor was. Is there a way to resolve it by using a different font? Because it would be surprising if RTL text with varying char widths can work, but not left-to-right.

@shreevatsa
Copy link

shreevatsa commented Jul 14, 2017

(All I wanted was a textarea into which I can put say 2 MB of plain text, with some minimal highlighting for # section headings and **bold** text. And now I'm looking at internals of Ace code 😄 )

From what I can understand of this PR by looking at the code, it looks like there are two parts to this support for Arabic/Hebrew/Persian:

  • We need to account for the fact that not all characters will have the same width. (Here characters ≠ codepoints, but whatever a “visual character” means: glyphs, graphemes, whatever—including combining characters.) This affects things like cursor position and cursor motion, selecting text, and so on. Basically the relation between where the user thinks they are (the cursor / mouse pointer) and where the code thinks it is.
  • We need to handle bidirectional issues: levels of right-to-left v/s left-to-right nesting, etc. This also affects cursor motion (e.g. between a run of text that should be left-to-right and one that should be right-to-left).

This PR does both! But what we need for supporting Indic scripts (or even characters like ⟦ ♞ ℝ as on the linked issues) is just the first part (correctly handling characters whose widths vary).

Is there a way of extracting out just the first part, so that it would be more useful?

shreevatsa added a commit to shreevatsa/impu that referenced this pull request Jul 15, 2017
Initially chose Ace for "performance" but the text so far is barely 1MB,
also Ran into issues with Indic text:
ajaxorg/ace#3137 (comment)

So trying CodeMirror, which certainly has the features I
want (https://codemirror.net/demo/bidi.html,
https://codemirror.net/demo/variableheight.html) and doesn't have
these bugs.
@hartman
Copy link
Contributor

hartman commented Aug 8, 2017

So am I correct in understanding that glyph fallback replacement on Macs breaks this font width calculation, since it replaces with glyphs that are not monospaced ?

Maybe there is a font feature setting that we can use to correct this ? I'm not sure.. anyone more experience with fonts than me ? :)

@ashensis
Copy link
Contributor Author

Hi @shreevatsa.
On addressing your comments, I should say that this PR was intended and tested primarily for Arabic & Hebrew languages that have, may be accidentally the common denominator in respect of character width.
I personally never dealt the Inidc.
Secondly, this PR doesn't take care of right-to-left direction (the issue I am planning to develop as a next step). So only left-to-right text orientation can be maintained for now.
The Bidirectional aspect of this PR lays with light weighted Bidi engine and it represents the bulk of code, while the maintaining variable character width constitutes relatively minor aspect. Besides, as far as I know, Ace team has plans to provide general support for variant width fonts.

@ashensis
Copy link
Contributor Author

Hi @hartman.
The "Monaco" monospaced font in fact isn't monospaced when it comes to Arabic and Hebrew glyph on Chrome engines.
Initially I developed automatic switch to Courier font for given line when typing of these characters occurs on Chrome. But this approach was out-ruled by reviewer due to this very issue of Mac having no Courier.
However the site author is free to specify the default font in editor.css, so if one finds the monospaced font (in respect to Arabic/Hebrew) on Mac, he/she may fairly well to make corresponding site configuration.

@shreevatsa
Copy link

Hi @ashensis -- from your comments it looks like you've done most of the work for supporting variable-width left-to-right scripts, and only some minor tweak remains. I am hopeful: if you have a few minutes could you look into the example I posted earlier (#3137 (comment) : you can install a Devanagari font like Noto Sans Devanagari if it doesn't render) and see why it has the issues I mentioned? My guess is that, if the text is simply detected / marked as being "bidi" (in terms of your code in this PR — even if it's strictly left-to-right), then everything would actually work, which would be great.

AFAIK the Ace team has no other plans in progress for variable-width characters, and as you've already done such a lot of work I feel there's probably only a minor change or two required for everything to work perfectly.

@ashensis
Copy link
Contributor Author

Hi @shreevatsa.

lastly I said that this PR was targeting only Arabic/Hebrew.
Please pay attention on the following (first) code line in bidihandler.js, 'updateCharacterWidths' method
this.updateCharacterWidths = function($fontMetrics) {
var _regularCharWidth = $fontMetrics.$measureCharWidth('a'), _bidiCharWidth = $fontMetrics.$measureCharWidth('\u05d4');
It measures Hebrew letter Alef when calculating bidiCharWidth. This width is uniformly equal for all Arabic and Hebrew characters (except compound Arabic ones like LamAlef) on Monaco or Courier monospaced fonts.
I tried to measure widths of a few of Inidc characters posted by you, and they have variant widths, like:
$fontMetrics.$measureCharWidth('भ') == 11.4666666
$fontMetrics.$measureCharWidth('सा') == 15.666666
So I presume that in order to provide general support for none-monospaced fonts, more elaborate approach should be accepted, namely measuring the cursor offset from line beginning on every change caret position (which should take a toll on overall performance)

@ashensis
Copy link
Contributor Author

I didn't try to install Devanagari font but if you already have it installed, you can play around by measuring the widths of various Indic characters to see if it is truly monospaced from this respect.
If the answer would be positive, then the straightforward change may be done, say to provide a configuration paarmeter telling which script should be supported, Bidi, Indic or may be something else

@ashensis
Copy link
Contributor Author

And then to measure respecting character for script being configured.

@shreevatsa
Copy link

@ashensis Thanks for that explanation, it was really helpful and cleared up my misunderstanding. I somehow seem to have missed that what's now supported is monospaced Arabic/Hebrew; I didn't even know those existed. :-) Thanks, guess the goal is further than I thought (as I'm not aware of monospaced Indic fonts, and what's really needed is variable-width support which as you say implies performance costs). Sorry for the noise arising from my misunderstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants