Delphi Clinic C++Builder Gate Training & Consultancy Delphi Notes Weblog Dr.Bob's Webshop
Bob Swart (aka Drs.Bob) Dr.Bob's Delphi Clinics Dr.Bob's Delphi Courseware Manuals
View Bob Swart's profile on LinkedIn Drs.Bob's Delphi Notes
These are the voyages using Delphi Enterprise (and Architect). Its mission: to explore strange, new worlds. To design and build new applications. To boldly go...
Title:

Unicode tip #3 - UTF-16 Number of Printable Characters

Author: Bob Swart
Posted: 11/24/2008 12:14:10 PM (GMT+1)
Content:

The number of elements of a string can be retrieved by calling the Length function. So using the example of yesterday, Length(Clef) = 4, meaning there are 4 WideChar elements in the Clef constant. However, this includes the two square brackets as well as the two surrogate characters.

For the number of printable characters in a UTF-16 string, we need to check the surrogates and count the surrogate pair as one printable character (instead of two). This is implemented in the following UTF16Length function, which returns the number of printable characters:

  function UTF16Length(const S: String): Integer;
var
i: Integer;
begin
Result := 0;
for i:=1 to Length(S) do
if not
IsSurrogate(S[i]) then Result := Result + 1
else // Surrogate
if (i > 1) and IsSurrogatePair(S[i-1],S[i]) then
Result := Result + 1
end;
Calling UTF16Length on Clef returns 3: the opening and closing brackets are counted as well as the surrogate pair (being equal to a single printable character).

This tip is the third in a series of Unicode tips taken from my Delphi 2009 Development Essentials book published today on Lulu.com.

Back  


7 Comments

AuthorPostedComments
Aleksander Oven 08/11/24 18:03:01I := 0? Are Strings in D2009 0-indexed?
Bob Swart 08/11/24 19:10:22Thanks for the sharp eyes: no, strings start at 1. I've fixed the code.
Victor 08/11/25 08:21:59OK, you wrote the code for clarity and all that, but I have to comment on the efficiency of this function. Four function calls are made to skip one surrogate pair, were one would be enough (assuming that the first surrogate is always followed by a second, but that assumptions seems reasonable safe to me).
Bob Swart 08/11/25 08:27:56Feel free to write your own edition, I'm sure a BASM version would be even more efficient. I just wanted to point out that the number of elements (Length) is not the same as the number of printable characters...
Delfi Phan 08/11/26 15:14:37Unicode strings are still indexed from 1? This has been an irritant to many that come from other languages like C, but up to now it had a reason. So what is in position 0? Surely not a length byte/integer/word?
Olaf Monien 08/11/27 08:44:56Delphi Strings are *always* indexed from 1 - that is a (Pascal-)language feature, and not a question of the actual string encoding.
Delfi Phan 08/11/27 10:38:51I would still be interested in what is in position 0. A lot of code (mis)uses the fact that this contains the length. What would happen if I tried MyStringLength := TheString[0]; (I don't have D2009)


New Comment (max. 2048 characters, no HTML):

Name:
Comment:



This webpage © 2005-2014 by Bob Swart (aka Dr.Bob - www.drbob42.com). All Rights Reserved.