In a recent debate I had it was said that strings in the Win64
runtime are too slow to be useful. That is, in my opinion, a gross
exaggeration. It is true that the Win32 runtime library (RTL) has
benefited a lot from the work of the FastCode project, usually with
routines in extremely clever assembler. For all other platforms, often the
routines are in plain Object Pascal, so no assembler is being used.
Also, far fewer routines have been replaced by clever implementations.
One very obvious example of this is the Pos
function, which
searches if a certain string (I call that the Needle
) can be found
in a larger one (the Haystack
). The Win32 implementation
is in highly optimized assembler, written by Aleksandr Sharahov from
the FastCode project, and licensed by CodeGear. The Win64
implementation is in plain Pascal (PUREPASCAL
). But the
implementation for UnicodeString
is not the same, or even similar,
to the implementation for AnsiString
!
The implementation for UnicodeString
is slower than the same
routine for Win32. On my system a search in Win64 takes
approx. 1.8 × the time it needs in Win32. On Win32,
Pos
for AnsiString
is about as fast (or sometimes
even slightly faster than) Pos
for UnicodeString
. But
on Win64, Pos
for AnsiString
takes 2 ×
the time Pos
for UnicodeString
needs!
If you look at the sources in System.pas
, you'll see that the
Unicode version is slightly better optimized (searching for the first
Char
in the Needle
first, and only checking the rest
if a match was found).
For fun, I took the code for the UnicodeString
implementation and converted
it to work for AnsiString
. It was slightly faster than System.Pos
for UnicodeString
, instead of 2 times as slow. I wonder why, in System.pas
, the AnsiString
implementation does not simply use the same code as that for UnicodeString
,
like I did. If I were a suspicious person, I would think it was done on purpose,
to deprecate AnsiString
by making it less usable.
But even that can be improved upon. I wrote three implementations of my own
routine, one for AnsiString
, one for UnicodeString
and one for TBytes
(many
people have complained that TBytes
lacks something like Pos
and
that was the reason they maintained the incredibly bad habit of using
strings to store binary data — <shudder> — I wanted to take away that
silly argument).
Code
Here is the code for my RVPosExA
function (for what it's worth: these days, there is no difference
between PosEx
and Pos
anymore: both have the exact same functionality and signature):
function RVPosExA(const Needle, Haystack: AnsiString;
Offset: Integer = 1): Integer;
type
PUInt32 = ^UInt32;
PUInt16 = ^UInt16;
{$IFNDEF CPU32BITS}
var
LNeedleTip: UInt32;
PNeedle: PAnsiChar;
PHaystack, PEnd: PAnsiChar;
LLenNeedle: Integer;
LCmpMemOffset: Integer;
{$ENDIF}
begin
{$IFDEF CPU32BITS}
// FastCode (asm) implementation.
Result := System.Pos(Needle, Haystack, Offset);
{$ELSE}
if Offset - 1 + Length(Needle) > Length(Haystack) then
Exit(0);
Result := 0;
PHaystack := PAnsiChar(Haystack) + Offset - 1;
PEnd := PHaystack + Length(Haystack) - Length(Needle) + 1;
case Length(Needle) of
0: Exit(0);
1:
begin
LNeedleTip := PByte(Needle)^;
while PHaystack < PEnd do
if PByte(PHaystack)^ = LNeedleTip then
Exit(PHaystack - PAnsiChar(Haystack) + 1)
else
Inc(PHaystack);
Exit(0);
end;
2:
begin
LNeedleTip := PUInt16(Needle)^;
while PHaystack < PEnd do
if PUInt16(Haystack)^ = LNeedleTip then
Exit(PHayStack - PAnsiChar(Haystack) + 1)
else
Inc(PHaystack);
Exit(0);
end;
3:
begin
LNeedleTip := PUInt32(Needle)^; // if Needle is length 3, then top byte
// is the #0 terminator
while PHaystack < PEnd do
if ((PUInt32(Haystack)^ xor LNeedleTip) and $FFFFFF) = 0 then
Exit(PHaystack - PAnsiChar(Haystack) + 1)
else
Inc(PHaystack);
Exit(0);
end;
4:
begin
LNeedleTip := PUInt32(Needle)^;
while PHaystack < PEnd do
if PUInt32(Haystack)^ = LNeedleTip then
Exit(PHaystack - PAnsiChar(Haystack) + 1)
else
Inc(PHaystack);
Exit(0);
end;
else
begin
LCmpMemOffset := SizeOf(UInt32) div SizeOf(AnsiChar);
PNeedle := PAnsiChar(Needle) + LCmpMemOffset;
LLenNeedle := Length(Needle) - LCmpMemOffset;
LNeedleTip := PUInt32(Needle)^;
while PHaystack < PEnd do
if (PUInt32(PHaystack)^ = LNeedleTip) and
CompareMem(PHaystack + LCmpMemOffset, PNeedle, LLenNeedle) then
Exit(PHaystack - PAnsiChar(Haystack) + 1)
else
Inc(PHaystack);
end;
end;
{$ENDIF}
end;
As you can see, under Win32, it simply jumps to
System.Pos
, as that is the fastest anyway. But on all other
platforms, it searches the Haystack
4-byte-wise (if the
Needle
is larger than 4 elements), and if it found something, then
it searches the rest using CompareMem
.
Timing
Here is a slightly reformatted output of a test program (I put the WIN32 and the WIN64 columns beside each other, to save space):
Different versions of Pos(Needle, Haystack: <sometype>; Offset: Integer): Integer
where <sometype> is UnicodeString, AnsiString or TBytes
Testing with Haystack lengths of 50, 200, 3000, 4000 and 300000
and Needle lengths of 1, 3, 8 and 20
5 * 4 * 2000 = 40000 loops
WIN64 WIN32
UnicodeString UnicodeString
------------- -------------
System.Pos: 2428 ms System.Pos: 1051 ms
StrUtils.PosEx: 2258 ms StrUtils.PosEx: 1070 ms
RVPosExU: 1071 ms RVPosExU: 1050 ms
AnsiString AnsiString
---------- ----------
System.Pos: 4956 ms System.Pos: 1046 ms
AnsiStrings.PosEx: 4959 ms AnsiStrings.PosEx: 1051 ms
OrgPosA: 5129 ms OrgPosA: 5712 ms
PosUModForA: 1958 ms PosUModForA: 3744 ms
RVPosExA: 1322 ms RVPosExA: 1086 ms
TBytes TBytes
------ ------
RVPosEXB: 998 ms RVPosEXB: 2754 ms
Haystack: random string of 500000000 ASCII characters or bytes
Needle: last 10 characters of Haystack = 'WRDURJVDFA'
WIN64 WIN32
UnicodeString UnicodeString
------------- -------------
System.Pos: 847 ms System.Pos: 421 ms
Strutils.PosEx: 827 ms Strutils.PosEx: 414 ms
RVPosExU: 421 ms RVPosExU: 438 ms
AnsiString AnsiString
---------- ----------
System.Pos: 1735 ms System.Pos: 428 ms
AnsiStrings.PosEx: 1831 ms AnsiStrings.PosEx: 428 ms
OrgPosA: 1749 ms OrgPosA: 2687 ms
PosUModForA: 708 ms PosUModForA: 1525 ms
RVPosExA: 368 ms RVPosExA: 423 ms
RvPosExA(,,Offset): 200 ms RvPosExA(,,Offset): 220 ms
TBytes TBytes
------ ------
RVPosExB(TBytes): 385 ms RVPosExB(TBytes): 1095 ms
The routines RVPosExA
, RVPosExU
and RVPosExB
are my implementations for AnsiString
, UnicodeString
and TBytes
respectively. OrgPosA
is the original code for Pos
for AnsiString
, while PosUModForA
is the original PUREPASCAL code for Pos
for UnicodeString
, modified for AnsiString
.
As you can see, the PosUModForA
routine is almost twice as fast as the rather braindead OrgPosA
, and in WIN32, the RVPosEx<A/U/B>
implementations are faster than the others.
I didn't check, but it is well possible that one of the plain Pascal versions of the FastCode project is faster. But for me, this implementation is a start and proof, that with a few simple optimizations string routines could be made faster. Perhaps, one day, Embarcadero will adopt more of the plain Pascal code from the FastCode project.
The code for the routines and the program that produces the output above can be downloaded from my website.