Delphi against the crazy C world

One thing drives me crazy. Why in the today world the ugly C-style syntax became so popular and widespread while beautiful pascal syntax is one on the latest places of popularity. Moreover today many new programming languages appear, and it seems to me the authors of those languages are competing with each other for the price in the category who figures out the most weird way of coding.

Let’s do some math about the SprutCAM project. Today the sources base of SprutCAM is about one and a half million lines of code, and it took 10 years to develop. As I know the effective development team never exceeded 5 men. 1’500’000 lines of code divided by 10 years of 50 weeks with 40 work hours multiplied at 5 men equals to 15 lines of code an hour.

What it looks like:

  result := false;
  at := self.GetArcType;
  case at of
    atL_L: begin
        Result := CalcArcLine_Line(ln1.p, ln1.tt, ln2.p, ln2.tt, s1, s2, fInitRc,
          ap1, ap2, ac, aR);
        if not result and
          IsEqD(1, abs(Vec_mul_Vec(ln1.tt, ln2.tt)), Increment)
        then begin
          result := true;
          ap1 := ProjectPointRay(self.pc, ln1.p, ln1.tt);
          ap2 := ProjectPointRay(self.pc, ln2.p, ln2.tt);
          ac := 0.5*(ap1+ap2);
          aR := VecLen(ap1, ac);
        end;

You can see there are a lot of spaces in each line. Your grandma can type 15 lines of code an hour! This example makes obvious one simple thing. Programming is not about typing. I can approve that programming is much more about reading a code and debugging it than typing it. However the authors of C-style languages think different! They think that by replacing human language words begin, end by curly braces {} and omitting then in the if statement they save hours of programmers lives. Here is an example of a C code.

   stbtt_uint16 format = ttUSHORT(data + index_map + 0);
   if (format == 0) { // apple byte encoding
      stbtt_int32 bytes = ttUSHORT(data + index_map + 2);
      if (unicode_codepoint < bytes-6)
         return ttBYTE(data + index_map + 6 + unicode_codepoint);
      return 0;
   } else if (format == 6) {
      stbtt_uint32 first = ttUSHORT(data + index_map + 6);
      stbtt_uint32 count = ttUSHORT(data + index_map + 8);
      if ((stbtt_uint32) unicode_codepoint >= first && (stbtt_uint32) unicode_codepoint < first+count)
         return ttUSHORT(data + index_map + 10 + (unicode_codepoint - first)*2);
      return 0;
   } else if (format == 2) {
      STBTT_assert(0); // @TODO: high-byte mapping for japanese/chinese/korean
      return 0;
   } else if (format == 4) { // standard mapping for windows fonts: binary search collection of ranges

Just look at those two code snippets! Which one do you prefer?

Advertisements
Delphi against the crazy C world

Parallel programming with Delphi part II. Resolving race conditions

In this post I will reveal some simple yet efficient techniques to address and resolve race conditions in parallel algorithms.

Do you remember my shiny parallel version of the triangle intersection code?

  procedure IntersectTriangles(i: integer; Triangle: TTriangles);
  begin
    Grid.GetTriangleExtents(Triangle[i], ixMin, ixMax, iyMin, iyMax);
    for iy := iyMin to iyMax do begin
      for ix := ixMin to ixMax do begin
        if RayTraceTriangle(Triangle[i], x(ix), y(iy), z) then begin
          Grid[ix,iy].AddIntPoint(Triangle[i], z);
        end;
      end;
    end;
  end;

  Parallel.ForI(0, Triangle.Count-1, @IntersectTriangles, @Triangle);

There is one problem with it. Imagine such a situation: two threads are trying to add intersection points with the same [ix, iy] to the Grid at the same time. What result will we get? Most likely only one intersection point will be added and the whole algorithm will fail. Such situations are usually called race conditions.

Basically there is only one thing, that can lead to race conditions in your parallel code. This is writing to a memory shared across threads. Typical examples of shared memory include input and output variables, counters, storages etc. While working on my own problem I was faced only with some of them. Here I will just show you my actual solutions to the actual problems, but at first enjoy my boring lecture.

Atomic instructions

I was thinking about parallel programming a long time ago. I was just looking at the modules written by me and wondering how to rewrite an existing heavily optimized sequential code into an equally efficient parallel version. And almost every time I was stuck, because I couldn’t figure out how to avoid writing to a shared memory. Than I started thinking that avoiding writing to a shared memory is impossible and this is a common problem and there should be existing solutions for it. And I started digging through the internet, MSDN, programming manuals etc. What I’ve found at first were mutexes, semaphores, critical sections, and many other scary words, the meaning of which I still do not understand. The proposed solutions seamed to me cumbersome and inefficient at the same time. For example, the critical sections. What a stupid idea is to serialize a peace of a potentially thread unsafe code every time you call it when the probability of a race condition is one to thousand? I am pretty sure that critical sections would make my parallel code many times slower than the sequential version.

So I continued my research and I’ve found it – the LOCK. Lock is a prefix instruction in the x86 assembler language. It turns ordinal arithmetic instructions such as inc, dec, add, sub and others into their atomic versions.

The right time to explain the term Atomic.

Every CPU instruction decomposes to more elementary operations. For example, the inc dword ptr [eax] instruction involves at least three steps.

  1. Read a DWORD from the memory location stored at eax into an accumulator.
  2. Increment the accumulator.
  3. Write the result back to the memory.

That’s why when several threads are operating on the same variable at the same time there is no guarantee that the resulting variable value will be correct.

    Opposed to the ordinal instructions atomics are executed as a whole without subdivision to more elementary operations (hence their name). This fact makes them thread safe. On the x86 architecture atomics are implemented either by locking the system bus or via the cache coherency mechanism when appropriate. Hence the Lock prefix.
    In the Windows API the atomic instructions are referred as interlocked. You have probably seen or used the InterlockedIncrement, InterlockedDecrement functions in the standard implementations of the IUnknown.AddRef, IUnknown.Release methods. Windows offers wrappers for all atomic instructions, but I wrote my own versions for some of them. Here they are.

    procedure AtomicInc(var i: integer);
    procedure AtomicDec(var i: integer);
    function AtomicAdd(Addend: integer; var Dest: integer): integer;
    function AtomicSwap(Data: pointer; var Dest): pointer; overload;
    function CAS32(oldValue: integer; newValue: integer; var destination): boolean;

Atomic counters

procedure AtomicInc(var i: integer); //i := i+1;
asm  
  lock inc dword ptr[eax];
end;
procedure AtomicDec(var i: integer); //i := i-1;
asm  
  lock dec dword ptr[eax];
end;

Sometimes you just need count some number while not caring of the actual value of this number. Than just use the AtomicInc, AtomicDec functions.

You have already seen how I used atomic counters in my implementation of the Paralel For cycle to increment and decrement fRunCounter.

procedure TSTTrheadPool.ForI(FromIndex, ToIndex: integer; const Func:
  TParallelFunction; UserData: pointer);
begin
  StartWorkers;
  try
    while (fTaskCounter>0) or (fRunCounter>0) do begin
      ParallelForExecute(0)
    end;
  finally
    StopWorkers;
  end;
end;

procedure TSTThreadPool.ParallelForExecute(tdi: integer);
var
  i: integer;
begin
  if (fTaskCounter>0) then begin
    AtomicInc(fRunCounter);
    i := AtomicAdd(-1, fTaskCounter);
    if i>=0 then begin
      fParallelFunction(fToIndex-i, fParallelData);
    end;
    AtomicDec(fRunCounter);
  end;
end;

Atomic add

function AtomicAdd(Addend: integer; var Dest: integer): integer;
asm
  //Dest := Dest+Addend;
  //result := Dest;
  mov ecx, eax;
  lock xadd [edx], eax;
  add eax, ecx;
end;

The AtomicAdd function is much more powerful than AtomicInc or AtomicDec as it can return the result of addition. It is possible because internally it is implemented not with the add instruction, but using the xadd instruction which performs two operations: adds two numbers and exchanges them.

AtomicAdd is very useful if you want to implement a lock free thread safe array. Here is a simple example, assuming you know the maximum number of items in advance and you only add items.

TThreadSafePointerArray = class
protected
  fTop, fMaxCount: integer;
  fpp: array of pointer;
public
  function Add(p: pointer): integer;
end;

function TThreadSafePointerArray.Add(p: pointer): integer;
begin
  result := AtomicAdd(1, fTop);
  ASSERT(result<fMaxCount, 'Array bounds exceeded!');
  fpp[result] := p;
end;

Atomic swap

function AtomicSwap(Data: pointer; var Dest): pointer; overload;
asm
  //result := Dest;
  //Dest := Data;
  lock xchg [edx], eax;
end;

AtomicSwap replaces a DWORD variable Dest (int32, pointer32, etc.) with your Data (in this case, a pointer) and returns the previous value of the variable in the result.

What is it useful for? For example, for thread safe destruction of objects and releasing interface references.

procedure FreeAndNilObject(var Obj);
var
  Temp: TObject;
begin
  if Obj<>nil then begin
    Temp := TObject(AtomicSwap(nil, Obj));
    if Temp<>nil then
      Temp.Destroy;
  end;
end;

procedure FreeAndNilInterface(var Intf);
var
  Temp: IUnknown;
begin
  if Intf<>nil then begin
    Temp := IUnknown(AtomicSwap(nil, Intf));
    Temp := nil;
  end;
end;

Atomic Compare and Swap/CAS

function CAS32(oldValue: integer; newValue: integer; var destination): boolean;
asm
  //if destination=oldValue then begin
  //  destination := newValue;
  //  result := true;
  //end else
  // result := false;
  lock cmpxchg dword ptr [destination], newValue
  setz  al
end;

The most powerful atomic function is the so called Compare And Swap routine or simply CAS. It takes three parameters: OldValue, NewValue and Destination. If the value of the destination variable equals to the OldValue than CAS replaces it with the NewValue and returns true, otherwise it returns false.

CAS is widely used in thread safe implementations of dynamic data structures such as queues, stacks, etc. Here is an example of how CAS is used to safely insert a node in a queue (Note: it works only if you are not trying to add and remove nodes at the same time).

TTThreadSafeQueue = class
type
  PQueueNode = ^TQueueNode;
  TQueueNode=record
    next; PQueueNode;
    Data: TQueueData;
  end;
protected
  fFirstNode: PQueueNode;
  fStorage: TThreadSafeStorage;
public
  procedure PushFront(const Data: TQueueData);
end;

procedure TThreadSafeQueue.PushFront(const Data: TQueueData);
var
  NewNode: PQueueNode;
begin
  NewNode := fStorage.GetNew;
  NewNode.Data := Data;
  repeat
    NewNode.next := fFirstNode;
  until CAS32(fFirstNode, NewNode, fFirstNode);
end;

Lock/Unlock

The above examples are widespread but still special cases. In more complicated situations you need serialize a thread unsafe code. The super simple, elegant and efficient way to do so is to use spinlocks, which again are easily implemented with atomic CAS and dec instructions.

Here is my humble implementation of spinlocks.

procedure Lock(var Lk: integer);
asm
  //while not CAS32(0, 1, Lk) do;
  mov edx, eax
  mov ecx, 1
@Loop:
  mov eax, 0
  lock cmpxchg dword ptr [edx], ecx
  jnz @Loop
end;

procedure UnLock(var Lk: integer);
asm
  lock dec dword ptr[eax];
end;

The function Lock takes one “ByRef” argument Lk which intends to be a lock variable. A lock variable is a DWORD which in an unlocked state equals to zero while in a locked state resembles 1. The Lock function spins in a tight loop until it manages to replace the Lk value from 0 to 1 using the CAS instruction. The UnLock function just decrements the Lk variable by one, thus returning it to the unlocked 0 state.

Here is a good example of the power of the Lock/Unlock pair.

type
  TIntPointGrid = class
  type
    PIntPointArray = ^TIntPointArray;
    TIntPointArray = record
      Lk: integer;
      pp: array of TIntPoint;
    end;
  protected
    fPP: array of array of TIntPointArray;
  public
    function AddIntPoint(ix, iy: integer; const p: TIntPoint);
  end;

function TIntPointGrid.AddIntPoint(ix, iy: integer; const p: TIntPoint);
var
  ip: PIntPointArray;
begin
  ip := @fPP[ix, iy];
  Lock(ip.Lk);
  try
    n := length(ip.pp);
    SetLength(ip.pp, n+1);
    ip.pp[n] := p;
  finally
    UnLock(ip.Lk);
  end;
end;

That is a possible solution (not the actual) for my add intersection problem. I just added the Lk field to my intersection points array data structure and used it as a lock variable. You can notice that the worker threads are locked only when they are trying to add intersection points to the same cells, what is a rare case. If I use critical sections I would be forced either to create NxM ones for all of the grid cells, which is a waste, or to include the whole AddIntPoint function into one critical section, which would lead to a pour performance.

Completely avoiding race conditions

Using atomics and spinlocks to resolve memory conflicts is good but even better is to completely avoid race conditions. The good news is that I’ve invented an easy way to do it in most cases.

As you know, I’ve implemented my own version of the Parallel For cycle. So I am free to modify it any way I want. What I’ve done is

  • I’ve introduced the global variable tdn: integer which resembles the maximum number of the worker threads in my parallel.ForI.
  • Each worker thread has an index greater or equal to zero and less than tdn.
  • The TParalleForFunction callback now takes one more parameter – the ThreadIndex, which is the index of the caller thread.
    TParallelFunction2 = procedure(i: integer; TdI: integer; UserData: pointer);
    
  • Knowing the tdn and the ThreadIndex you can create tdn number of the supporting data structures instead of one so that each thread will be dealing with its own data set. Than at a final stage you will just collect the generated data from the tdn structures into one. Nice and easy.Here is an example, which computes a bounding box of some objects in a parallel loop.

    function CalculateBox(const Objects: array of IObjectWithBox): TBox;
    type
      PBoxData=TBoxData;
      TBoxData=record
        Objects: array of IObjectWithBox;
        bb: array of TBox;
      end;
      procedure CalcBox(i: integer; tdi: integer; BoxData: PBoxData);
      begin
        BoxData.bb[tdi].AddBox(BoxData.Objects[i].GetBox);
      end;
    
    var
      b: TBox;
      BoxData: TBoxData;
    begin
      result.IsEmpty := true;
      BoxData.Objects := Objects;
      SetLength(BoxData.bb, Parallel.tdn);
      for i := 0 to tdn-1 do
        BoxData.bb[i].IsEmpty := true;
      try
        Parallel.ForI(0, High(Objects), @CalcBox, @BoxData);
      finally
        for i := 0 to tdn-1 do
          result.AddBox(BoxData.bb[i]);
      end;
    end;
    

As you can see, I just create tdn temporary boxes, so that each parallel worker fills its own box, than at the final stage I just add the temporal boxes into the result. The similar way you can do many things.

Conclusion

Parallel programming is a little bit tricky, mostly due to shared memory conflicts and things like that. Even such common operations as assigning a value to a variable or destroying an object may require special treatment when used in parallel algorithms. On the other hand, parallel programming is all about performance.

That’s why

  1. Use tdn and tdi in Parallel.ForI and create separate data structures for each thread to completely avoid memory conflicts.
  2. Use atomics and spinlocks instead of critical sections and other crap. This way you will lock the data, not the code and only when it is really necessary. Here is the source code, which includes STParallel.pas with Paralel.ForI and atomics, and PageStorage.pas with a thread safe memory manager.
Parallel programming with Delphi part II. Resolving race conditions

Parallel programming in Delphi part I. Parallel For

Today multicore CPUs are commonplace, however many programmers still didn’t written a single line of a parallel code. And so I did until recently. In this post I will try to share the results of my first experience of parallel programming in Delphi. Maybe it will help someone to start implementing parallel algorithms by himself.

Parallel.ForI()

This summer I was working on the voxel 5d NC simulation module. In this module I had a code that looked like this:

  for i := 0 to Triangle.Count-1 do begin
    Grid.GetTriangleExtents(Triangle[i], ixMin, ixMax, iyMin, iyMax);
    for iy := iyMin to iyMax do begin
      for ix := ixMin to ixMax do begin
        if RayTraceTriangle(Triangle[i], x(ix), y(iy), z) then begin
          Grid[ix,iy].AddIntPoint(Triangle[i], z);
        end;
      end;
    end;
  end;

So what I had to do was to parallelize the outer loop. I know there are many parallel thread libraries with parallel for out there (actually I know only one for delphi – the OmniThreadLibrary – which is free and open source). In spite of this I decided to write my own stuff (the reason was I could not compile the OTL in BDS 2006 as it was designed for later delphi versions).

The goal was to write a unit STParallel.pas which would be looking like this:

unit STParallel; interface

type TParallelFunction = procedure(i: integer; UserData: pointer); TSTThreadPool = class public procedure ForI(FromIndex, ToIndex: integer; const Func: TParallelFunction; UserData: pointer); end; var Parallel: TSTThreadPool=nil;

So I could rewrite my triangles intersection routine in the form of

  procedure IntersectTriangles(i: integer; Triangle: TTriangles);
  begin
    Grid.GetTriangleExtents(Triangle[i], ixMin, ixMax, iyMin, iyMax);
    for iy := iyMin to iyMax do begin
      for ix := ixMin to ixMax do begin
        if RayTraceTriangle(Triangle[i], x(ix), y(iy), z) then begin
          Grid[ix,iy].AddIntPoint(Triangle[i], z);
        end;
      end;
    end;
  end;

  Parallel.ForI(0, Triangle.Count-1, @IntersectTriangles, @Triangle);

To implement such a class you should decide how to

  • determine the number of available cores and create an array/pool of windows threads,
  • suspend and wake up the threads all at once,
  • subdivide the task into smaller pieces between the threads.
    Let’s start. The first point is pretty simple. The function that returns the number of available cores is the Windows.GetSytemInfo

    function GetProcessorNum: integer;
    var
      si:    _SYSTEM_INFO;
    begin
      GetSystemInfo(si);
      Result:=si.dwNumberOfProcessors;
    end;
    

    For the threads I have created the TSTWorkerThread class

    interface
    
      TExecuteFunction = procedure(tdi: integer) of object;
    
      TSTWorkerThread = class(TThread)
      public
        Constructor Create;
        Destructor Destroy; override;
        Procedure Execute; override;
        property OnExecute: TExecuteFunction read fOnExecute write fOnExecute;
        property IsRunning: boolean read fIsRunning write fIsRunning;
        property Index: integer read fIndex write fIndex;
      end;
    
    implementation
    
    procedure TSTThreadPool.InitWorkers;
    var
      i: integer;
    begin
      tdn := STDef.GetProcessorNum;
      SetLength(fWorkers, tdn);
      for i := 1 to High(fWorkers) do begin  //Notice! I’m creating tdn-1 threads
        fWorkers[i] := TSTWorkerThread.Create;
        fWorkers[i].index := i;
        fWorkers[i].OnExecute := self.ParallelForExecute;
        fWorkers[i].Resume;
      end;
    end;
    
    procedure TSTWorkerThread.Execute;
    begin
      while not self.Terminated do
        fOnExecute(fIndex);
    end;
    
    procedure TSTThreadPool.ParallelForExecute2(tdi: integer);
    begin
      ...
    end;
    

The solution for the second point I’ve borrowed from the OTL. They use the SetEvent, ResetEvent and WaitForSingleObject routines. So I did.

constructor TSTThreadPool.Create;
begin
  fStartEvent := Windows.CreateEvent(nil, true, false, nil);
end;

procedure TSTThreadPool.StartWorkers;
begin
  while not Windows.SetEvent(fStartEvent) do;
end;

procedure TSTThreadPool.StopWorkers;
begin
  while not Windows.ResetEvent(fStartEvent) do;
end;

procedure TSTThreadPool.ForI(FromIndex, ToIndex: integer; const Func:
  TParallelFunction2; const UserData: pointer);
begin
  fParallelFunction := Func;
  fParallelData := UserData;
  StartWorkers;
  try
    ScheduleForTask(FromIndex, ToIndex);
  finally
    StopWorkers;
  end;
end;

procedure TSTThreadPool.ParallelForExecute(tdi: integer);
begin
  if GetForIndex(tdi, i) then
    fParallelFunction(i, fParallelData);
  end else if tdi>0 then begin
    WaitForSingleObject(fStartEvent, windows.INFINITE);
  end;
end.

Thread scheduling

The most interesting part is thread scheduling. Here is the problem. We have to call the loop body function N times while having M threads. How can we distribute those N tasks between M threads to minimize the execution time of our Parallel.ForI function? Obviously the time is equal to the execution time of the most slow thread plus some overhead time. We should also consider the fact that iterations can last different time (e.g. when intersecting triangles some triangles are small, some are large, some may lie far away and will be pruned quickly).

Another question is, what the calling process will be doing meanwhile: will it just wait until the workers finish or will it do some work too?

The answers depend on actual use cases of our class. Thus I implemented four different scheduling strategies. Those are:

  • Atomic,
  • Log. atomic,
  • Divide and conquer,
  • Interleaved.

        Atomic strategy

        This was the first strategy I’ve implemented. The reason is that theoretically the strategy achieves the most even workload distribution between threads. Here is the implementation.

        procedure TSTTrheadPool.ForI(FromIndex, ToIndex: integer; const Func:
          TParallelFunction; UserData: pointer);
        begin
          fFromIndex := FromIndex;
          fToIndex := ToIndex;
          fTaskCounter := FromIndex-ToIndex+1;
          fParallelFunction := Func;
          fParallelData := UserData;
          StartWorkers;
          try
            while (fTaskCounter>0) or (fRunCounter>0) do begin
              ParallelForExecute(0)
            end;
          finally
            StopWorkers;
          end;
        end;
        
        procedure TSTThreadPool.ParallelForExecute(tdi: integer);
        var
          i: integer;
        begin
          if (fTaskCounter>0) then begin
            AtomicInc(fRunCounter);
            i := AtomicAdd(-1, fTaskCounter);
            if i>=0 then begin
              fParallelFunction(fToIndex-i, fParallelData);
            end;
            AtomicDec(fRunCounter);
          end else if tdi>0 then begin
            WaitForSingleObject(fStartEvent, windows.INFINITE);
          end;
        end;
        

        How it works.

      The fTaskCounter variable determines the count of remaining iterations.

      The calling process and the worker threads are all executing the same ParallelForExecute function. So we use only N-1 worker threads for an N-core system.

      The task is completed when fTaskCounter is zero and all the workers have finished its subtasks. To determine the latter moment I’ve introduced the fRunCounter variable. Every worker increments it when starts executing a task and decrements it at the end. So the exit condition is (fTaskCounter=0) and (fRunCounter=0).

      The work stealing process is accomplished in the lines

          i := AtomicAdd(-1, fTaskCounter);
          if i>=0 then begin
            fParallelFunction(fToIndex-i, fParallelData);
          end;
      

      The AtomicAdd function just decrements the task counter by one and returns the result. The term “atomic” means that the function is locking, i.e. when two or more threads will be trying to decrement the same variable at the same time, all the threads will be put in a queue by the system hardware so you will always get the correct result. The ordinal arithmetic functions don’t guarantee a correct result in such situations.

      That’s it. Theoretically the solution should demonstrate outstanding performance. But what in real life? Well, it depends…The problem is in that magic atomic functions. It seams they are not cheap at all. I thought the triangle intersection routine should be heavy enough to make the overheads look negligible. But it’s not.

      Thus the atomic strategy is ideal only for compute intensive tasks when each loop iteration takes a long time. In my case a loop iteration takes a little time while the iteration count is large (1000-1000000). And I’ve developed yet another scheduling strategy.

      Log Atomic strategy

      The idea of the Log Atomic strategy is pretty simple. What if each worker thread will decrement the task counter not by one, but by some number, depending on the amount of the task counter. The larger the task counter is  the more we decrement. This will reduce the overhead introduced by atomic functions many times while still ensuring pretty even workload distribution.

      Playing with the decrement amount choosing function I found one which seems to be the only right (there is some math theory behind it, for sure). Here it is.

      function GetStep(TaskCounter: integer): integer;
      begin
        result := Max(1, TaskCounter div (2*ThreadNumber));
      end;
      

      For 100 iterations and 4 threads the sequence will be the following: (12-11-9-8-7-6-5-5-4-4-3-3-2-2-2-2-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1). So there is only 31 decrements instead of 100. For 1000 iterations and 4 threads the sequence is (125-109-95-83-73-64-56-49-43-37-33-29-25-22-19-17-15-13-11-10-9-7-7-6-5-4-4-3-3-3-2-2-2-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1), i.e. 48 decrements instead of 1000. You can see, the law is logarithmic.

      Divide and conquer and Interleaved strategies

      Sometimes you have a loop with many iterations taking near the same time. In such situations work stealing strategies may be not so efficient as simple even tasks subdivision. By even subdivision I mean an approach when we just take the number of tasks and divide it by the number of workers, than each worker fulfills its tasks range independently of each other.

      I’ve implemented two subdivision strategies: the Simple Divide and the Interleaved.

      In the Divide strategy every thread executes the continuous range of iterations [FromIndex+Worker.Index*N..Max(ToIndex, FromIndex+Worker.Index*(N+1)-1], where N := UpRound((FromIndex-ToIndex+1)/ThreadCount).

      In the interleaved strategy every worker executes iterations from FromIndex+Worker.Index to ToIndex with step := ThreadCount.

      Here is the implementation.

      procedure RunWorkers(const w: TSTWorkerArray); inline; var i: integer; begin for i := 0 to High(w) do w[i].IsRunning := true;

      StartWorkers; end; procedure WaitWorkers(const w: TSTWorkerArray); var i: integer; begin i := 1; repeat if not w[i].IsRunning then inc(i); until i=tdn;

      StopWorkers; end; procedure TSTTrheadPool.ForI(FromIndex, ToIndex: integer; const Func: TParallelFunction; UserData: pointer); begin fFromIndex := FromIndex; fToIndex := ToIndex; fTaskCounter := FromIndex-ToIndex+1; fParallelFunction := Func; fParallelData := UserData; RunWorkers(fWorkers); try ParallelForExecute(0); finally WaitWorkers(fWorkers); end; end; procedure TSTThreadPool.ParallelForExecute(tdi: integer); var i, n, si, ti, s, j: integer; begin case fForType of pfDivide: begin if fWorkers[tdi].IsRunning then begin Step := UpRound(n/tdn); si := fFromIndex+Step*tdi; ti := Min(fToIndex, si+Step-1); for i := si to ti do fParallelFunction(i, fParallelData); fWorkers[tdi].IsRunning := false; end else if tdi>0 then begin WaitForSingleObject(fStartEvent, windows.INFINITE); end; end; pfInterleaved: begin if fWorkers[tdi].IsRunning then begin i := fFromIndex+tdi; while i<=fToIndex do begin fParallelFunction(i, fParallelData); i := i+tdn; end; fWorkers[tdi].IsRunning := false; end else if tdi>0 then begin WaitForSingleObject(fStartEvent, windows.INFINITE); end; end; end; end;

      As you can see the StartWorkers, StopWorkers functions were replaced with RunWorkers and WaitWorkers functions and we do not use atomic operations at all.

      While testing those approaches on small tasks I encountered that the WaitWorkers function takes relatively much time, even if the loop iterations take the same time and distribute equally across the CPU cores. The reason seams to lie in the windows WaitForSingleObject function: the awaiting threads react to our fStartEvent with some delay and asynchronously. Thus they finish also asynchronously. That’s a pity.

      Conclusion

      My measurements showed that none of the parallel.ForI versions like small loops. What I mean is that you will never reach the theoretical 4x boost on a quad-core CPU trying to parallelize them. In my opinion, the reason of that is that many critical tasks such as thread scheduling and synchronization are implemented in software and not hardware. And this is the common problem. There is no thread library free of that drawback as I know. Thus my recommendation is: parallelize only your most outer loops. 

      Here is the full source code.

      Parallel programming in Delphi part I. Parallel For