64-bit code for fast generation of patterns

JohanB · Post by **JohanB** » June 11th, 2016, 1:37 am

Let me introduce myself.
My name is Johan Bontes,some people may know me as the author of Life32.

I've recently been playing around with some better algorithms and thought I've share some early code.

A suggestion to speed-up GoL calculations
Here's some x64 assembly to generate a 8x8 cell.
The code can calculate 64 cells in 52 CPU cycles, so less than 1 cycle per cell.
On a 2Ghz processor that means it can calculate 40 million blocks x 64 = 2,5 billion cells per second.
This is on AMD K12, where the theoretical throughput is 3 instructions per cycle and the code runs at the full 3 instr per cycle. It uses only pure x64 code, no SSE.
Using the SSSE3 instruction set it might be made slightly faster using the pshufb instruction.

How does it work
It works by splitting a 8x8 block into 4 vertical parts with alternating columns, block 0xx, 1xx, 2xx and block 3xx.
It then parallelly adds the bitcounts together (the bitcount of a live cell is one, because that's a binary 1 and the bitcount of a dead cell is zero).
1+1+1=3 which fits into 2 bits, so I have plenty of space.
I have now added 3 neighbors (012, 123, 234 etc).
Note that the first block holds 16 neighborcounts for 012+456 (x 8 rows), the second holds 123+567, third holds 234+678 and the last holds 345+789
Now I split the columns into 4 horizontal rows A,B,C,D and do the same trick.
Because 3x3 = 9 fits into 4 bits. I need 4 blocks for my neighbor counts so I reuse block 0xx,1xx,2xx,3xx to hold the neighbor counts of every 1st, 2nd, 3rd and 4th cell respectively.
A also keep a copies around of my original cell-states, so I can see whether a cell was dead or alive to begin with.
I then subtract the livecount from the cellcount to get the neighborcount and xor this with not(3).
In binary:
0011(3) xor(not(0011)(C)) = 1111(F) -> a guaranteed birth.
0010(2) xor(not(0011)(C)) = 1110(E) -> maybe a birth, but only the cell was live to begin with.
Now I OR the life cell: 0001/0000 back in.
1111 OR 0001 = 1111 -> guaranteed birth
1111 OR 0000 = 1111 -> " " birth
1110 OR 0001 = 1111 -> birth.
1110 OR 0000 = 1110 -> death
All other inputs will generate something other than all 1's.
Any sequence of 4 bits with a zero somewhere gets condensed into a single 0 (death) and 4 ones condense into a 1 (live).
I combine all this into a single 8x8 64-bit result and I'm done.

This means I can calculate 64 cells using 3 instructions x 4 blocks. These 12 instructions run in 4 cycles!
If I had to use a lookup table I'd need to do (64 / (2x2)) = 16 lookups.
If I'm really really lucky that will take me 32 cycles.
If I have a cache miss, it will take me 100's of cycles more.
The lookup table would save me the counting, but I'd still need to extract the 4x4 blocks for the lookup and put the pieces back together.
So the rest of the code would not be faster using a lookup table.

Note that I'm lucky that 23/3 turns out to be so easy to do in code. Just 3 instructions!

Oh and just Like Hershel life, QLife, Life32 and Golly the generations are stagger stepped, so I only need 3 neighbor cells to get the 66x66 input block needed to generate a 64x64 cell result. 2 bit wide N,W,NW rows come from the neighbors and the S,E,SE,NE,SW rows comes from the cell itself.
The result cell is shifted upwards and to the side by a single pixel.

In the next generation I do the same thing, but the shifting is exactly the other way round.

The code is a bit confusing, because I keep running out of CPU registers to store all the intermediate data in.
Also I try to avoid dependency chains, use simple instructions and try to keep the code size small.

To recap:
this code basically runs the naive loop: for x:= -1 to 1 do for y:= -1 to 1 do cellcount += grid[x,y];
for every cell, except that it does this in parallel for 16 cells at a time and it abuses the fact that a CPU can run 3 instructions superscalar to run nearly 4 of those parallel calculations in a single CPU cycle.
The code does not do ifs, does not do loops and does not access memory, everything is does in registers.
So there are no mispredicted branches and no cache misses.

Code: Select all

//155 instructions for 64 Cells = 2.5 instructions per Cell.
//52 CPU cycles = 3 instr per cycle on AMD K12

function Generate_PtoQ(NW,NE,SW,SE_: Int64):Int64; //Note that SE_ is the input cell itself.
asm
  push rdi   //rcx: A
  push rbx   //rdx: B
  push rsi   //r8:  C
  push r14   //r9:  D
  push r15   // Layout of a Cell is
  push r12   //  A  B
  push r13   //  C  D
  push rbp

  mov rbp,$0303030303030303  //The rightmost 2 columns of A and C
  shl rdx,(64-16)            //Keep only the two lower rows of B
  and r8,rbp                 //Rightmost 2 columns of C
  mov rsi,$1111111111111111  //counting mask
  lea rdi,[rsi+rsi*4]        //$55555555  //fast counting mask
  shl r8,(8-2)               //shift out the 6 columns we don't need
  and ecx,ebp            //The rightmost 2 columns of A
  shl rcx,(64-16+(8-2))  //shift out the top 6 rows and the 6 columns we don't need
  not rbp                //mask for leftmost 6 columns of D/B to be combined with C/A

@CountNeighborsPerRow:      //count C+D, interleaved with count A+B
  ///////////Count the first 3 columns.
  ///
  //count 3 neighbors in 4 columns at once
  mov r10,r9  //Cache D'
  mov rax,rdi  //Cache D"
  and rax,r9   //r8= bits in column0246
  shr r10,1
  and r10,rdi  //r10= bits in column 1357
  mov r11,rdi
  and r9,rbp        //Mask off lower 2 columns
  shr r9,2
  lea rbx,[r8+r9]  //add colums 8+9
  and r11,rbx  //r11= bits in column 2468
  //Add all the bits together
  lea r9,[r11+r10] //add bits 1357+2468 together
  lea r8,[rax+r9]  //r9 = NCount 1357  (0246+1357+2468)
  //////////////Count the next 3 columns
  //Do the same for the neighborhood counts in column
  //r11 is live Cells in 2468
  xor rbx,r11    //23456789 -/- 2468 = columns 3579
  shr rbx,1     //rbx=bitcount in column 3579
  add r9,rbx    //column 3579+2468+1357 = NCount for columns 2468
  ///  register use:
  ///  r8: NCount 1357
  ///  r9: NCount 2468
  ///  r10: live bits in 1357
  ///  r11: live bits in 2468


    ///////////Count the first 3 columns.
  mov r12,rdx   //break the dependency chain by caching B
  mov r13,rdx   //count 3 neighbors in 4 columns at once
  and r13,rdi   //r13= bits in column0246
  shr rdx,1
  and rdx,rdi  //r14= bits in column 1357
  mov r15,rdi
  and r12,rbp        //Mask off rightmost 2 columns
  shr r12,2
  add rcx,r12  //add colums 8+9
  and r15,rcx  //r15= bits in column 2468
  //Add all the bitcounts together
  lea rax,[r15+rdx] //1357+2468
  lea r12,[rax+r13] //r12 = NCount 1357  (0246+1357+2468)
  //////////////Count the next 3 columns
  //Do the same for the neighborhood counts in column
  //r15 is bitcount in 1357+2468
  //xor rcx,r15    //23456789 -/- 2468 = columns 3579
  shr rcx,1     //rcx=bitcount in column 3579
  and rcx,rdi
  lea r13,[rcx+rax]  //and add column 2468+1357 = NCount for columns 2468

  //Now we're all done with the neighborhood counts in ABC and D
  ///  Register use
  ///  r12: NCount 1357
  ///  r13: NCount 2468
  ///  rdx: live bits in 1357
  ///  r15: live bits in 2468
  lea rsi,[rsi+rsi*2] //Mask 333333333
  mov rbp,rsi
  not rbp             //Mask CCCCCCC

  @ExtractLiveCells:
  ///  We now need to extract just the middle part for the live Cells
  shr r10,8     //middle part of D 1357
  shl rdx,8     //middle part of B 1357
  add r10,rdx   //Live Cells 1357
  mov r14,r10

  shr r11,8     //middle part of D 2468
  shl r15,8     //middle part of B 2468
  add r11,r15   //Live Cells 2468
  mov r15,r11

  and r10,rbp   //split up the live Cells so they match with the counts.
  shr r10,2     //r10 = live Cells 37
  and r14,rsi   //r8 = live Cells 15

  and r11,rbp   //split up the live Cells so they match with the counts.
  shr r11,2     //r11 = live Cells 48
  and r15,rsi   //r9 = live Cells 26

@AddRowsTogether:
  //input: r8:  Ncount for D1357  //rdx = copy
  //       r9:  Ncount for D2468
  //       r10: live Cells for 1357
  //       r11: live Cells for 2468
  //       r12: NCount for B1357  //rax = copy
  //       r13: NCount for B2468
  //       rdi: mask 11111111
  //Push the live Cells

  //Now we collect the counts in a 3x3 grid.
  // Start at the bottom and work our way up.

  mov rdx, r9  //Neighborhood counts for D-2468
  mov rbx, r13  //Neighborhood counts for B-2468, just the 2 rows
  //Get 3 counts: bottom-rcx, middle-rdx, top-rbx
  //the bottom is already in rcx
  shr r9,8           //middle of D
  shl r13,8          //bottom row of B
  mov rax,rdx   //rax = D
  mov rcx,rsi
  shr rax,16    //top part of D
  add rax,rbx   //rax is the top part
  lea rbx,[r13+r9]   //now we have the middle row in RBX

  //we need to split up the 4 columns in 2 pairs.
  mov r13,rsi    //Get masks
  mov r9,rsi
  and r13,rax    //r11 = row 26-top
  and rcx,rbx    //rcx = row 26-middle
  and r9,rdx     //r9 - row 26-bottom
  //Extract 48
  and rax,rbp    //rax = row 48 - top
  and rbx,rbp    //rbx = row 48 - middle
  and rdx,rbp     //rdx - row 48 - bottom
  //Align 48 with 26
  //shr rax,2      //Line up 48 with 26
  //shr rbx,2      //No shortcuts or we'll get overflows and bitloss.
  shr rax,2
  shr rbx,2
  shr rdx,2
  add r9,rcx      //top 26 + middle 26
  lea rcx,[r9+r13] //top+middle+bottom 26
  add rax,rbx
  add rax,rdx    //Add top+middle+bottom for 9 Cell count

  //now we have the neighbor counts including the centre Cells.
  //rax is NCount for columns 48
  //rcx is NCount for columns 26
  //Now we do the same for D1357 and B1357
  mov r9,r8     //Ncount for D1357
  mov rdx,r8    //D
  mov rbx,r12    //NCount for B1357
  shr r8,8       //Middle of D
  shl r12,8      //middle of B

  shr rdx,16      //top part of D
  add rbx,rdx     //rbx = top
  add r8,r12      //r8=middle, r9 = bottom
  //Split up the 4 columns in 2 pairs
  mov r12,rsi     {for some reason mov x,mask+and x,value is faster than the other way round}
  mov r13,rsi
  mov rdx,rsi
  and r12,rbx     //top 15
  and r13,r8      //middle 15
  and rdx,r9      //bottom 15
  //fix up rbx,r8,r9
  and rbx,rbp     //top 37
  and r8,rbp      //middle 37
  and r9,rbp      //bottom 37
  //Add everything together
  shr rbx,2
  shr r8,2
  shr r9,2
  add rbx,r8      //top+middle 37
  add rbx,r9      //top+middle+bottom 37
  add rdx,r13     //top+middle 15
  add rdx,r12     //top+middle+bottom 15
  //Add everything together

  //Register use is as follows:
  ///  rax:  NCount 48
  ///  rbx:  NCount 37
  ///  rcx:  NCount 26
  ///  rdx:  NCount 15

  //////////////////////////////////////////////////////
  ///  The following code is specific to a rule.
  ///  This code is tuned to 23/3
  ///  It needs to be patched for a different rule
  //////////////////////////////////////////////////////
  //Register use is as follows:
  ///  rax: NCount 48
  ///  rbx: NCount 37
  ///  rdx: NCount 15
  ///  rcx:  NCount 26
  ///  r14: Live Cells 15
  ///  r15: Live Cells 26
  ///  r10: Live Cells 37
  ///  r11: Live Cells 48

/// Start with NCount 15
  sub rdx,r14     //subtract the live Cells from the Cell count
  sub rcx,r15
  sub rbx,r10
  sub rax,r11
  xor rdx,rbp    //reg = not(NCount xor 3)
  xor rcx,rbp    //if dead+3 or Live+3 -> 0 xor not(3) = 1111, if 2 neighbors -> 1 xor not(3) = 1110
  xor rbx,rbp    //so it's a 1111 then new Cell for sure, 1110 if maybe new Cell
  xor rax,rbp    //and something else otherwise
  or rdx,r14      //if NCount=2 (1110) + life Cell (0001) -> reg = 1111
  or rcx,r15      //otherwise reg has 0 somewhere.
  or rbx,r10
  or rax,r11
//////////////////////////////////////////////////////
///  End of the code that's specific to a rule.
/////////////////////////////////////////////////////

  //Register use is as follows:
  ///  rax: Dirty live counts 48 , nibble=0 if new live, other if dead
  ///  rbx: Dirty live counts 37
  ///  rcx: Dirty live counts 26
  ///  rdx: Dirty live counts 15
//Clean up the live counts
    //Truthtable for Cell counts
//1.  Full Cellcount                     0  	 1	  2    3	  4 	 5 	  6    7     8     9
//1.  Fill Cellcount                   0000  0001 0010  0011 0100 0101 0110 0111  1000  1001
//2.  Full Cellcount xor 3	             11  	10	   1	   0	111  110 	101  100  1011  1010
//3.  Full Cellcount and not life Cell 0010  0010 0000	0000 0110 0110 0100 0100  1010  1010
//4.  Full Cellcount and not dead Cell 0011  0010 0001	0000 0111 0110 0101 0100  1011  1010 //no-op from 2.
//3a  Not 3                            1101  1101 1111  1111 1001 1001 1011 1011  0101  0101
//4a  Not 4.                           1100  1101 1110  1111 1000 1001 1010 1011  0100  0101
//3b  bit 23 and bit 10 for 3a          01    01   11   11   00   00   10   10    01    01

  mov r8,rax      //Compress the 1111 into   0001
  mov r9,rbx      //and everything else into 0000
  mov r10,rcx
  //mov r11,rdx
  and rax,rsi     //First compress 4 bits into 2
  and rbx,rsi
  and rcx,rsi
  and rsi,rdx
  shl rax,2       //Shift rax up 3 places (2 now, 1 later)
  shl rbx,2       //shift rbx up 2 places
  shr r10,2       //shift rcx up 1 place (0 now, 1 later)
  shr rdx,2       //leave rsi
  and rax,r8      //combine with 23
  and rbx,r9
  and rcx,r10
  and rsi,rdx     //Now 1111 -> 0011  and all the rest into 10,01 or 00
  //combine the halfs
  or rax,rcx     //Combine the halfs
  or rbx,rsi
  mov r10,rax     //then compress 2 bits into 1
  //mov rdx,rbx
  and rax,rdi     //keep bit 0
  and rdi,rbx
  shl rax,1
  shr rbx,1
  and rdi,rbx
  and rax,r10     //and combine it with bit 1
  or rax,rdi      //add everything together
  //cleanup
  pop rbp
  pop r13
  pop r12
  pop r15
  pop r14
  pop rsi
  pop rbx
  pop rdi
end;

I hope someone finds this useful.
The first thing I want to do from here is speed up coppersearch.

My testcode makes it run about a 1000 times faster than a naive for loop like listed below.

Code: Select all

//Slow but simple routine. 
//Used to validate that asm routine generates correct output.
function GetMask(x, y: integer): Int64;
begin
  Result:= 1;
  Result:= Result shl x;
  Result:= Result shl (y * 8);
end;

function GetBasePtoQ(A,B,C,D: int64; var x, y: integer): Int64;
begin
  case x of
    0..7: begin
      if y in [0..7] then begin
        Result:= D;
      end else begin
        Result:= B;
        Dec(y,8);
      end;
    end;
    8,9: begin
      if y in [0..7] then begin
        Result:= C;
        Dec(x,8);
      end else begin
        Result:= A;
        Dec(y,8); Dec(x,8);
      end;
    end;
  end;
end;

function GetBaseQtoP(A,B,C,D: int64; var x, y: integer): Int64;
begin
  case x of
    2..9: begin
      if y in [2..9] then begin
        Result:= D;
        Dec(x,2); Dec(y,2);
      end else begin
        Result:= B;
        Dec(x,2);
        Inc(y,6);
      end;
    end;
    0..1: begin
      if y in [0..1] then begin
        Result:= C;
        Inc(x,6); Inc(y,6);
      end else begin
        Result:= A;
        Dec(y,2);
        Inc(x,6);
      end;
    end;
  end;
end;

function TestGeneratePtoQ(A,B,C,D: int64): int64;

  function GetCell(x,y: integer): boolean;
  var
    Mask: Int64;
    Basis: Int64;
  begin
    Basis:= GetBasePtoQ(A,B,C,D,x,y);
    Mask:= GetMask(x,y);
    Result:= (Basis and Mask) <> 0;
  end;

  procedure SetCell(x,y: integer; Cell: boolean);
  var
    Mask: Int64;
  begin
    Mask:= GetMask(x,y);
    if Cell then begin
      Result:= Result or Mask;
    end else begin
      Result:= result and not Mask;
    end;
  end;
var
  x: Integer;
  y: Integer;
  x1,y1: integer;
  Count: integer;
  Cell: boolean;
begin
  Result:= 0;
  for x := 1 to 8 do begin
    for y := 1 to 8 do begin
      Count:= 0;
      Cell:= GetCell(x,y);  //The cell itself
      for x1:= -1 to 1 do begin
        for y1:= -1 to 1 do begin
          //Neighborhood count incl the cell
          if GetCell(x+x1,y+y1) then Inc(Count);
        end; {for y1}
      end; {for x1}
      case Count of
        3: SetCell(x-1,y-1,true);
        4: if Cell then SetCell(x-1,y-1,true);
      end; {case}
    end; {for y}
  end; {for x}
end;

//    D D D D D D D D D D|C|C
//                       | |
//    D D D D D D D D D D|C|C
//    -------------------+ |
//    B B B B B B B B B B A|A
//    ---------------------+
//    B B B B B B B B B B A A

function TestGenerateQtoP(const A,B,C,D: int64): int64;

  function GetCell(x,y: integer): boolean;
  var
    Mask: Int64;
    Basis: Int64;
  begin
    Basis:= GetBaseQtoP(A,B,C,D,x,y);
    Mask:= GetMask(x,y);
    Result:= (Basis and Mask) <> 0;
  end;

  procedure SetCell(x,y: integer; Cell: boolean);
  var
    Mask: Int64;
  begin
    Mask:= GetMask(x,y);
    if Cell then begin
      Result:= Result or Mask;
    end else begin
      Result:= result and not Mask;
    end;
  end;
var
  x: Integer;
  y: Integer;
  x1,y1: integer;
  Count: integer;
  Cell: boolean;
begin
  Result:= 0;
  for x := 1 to 8 do begin
    for y := 1 to 8 do begin
      Count:= 0;
      Cell:= GetCell(x,y);  //The cell itself
      for x1:= -1 to 1 do begin
        for y1:= -1 to 1 do begin
          //Neighborhood count incl the cell
          if GetCell(x+x1,y+y1) then begin
            Inc(Count);
          end;
        end; {for y1}
      end; {for x1}
      case Count of
        3: SetCell(x-1,y-1,true);
        4: if Cell then SetCell(x-1,y-1,true);
      end; {case}
    end; {for y}
  end; {for x}
end;

100 000 repeats, lowest time of 25 timings, accuracy +/- 2 cycles.

naive code: 2337 ms, 51298 ticks
x64 code: 2 ms, 51 ticks
xmm code: 6 ms, 127 ticks //SSE2 optimized code
x64 is: 100584% faster

simeks · Post by **simeks** » June 11th, 2016, 10:03 am

JohanB wrote:I've recently been playing around with some better algorithms and thought I've share some early code.

Hi Johan and welcome to these forums!

Your work looks interesting! Still, it would be nice to know how it compares to other recent implementations of fast ways to evolve Conway's Game of Life.

Here are some links to recent discussions.

LifeAPI by simsim314:
viewtopic.php?f=9&t=1524

apgsearch by calcyman:
viewtopic.php?f=7&t=2099#p28860
viewtopic.php?f=7&t=1784#p21512

The GoLGrid datatype I'm currently working on:
viewtopic.php?f=2&t=1701&start=50#p31243

JohanB · Post by **JohanB** » June 11th, 2016, 12:35 pm

I plan to first implement it in CopperSearch.
I'm hoping to speed that up by a considerable factor.
When I have a running implementation I'll get back to you.

Your code looks pretty optimized.
Obviously my running time is still O(n), but at least the constant factors are as small as I can get them.

I'll let you know when I've got a program people can test with.

ConwayLife.com

64-bit code for fast generation of patterns

64-bit code for fast generation of patterns

Re: 64-bit code for fast generation of patterns

Re: 64-bit code for fast generation of patterns