My name is Johan Bontes,some people may know me as the author of Life32.
I've recently been playing around with some better algorithms and thought I've share some early code.
A suggestion to speed-up GoL calculations
Here's some x64 assembly to generate a 8x8 cell.
The code can calculate 64 cells in 52 CPU cycles, so less than 1 cycle per cell.
On a 2Ghz processor that means it can calculate 40 million blocks x 64 = 2,5 billion cells per second.
This is on AMD K12, where the theoretical throughput is 3 instructions per cycle and the code runs at the full 3 instr per cycle. It uses only pure x64 code, no SSE.
Using the SSSE3 instruction set it might be made slightly faster using the pshufb instruction.
How does it work
It works by splitting a 8x8 block into 4 vertical parts with alternating columns, block 0xx, 1xx, 2xx and block 3xx.
It then parallelly adds the bitcounts together (the bitcount of a live cell is one, because that's a binary 1 and the bitcount of a dead cell is zero).
1+1+1=3 which fits into 2 bits, so I have plenty of space.
I have now added 3 neighbors (012, 123, 234 etc).
Note that the first block holds 16 neighborcounts for 012+456 (x 8 rows), the second holds 123+567, third holds 234+678 and the last holds 345+789
Now I split the columns into 4 horizontal rows A,B,C,D and do the same trick.
Because 3x3 = 9 fits into 4 bits. I need 4 blocks for my neighbor counts so I reuse block 0xx,1xx,2xx,3xx to hold the neighbor counts of every 1st, 2nd, 3rd and 4th cell respectively.
A also keep a copies around of my original cell-states, so I can see whether a cell was dead or alive to begin with.
I then subtract the livecount from the cellcount to get the neighborcount and xor this with not(3).
In binary:
0011(3) xor(not(0011)(C)) = 1111(F) -> a guaranteed birth.
0010(2) xor(not(0011)(C)) = 1110(E) -> maybe a birth, but only the cell was live to begin with.
Now I OR the life cell: 0001/0000 back in.
1111 OR 0001 = 1111 -> guaranteed birth
1111 OR 0000 = 1111 -> " " birth
1110 OR 0001 = 1111 -> birth.
1110 OR 0000 = 1110 -> death
All other inputs will generate something other than all 1's.
Any sequence of 4 bits with a zero somewhere gets condensed into a single 0 (death) and 4 ones condense into a 1 (live).
I combine all this into a single 8x8 64-bit result and I'm done.
This means I can calculate 64 cells using 3 instructions x 4 blocks. These 12 instructions run in 4 cycles!
If I had to use a lookup table I'd need to do (64 / (2x2)) = 16 lookups.
If I'm really really lucky that will take me 32 cycles.
If I have a cache miss, it will take me 100's of cycles more.
The lookup table would save me the counting, but I'd still need to extract the 4x4 blocks for the lookup and put the pieces back together.
So the rest of the code would not be faster using a lookup table.
Note that I'm lucky that 23/3 turns out to be so easy to do in code. Just 3 instructions!
Oh and just Like Hershel life, QLife, Life32 and Golly the generations are stagger stepped, so I only need 3 neighbor cells to get the 66x66 input block needed to generate a 64x64 cell result. 2 bit wide N,W,NW rows come from the neighbors and the S,E,SE,NE,SW rows comes from the cell itself.
The result cell is shifted upwards and to the side by a single pixel.
In the next generation I do the same thing, but the shifting is exactly the other way round.
The code is a bit confusing, because I keep running out of CPU registers to store all the intermediate data in.
Also I try to avoid dependency chains, use simple instructions and try to keep the code size small.
To recap:
this code basically runs the naive loop: for x:= -1 to 1 do for y:= -1 to 1 do cellcount += grid[x,y];
for every cell, except that it does this in parallel for 16 cells at a time and it abuses the fact that a CPU can run 3 instructions superscalar to run nearly 4 of those parallel calculations in a single CPU cycle.
The code does not do ifs, does not do loops and does not access memory, everything is does in registers.
So there are no mispredicted branches and no cache misses.
Code: Select all
//155 instructions for 64 Cells = 2.5 instructions per Cell.
//52 CPU cycles = 3 instr per cycle on AMD K12
function Generate_PtoQ(NW,NE,SW,SE_: Int64):Int64; //Note that SE_ is the input cell itself.
asm
push rdi //rcx: A
push rbx //rdx: B
push rsi //r8: C
push r14 //r9: D
push r15 // Layout of a Cell is
push r12 // A B
push r13 // C D
push rbp
mov rbp,$0303030303030303 //The rightmost 2 columns of A and C
shl rdx,(64-16) //Keep only the two lower rows of B
and r8,rbp //Rightmost 2 columns of C
mov rsi,$1111111111111111 //counting mask
lea rdi,[rsi+rsi*4] //$55555555 //fast counting mask
shl r8,(8-2) //shift out the 6 columns we don't need
and ecx,ebp //The rightmost 2 columns of A
shl rcx,(64-16+(8-2)) //shift out the top 6 rows and the 6 columns we don't need
not rbp //mask for leftmost 6 columns of D/B to be combined with C/A
@CountNeighborsPerRow: //count C+D, interleaved with count A+B
///////////Count the first 3 columns.
///
//count 3 neighbors in 4 columns at once
mov r10,r9 //Cache D'
mov rax,rdi //Cache D"
and rax,r9 //r8= bits in column0246
shr r10,1
and r10,rdi //r10= bits in column 1357
mov r11,rdi
and r9,rbp //Mask off lower 2 columns
shr r9,2
lea rbx,[r8+r9] //add colums 8+9
and r11,rbx //r11= bits in column 2468
//Add all the bits together
lea r9,[r11+r10] //add bits 1357+2468 together
lea r8,[rax+r9] //r9 = NCount 1357 (0246+1357+2468)
//////////////Count the next 3 columns
//Do the same for the neighborhood counts in column
//r11 is live Cells in 2468
xor rbx,r11 //23456789 -/- 2468 = columns 3579
shr rbx,1 //rbx=bitcount in column 3579
add r9,rbx //column 3579+2468+1357 = NCount for columns 2468
/// register use:
/// r8: NCount 1357
/// r9: NCount 2468
/// r10: live bits in 1357
/// r11: live bits in 2468
///////////Count the first 3 columns.
mov r12,rdx //break the dependency chain by caching B
mov r13,rdx //count 3 neighbors in 4 columns at once
and r13,rdi //r13= bits in column0246
shr rdx,1
and rdx,rdi //r14= bits in column 1357
mov r15,rdi
and r12,rbp //Mask off rightmost 2 columns
shr r12,2
add rcx,r12 //add colums 8+9
and r15,rcx //r15= bits in column 2468
//Add all the bitcounts together
lea rax,[r15+rdx] //1357+2468
lea r12,[rax+r13] //r12 = NCount 1357 (0246+1357+2468)
//////////////Count the next 3 columns
//Do the same for the neighborhood counts in column
//r15 is bitcount in 1357+2468
//xor rcx,r15 //23456789 -/- 2468 = columns 3579
shr rcx,1 //rcx=bitcount in column 3579
and rcx,rdi
lea r13,[rcx+rax] //and add column 2468+1357 = NCount for columns 2468
//Now we're all done with the neighborhood counts in ABC and D
/// Register use
/// r12: NCount 1357
/// r13: NCount 2468
/// rdx: live bits in 1357
/// r15: live bits in 2468
lea rsi,[rsi+rsi*2] //Mask 333333333
mov rbp,rsi
not rbp //Mask CCCCCCC
@ExtractLiveCells:
/// We now need to extract just the middle part for the live Cells
shr r10,8 //middle part of D 1357
shl rdx,8 //middle part of B 1357
add r10,rdx //Live Cells 1357
mov r14,r10
shr r11,8 //middle part of D 2468
shl r15,8 //middle part of B 2468
add r11,r15 //Live Cells 2468
mov r15,r11
and r10,rbp //split up the live Cells so they match with the counts.
shr r10,2 //r10 = live Cells 37
and r14,rsi //r8 = live Cells 15
and r11,rbp //split up the live Cells so they match with the counts.
shr r11,2 //r11 = live Cells 48
and r15,rsi //r9 = live Cells 26
@AddRowsTogether:
//input: r8: Ncount for D1357 //rdx = copy
// r9: Ncount for D2468
// r10: live Cells for 1357
// r11: live Cells for 2468
// r12: NCount for B1357 //rax = copy
// r13: NCount for B2468
// rdi: mask 11111111
//Push the live Cells
//Now we collect the counts in a 3x3 grid.
// Start at the bottom and work our way up.
mov rdx, r9 //Neighborhood counts for D-2468
mov rbx, r13 //Neighborhood counts for B-2468, just the 2 rows
//Get 3 counts: bottom-rcx, middle-rdx, top-rbx
//the bottom is already in rcx
shr r9,8 //middle of D
shl r13,8 //bottom row of B
mov rax,rdx //rax = D
mov rcx,rsi
shr rax,16 //top part of D
add rax,rbx //rax is the top part
lea rbx,[r13+r9] //now we have the middle row in RBX
//we need to split up the 4 columns in 2 pairs.
mov r13,rsi //Get masks
mov r9,rsi
and r13,rax //r11 = row 26-top
and rcx,rbx //rcx = row 26-middle
and r9,rdx //r9 - row 26-bottom
//Extract 48
and rax,rbp //rax = row 48 - top
and rbx,rbp //rbx = row 48 - middle
and rdx,rbp //rdx - row 48 - bottom
//Align 48 with 26
//shr rax,2 //Line up 48 with 26
//shr rbx,2 //No shortcuts or we'll get overflows and bitloss.
shr rax,2
shr rbx,2
shr rdx,2
add r9,rcx //top 26 + middle 26
lea rcx,[r9+r13] //top+middle+bottom 26
add rax,rbx
add rax,rdx //Add top+middle+bottom for 9 Cell count
//now we have the neighbor counts including the centre Cells.
//rax is NCount for columns 48
//rcx is NCount for columns 26
//Now we do the same for D1357 and B1357
mov r9,r8 //Ncount for D1357
mov rdx,r8 //D
mov rbx,r12 //NCount for B1357
shr r8,8 //Middle of D
shl r12,8 //middle of B
shr rdx,16 //top part of D
add rbx,rdx //rbx = top
add r8,r12 //r8=middle, r9 = bottom
//Split up the 4 columns in 2 pairs
mov r12,rsi {for some reason mov x,mask+and x,value is faster than the other way round}
mov r13,rsi
mov rdx,rsi
and r12,rbx //top 15
and r13,r8 //middle 15
and rdx,r9 //bottom 15
//fix up rbx,r8,r9
and rbx,rbp //top 37
and r8,rbp //middle 37
and r9,rbp //bottom 37
//Add everything together
shr rbx,2
shr r8,2
shr r9,2
add rbx,r8 //top+middle 37
add rbx,r9 //top+middle+bottom 37
add rdx,r13 //top+middle 15
add rdx,r12 //top+middle+bottom 15
//Add everything together
//Register use is as follows:
/// rax: NCount 48
/// rbx: NCount 37
/// rcx: NCount 26
/// rdx: NCount 15
//////////////////////////////////////////////////////
/// The following code is specific to a rule.
/// This code is tuned to 23/3
/// It needs to be patched for a different rule
//////////////////////////////////////////////////////
//Register use is as follows:
/// rax: NCount 48
/// rbx: NCount 37
/// rdx: NCount 15
/// rcx: NCount 26
/// r14: Live Cells 15
/// r15: Live Cells 26
/// r10: Live Cells 37
/// r11: Live Cells 48
/// Start with NCount 15
sub rdx,r14 //subtract the live Cells from the Cell count
sub rcx,r15
sub rbx,r10
sub rax,r11
xor rdx,rbp //reg = not(NCount xor 3)
xor rcx,rbp //if dead+3 or Live+3 -> 0 xor not(3) = 1111, if 2 neighbors -> 1 xor not(3) = 1110
xor rbx,rbp //so it's a 1111 then new Cell for sure, 1110 if maybe new Cell
xor rax,rbp //and something else otherwise
or rdx,r14 //if NCount=2 (1110) + life Cell (0001) -> reg = 1111
or rcx,r15 //otherwise reg has 0 somewhere.
or rbx,r10
or rax,r11
//////////////////////////////////////////////////////
/// End of the code that's specific to a rule.
/////////////////////////////////////////////////////
//Register use is as follows:
/// rax: Dirty live counts 48 , nibble=0 if new live, other if dead
/// rbx: Dirty live counts 37
/// rcx: Dirty live counts 26
/// rdx: Dirty live counts 15
//Clean up the live counts
//Truthtable for Cell counts
//1. Full Cellcount 0 1 2 3 4 5 6 7 8 9
//1. Fill Cellcount 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
//2. Full Cellcount xor 3 11 10 1 0 111 110 101 100 1011 1010
//3. Full Cellcount and not life Cell 0010 0010 0000 0000 0110 0110 0100 0100 1010 1010
//4. Full Cellcount and not dead Cell 0011 0010 0001 0000 0111 0110 0101 0100 1011 1010 //no-op from 2.
//3a Not 3 1101 1101 1111 1111 1001 1001 1011 1011 0101 0101
//4a Not 4. 1100 1101 1110 1111 1000 1001 1010 1011 0100 0101
//3b bit 23 and bit 10 for 3a 01 01 11 11 00 00 10 10 01 01
mov r8,rax //Compress the 1111 into 0001
mov r9,rbx //and everything else into 0000
mov r10,rcx
//mov r11,rdx
and rax,rsi //First compress 4 bits into 2
and rbx,rsi
and rcx,rsi
and rsi,rdx
shl rax,2 //Shift rax up 3 places (2 now, 1 later)
shl rbx,2 //shift rbx up 2 places
shr r10,2 //shift rcx up 1 place (0 now, 1 later)
shr rdx,2 //leave rsi
and rax,r8 //combine with 23
and rbx,r9
and rcx,r10
and rsi,rdx //Now 1111 -> 0011 and all the rest into 10,01 or 00
//combine the halfs
or rax,rcx //Combine the halfs
or rbx,rsi
mov r10,rax //then compress 2 bits into 1
//mov rdx,rbx
and rax,rdi //keep bit 0
and rdi,rbx
shl rax,1
shr rbx,1
and rdi,rbx
and rax,r10 //and combine it with bit 1
or rax,rdi //add everything together
//cleanup
pop rbp
pop r13
pop r12
pop r15
pop r14
pop rsi
pop rbx
pop rdi
end;
The first thing I want to do from here is speed up coppersearch.
My testcode makes it run about a 1000 times faster than a naive for loop like listed below.
Code: Select all
//Slow but simple routine.
//Used to validate that asm routine generates correct output.
function GetMask(x, y: integer): Int64;
begin
Result:= 1;
Result:= Result shl x;
Result:= Result shl (y * 8);
end;
function GetBasePtoQ(A,B,C,D: int64; var x, y: integer): Int64;
begin
case x of
0..7: begin
if y in [0..7] then begin
Result:= D;
end else begin
Result:= B;
Dec(y,8);
end;
end;
8,9: begin
if y in [0..7] then begin
Result:= C;
Dec(x,8);
end else begin
Result:= A;
Dec(y,8); Dec(x,8);
end;
end;
end;
end;
function GetBaseQtoP(A,B,C,D: int64; var x, y: integer): Int64;
begin
case x of
2..9: begin
if y in [2..9] then begin
Result:= D;
Dec(x,2); Dec(y,2);
end else begin
Result:= B;
Dec(x,2);
Inc(y,6);
end;
end;
0..1: begin
if y in [0..1] then begin
Result:= C;
Inc(x,6); Inc(y,6);
end else begin
Result:= A;
Dec(y,2);
Inc(x,6);
end;
end;
end;
end;
function TestGeneratePtoQ(A,B,C,D: int64): int64;
function GetCell(x,y: integer): boolean;
var
Mask: Int64;
Basis: Int64;
begin
Basis:= GetBasePtoQ(A,B,C,D,x,y);
Mask:= GetMask(x,y);
Result:= (Basis and Mask) <> 0;
end;
procedure SetCell(x,y: integer; Cell: boolean);
var
Mask: Int64;
begin
Mask:= GetMask(x,y);
if Cell then begin
Result:= Result or Mask;
end else begin
Result:= result and not Mask;
end;
end;
var
x: Integer;
y: Integer;
x1,y1: integer;
Count: integer;
Cell: boolean;
begin
Result:= 0;
for x := 1 to 8 do begin
for y := 1 to 8 do begin
Count:= 0;
Cell:= GetCell(x,y); //The cell itself
for x1:= -1 to 1 do begin
for y1:= -1 to 1 do begin
//Neighborhood count incl the cell
if GetCell(x+x1,y+y1) then Inc(Count);
end; {for y1}
end; {for x1}
case Count of
3: SetCell(x-1,y-1,true);
4: if Cell then SetCell(x-1,y-1,true);
end; {case}
end; {for y}
end; {for x}
end;
// D D D D D D D D D D|C|C
// | |
// D D D D D D D D D D|C|C
// -------------------+ |
// B B B B B B B B B B A|A
// ---------------------+
// B B B B B B B B B B A A
function TestGenerateQtoP(const A,B,C,D: int64): int64;
function GetCell(x,y: integer): boolean;
var
Mask: Int64;
Basis: Int64;
begin
Basis:= GetBaseQtoP(A,B,C,D,x,y);
Mask:= GetMask(x,y);
Result:= (Basis and Mask) <> 0;
end;
procedure SetCell(x,y: integer; Cell: boolean);
var
Mask: Int64;
begin
Mask:= GetMask(x,y);
if Cell then begin
Result:= Result or Mask;
end else begin
Result:= result and not Mask;
end;
end;
var
x: Integer;
y: Integer;
x1,y1: integer;
Count: integer;
Cell: boolean;
begin
Result:= 0;
for x := 1 to 8 do begin
for y := 1 to 8 do begin
Count:= 0;
Cell:= GetCell(x,y); //The cell itself
for x1:= -1 to 1 do begin
for y1:= -1 to 1 do begin
//Neighborhood count incl the cell
if GetCell(x+x1,y+y1) then begin
Inc(Count);
end;
end; {for y1}
end; {for x1}
case Count of
3: SetCell(x-1,y-1,true);
4: if Cell then SetCell(x-1,y-1,true);
end; {case}
end; {for y}
end; {for x}
end;
naive code: 2337 ms, 51298 ticks
x64 code: 2 ms, 51 ticks
xmm code: 6 ms, 127 ticks //SSE2 optimized code
x64 is: 100584% faster