LifeCube - new fast AVX iterator (in C++)

For scripts to aid with computation or simulation in cellular automata.
Post Reply
User avatar
simsim314
Posts: 1823
Joined: February 10th, 2014, 1:27 pm

LifeCube - new fast AVX iterator (in C++)

Post by simsim314 » August 2nd, 2016, 2:15 pm

I've published an early stage of new LifeCube iterator. It's based on AVX and is extremely fast, in terms of "true" cell evaluations per second (~7.2 * 10^10 cell evaluations/sec on my i7 using 8 threads).

The main idea of this iterator, unlike many others - is to evolve 256 life grids in parallel. This has it's own advantages and disadvantages altogether. The main advantage is the minimal amount of bitwise operations, due to it's structure, one can reuse the row sum (you only need to compute 4 additions instead of 9, to calculate all neighbors total). This turns out to allow serious speedup, minimizing considerably the bitwise operations needed to compute GOL rule.

The disadvantage is of course not intuitive parallelism that one is forced to handle using this "Cubical" structure, using 256 parallel grids. This also costs extra computations, as even single grid which is exploding, will cause all other grids to be computed in "empty" space. On the other hand, in small compact and limited area, it should work very well. Which brings another advantage of this approach, the space is not limited to some sort of tiles, and can be of any size - the AVX optimization is done in depth of the cube instead of the width/height.

There is no automatic size adjustments yet, but it has some quite nice speed optimizations already.

Review comments and requests are very welcomed.
LifeCube on GitHub

EDIT I've added SSE branch which has SSE integration (it works for SSE and AVX). SSE is obviously twice slower by default.

simeks
Posts: 407
Joined: March 11th, 2015, 12:03 pm
Location: Sweden

Re: LifeCube - new fast AVX iterator (in C++)

Post by simeks » August 3rd, 2016, 7:45 am

simsim314 wrote:I've published an early stage of new LifeCube iterator. It's based on AVX and is extremely fast, in terms of "true" cell evaluations per second (~7.2 * 10^10 cell evaluations/sec on my i7 using 8 threads).

The main idea of this iterator, unlike many others - is to evolve 256 life grids in parallel.
It's a very interesting idea, even if it could be difficult to make full use of.

It seems LifeCube doesn't compile with GCC at the time, so I haven't been able to try it.

To get an idea of evolution speed, I tried an example with my GoLGrid datatype.

I used this example as a test pattern. When places in the middle of a 64-by-64 grid, it stabilizes in generation 1243 without ever going outside the grid.

Code: Select all

x = 66, y = 66, rule = LifeHistory
66D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D
$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D
64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D29.3A32.D$D29.A4.A29.D$D29.
A.A.A30.D$D29.A.A2.A29.D$D29.A34.D$D29.A.3A30.D$D64.D$D64.D$D64.D$D
64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.
D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D64.D$D
64.D$D64.D$D64.D$66D!
Running in a single thread on a CPU with AVX2 support, the evolution speed is about 12.3 computed cells for each CPU clock cycle.

Assuming this result can be scaled up to 4 cores without any speed loss (I don't think hyper-threading helps much for a task like this) and a CPU speed of 4.0 GHz, that amounts to 1.96 * 10^11 cells/s.

This is the code I used for the test:
speedtest.zip
(52.23 KiB) Downloaded 345 times

User avatar
simsim314
Posts: 1823
Joined: February 10th, 2014, 1:27 pm

Re: LifeCube - new fast AVX iterator (in C++)

Post by simsim314 » August 3rd, 2016, 8:34 am

simeks wrote:It's a very interesting idea, even if it could be difficult to make full use of.
Indeed the implementation is not so straightforward, and the idea will need to be developed with some trial and error for best practices recommendations, per specific application. CubeLife will probably work faster for some applications, and slower for others.
simeks wrote:It seems LifeCube doesn't compile with GCC at the time, so I haven't been able to try it.
This is definitely unfortunate. According to this all is needed is to add #include <immintrin.h> and compile with gcc -mavx flag. I will try to compile it on GCC later on (when I'll have access to good PC).
simeks wrote:To get an idea of evolution speed, I tried an example with my GoLGrid datatype.
This sounds way too fast to me.... have you done some "cheat" optimization like skipping empty cells? If so, can you try to run this pattern for 2.5 millions iterations (unless you have some specific p2 optimization, this pattern doesn't leave space for too much cheating):

Code: Select all

x = 64, y = 64, rule = B3/S23
$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3o
b3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o
b3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob
3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob
3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3o
b3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3o
b3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o
4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob
3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob
3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3o
b3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o4$3ob3ob3ob3ob3ob3ob3ob3ob3ob3ob3o
b3ob3ob3ob3ob3o!
On my i7 with LifeCube, 2.5M iterations of this pattern should take about 1 sec per thread.
simeks wrote:I don't think hyper-threading helps much for a task like this
My tests show it actually does help. Although not linear speedup, my 8 threads work 1.5 faster than 4 threads per single soup (and I have 2 threads per CPU core and 4 cores).

-----

I'll be smarter later on - when I get my hands on PC with AVX.

simeks
Posts: 407
Joined: March 11th, 2015, 12:03 pm
Location: Sweden

Re: LifeCube - new fast AVX iterator (in C++)

Post by simeks » August 3rd, 2016, 10:22 am

simsim314 wrote:This is definitely unfortunate. According to this all is needed is to add #include <immintrin.h
Thanks, that was all it took! The reported running time for your test program is 0.80 s on my machine.
simsim314 wrote:This sounds way too fast to me.... have you done some "cheat" optimization like skipping empty cells?
There's no cheating ;) The GoLGrid datatype does keep track of the first and the last non-empty row in the grid, and uses this information to only compute the rows from (first - 1) to (last + 1), same as LifeAPI does as I remember, but I have this line in my test-program to dynamically count the number of computed cells:

evolved_cells += (64 * (2 + (g0->pop_y_off - g0->pop_y_on)));

The innermost loop that computes 256 cells per iteration compiles to only 47 instructions, so it's an average of 2.25 instructions/clock cycle, which isn't that wild...
simsim314 wrote:If so, can you try to run this pattern for 2.5 millions iterations (unless you have some specific p2 optimization, this pattern doesn't leave space for too much cheating):
The running time for this is 158 ms @ 4.4 GHz, about 14.7 cells/clock cycle. Because this always computes 64×64 cells, there is less relative overhead of doing other things, so it's not surprising the result is slightly better.
simsim314 wrote:My tests show it actually does help. Although not linear speedup, my 8 threads work 1.5 faster than 4 threads per single soup (and I have 2 threads per CPU core and 4 cores).
I tried running 4 and 8 copies of my test program at the same time. The speed-up was 12%, which was better that I thought it would be, but nowhere near 50%... Could it be in your program, that one thread gets to run while the other one is waiting for data from the L2 or L3 cache? Even a single 47×47×256 bit array is 69 Kbytes, so all data can't fit in the L1 data cache.

User avatar
simsim314
Posts: 1823
Joined: February 10th, 2014, 1:27 pm

Re: LifeCube - new fast AVX iterator (in C++)

Post by simsim314 » August 5th, 2016, 6:11 am

simeks wrote:The innermost loop that computes 256 cells per iteration compiles to only 47 instructions, so it's an average of 2.25 instructions/clock cycle, which isn't that wild...
Wow this does sounds like well optimized iterator...

I can only confirm that on my station your iterator runs at 46 BCO/s (billion cell operations) per single thread and my 8.5 BCO/s. This is really cool...

===

I need to investigate this issue - it could be that LifeCube is a bad idea after all due to L1 cache issue or something else I'm missing.

User avatar
simsim314
Posts: 1823
Joined: February 10th, 2014, 1:27 pm

Re: LifeCube - new fast AVX iterator (in C++)

Post by simsim314 » August 5th, 2016, 11:48 am

I've played with GolGrid class, and here are some of my remarks/changes:

1. I've managed to make it compile with MSVC from VS2013, although I couldn't make it vectorize the iteration loop. So it obviously 4 times slower, but it works.

2. I've improved the iterator performance by 5%, using some simple tricks (check out GoLGrid_int_evolve_word in golgrid.c).

3. I've added OpenMP in the test.c - this gave the amazing 200 BCO/s on my i7 (i.e. ~ 2 * 10^11 cell evaluations per second).

4. I've measured more accurately 4 vs 8 threads, and it was ~20% difference.

It's obviously still back compatible, and works well with gcc.

EDIT

More precisely:

Threads ----------------- BCO/s
1 - No OpenMP-----------50
1 - OpenMP---------------43
2 - OpenMP---------------81
4 - OpenMP---------------160
8 - OpenMP---------------200
Attachments
speedtest.zip
(52.59 KiB) Downloaded 364 times

simeks
Posts: 407
Joined: March 11th, 2015, 12:03 pm
Location: Sweden

Re: LifeCube - new fast AVX iterator (in C++)

Post by simeks » August 6th, 2016, 5:34 am

simsim314 wrote:1. I've managed to make it compile with MSVC from VS2013, although I couldn't make it vectorize the iteration loop. So it obviously 4 times slower, but it works.
I downloaded Visual Studio Community 2015 to try this.
I've noticed in GCC that vectorization is very picky about the specific code constructs you use. I got MSVC to vectorize the loop by rewriting it to this:

Code: Select all

	for (row_ix = 0; row_ix < row_cnt; row_ix++)
	{
		u64 out_word = GoLGrid_int_evolve_word(in_entry[row_ix - 1], in_entry[row_ix], in_entry[row_ix + 1]);

		out_entry[row_ix] = out_word;
		or_of_result |= out_entry[row_ix];
	}
Note that if using "or_of_result |= out_word" instead, MSVC fails to vectorize, strangly enough.

The result is a bit slower than if compiled with GCC. MSVC decides to also unroll this inner loop, which might create unnecessary overhead. This could be the reason it is slower, but I haven't looked into this.
simsim314 wrote:2. I've improved the iterator performance by 5%, using some simple tricks (check out GoLGrid_int_evolve_word in golgrid.c).
Nice! This reduces the instruction count of the inner loop from 47 to 45.
simsim314 wrote:3. I've added OpenMP in the test.c - this gave the amazing 200 BCO/s on my i7 (i.e. ~ 2 * 10^11 cell evaluations per second).
Thanks for trying this! It will be nice to have a working example when I start experimenting with OpenMP.

User avatar
simsim314
Posts: 1823
Joined: February 10th, 2014, 1:27 pm

Re: LifeCube - new fast AVX iterator (in C++)

Post by simsim314 » August 6th, 2016, 5:59 am

simeks wrote:I got MSVC to vectorize the loop by rewriting it to this
Nice!

Here is my performance table for VS2015 (notice that it still didn't help for VS2013... but it looks like MSVC is terrible for this stuff anyway):

Threads ----------------- BCO/s
1 - No OpenMP-----------31
1 - OpenMP---------------30
2 - OpenMP---------------53
4 - OpenMP---------------91
8 - OpenMP---------------130

Looks like they also didn't implemented OpenMP properly as well.

This is significantly worse than GCC... have you tried to compile with native GCC using bash for windows that was released latelly?
simeks wrote: This reduces the instruction count of the inner loop from 47 to 45.
Good to know... I was looking into hacking it further but without success...

-----

I've now a new idea - to use 16x16 overlapping tiles (i.e. 14x14 tiles grid), and implement all the instruction on single 256 bit integer. This idea + hash table per tile + p2 awareness, should give significant boost in operations per cycle as well as in total average GOL soup time (especially for searches where the hash can be reused most of the time). This is HashLife meeting AVX + p2 awareness, should be a real monster.

EDIT I've attached iterator for 256 integer, for 16x16 tile. It's basically somewhat faster in BCO/s than what comes from compiler optimization by about 12% (if I count it as 16x16 it runs at ~225 BCO/s). This means that using the __m256i type, it should be possible to improve the 64x64 iterator by 12%. Anyway I'm inclined to add p2 awareness to the 14x14 tiles for general "soup" iteration boosting, and Hash for search utilities.

LifeCube still remains a mystery. I think it's best to use the same logical operations and see where it goes wrong.

EDIT2 I've modified LifeCube to reduce significantly the amount of operations made in the iterator, and it didn't improve the performance even by a bit. Seems like LifeCube has problem with memory access (L1 cache etc.) and it takes at least twice the time of the iterator itself.
Attachments
GolAVX14.zip
(26.71 KiB) Downloaded 311 times

JohanB
Posts: 16
Joined: June 10th, 2016, 11:59 pm

Re: LifeCube - new fast AVX iterator (in C++)

Post by JohanB » February 19th, 2017, 1:12 pm

I've had a look at your rather marvellous code and I've come up with the following implementation.

Code: Select all

section .data
 ;start: dw 0xEEEE, 0xEEEE,0xEEEE, 0xEEEE,0xEEEE, 0xEEEE,0xEEEE, 0xEEEE 
  start:dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff;dw 0x3333,0x3333,0x3333,0x3333,0x3333,0x3333,0x3333,0x3333
  nw:   dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff
  n:    dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff;dw 0x5555,0x5555,0x5555,0x5555,0x5555,0x5555,0x5555,0x5555
  w:    dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff;dw 1,1,1,1,1,1,1,1
  se:   dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff
  s:    dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff;dw 0x5555,0x5555,0x5555,0x5555,0x5555,0x5555,0x5555,0x5555
  e:    dw 0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFff,0xFFfe;dw 1,1,1,1,1,1,1,1
  result:dw 0,0,0,0,0,0,0,0
section .text
global CMAIN
CMAIN:
    mov rbp, rsp; for correct debugging
    ; A cell looks like this:
    ;   DCB   876                 
    ;   AxA   504
    ;   DCB   321
    ; we're using half-adder logic to store the 1, 2 and 4's count in 3 bitplanes.
    movdqu xmm15,[start]          ;***** xmm15 8-D'
    movdqu xmm1,[s]
    movdqu xmm2,[e]
    movdqu xmm3,[se]
    ;now create 9 planes, each one slightly shifted.
    ;xmm0 is the center plane.
    vpsrldq xmm3,xmm3,16-4       ;keep the bottom 2 rows of NW and shift them to the top
    vpsrldq xmm6,xmm1,16-2       ;N5    keep the bottom 1 rows of N and shift them to the top.
    vpsrldq xmm1,xmm1,16-4       ;N3    keep the bottom 2 rows of N and shift them to the top.
    vpsrlw xmm2,xmm2,14          ;W6    keep the 2 rightmost rows of W
    vpsrlw xmm3,xmm3,14          ;NW1   keep the 2 rightmost rows of NW
    vpslldq xmm5,xmm15,4         ;main3
    vpslldq xmm4,xmm15,2         ;main5
    vpxor xmm14,xmm1,xmm5        ;***** xmm14 3 - D
    vpxor xmm13,xmm4,xmm6        ;***** xmm13 5 - A'
    ;we are done with N, xmm1
    vpsrlw xmm1,xmm2,1           ;W7
    vpsllw xmm7,xmm15,1          ;main7
    vpsllw xmm8,xmm13,1          ;main0+N0
    vpsllw xmm9,xmm14,1          ;main2+N2
    vpxor xmm12,xmm7,xmm1        ;***** xmm12 7 - C
    vpsllw xmm7,xmm7,1           ;main6
    vpxor xmm11,xmm7,xmm2        ;***** xmm11 6 - B'
    vpslldq xmm10,xmm11,2        ;main4+w4
    vpsrldq xmm7,xmm3,2          ;NW4
    vpsllw xmm6,xmm6,2           ;N4
    vpxor xmm10,xmm10,xmm7       ;main4+w4+N4
    vpxor xmm10,xmm10,xmm6       ;***** xmm10 4 - A
    vpslldq xmm1,xmm1,2          ;W0
    vpsrlw xmm7,xmm7,1           ;NW0
    vpxor xmm0,xmm8,xmm1         ;main0+N0+W0
    vpxor xmm0,xmm0,xmm7         ;***** xmm0 0 - x
    vpslldq xmm1,xmm2,4          ;W1
    vpsllw xmm8,xmm9,1           ;main1+N1
    vpxor xmm8,xmm8,xmm1         ;main1+N1+W1
    vpxor xmm8,xmm8,xmm3         ;**** xmm8 1 - B
    vpsrlw xmm4,xmm1,1           ;W2
    vpsrlw xmm5,xmm3,1           ;NW2
    vpxor xmm1,xmm4,xmm5         ;W2+NW2
    vpxor xmm9,xmm9,xmm1         ;main2+N2+W2+NW2 
    ;First get all the counts
    vpxor xmm1,xmm8,xmm9         ;1's count of c
    vpand xmm2,xmm8,xmm9         ;2's count of c
    vpxor xmm3,xmm10,xmm11       ;1's count of a
    vpand xmm4,xmm10,xmm11       ;2's count of a
    vpxor xmm5,xmm12,xmm13       ;1's count of b
    vpand xmm6,xmm12,xmm13       ;2's count of b
    vpxor xmm7,xmm14,xmm15       ;1's count of d
    vpand xmm8,xmm14,xmm15       ;2's count of d
    ;Now add the 1's together
    vpand xmm10,xmm1,xmm3        ;2's count of CA
    vpxor xmm1,xmm1,xmm3         ;combined ones of CA
    vpand xmm12,xmm5,xmm7        ;2's count of BD
    vpxor xmm5,xmm5,xmm7         ;combined ones of BD
    vpand xmm14,xmm1,xmm5        ;2's count of CABD
    vpxor xmm1,xmm1,xmm5         ;final count of the 1's
    ;now we need to add all the 2's together.
    vpand xmm3,xmm2,xmm4         ;4's count of ca
    vpxor xmm2,xmm2,xmm4         ;2's count of ca
    vpand xmm5,xmm6,xmm8         ;4's count of bd
    vpxor xmm6,xmm6,xmm8         ;2's count of bd
    vpand xmm7,xmm10,xmm12       ;4's count of CABD
    vpxor xmm8,xmm10,xmm12       ;2's count of CABD
    vpand xmm9,xmm2,xmm6         ;4's count of cabd
    vpxor xmm4,xmm2,xmm6         ;2's count of cabd
    vpand xmm11,xmm8,xmm14       ;4's count of CABD+abcd
    vpxor xmm12,xmm8,xmm14       ;2's count of CABD+abcd
    ;add all 4's
    vpor xmm15,xmm3,xmm5
    vpor xmm13,xmm7,xmm9
    vpor xmm14,xmm11,xmm15
    ;add the 2's
    vpxor xmm2,xmm12,xmm4
    ;final add
    vpor xmm4,xmm14,xmm13
    ;now we have all the counts.
    ;register usage:
    ;xmm0 - original pattern
    ;xmm1 - count of the ones
    ;xmm2 - count of the twos
    ;xmm15 - count of the fours
    vpand xmm0,xmm0,xmm2         ;anything with a 2 stays the same
    vpand xmm3,xmm2,xmm1         ;anything with a 3 is alive
    vpor xmm0,xmm0,xmm3          ;add the alive cells
    vpandn xmm0,xmm4,xmm0        ;subtract the cells with 4 or more neighbors

    
        
            
    ;ret
    
    ; A cell looks like this:
    ;   BCD   123                
    ;   AxA   405
    ;   BCD   678
    ; we're using half-adder logic to store the 1, 2 and 4's count in 3 bitplanes.
    movdqu xmm15,[start]          ;***** xmm15 8-D'
    movdqu xmm1,[n]
    movdqu xmm2,[w]
    movdqu xmm3,[nw]
    ;now create 9 planes, each one slightly shifted.
    ;xmm0 is the center plane.
    vpslldq xmm3,xmm3,16-4       ;keep the bottom 2 rows of NW and shift them to the top
    vpslldq xmm6,xmm1,16-2       ;N5    keep the bottom 1 rows of N and shift them to the top.
    vpslldq xmm1,xmm1,16-4       ;N3    keep the bottom 2 rows of N and shift them to the top.
    vpsllw xmm2,xmm2,14          ;W6    keep the 2 rightmost rows of W
    vpsllw xmm3,xmm3,14          ;NW1   keep the 2 rightmost rows of NW
    vpsrldq xmm5,xmm15,4         ;main3
    vpsrldq xmm4,xmm15,2         ;main5
    vpxor xmm14,xmm1,xmm5        ;***** xmm14 3 - D
    vpxor xmm13,xmm4,xmm6        ;***** xmm13 5 - A'
    ;we are done with N, xmm1
    vpsllw xmm1,xmm2,1           ;W7
    vpsrlw xmm7,xmm15,1          ;main7
    vpsrlw xmm8,xmm13,1          ;main0+N0
    vpsrlw xmm9,xmm14,1          ;main2+N2
    vpxor xmm12,xmm7,xmm1        ;***** xmm12 7 - C
    vpsrlw xmm7,xmm7,1           ;main6
    vpxor xmm11,xmm7,xmm2        ;***** xmm11 6 - B'
    vpsrldq xmm10,xmm11,2        ;main4+w4
    vpslldq xmm7,xmm3,2          ;NW4
    vpsrlw xmm6,xmm6,2           ;N4
    vpxor xmm10,xmm10,xmm7       ;main4+w4+N4
    vpxor xmm10,xmm10,xmm6       ;***** xmm10 4 - A
    vpsrldq xmm1,xmm1,2          ;W0
    vpsllw xmm7,xmm7,1           ;NW0
    vpxor xmm0,xmm8,xmm1         ;main0+N0+W0
    vpxor xmm0,xmm0,xmm7         ;***** xmm0 0 - x
    vpsrldq xmm1,xmm2,4          ;W1
    vpsrlw xmm8,xmm9,1           ;main1+N1
    vpxor xmm8,xmm8,xmm1         ;main1+N1+W1
    vpxor xmm8,xmm8,xmm3         ;**** xmm8 1 - B
    vpsllw xmm4,xmm1,1           ;W2
    vpsllw xmm5,xmm3,1           ;NW2
    vpxor xmm1,xmm4,xmm5         ;W2+NW2
    vpxor xmm9,xmm9,xmm1         ;main2+N2+W2+NW2 
    ;First get all the counts
    vpxor xmm1,xmm8,xmm9         ;1's count of c
    vpand xmm2,xmm8,xmm9         ;2's count of c
    vpxor xmm3,xmm10,xmm11       ;1's count of a
    vpand xmm4,xmm10,xmm11       ;2's count of a
    vpxor xmm5,xmm12,xmm13       ;1's count of b
    vpand xmm6,xmm12,xmm13       ;2's count of b
    vpxor xmm7,xmm14,xmm15       ;1's count of d
    vpand xmm8,xmm14,xmm15       ;2's count of d
    ;Now add the 1's together
    vpand xmm10,xmm1,xmm3        ;2's count of CA
    vpxor xmm1,xmm1,xmm3         ;combined ones of CA
    vpand xmm12,xmm5,xmm7        ;2's count of BD
    vpxor xmm5,xmm5,xmm7         ;combined ones of BD
    vpand xmm14,xmm1,xmm5        ;2's count of CABD
    vpxor xmm1,xmm1,xmm5         ;final count of the 1's
    ;now we need to add all the 2's together.
    vpand xmm3,xmm2,xmm4         ;4's count of ca
    vpxor xmm2,xmm2,xmm4         ;2's count of ca
    vpand xmm5,xmm6,xmm8         ;4's count of bd
    vpxor xmm6,xmm6,xmm8         ;2's count of bd
    vpand xmm7,xmm10,xmm12       ;4's count of CABD
    vpxor xmm8,xmm10,xmm12       ;2's count of CABD
    vpand xmm9,xmm2,xmm6         ;4's count of cabd
    vpxor xmm4,xmm2,xmm6         ;2's count of cabd
    vpand xmm11,xmm8,xmm14       ;4's count of CABD+abcd
    vpxor xmm12,xmm8,xmm14       ;2's count of CABD+abcd
    ;add all 4's
    vpor xmm15,xmm3,xmm5
    vpor xmm13,xmm7,xmm9
    vpor xmm14,xmm11,xmm15
    ;add the 2's
    vpxor xmm2,xmm12,xmm4
    ;final add
    vpor xmm4,xmm14,xmm13
    ;now we have all the counts.
    ;register usage:
    ;xmm0 - original pattern
    ;xmm1 - count of the ones
    ;xmm2 - count of the twos
    ;xmm15 - count of the fours
    vpand xmm7,xmm0,xmm2         ;anything with a 2 stays the same
    vpand xmm3,xmm2,xmm1         ;anything with a 3 is alive
    pxor xmm15,xmm15             ;zero xmm15 for the comparison against 0
    vpor xmm7,xmm7,xmm3          ;add the alive cells
    xor eax,eax
    ;bit 7 of al will be shifted into bit 16 of eax at a later stage and so on....
    ;The status bit work as follows
    ;7----6----5----4----3----2----1----0
    ;+    changed      |  alive         +
    ;NW   W    N   all   NW   W    N   all
    vpandn xmm4,xmm4,xmm7        ;subtract the cells with 4 or more neighbors
    vpxor xmm0,xmm7,xmm0         ;see which cells have changed
    vpsrldq xmm5,xmm4,16-4       ;new N
    vpsrldq xmm1,xmm0,16-4       ;changes in N
    vpsrlw xmm6,xmm4,16-2        ;new W
    vpsrlw xmm2,xmm0,16-2        ;changes in W
    vpsrlw xmm7,xmm5,16-2        ;new NW
    vpsrlw xmm3,xmm0,16-2        ;changes in NW
    ; test the 2 qwords in each vector against zero
    vpcmpeqq xmm11, xmm1, xmm15
    vpcmpeqq xmm12, xmm3, xmm15
    vpcmpeqq xmm13, xmm5, xmm15
    vpcmpeqq xmm14, xmm7, xmm15

    ; blend the results down into xmm10   word origin
    vpblendw xmm10, xmm11, xmm12, 0xAA   ; 3131 3131
    vpblendw xmm13, xmm13, xmm14, 0xAA   ; 7575 7575
    vpblendw xmm10, xmm10, xmm13, 0xCC   ; 7531 7531

    ; test the 2 qwords in each vector against zero
    vpcmpeqq xmm11, xmm2, xmm15
    vpcmpeqq xmm12, xmm4, xmm15
    vpcmpeqq xmm13, xmm6, xmm15
    vpcmpeqq xmm14, xmm8, xmm15

    ; blend the results down into xmm11   word origin
    vpblendw xmm11, xmm11, xmm12, 0xAA   ; 4242 4242
    vpblendw xmm13, xmm13, xmm14, 0xAA   ; 8686 8686
    vpblendw xmm11, xmm11, xmm13, 0xCC   ; 8642 8642
    
    ;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
    ;
    ; xmm10 77553311 77553311
    ; xmm11 88664422 88664422  before shift
    ; xmm10 07050301 07050301
    ; xmm11 80604020 80604020  after shift
    ;result 87654321 87654321  result, note that HL don't have to be the same
    vpsrlw xmm10,xmm10,8
    vpsllw xmm11,xmm11,8
    vpor xmm10,xmm10,xmm11
    
    ;combine the low and high dqword to make sure both are zero. 
    vpsrldq xmm11,xmm10,8
    vpand xmm10,xmm11
    vpmovmskb eax,xmm10
    not eax
    ;we now have a bitmap of area's that are alive and changing.
    
    ;count the number of live and changed cells in the result.
    ;count the live cells
    vpsrldq xmm8,xmm7,8          ;xmm7 = result, xmm8-high result
    movq rdx,xmm7                ;low part
    movq rcx,xmm8                ;high part
    popcnt rdx,rdx               ;count the first half
    popcnt rcx,rcx               ;ecx = high bytes
    add ecx,edx                  ;ecx = live cell count 0..128
    vpsrldq xmm9,xmm0,8          ;xmm0 = changes, xmm9-high changes
    movq rdx,xmm0                ;low part
    movq r8,xmm9                 ;high part
    popcnt rdx,rdx               ;count the low cells
    popcnt r8,r8                 ;count the high cells
    add rdx,r8                   ;get the changed bit count 0..128 
The way it works is that I stagger the even/odd generations like so:

Code: Select all

    A A A A 
     B B B B
    A A A A
     B B B B 
    A A A A
     B B B B
    A A A A
     B B B B
This is the same trick that qlife/life32 and henselLife use.
With some shifting I can then create 8 different bitmaps for the neighborhoods and lookup 128 bits in one go.
AVX2 will allow 256 bits in one go, but my CPU runs ymm code twice as slow as xmm code, so I skipped the AVX2 for now.
If you add the bitplane for the count of 8 neighborhood cells, you can cover all the totalistic variants of life.

On the laptop it runs 128 bits in 38 cycles which is impressive.
With a bit more manipulation I can do the bookkeeping for the 128-bit units as well and even throw in a cell counting, but that will not be in the main running code.

I plan to use the code for a display program and pattern searcher, but am already very happy with the result.
It will also run on a GPU with some adjustments which should speed things up quite a bit more.

Anyway, thanks for your suggestion of using bitplanes to keep track of the 1,2,4,8 neighborhood bitcounts.

You can fix the problems with memory access time by doing more bookkeeping inside the calculations and use these to eliminate calculating stable/dead/p2 units.
Also staggering your even and odd generations cuts the memory load more than half (you only have to access 3 neighbor-units, not 8).
Finally by calculating all 128(256) bits and not 84(196) you'll spend more time calculating but save double on memory access piecing patterns together.
Finally by accessing the odd generation units in the opposite order of the even units you'll optimize the cache coherency.
(This is what qlife (golly) does).

Post Reply