apgsearch v5.0

For general discussion about Conway's Game of Life.
Post Reply
User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

apgsearch v5.0

Post by calcyman » March 20th, 2019, 5:15 pm

To finally lay to rest a 3-month-old issue raised by Johan Bontes, I've written a GPU implementation for running asymmetric b3s23 soups in a bounded universe. This is based on the upattern algorithm used in lifelib, but with several differences:
  • The universe is a finite size of either 19, 37, 91, or 127 tiles (beginning at the smallest and resizing as necessary);
  • The outermost tiles of the current universe size (and only these) have escaping glider/*WSS detection;
  • Tiles are 64x64 instead of 32x32;
  • The centre-to-centre spacing between tiles is 52 instead of 28;
  • Tiles are advanced by 6 generations at a time instead of 2;
  • If there is a single active tile in the universe, it is 'supercharged' (iterated multiple times until the active reaction reaches the edge of the tile);
  • As a special case, the initial 16x16 soup is iterated for 18 generations before being emplaced into the universe.
Work is delegated to the GPU in increments of 1 000 000 soups (where up to 16384 of these run concurrently). Each soup is run until it either stabilises (becomes period-6), reaches the boundary, or 12000 generations have elapsed. In the first case the soup is deemed 'boring'; in the second and third cases, it is deemed 'interesting'. Roughly 99.73% of soups are classified as 'boring', with the other 0.27% being 'interesting'.

If you compile apgsearch v5.0 with the invocation ./recompile.sh --cuda, it will use the GPU to pre-filter the soups and the CPU to properly search the interesting ones. (To ensure the CPU is not the bottleneck, 4 threads are allocated to processing the stream of interesting soups; this can be changed using the -p flag as usual.) The hauls are uploaded under the census G1 instead of C1, because throwing out 99.73% of the soups causes a distortion of the census statistics. Nevertheless, the soups are generated in exactly the same way (/hashsoup responds identically when given a G1 or C1 soup). Here's an example haul:

https://catagolue.appspot.com/haul/b3s2 ... 5db8652cc1

As you can see, the pentadecathlon is almost as common as the pulsar in G1, but definitely not in C1.

This can process approximately 385 000 soups per second on an NVIDIA V100 (Volta) GPU. I haven't tried it on other GPUs, but it should work fine provided you compile with CUDA >= 9 and your GPU has at least 1.5 gigabytes of memory. (If it doesn't, try ./apgluxe -u 4096 to run half as many universes.)
What do you do with ill crystallographers? Take them to the mono-clinic!

User avatar
Moosey
Posts: 4306
Joined: January 27th, 2019, 5:54 pm
Location: here
Contact:

Re: apgsearch v5.0

Post by Moosey » March 20th, 2019, 5:31 pm

What will this one be called? Apgcellent?
not active here but active on discord

Sokwe
Moderator
Posts: 2643
Joined: July 9th, 2009, 2:44 pm

Re: apgsearch v5.0

Post by Sokwe » March 20th, 2019, 5:53 pm

Hopefully I'm not being too presumptuous, but are there any plans for a symmetric soup pre-processor?
-Matthias Merzenich

AforAmpere
Posts: 1334
Joined: July 1st, 2016, 3:58 pm

Re: apgsearch v5.0

Post by AforAmpere » March 20th, 2019, 5:58 pm

This is interesting, I'll have to try it out. For names, maybe: apgcard, apgflux, apgswitch, apgpu.
I manage the 5S project, which collects all known spaceship speeds in Isotropic Non-totalistic rules. I also wrote EPE, a tool for searching in the INT rulespace.

Things to work on:
- Find (7,1)c/8 and 9c/10 ships in non-B0 INT.
- EPE improvements.

User avatar
muzik
Posts: 5612
Joined: January 28th, 2016, 2:47 pm
Location: Scotland

Re: apgsearch v5.0

Post by muzik » March 20th, 2019, 6:15 pm

I'm very disappointed that this isn't apgsearch v5.-1, given how the other threads were going.

Also, won't this have to be called apg(four-letter word beginning with k), in order to keep the cycle going since nano?

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 21st, 2019, 3:26 am

Fantastic work calcyman. I'm continually inspired by how you are able to adopt new programming tools and really get the most out of them in such a short time frame. I've tested apgsearch v5.0 on a MSI laptop with a Core i7-6700HQ CPU @ 2.60GHz, 8GB RAM, and a GeForce GTX 960M (with 2GB VRAM and 640 cuda cores). I'm running Ubuntu 16.04 which I admit is a bit out of date. When I initially tried to compile I found I was missing nvcc, and installed it with

Code: Select all

sudo apt install nvidia-cuda-toolkit
This installed a bunch of other nvidia tools and the nvidia driver which I also didn't have. I wasn't expecting installation of the cuda toolkit to result in a new Linux kernel, but everything seems to be working after a reboot.

I had some difficulty compiling. The first error I got was:

Code: Select all

nvcc -c includes/gpusrc.cu -o includes/gpusrc.o
includes/../lifelib/cuda/basics.h(14): error: this declaration has no storage class or type specifier
<snip>
Initially I resolved this by commenting out the asserts, but I uncommented after fixing the last error.

Second problem was my cuda version is too old. Ubuntu 16.04 comes with cuda 7.5. I probably should have manually installed but instead I've updated the cuda related packages to the Ubuntu 18.04 versions. I've heard this can cause problems with library version mismatches, but it seems to be OK in this case. The specific feature missing from v7.5 (and also v8.0) was __ballot_sync. This comment indicated that feature was only used on Volta and its use was unnecessary for older GPUs. Considering my personal results below it might not be worth making this backward compatible.

The last error was:

Code: Select all

nvcc -c includes/gpusrc.cu -o includes/gpusrc.o
includes/../lifelib/cuda/gs_impl.h(48): error: explicit type is missing ("int" assumed)

includes/../lifelib/cuda/gs_impl.h(48): error: no suitable conversion function from "std::vector<uint64_t, std::allocator<uint64_t>>" to "int" exists

includes/../lifelib/cuda/gs_impl.h(53): error: no suitable constructor exists to convert from "int" to "std::vector<uint64_t, std::allocator<uint64_t>>"

<snip more similar errors>
I resolved this by adding --std=c++11 to CU_FLAGS in the makefile.

Results
Perhaps not surprisingly, this GPU sees nowhere near the same performance as the Volta. I get between 28500 and 29000 soups/s with 100% GPU utilisation and 100% utilisation of a single core. Even with -p 2 I only see a brief spike in CPU usage while the "interesting" soups are run through apgluxe on the CPU.

I've submitted a few hauls, but I don't think I'll continue with this machine as it is running pretty hot

Edit: Just read the OP again. Somehow I missed the requirement for v9 CUDA. Still not sure why I explicitely needed to add "--std=c++11" to CU_FLAGS.
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 21st, 2019, 6:48 am

wildmyron wrote:Fantastic work calcyman. I'm continually inspired by how you are able to adopt new programming tools and really get the most out of them in such a short time frame.
Thanks! It wasn't easy to debug when things initially didn't work (as you can possibly imagine from seeing the code).
Second problem was my cuda version is too old. Ubuntu 16.04 comes with cuda 7.5. I probably should have manually installed but instead I've updated the cuda related packages to the Ubuntu 18.04 versions. I've heard this can cause problems with library version mismatches, but it seems to be OK in this case. The specific feature missing from v7.5 (and also v8.0) was __ballot_sync. This comment indicated that feature was only used on Volta and its use was unnecessary for older GPUs. Considering my personal results below it might not be worth making this backward compatible.

The last error was:

Code: Select all

nvcc -c includes/gpusrc.cu -o includes/gpusrc.o
includes/../lifelib/cuda/gs_impl.h(48): error: explicit type is missing ("int" assumed)

includes/../lifelib/cuda/gs_impl.h(48): error: no suitable conversion function from "std::vector<uint64_t, std::allocator<uint64_t>>" to "int" exists

includes/../lifelib/cuda/gs_impl.h(53): error: no suitable constructor exists to convert from "int" to "std::vector<uint64_t, std::allocator<uint64_t>>"

<snip more similar errors>
I resolved this by adding --std=c++11 to CU_FLAGS in the makefile.
Oh, wow, thanks for getting it to work! So, to summarise, is it sufficient for me to:
  • Add --std=c++11 to the CU_FLAGS in the makefile;
  • Insert some preprocessor statements to use __ballot() instead of __ballot_sync() when the CUDA version is less than 9?
Results
Perhaps not surprisingly, this GPU sees nowhere near the same performance as the Volta. I get between 28500 and 29000 soups/s with 100% GPU utilisation and 100% utilisation of a single core.
This sounds quite reasonable: if you take the (admittedly dubious) metric of multiplying clock speed in MHz by the number of 'CUDA cores', you get 752640 for the 960M and 7014400 for the V100 (so a factor of 9). If you do the same with SMs instead of CUDA cores, you get a factor of 18. This suggests that the V100 is somewhere between 9 and 18 times more powerful than the 960M, which is entirely consistent with results.

Either way, it's still as good as 4 state-of-the-art CPU cores.

Does this change at all if you use -u 4096 instead of (the default of) -u 8192?

(Interestingly, the V100 only manages a continuous 90% GPU utilisation with -u 8192, so maybe I should add an option to double the universe count even further.)
Even with -p 2 I only see a brief spike in CPU usage while the "interesting" soups are run through apgluxe on the CPU.

I've submitted a few hauls, but I don't think I'll continue with this machine as it is running pretty hot
Those hauls look great (i.e. statistically similar to the one I uploaded).
What do you do with ill crystallographers? Take them to the mono-clinic!

googoIpIex
Posts: 292
Joined: February 28th, 2019, 4:49 pm
Location: Sqrt(-1)

Re: apgsearch v5.0

Post by googoIpIex » March 21st, 2019, 8:40 am

Can I recommend an option to use the GPU and CPU to run C1 symmetry?
woomy on a vroomy

User avatar
77topaz
Posts: 1496
Joined: January 12th, 2018, 9:19 pm

Re: apgsearch v5.0

Post by 77topaz » March 21st, 2019, 7:24 pm

When I try to run --cuda, I get this error:

Code: Select all

nvcc -c --std=c++11 includes/gpusrc.cu -o includes/gpusrc.o
make: nvcc: No such file or directory
make: *** [includes/gpusrc.o] Error 1
I take it this means I need to download/install something called nvcc?

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 21st, 2019, 7:31 pm

77topaz wrote:When I try to run --cuda, I get this error:

Code: Select all

nvcc -c --std=c++11 includes/gpusrc.cu -o includes/gpusrc.o
make: nvcc: No such file or directory
make: *** [includes/gpusrc.o] Error 1
I take it this means I need to download/install something called nvcc?
Yes, you need to install the CUDA toolkit. https://docs.nvidia.com/cuda/cuda-insta ... index.html
I have no experience with this on MacOS, but that link should get you going.

Edit: and as a prerequisite - you need an NVIDIA GPU
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
77topaz
Posts: 1496
Joined: January 12th, 2018, 9:19 pm

Re: apgsearch v5.0

Post by 77topaz » March 21st, 2019, 7:47 pm

I have an NVIDIA GPU, but macOS 10.12.6 instead of 10.13. However, I do occasionally get the option to upgrade to Mojave (10.14), so that means my hardware should be able to support it - I just haven't decided whether upgrading to 10.14 is worth it (I used to have a Windows computer, and when I upgraded that from Windows 7 to Windows 10, it came with a whole host of annoying problems, particularly frequent crashes).

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 22nd, 2019, 5:52 am

calcyman wrote:Oh, wow, thanks for getting it to work! So, to summarise, is it sufficient for me to:
  • Add --std=c++11 to the CU_FLAGS in the makefile;
  • Insert some preprocessor statements to use __ballot() instead of __ballot_sync() when the CUDA version is less than 9?
The first is sufficient to make it easy for anyone with CUDA >= v9 to compile. The second is sufficient to allow use of CUDA < v9. I've tested compilation with the Ubuntu 16.04 supplied CUDA v7.5. Compiled with no errors using current HEAD. You may like to mention that --cuda won't work on WSL (at this point in time). It's not clear to me if cygwin or msys could be used to build apgluxe but according to the docs, compilers other than cl.exe are not supported on Windows.

Using this build I get about 30000 soups/s. Were there any code changes in the last day? I didn't test the current code with v9 of the CUDA toolkit so I can't say if the change was purely due to the different toolkit version.
calcyman wrote:
Results
Perhaps not surprisingly, this GPU sees nowhere near the same performance as the Volta. I get between 28500 and 29000 soups/s with 100% GPU utilisation and 100% utilisation of a single core.
This sounds quite reasonable: if you take the (admittedly dubious) metric of multiplying clock speed in MHz by the number of 'CUDA cores', you get 752640 for the 960M and 7014400 for the V100 (so a factor of 9). If you do the same with SMs instead of CUDA cores, you get a factor of 18. This suggests that the V100 is somewhere between 9 and 18 times more powerful than the 960M, which is entirely consistent with results.

Either way, it's still as good as 4 state-of-the-art CPU cores.

Does this change at all if you use -u 4096 instead of (the default of) -u 8192?
Using "-u 4096" I see little to no change in the GPU and CPU utilisation, but I do I see an increase of about 1000 soups/s. This is consistent for both builds with the v7.5 and v9 version of the CUDA toolkit. Sorry I can't be more precise - there's seems to be a long delay at the start before soups start being run which results in the first soup speed measure being about half the average. With only 20 million soups this has a significant impact on the overall soup speed.

I should also mention that all these tests were using nvidia-driver 415.27 which supports CUDA v10.
77topaz wrote:I have an NVIDIA GPU, but macOS 10.12.6 instead of 10.13. However, I do occasionally get the option to upgrade to Mojave (10.14), so that means my hardware should be able to support it - I just haven't decided whether upgrading to 10.14 is worth it (I used to have a Windows computer, and when I upgraded that from Windows 7 to Windows 10, it came with a whole host of annoying problems, particularly frequent crashes).
Can't say anything about the upgrade option, but I'm pretty sure you could use an older version of the nvidia driver and CUDA toolkit. The first v9.0 version of both supports MacOS 10.12.
https://www.nvidia.com/object/macosx-cu ... river.html
https://developer.nvidia.com/cuda-90-download-archive
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 22nd, 2019, 7:56 am

wildmyron wrote:You may like to mention that --cuda won't work on WSL (at this point in time). It's not clear to me if cygwin or msys could be used to build apgluxe but according to the docs, compilers other than cl.exe are not supported on Windows.
Argh -- that's frustrating, because MSVC doesn't support x86_64 inline assembly, so the CPU and GPU parts would need to be compiled using different compilers (and somehow linked). Maybe the most straightforward way to get this working on Windows is to make the GPU searcher into a separate executable, and pipe the soup seeds into apgsearch (stdin-style).
Using this build I get about 30000 soups/s. Were there any code changes in the last day? I didn't test the current code with v9 of the CUDA toolkit so I can't say if the change was purely due to the different toolkit version.
Almost certainly it's down to the CUDA toolkit version; I didn't change any of the GPU code.
Using "-u 4096" I see little to no change in the GPU and CPU utilisation, but I do I see an increase of about 1000 soups/s. This is consistent for both builds with the v7.5 and v9 version of the CUDA toolkit. Sorry I can't be more precise - there's seems to be a long delay at the start before soups start being run which results in the first soup speed measure being about half the average.
Oh, yes: that's because the GPU needs to process a full epoch of 10^6 soups before the CPU can start searching the interesting soups. I think that can be rectified by simply starting the clock after the GPU has processed the first epoch:

https://gitlab.com/apgoucher/apgmera/co ... d64726a5dd
Sokwe wrote:Hopefully I'm not being too presumptuous, but are there any plans for a symmetric soup pre-processor?
This could be done, and it wouldn't be too difficult: it would just involve writing an instance of copyhashes() for each new symmetry:

https://gitlab.com/apgoucher/lifelib/bl ... .h#L30-116

with the 18 initial generations replaced by 4 (or removed completely for simplicity?).
What do you do with ill crystallographers? Take them to the mono-clinic!

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 23rd, 2019, 2:10 pm

Version 5.01 includes a considerably refactored version of includes/searching.h with the following effects:
  • Soup-searching speed and progress are displayed in parallel mode as well as sequential mode;
  • In parallel mode, the soups are no longer statically split between cores, but rather pulled from a concurrent queue by idle threads. As such, parallelism no longer suffers the issue whereby many cores are idle towards the end of a haul.
EDIT: 5.02 fixes a bug where Q-pressing would neglect to upload progress to Catagolue.
What do you do with ill crystallographers? Take them to the mono-clinic!

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 24th, 2019, 11:12 am

Version 5.03 incorporates a change by testitemqlstudop which records methuselahs with final populations of at least 3000 cells. Note that the determined 'final population' may be anywhere between the limsup and liminf of the population (inclusive); for instance, when a soup contains a pulsar, these differ by as much as 24.

If anyone has strong feelings about whether it should be the limsup, liminf, or average final population, let me know.
What do you do with ill crystallographers? Take them to the mono-clinic!

User avatar
Ian07
Moderator
Posts: 891
Joined: September 22nd, 2018, 8:48 am
Location: New Jersey, US

Re: apgsearch v5.0

Post by Ian07 » March 25th, 2019, 7:16 pm

I think there's been a bit of a slowdown since v4.9 in b3s23/D8_4 (and of course, possibly other rules/symmetries); I remember getting well over 7k soups/sec/core in v4.9 but have only been getting about 6.5k in v5.03.

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 26th, 2019, 5:42 am

calcyman wrote:
wildmyron wrote:Sorry I can't be more precise - there's seems to be a long delay at the start before soups start being run which results in the first soup speed measure being about half the average.
Oh, yes: that's because the GPU needs to process a full epoch of 10^6 soups before the CPU can start searching the interesting soups. I think that can be rectified by simply starting the clock after the GPU has processed the first epoch:

https://gitlab.com/apgoucher/apgmera/co ... d64726a5dd
Thanks for that change. It rectified the overall soup count during the search - but now there's a bogus value for the last 10^6 soups in the haul which is artificially inflating the final overall soup search speed. This is minor compared to the previous error though, because I can just use the second last value if I want to do any perf comparison.

I was curious about using a smaller universe count on the GTX 960M. The naive code change I made to try this out didn't work out - how complex is actually changing the allowable values of this parameter? I'm guessing it involves modifications to copyhashes() in gpupattern.h that aren't at all obvious to me.
calcyman wrote:
wildmyron wrote:You may like to mention that --cuda won't work on WSL (at this point in time). It's not clear to me if cygwin or msys could be used to build apgluxe but according to the docs, compilers other than cl.exe are not supported on Windows.
Argh -- that's frustrating, because MSVC doesn't support x86_64 inline assembly, so the CPU and GPU parts would need to be compiled using different compilers (and somehow linked). Maybe the most straightforward way to get this working on Windows is to make the GPU searcher into a separate executable, and pipe the soup seeds into apgsearch (stdin-style).
Hmm, I meant to include this URL: https://wpdev.uservoice.com/forums/2669 ... pu-support - I haven't seen any recent news from the WSL team on progress towards this feature.

From the reading I've done, it seems it would be possible to link the CPU and GPU parts built with gcc (cross compiled) and nvcc + cl.exe, respectively. I'm not sure if differing conventions in the decorators used would cause problems, but building the CPU searcher as a dll would hopefully avoid that issue. I'd say that would be the preferred option, but I'm sure the multiprocess alternative with a pipe for soup seeds is viable too.

======

Among other objects missing from the G1 census are messless soups, because they obviously pass the p6 stability test within a short duration and are therefore considered uninteresting. Is it easy (and worthwhile) to pick these up and pass them to the CPU searcher?
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 26th, 2019, 6:19 pm

Ian07 wrote:I think there's been a bit of a slowdown since v4.9 in b3s23/D8_4 (and of course, possibly other rules/symmetries); I remember getting well over 7k soups/sec/core in v4.9 but have only been getting about 6.5k in v5.03.
On a slightly different but related note, I received a bug report saying that HighLife search speed had stalled to basically zero. That's now been fixed in commit a679e82f, so HighLife can now be searched at a comfortable 20 soups/second. Other rules with erratic-growth patterns should see a concomitant return to the former glory of early v4.x versions.

wildmyron wrote:I was curious about using a smaller universe count on the GTX 960M. The naive code change I made to try this out didn't work out - how complex is actually changing the allowable values of this parameter? I'm guessing it involves modifications to copyhashes() in gpupattern.h that aren't at all obvious to me.
Essentially, this part of the code:

Code: Select all

            if (unicount == 8192) {
                logf = 7;
                exclusive_scan_uint64_128<<<64, 128>>>(tilemasks, l1sums, l1totals);
            } else if (unicount == 4096) {
                logf = 6;
                exclusive_scan_uint64<<<64, 64>>>(tilemasks, l1sums, l1totals);
            }
            exclusive_scan_uint64<<<1, 64>>>(l1totals, l2sums, l2sums + 64);

            compact_tiles<<<unicount, 128>>>((uint32_cu*) multiverse, tilemasks, l1sums, l2sums, compacts, logf);

            psc_universes<<<(unicount / 32), 32>>>(to_restore, psums);

            if (unicount == 8192) {
                exclusive_scan_uint32_256<<<1, 256>>>(psums, psums2, l2sums + 65);
            } else if (unicount == 4096) {
                exclusive_scan_uint32<<<1, 128>>>(psums, psums2, l2sums + 65);
            }
needs to be changed if you want a smaller number of universes. In particular, to add 2048 as an option, you'll need to implement a 32-element uint64 exclusive scan and a 64-element uint32 exclusive scan. (Is this necessary due to memory requirements? Setting -u 4096 should already use less than a gigabyte total GPU memory. I'll add the 2048 option myself if there are use-cases.)
wildmyron wrote:Hmm, I meant to include this URL: https://wpdev.uservoice.com/forums/2669 ... pu-support - I haven't seen any recent news from the WSL team on progress towards this feature.

From the reading I've done, it seems it would be possible to link the CPU and GPU parts built with gcc (cross compiled) and nvcc + cl.exe, respectively.
Thanks for the suggestions. Is the idea to build the CPU part using MinGW (in the same way that the precompiled Windows binary is built), but output a dynamic library instead of an executable, and then link to that?
I'm not sure if differing conventions in the decorators used would cause problems, but building the CPU searcher as a dll would hopefully avoid that issue.
Usually you can avoid linkage problems by using extern "C", at the expense of having a more low-level interface between the parts of the program. (That's how python-lifelib works: the lifelib dynamic library exposes several C-like interfaces and calls them from the Python side.)
Among other objects missing from the G1 census are messless soups, because they obviously pass the p6 stability test within a short duration and are therefore considered uninteresting. Is it easy (and worthwhile) to pick these up and pass them to the CPU searcher?
It should be possible, but would cause a slight slowdown: I'd need to check whether all of the tiles are already empty in the copyhashes() routine -- the one in which I clear the existing universe -- and if so, mark the soup as interesting. The reason for the slowdown is that each write would need to be replaced with the combination of a read and a write.

Of course, another possibility is to move the universe-clearing code out of copyhashes(), and incorporate it into a new function which can actually analyse the ash products of uninteresting soups (and mark 'interesting' if there's anything it can't identify). That would allow genuine C1 searching using the GPU: admittedly marginally slower than searching G1, owing to the extra overhead of ash analysis, but with the huge upside of being compatible with the existing main census.
What do you do with ill crystallographers? Take them to the mono-clinic!

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 27th, 2019, 3:20 am

calcyman wrote:Essentially, this part of the code:

Code: Select all

            if (unicount == 8192) {
                logf = 7;
                exclusive_scan_uint64_128<<<64, 128>>>(tilemasks, l1sums, l1totals);
            } else if (unicount == 4096) {
                logf = 6;
                exclusive_scan_uint64<<<64, 64>>>(tilemasks, l1sums, l1totals);
            }
            exclusive_scan_uint64<<<1, 64>>>(l1totals, l2sums, l2sums + 64);

            compact_tiles<<<unicount, 128>>>((uint32_cu*) multiverse, tilemasks, l1sums, l2sums, compacts, logf);

            psc_universes<<<(unicount / 32), 32>>>(to_restore, psums);

            if (unicount == 8192) {
                exclusive_scan_uint32_256<<<1, 256>>>(psums, psums2, l2sums + 65);
            } else if (unicount == 4096) {
                exclusive_scan_uint32<<<1, 128>>>(psums, psums2, l2sums + 65);
            }
needs to be changed if you want a smaller number of universes. In particular, to add 2048 as an option, you'll need to implement a 32-element uint64 exclusive scan and a 64-element uint32 exclusive scan. (Is this necessary due to memory requirements? Setting -u 4096 should already use less than a gigabyte total GPU memory. I'll add the 2048 option myself if there are use-cases.)
Thanks for the explanation. There's not really a use case for this - just that after some further testing I'm confident that the differences in speed with unicount weren't due to random variation. For "-u 8092" I get 30000 soups/s and for "-u 4096" I get 31000 soups/s. I don't understand why this might be the case and I was curious what would happen for lower values.
calcyman wrote:
wildmyron wrote:From the reading I've done, it seems it would be possible to link the CPU and GPU parts built with gcc (cross compiled) and nvcc + cl.exe, respectively.
Thanks for the suggestions. Is the idea to build the CPU part using MinGW (in the same way that the precompiled Windows binary is built), but output a dynamic library instead of an executable, and then link to that?
Yes that's the idea. If it can be done with a static lib, even better I guess (simpler distribution?). I mentioned the dll option as a fallback, not thinking that you already had such an implementation for Python-lifelib.
calcyman wrote:
wildmyron wrote:Among other objects missing from the G1 census are messless soups, because they obviously pass the p6 stability test within a short duration and are therefore considered uninteresting. Is it easy (and worthwhile) to pick these up and pass them to the CPU searcher?
It should be possible, but would cause a slight slowdown: I'd need to check whether all of the tiles are already empty in the copyhashes() routine -- the one in which I clear the existing universe -- and if so, mark the soup as interesting. The reason for the slowdown is that each write would need to be replaced with the combination of a read and a write.
Hmm, well I wouldn't want something like this at the cost of a measurable slowdown. There are plenty of other objects of interest which would merit detection above messless soups - p6 oscillators in particular, not to mention that 14bit SL which still hasn't shown up in C1. I understood from your description that the CUDA code detected stabilisation and presumed that you record this in some way which can be read at little cost. I thought a similar check for emptyness would be simple and low cost.
calcyman wrote:Of course, another possibility is to move the universe-clearing code out of copyhashes(), and incorporate it into a new function which can actually analyse the ash products of uninteresting soups (and mark 'interesting' if there's anything it can't identify). That would allow genuine C1 searching using the GPU: admittedly marginally slower than searching G1, owing to the extra overhead of ash analysis, but with the huge upside of being compatible with the existing main census.
That seems preferable to slowing down the search for the sake of diehard detection. On a related note I was thinking about perf on my, and other lowend NVIDIA GPUs.
calcyman wrote:Work is delegated to the GPU in increments of 1 000 000 soups (where up to 16384 of these run concurrently). Each soup is run until it either stabilises (becomes period-6), reaches the boundary, or 12000 generations have elapsed. In the first case the soup is deemed 'boring'; in the second and third cases, it is deemed 'interesting'. Roughly 99.73% of soups are classified as 'boring', with the other 0.27% being 'interesting'.
Because searcing is GPU bound on my system it seems like reducing the 12000 gen limit (and possibly reducing the max universe size) would increase the GPU soup search speed at the expense of sending more soups to the CPU. I'm interested in exploring the tradeoff, but I'm not sure about what to do with the results. I'd upload them to a symmetry such as "G1_Xkgen_Test" (where 'X' represents the max gen used) so that the statistics for the G1 symmetry should remain consistent.

Is the current apgmera code able to output statistics on the fraction of interesting soups, or did you determine this independently?

One last thought about the object statistics. Because Catagolue ranks each census by objects found rather than soups searched, G1 is at a distinct disadvantage in comparison to the other standard symmetries. Not a big deal I suppose, as it seems unlikely G1 will be even close to catching up anytime soon.

====

Has anyone else attempted to compile with --cuda?
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 27th, 2019, 5:20 am

wildmyron wrote:Thanks for the explanation. There's not really a use case for this - just that after some further testing I'm confident that the differences in speed with unicount weren't due to random variation. For "-u 8092" [sic] I get 30000 soups/s and for "-u 4096" I get 31000 soups/s. I don't understand why this might be the case and I was curious what would happen for lower values.
My guess is that the lower memory consumption behaves better with regard to the L2 cache.
wildmyron wrote: I understood from your description that the CUDA code detected stabilisation and presumed that you record this in some way which can be read at little cost. I thought a similar check for emptyness would be simple and low cost.
Relatively low cost, yes, but nonzero.
Because searcing is GPU bound on my system it seems like reducing the 12000 gen limit
Maybe I need to say where that '12000' is stored: you firstly divide it by 3 (to get 4000), write it in hexadecimal (as 0xfa0), and append '00000u' to the end of the constant (to get 0xfa000000u). It's located on line 162 of gpupattern.h:

Code: Select all

if (((t >> 48) & am & 63) || (tmp0 > 0xfa000000u)) { expusize = 31; }
Is the current apgmera code able to output statistics on the fraction of interesting soups, or did you determine this independently?
I've modified the latest version to report the number of interesting soups per epoch (untested).
Has anyone else attempted to compile with --cuda?
Tom Rokicki ran it on a consumer 1080 GPU, with a soup-searching speed of 126 000 soups/second. (That means it's unjustifiably expensive to buy a Volta V100 for apgsearching; you can just get three 1080s instead for a quarter of the price!)
What do you do with ill crystallographers? Take them to the mono-clinic!

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 27th, 2019, 10:16 pm

Thank you for the details and the updated statistics. I will do some experiments with alternative values in place of the 12000 max gen.
calcyman wrote:
Has anyone else attempted to compile with --cuda?
Tom Rokicki ran it on a consumer 1080 GPU, with a soup-searching speed of 126 000 soups/second. (That means it's unjustifiably expensive to buy a Volta V100 for apgsearching; you can just get three 1080s instead for a quarter of the price!)
Hah, I believe that's true for practically all workloads - except perhaps where the extra memory is required.
Lambda Labs wrote:2080 Ti is a Porsche 911, the V100 is a Bugatti Veyron [1]
Lambda Labs wrote:And if you think I'm going overboard with the Porsche analogy, you can buy a DGX-1 8x V100 for $120,000 or a Lambda Blade 8x 2080 Ti for $28,000 and have enough left over for a real Porsche 911. Your pick.
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
testitemqlstudop
Posts: 1367
Joined: July 21st, 2016, 11:45 am
Location: in catagolue
Contact:

Re: apgsearch v5.0

Post by testitemqlstudop » March 27th, 2019, 11:22 pm

How difficult would it be to modify apgsearch so that it skips soups that can't be censused in 10 seconds? I'm searching a b2a rule that has soup search speed swings (alliteration not intended) from 1200 soups/second to 0.012 soups/second (no kidding.) The culprit seems to be in yl detection, especially for very high yl periods - this b2a rule lacks replicators.

Alternatively, what would happen if I changed the "maximum pathological attempts" from 5 to something like 2?

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » March 28th, 2019, 4:50 am

wildmyron wrote:Thank you for the details and the updated statistics. I will do some experiments with alternative values in place of the 12000 max gen.
Well, that was disappointing. The short version is that for my system, the extra CPU load required to process the additional interesting soups is better utilised by running a standard C1 search. My conclusion is that the 12k limit is a pretty good threshold.

Results:

Code: Select all

max gen   speed     CPUsoups/epoch  CPU util
12000     31000      2700           ~20%
 6000      31400      4400           ~30%
 4500      31800      8800           ~60%
 3000      34000     24100           ~130%
 2400      35700     39100           ~210%
Standard C1 search speed is 7100 soups/s/core, so for all values of max gen < 12000 on this table there is a greater benefit to devoting those CPU resources to C1 soup searching rather than processing the additional interesting soups from the slightly faster GPU search.
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

User avatar
calcyman
Moderator
Posts: 2932
Joined: June 1st, 2009, 4:32 pm

Re: apgsearch v5.0

Post by calcyman » March 30th, 2019, 10:09 am

apgsearch v5.05 allows searching of H2_+1 and H2_+2 (the GPU versions of D2_+1 and D2_+2, respectively) in addition to G1.

EDIT: apgsearch v5.061-ll2.2.2 has GPU support for arbitrary outer-totalistic rules. (They won't work very well if there are common spaceships that aren't either (2,0)c/4 or (1,1)c/4.)
What do you do with ill crystallographers? Take them to the mono-clinic!

wildmyron
Posts: 1542
Joined: August 9th, 2013, 12:45 am
Location: Western Australia

Re: apgsearch v5.0

Post by wildmyron » April 1st, 2019, 5:27 am

calcyman wrote:apgsearch v5.05 allows searching of H2_+1 and H2_+2 (the GPU versions of D2_+1 and D2_+2, respectively) in addition to G1.
I did some test runs with these and the three D4_+ symmetries which are now also supported. I ran 100 million soups for each of them, and verified that a small number of reported objects are indeed produced by the reported soups. Here are some perf statistics from my system.

Code: Select all

Symmetry    Interesting Search speed
            Soups           (2SF)
H2_+1       ~0.01%      50000 soups/s
H2_+2       ~0.7%       40000 soups/s
H4_+1       ~0.04%      70000 soups/s
H4_+2       ~0.6%       53000 soups/s
H4_+4       ~0.8%       49000 soups/s
For the symmetries with at least one even reflection, the majority of interesting soups contain a pentadecathlon. I also did a few quick tests with reduction of the value of maxgen (as discussed above). In the case of H4_+2 this does actually give a decent benefit. With maxgen = 6000 the search speed increases to 63000 soups/s and for maxgen = 4500 it increases to 66000 soups/s. There was only a small increase in the % of interesting soups so 1 CPU could still keep up.
calcyman wrote:apgsearch v5.061-ll2.2.2 has GPU support for arbitrary outer-totalistic rules. (They won't work very well if there are common spaceships that aren't either (2,0)c/4 or (1,1)c/4.)
Fantastic work. Common oscillators which aren't p2, p3, or p6 can also reduce the benefit of the CUDA code - in particular p4 oscillators which are common in plenty of other rules. Replicators are presumably also problematic - not surprisingly HighLife was CPU bound and B36/S245 was significantly slower than CPU only search (actually, I think that was due to the common p4).

Flock (B3/S12) on the other hand is an ideal candidate - but here I had to reduce maxgen to get the best results.
CPU speed - 86,000 soups/s
GPU speed with maxgen = 12000 - 100,000 soups/s
GPU speed with maxgen = 300 - 1,000,000 soups/s

I didn't upload any of these tests for Flock, or the other tests with reduced maxgen.
The 5S project (Smallest Spaceships Supporting Specific Speeds) is now maintained by AforAmpere. The latest collection is hosted on GitHub and contains well over 1,000,000 spaceships.

Semi-active here - recovering from a severe case of LWTDS.

Post Reply