Optimizing with SSE2 and AVX2 — Handmade Ray — Episode Guide

0:02Recap and set the stage for the day

🗩

0:02Recap and set the stage for the day

🗩

0:02Recap and set the stage for the day

🗩

1:26Run the program to show the current picture

🏃

1:26Run the program to show the current picture

🏃

1:26Run the program to show the current picture

🏃

4:17Begin to implement the LANE_WIDTH == 4 versions for our various functions / operators¹

13:57Describe the _mm_xor_si128 instruction²

📖

13:57Describe the _mm_xor_si128 instruction²

📖

13:57Describe the _mm_xor_si128 instruction²

📖

16:52Implement a full set of lane width-agnostic operators³

42:33Fix up CastSampleRays() to convert everything to the correct lane width^α

45:35Introduce LaneV3FromV3() and continue fixing up CastSampleRays()

48:45Implement the various lane_v3 functions / operators⁴

1:03:14Implement scalar comparison operators⁵

1:12:16Introduce AndNot() using _mm_andnot_si128 for ConditionalAssign() to use⁶

1:23:00Continue to implement our scalar functions

1:32:28Double-check C's specification for comparison operators⁷

📖

1:32:28Double-check C's specification for comparison operators⁷

📖

1:32:28Double-check C's specification for comparison operators⁷

📖

1:34:00Continue to fix up compile errors

1:34:52Implement scalar loading of materials using _mm_setr_ps⁸

1:50:15Continue to fix up compile errors⁹

1:59:59Implement multiple permutations of MaskIsZeroed() and HorizontalAdd()¹⁰

2:06:20Make RenderTile() pack the sRGB colour inline and initialise everything in scalar

2:11:39Introduce Extract0() for RenderTile() to call

2:14:17Change Materials, Planes and Spheres to be initialiser lists

2:18:47Make the Entropy stuff work properly¹¹

2:21:42Run the program to see totally bogus results

🏃

2:21:42Run the program to see totally bogus results

🏃

2:21:42Run the program to see totally bogus results

🏃

2:22:18Print out the lane width and flip LANE_WIDTH back to 1 so we can get that working again

2:32:07Run the program in 1-wide lanes to see that this no longer works

🏃

2:32:07Run the program in 1-wide lanes to see that this no longer works

🏃

2:32:07Run the program in 1-wide lanes to see that this no longer works

🏃

2:34:23Step through CastSampleRays() and inspect its values

2:43:07Make the operator& for lane_v3 zero out the mask if needed

2:44:29Run the program...

🏃

2:44:29Run the program...

🏃

2:44:29Run the program...

🏃

2:45:19Bump the CPUCount back up

2:45:27Run the program to see what's going on

🏃

2:45:27Run the program to see what's going on

🏃

2:45:27Run the program to see what's going on

🏃

2:46:58Increase the LANE_WIDTH to 4

2:47:19Run the program to see a bizarre picture

🏃

2:47:19Run the program to see a bizarre picture

🏃

2:47:19Run the program to see a bizarre picture

🏃

2:48:26Switch back to the slow mode and step through CastSampleRays() to inspect its values

2:54:12Fix ConditionalAssign() to cast rather than convert

2:54:54Step back through CastSampleRays() to see more expected values

2:56:20Run our program to see a better image

🏃

2:56:20Run our program to see a better image

🏃

2:56:20Run our program to see a better image

🏃

2:58:02Compare our GatherF32_() functions

2:59:52Step into CastSampleRays() and inspect the material values

3:03:16Scrutinise our operator!= for lane_u32

3:04:39Fix our operator!= for lane_u32 to use _mm_set1_epi32(0xFFFFFFFF) rather than _mm_setzero_si128()

3:05:55Step in to CastSampleRays() to see that our lane mask is set properly

3:06:57Run our program to see that we're now only a little bit wrong

🏃

3:06:57Run our program to see that we're now only a little bit wrong

🏃

3:06:57Run our program to see that we're now only a little bit wrong

🏃

3:08:08Read through our scalar code for any obvious mistakes

3:18:07Run our program on 1 lane, to compare our image with the 4 lane version

🏃

3:18:07Run our program on 1 lane, to compare our image with the 4 lane version

🏃

3:18:07Run our program on 1 lane, to compare our image with the 4 lane version

🏃

3:20:51Rename Scatter to Specular and try to force all Specular values to 1

3:23:52Run our program to see what that looks like

🏃

3:23:52Run our program to see what that looks like

🏃

3:23:52Run our program to see what that looks like

🏃

3:25:53Revert those specular values and investigate whether the PureBounce, RandomBounce and RayDirection are being computed correctly

3:28:49Step in to the lane_v3 Lerp() to see what it produces

3:32:42Check the normalisation of RayDirection

3:33:37Step through RandomBilateral()

3:35:01Step into LaneF32FromU32() and double-check what it is computing

3:36:48Make LaneU32FromU32 cast its incoming u32 to an int when passing it to _mm_set1_epi32^β

3:38:43Step back in to RandomUnilateral() to see possibly more expected results

3:40:06Assert in RandomUnilateral() that Result < 0.6f

3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1

🏃

3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1

🏃

3:41:06Run the game and don't hit that assert, to determine that RandomUnilateral() is not producing the full range of values from 0 to 1

🏃

3:43:00Make RandomUnilateral() shift down its terms by 1^γ

3:44:29Run and hit our assertion in RandomUnilateral()

🏃

3:44:29Run and hit our assertion in RandomUnilateral()

🏃

3:44:29Run and hit our assertion in RandomUnilateral()

🏃

3:44:34Remove that assert and run the game to see a reasonable result

🏃

3:44:34Remove that assert and run the game to see a reasonable result

🏃

3:44:34Remove that assert and run the game to see a reasonable result

🏃

3:46:17Run the program at full quality and compare our images^δ

🏃

3:46:17Run the program at full quality and compare our images^δ

🏃

3:46:17Run the program at full quality and compare our images^δ

🏃

3:49:06Step in to CastSampleRays() to see that we do break out properly

3:49:20Cast significantly fewer rays per pixel to determine that we are not over-casting

3:52:21Step in to CastSampleRays() and inspect the asm

3:55:07Make CastSampleRays() count up the LoopsComputed for us to print out

4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces

🏃

4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces

🏃

4:00:35Run our program and inspect its statistics to see a mere 10.61% wasted bounces

🏃

4:01:40Q&A

🗩

4:01:40Q&A

🗩

4:01:40Q&A

🗩

4:02:40@thecodedragon You didn't replace &= and |= with the correct operator inside the function

🗪

4:02:40@thecodedragon You didn't replace &= and |= with the correct operator inside the function

🗪

4:02:40@thecodedragon You didn't replace &= and |= with the correct operator inside the function

🗪

4:03:03@popcorn0x90 Q: Is your beard fake? It grew pretty fast

🗪

4:03:03@popcorn0x90 Q: Is your beard fake? It grew pretty fast

🗪

4:03:03@popcorn0x90 Q: Is your beard fake? It grew pretty fast

🗪

4:03:11@Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)

🗪

4:03:11@Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)

🗪

4:03:11@Kelimion cmuratori: Not just day 3, but also 2017-11-19 (for the image filename)

🗪

4:03:38@pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?

🗪

4:03:38@pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?

🗪

4:03:38@pragmascrypt Q: 64 samples looked very smooth. Could you compare 64 samples with 4 wide to 64 samples 1 wide?

🗪

4:03:59@chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?

🗪

4:03:59@chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?

🗪

4:03:59@chrysos42 Q: Due to floating point precision, is there a significant difference between generating a random float by dividing 32 random bits by the max 32 bit integer vs dividing 24 random bits by the max 24 bit integer?

🗪

4:05:34@the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?

🗪

4:05:34@the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?

🗪

4:05:34@the_lyribolical_coach_b Q: You said in a much earlier stream that using operator overloads for SIMD could confuse the compiler, preferring to use macros. Why the change?

🗪

4:06:53@pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different

🗪

4:06:53@pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different

🗪

4:06:53@pragmascrypt Q: I was thinking maybe it does more samples than it should with 4 wide, so by comparing 64 samples 1 wide with 64 samples 4 wide maybe it would look different

🗪

4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images

🏃

4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images

🏃

4:07:12Run the program on 1 lane and 64 RaysPerPixel and compare the images

🏃

4:09:23@groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling

🗪

4:09:23@groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling

🗪

4:09:23@groggeh Q: Nine women can't grow a baby any faster; would smaller packing potentially be better? Is the packing taking too much time? Just spit-balling

🗪

4:10:24Enable CastSampleRays() to early-out as often as possible

4:14:40Run the program to see that it is now twice as fast

🏃

4:14:40Run the program to see that it is now twice as fast

🏃

4:14:40Run the program to see that it is now twice as fast

🏃

4:16:00Consider avoiding gathering for rays that haven't hit

4:16:59Explicitly establish that the LaneMask is not zeroed before setting the Attenuation, Bounces and RayDirection

4:17:56Run the program to see another speedup

🏃

4:17:56Run the program to see another speedup

🏃

4:17:56Run the program to see another speedup

🏃

4:19:02Pull out the lane width-specific code to their own .h files, introducing 8-wide versions for everything¹²

4:22:17Check out the _CMP* defines in immintrin.h¹³^ε

4:27:08Learn what "ordered" means in the context of these _CMP* defines¹⁴

📖

4:27:08Learn what "ordered" means in the context of these _CMP* defines¹⁴

📖

4:27:08Learn what "ordered" means in the context of these _CMP* defines¹⁴

📖

4:28:22Continue to implement the 8-wide versions of our functions / operators¹⁵

4:36:07Run the program in 8-wide lanes and crash immediately

🏃

4:36:07Run the program in 8-wide lanes and crash immediately

🏃

4:36:07Run the program in 8-wide lanes and crash immediately

🏃

4:38:21Inspect the asm for RenderTile() to see that we are failing on the vunpcklps call, and investigate if it is an alignment issue

4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps¹⁶

📖

4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps¹⁶

📖

4:42:03Search the Intel 64 and IA-32 Architectures Software Developer Manuals for vunpcklps¹⁶

📖

4:44:33Pass -arch:AVX2 on the build line to prevent the vunpcklps call from using bcst¹⁷

4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker

🏃

4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker

🏃

4:46:26Run our program in 8-wide lanes to see that we are slower, more wasteful and darker

🏃

4:47:18Fix our 8-wide HorizontalAdd()

4:48:10Run our program to see that we are much better, and save off our images and statistics

🏃

4:48:10Run our program to see that we are much better, and save off our images and statistics

🏃

4:48:10Run our program to see that we are much better, and save off our images and statistics

🏃

4:52:28That's about it for today

🗩

4:52:28That's about it for today

🗩

4:52:28That's about it for today

🗩

4:53:20@butwhynot1 Q: Do AVX512 now

🗪

4:53:20@butwhynot1 Q: Do AVX512 now

🗪

4:53:20@butwhynot1 Q: Do AVX512 now

🗪

4:54:04That's it

🗩

4:54:04That's it

🗩

4:54:04That's it

🗩

Keyboard Navigation

Global Keys

Menu toggling

In-Menu and Index Controls

Quotes and References Menus and Index

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu