SIMD Basics — Handmade Hero — Episode Guide

0:59Open us up and see where we're at in terms of performance

1:52Blackboard: SIMD on x64

6:34Blackboard: How do we use SIMD?

8:16Blackboard: CPU vs GPU framebuffers

14:10Blackboard: "SOA" vs "AOS"

18:22Blackboard: How this stuff actually works

22:13Blackboard: Strided loading on NEON

25:22build.bat: Turn off compiler optimisations

25:46Internet: Intel Intrinsics Guide¹

28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide

31:07Debugger: Go to disassembly and look at the SIMD registers

36:02handmade_render_group.cpp: Set four different values in the registers

36:36Debugger: See those different values in the registers and note the order in which they are loaded

39:47handmade_render_group.cpp: Turn Square functions into multiplies

41:04Fix the loop to work on pixels in batches of 4

46:20Run the game and note that we are overwriting our boundary

47:15handmade_render_group.cpp: Temporarily clip the buffers

48:19Separate the memory loading stuff from the computations

50:58Declare the arrays before the loop

57:20Debugger: Run and investigate the error

58:18handmade_render_group.cpp: Correctly test ShouldFill[I]

58:50Run and note that we're (almost) back to where we started

59:42handmade_render_group.cpp: Walk through the routine

1:01:53Load in the Pixels from the right place

1:02:33Run, note that we're back to some semblance of good, and glimpse into the future

1:04:06Q&A

🗩

1:04:06Q&A

🗩

1:04:06Q&A

🗩

1:04:30@thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:04:30@thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:04:30@thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:05:15@houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:05:15@houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:05:15@houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:07:44@culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:07:44@culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:07:44@culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:10:45@hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:10:45@hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:10:45@hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:11:30@hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:11:30@hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:11:30@hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:12:29@guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:12:29@guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:12:29@guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:14:40@culver_fly Please send my best regards to Jeff

🗪

1:14:40@culver_fly Please send my best regards to Jeff

🗪

1:14:40@culver_fly Please send my best regards to Jeff

🗪

1:14:52@sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:14:52@sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:14:52@sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:15:01@ray_caster Will you be covering Morton order texture swizzling?

🗪

1:15:01@ray_caster Will you be covering Morton order texture swizzling?

🗪

1:15:01@ray_caster Will you be covering Morton order texture swizzling?

🗪

1:16:54@dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?²^,3

🗪

1:16:54@dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?²^,3

🗪

1:16:54@dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?²^,3

🗪

1:21:37starchypancakes [...] Casey said SSE2 was standard, I guess I'll start there⁴

1:24:06@houb_ Is there a way to track how memory gets stored to cache?⁵

🗪

1:24:06@houb_ Is there a way to track how memory gets stored to cache?⁵

🗪

1:24:06@houb_ Is there a way to track how memory gets stored to cache?⁵

🗪

1:28:01@hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:01@hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:01@hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:50@xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

🗪

1:28:50@xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

🗪

1:28:50@xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

🗪

1:30:37@ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions⁶

🗪

1:30:37@ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions⁶

🗪

1:30:37@ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions⁶

🗪

1:33:54@rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:33:54@rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:33:54@rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:34:39@goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?

🗪

1:34:39@goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?

🗪

1:34:39@goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?

🗪

1:35:33Wrap things up

🗩

1:35:33Wrap things up

🗩

1:35:33Wrap things up

🗩

Keyboard Navigation

Global Keys

Menu toggling

In-Menu and Index Controls

Quotes and References Menus and Index

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu