SIMD Basics
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next timestamp
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu and Index Controls

a
w
s
d
h j k l


Esc Close menu / unfocus timestamp

Quotes and References Menus and Index

Enter Jump to timestamp

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:59Open us up and see where we're at in terms of performance
0:59Open us up and see where we're at in terms of performance
0:59Open us up and see where we're at in terms of performance
1:52Blackboard: SIMD on x64
1:52Blackboard: SIMD on x64
1:52Blackboard: SIMD on x64
6:34Blackboard: How do we use SIMD?
6:34Blackboard: How do we use SIMD?
6:34Blackboard: How do we use SIMD?
8:16Blackboard: CPU vs GPU framebuffers
8:16Blackboard: CPU vs GPU framebuffers
8:16Blackboard: CPU vs GPU framebuffers
14:10Blackboard: "SOA" vs "AOS"
14:10Blackboard: "SOA" vs "AOS"
14:10Blackboard: "SOA" vs "AOS"
18:22Blackboard: How this stuff actually works
18:22Blackboard: How this stuff actually works
18:22Blackboard: How this stuff actually works
22:13Blackboard: Strided loading on NEON
22:13Blackboard: Strided loading on NEON
22:13Blackboard: Strided loading on NEON
25:22build.bat: Turn off compiler optimisations
25:22build.bat: Turn off compiler optimisations
25:22build.bat: Turn off compiler optimisations
25:46Internet: Intel Intrinsics Guide1
25:46Internet: Intel Intrinsics Guide1
25:46Internet: Intel Intrinsics Guide1
28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide
28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide
28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide
31:07Debugger: Go to disassembly and look at the SIMD registers
31:07Debugger: Go to disassembly and look at the SIMD registers
31:07Debugger: Go to disassembly and look at the SIMD registers
36:02handmade_render_group.cpp: Set four different values in the registers
36:02handmade_render_group.cpp: Set four different values in the registers
36:02handmade_render_group.cpp: Set four different values in the registers
36:36Debugger: See those different values in the registers and note the order in which they are loaded
36:36Debugger: See those different values in the registers and note the order in which they are loaded
36:36Debugger: See those different values in the registers and note the order in which they are loaded
39:47handmade_render_group.cpp: Turn Square functions into multiplies
39:47handmade_render_group.cpp: Turn Square functions into multiplies
39:47handmade_render_group.cpp: Turn Square functions into multiplies
41:04Fix the loop to work on pixels in batches of 4
41:04Fix the loop to work on pixels in batches of 4
41:04Fix the loop to work on pixels in batches of 4
46:20Run the game and note that we are overwriting our boundary
46:20Run the game and note that we are overwriting our boundary
46:20Run the game and note that we are overwriting our boundary
47:15handmade_render_group.cpp: Temporarily clip the buffers
47:15handmade_render_group.cpp: Temporarily clip the buffers
47:15handmade_render_group.cpp: Temporarily clip the buffers
48:19Separate the memory loading stuff from the computations
48:19Separate the memory loading stuff from the computations
48:19Separate the memory loading stuff from the computations
50:58Declare the arrays before the loop
50:58Declare the arrays before the loop
50:58Declare the arrays before the loop
57:20Debugger: Run and investigate the error
57:20Debugger: Run and investigate the error
57:20Debugger: Run and investigate the error
58:18handmade_render_group.cpp: Correctly test ShouldFill[I]
58:18handmade_render_group.cpp: Correctly test ShouldFill[I]
58:18handmade_render_group.cpp: Correctly test ShouldFill[I]
58:50Run and note that we're (almost) back to where we started
58:50Run and note that we're (almost) back to where we started
58:50Run and note that we're (almost) back to where we started
59:42handmade_render_group.cpp: Walk through the routine
59:42handmade_render_group.cpp: Walk through the routine
59:42handmade_render_group.cpp: Walk through the routine
1:01:53Load in the Pixels from the right place
1:01:53Load in the Pixels from the right place
1:01:53Load in the Pixels from the right place
1:02:33Run, note that we're back to some semblance of good, and glimpse into the future
1:02:33Run, note that we're back to some semblance of good, and glimpse into the future
1:02:33Run, note that we're back to some semblance of good, and glimpse into the future
1:04:06Q&A
🗩
1:04:06Q&A
🗩
1:04:06Q&A
🗩
1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?
🗪
1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?
🗪
1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?
🗪
1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?
🗪
1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?
🗪
1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?
🗪
1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?
🗪
1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?
🗪
1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?
🗪
1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around
🗪
1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around
🗪
1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around
🗪
1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?
🗪
1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?
🗪
1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?
🗪
1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for
🗪
1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for
🗪
1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for
🗪
1:14:40culver_fly Please send my best regards to Jeff
🗪
1:14:40culver_fly Please send my best regards to Jeff
🗪
1:14:40culver_fly Please send my best regards to Jeff
🗪
1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?
🗪
1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?
🗪
1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?
🗪
1:15:01ray_caster Will you be covering Morton order texture swizzling?
🗪
1:15:01ray_caster Will you be covering Morton order texture swizzling?
🗪
1:15:01ray_caster Will you be covering Morton order texture swizzling?
🗪
1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3
🗪
1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3
🗪
1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3
🗪
1:21:37starchypancakes [...] Casey said SSE2 was standard, I guess I'll start there4
1:21:37starchypancakes [...] Casey said SSE2 was standard, I guess I'll start there4
1:21:37starchypancakes [...] Casey said SSE2 was standard, I guess I'll start there4
1:24:06houb_ Is there a way to track how memory gets stored to cache?5
🗪
1:24:06houb_ Is there a way to track how memory gets stored to cache?5
🗪
1:24:06houb_ Is there a way to track how memory gets stored to cache?5
🗪
1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?
🗪
1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?
🗪
1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?
🗪
1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?
🗪
1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?
🗪
1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?
🗪
1:30:37ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions6
🗪
1:30:37ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions6
🗪
1:30:37ray_caster I think he's essentially asking how proficient compilers are at automatically emitting SIMD instructions6
🗪
1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?
🗪
1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?
🗪
1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?
🗪
1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?
🗪
1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?
🗪
1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?
🗪
1:35:33Wrap things up
🗩
1:35:33Wrap things up
🗩
1:35:33Wrap things up
🗩