Counting Intrinsics
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next timestamp
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu and Index Controls

a
w
s
d
h j k l


Esc Close menu / unfocus timestamp

Quotes and References Menus and Index

Enter Jump to timestamp

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:10Lesson: Das keyboards are horrible
0:10Lesson: Das keyboards are horrible
0:10Lesson: Das keyboards are horrible
1:36Recap of last episode and today's agenda
1:36Recap of last episode and today's agenda
1:36Recap of last episode and today's agenda
2:33Prep work for getting pre-optimization vs post-optimization cycle counts
2:33Prep work for getting pre-optimization vs post-optimization cycle counts
2:33Prep work for getting pre-optimization vs post-optimization cycle counts
3:43Add cycle counting to DrawRectangleSlowly
3:43Add cycle counting to DrawRectangleSlowly
3:43Add cycle counting to DrawRectangleSlowly
4:41... 350 vs 50 cycles per pixel!
4:41... 350 vs 50 cycles per pixel!
4:41... 350 vs 50 cycles per pixel!
5:17How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...
5:17How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...
5:17How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...
7:10... How can we automate this counting process?
7:10... How can we automate this counting process?
7:10... How can we automate this counting process?
7:58Answer: Override the intrinsics with macros that add to some counter variables
7:58Answer: Override the intrinsics with macros that add to some counter variables
7:58Answer: Override the intrinsics with macros that add to some counter variables
8:47Oops, there's still some SIMDizing left to do here...
8:47Oops, there's still some SIMDizing left to do here...
8:47Oops, there's still some SIMDizing left to do here...
9:30Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)
9:30Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)
9:30Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)
11:55dx and dy can be baked into PixelPx and PixelPy (2 cycles better)
11:55dx and dy can be baked into PixelPx and PixelPy (2 cycles better)
11:55dx and dy can be baked into PixelPx and PixelPy (2 cycles better)
13:08Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?
13:08Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?
13:08Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?
13:59Maybe loft just the multiplies but not the add? Hmm...
13:59Maybe loft just the multiplies but not the add? Hmm...
13:59Maybe loft just the multiplies but not the add? Hmm...
14:20... try lofting the multiplications. (1-2 cycles worse)
14:20... try lofting the multiplications. (1-2 cycles worse)
14:20... try lofting the multiplications. (1-2 cycles worse)
15:50Note: Texture fetches can't be done in SIMD
15:50Note: Texture fetches can't be done in SIMD
15:50Note: Texture fetches can't be done in SIMD
16:52Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.
16:52Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.
16:52Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.
18:15Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)
18:15Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)
18:15Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)
21:45Start setting up the intrinsic #defines to count operations
21:45Start setting up the intrinsic #defines to count operations
21:45Start setting up the intrinsic #defines to count operations
23:45Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params
23:45Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params
23:45Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params
27:34Define load/store to nothing
27:34Define load/store to nothing
27:34Define load/store to nothing
28:39Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically
28:39Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically
28:39Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically
31:46We've got counts!
31:46We've got counts!
31:46We've got counts!
32:15Double check that counts make sense
32:15Double check that counts make sense
32:15Double check that counts make sense
33:27Multiply counts by throughputs to get total latency estimate
33:27Multiply counts by throughputs to get total latency estimate
33:27Multiply counts by throughputs to get total latency estimate
35:27_mm_castps_si128 latency is difficult to know.
35:27_mm_castps_si128 latency is difficult to know.
35:27_mm_castps_si128 latency is difficult to know.
35:52looking up the processor core type in windows
35:52looking up the processor core type in windows
35:52looking up the processor core type in windows
36:52_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem
36:52_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem
36:52_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem
40:28Use a macro to sum up the latency*counts to get a rough throughput total
40:28Use a macro to sum up the latency*counts to get a rough throughput total
40:28Use a macro to sum up the latency*counts to get a rough throughput total
42:55Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle
42:55Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle
42:55Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle
45:40How many units are in Nehalem core?
45:40How many units are in Nehalem core?
45:40How many units are in Nehalem core?
48:17... Two?
48:17... Two?
48:17... Two?
49:12On the limitations of executing multiple instructions per clock
49:12On the limitations of executing multiple instructions per clock
49:12On the limitations of executing multiple instructions per clock
51:25We're quite close to the max theoretical throughput.
51:25We're quite close to the max theoretical throughput.
51:25We're quite close to the max theoretical throughput.
52:19Memory latency probably isn't hurting performance
52:19Memory latency probably isn't hurting performance
52:19Memory latency probably isn't hurting performance
52:47Make an #if toggle for the intrinsic measurement code
52:47Make an #if toggle for the intrinsic measurement code
52:47Make an #if toggle for the intrinsic measurement code
53:58How much is gamma (sqrt) costing us?
53:58How much is gamma (sqrt) costing us?
53:58How much is gamma (sqrt) costing us?
56:30A troubling visual artifact appears around our hero...
56:30A troubling visual artifact appears around our hero...
56:30A troubling visual artifact appears around our hero...
57:47Aha! An issue with the linear/SRGB code
57:47Aha! An issue with the linear/SRGB code
57:47Aha! An issue with the linear/SRGB code
1:00:28gamma is costing only 6 cycles
1:00:28gamma is costing only 6 cycles
1:00:28gamma is costing only 6 cycles
1:01:05This is a reasonably optimized pixel loop
1:01:05This is a reasonably optimized pixel loop
1:01:05This is a reasonably optimized pixel loop
1:01:32Agenda for next session: Optimize outside/around the pixel loop.
1:01:32Agenda for next session: Optimize outside/around the pixel loop.
1:01:32Agenda for next session: Optimize outside/around the pixel loop.
1:01:56Q&A
🗩
1:01:56Q&A
🗩
1:01:56Q&A
🗩
1:02:09stelar7 Is this what you were looking for?
🗪
1:02:09stelar7 Is this what you were looking for?
🗪
1:02:09stelar7 Is this what you were looking for?
🗪
1:03:16Nehalem diagram: Only one FPU?
1:03:16Nehalem diagram: Only one FPU?
1:03:16Nehalem diagram: Only one FPU?
1:05:52grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?
🗪
1:05:52grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?
🗪
1:05:52grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?
🗪
1:12:46thesizik You counted _mm_and_ps wrong.
🗪
1:12:46thesizik You counted _mm_and_ps wrong.
🗪
1:12:46thesizik You counted _mm_and_ps wrong.
🗪
1:13:35ieee754 Are you doing pre-multipled alpha? (Yes)
🗪
1:13:35ieee754 Are you doing pre-multipled alpha? (Yes)
🗪
1:13:35ieee754 Are you doing pre-multipled alpha? (Yes)
🗪
1:13:38tenbroya Could you run the game with task manager open?
🗪
1:13:38tenbroya Could you run the game with task manager open?
🗪
1:13:38tenbroya Could you run the game with task manager open?
🗪
1:16:17jayp2 Will this game only work for your specific processor?
🗪
1:16:17jayp2 Will this game only work for your specific processor?
🗪
1:16:17jayp2 Will this game only work for your specific processor?
🗪
1:16:43toppstv Are you going to update the yellow background textures?
🗪
1:16:43toppstv Are you going to update the yellow background textures?
🗪
1:16:43toppstv Are you going to update the yellow background textures?
🗪
1:17:32braincruser The texture fetch should be an L1 cache fetch.
🗪
1:17:32braincruser The texture fetch should be an L1 cache fetch.
🗪
1:17:32braincruser The texture fetch should be an L1 cache fetch.
🗪
1:18:100xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?
🗪
1:18:100xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?
🗪
1:18:100xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?
🗪
1:19:34miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?
🗪
1:19:34miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?
🗪
1:19:34miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?
🗪
1:20:33robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?
🗪
1:20:33robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?
🗪
1:20:33robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?
🗪
1:22:42marumoto How will we split up the drawing onto multiple cores?
🗪
1:22:42marumoto How will we split up the drawing onto multiple cores?
🗪
1:22:42marumoto How will we split up the drawing onto multiple cores?
🗪
1:22:56dingernalt2 What's the floating head?
🗪
1:22:56dingernalt2 What's the floating head?
🗪
1:22:56dingernalt2 What's the floating head?
🗪
1:23:14nothings2 Question about _mm_ps_sqrt and common subexpression elimination
🗪
1:23:14nothings2 Question about _mm_ps_sqrt and common subexpression elimination
🗪
1:23:14nothings2 Question about _mm_ps_sqrt and common subexpression elimination
🗪
1:24:14thesizik What's that drum-like background noise?
🗪
1:24:14thesizik What's that drum-like background noise?
🗪
1:24:14thesizik What's that drum-like background noise?
🗪
1:24:37jayp2 Do you see all the questions?
🗪
1:24:37jayp2 Do you see all the questions?
🗪
1:24:37jayp2 Do you see all the questions?
🗪
1:25:07thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?
🗪
1:25:07thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?
🗪
1:25:07thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?
🗪
1:26:23cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?
🗪
1:26:23cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?
🗪
1:26:23cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?
🗪
1:29:34ttbjm Do you expect a 16x speedup from multi-threading?
🗪
1:29:34ttbjm Do you expect a 16x speedup from multi-threading?
🗪
1:29:34ttbjm Do you expect a 16x speedup from multi-threading?
🗪
1:29:59gasto5 How do you select the instruction set for optimizing?
🗪
1:29:59gasto5 How do you select the instruction set for optimizing?
🗪
1:29:59gasto5 How do you select the instruction set for optimizing?
🗪
1:32:55nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?
🗪
1:32:55nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?
🗪
1:32:55nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?
🗪
1:34:01captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?
🗪
1:34:01captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?
🗪
1:34:01captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?
🗪
1:35:24jayp2 Can a processor work through different types of calculations in a single cycle?
🗪
1:35:24jayp2 Can a processor work through different types of calculations in a single cycle?
🗪
1:35:24jayp2 Can a processor work through different types of calculations in a single cycle?
🗪
1:37:16ca2dev What kinds of things can be delegated to the GPU?
🗪
1:37:16ca2dev What kinds of things can be delegated to the GPU?
🗪
1:37:16ca2dev What kinds of things can be delegated to the GPU?
🗪