Counting Intrinsics — Handmade Hero — Episode Guide

0:10Lesson: Das keyboards are horrible

1:36Recap of last episode and today's agenda

2:33Prep work for getting pre-optimization vs post-optimization cycle counts

3:43Add cycle counting to DrawRectangleSlowly

4:41... 350 vs 50 cycles per pixel!

5:17How long *should* it take to fill each pixel? Let's count up all the intrinsics and their throughputs...

7:10... How can we automate this counting process?

7:58Answer: Override the intrinsics with macros that add to some counter variables

8:47Oops, there's still some SIMDizing left to do here...

9:30Use _mm_add_ps to increment PixelPx by 4 instead of scalar adds (2-3 cycles better)

11:55dx and dy can be baked into PixelPx and PixelPy (2 cycles better)

13:08Should we loft PixelPx and PixelPy axis multiply/add calculation out of the inner loop?

13:59Maybe loft just the multiplies but not the add? Hmm...

14:20... try lofting the multiplications. (1-2 cycles worse)

15:50Note: Texture fetches can't be done in SIMD

16:52Fabian on why _mm_maskmoveu_si128 is so slow. Don't use it! It bypasses the cache.

18:15Adding a #define for each intrinsic to count operations (_mm_add_ps, _mm_mul_ps, etc)

21:45Start setting up the intrinsic #defines to count operations

23:45Preprocessor cleverness that handles the fact that intrinsics often take other intrinsics as params

27:34Define load/store to nothing

28:39Mini-rant about the compiler not doing instruction/intrinsic instrumentation automatically

31:46We've got counts!

32:15Double check that counts make sense

33:27Multiply counts by throughputs to get total latency estimate

35:27_mm_castps_si128 latency is difficult to know.

35:52looking up the processor core type in windows

36:52_mm_and_ps and bitwise ops are 1/3 cycle on Nehalem

40:28Use a macro to sum up the latency*counts to get a rough throughput total

42:55Well, Isn't that fancy: Measured throughput is lower than the theoretical best throughput. Instructions are likely executing on multiple ALUs per cycle

45:40How many units are in Nehalem core?

48:17... Two?

49:12On the limitations of executing multiple instructions per clock

51:25We're quite close to the max theoretical throughput.

52:19Memory latency probably isn't hurting performance

52:47Make an #if toggle for the intrinsic measurement code

53:58How much is gamma (sqrt) costing us?

56:30A troubling visual artifact appears around our hero...

57:47Aha! An issue with the linear/SRGB code

1:00:28gamma is costing only 6 cycles

1:01:05This is a reasonably optimized pixel loop

1:01:32Agenda for next session: Optimize outside/around the pixel loop.

1:01:56Q&A

🗩

1:01:56Q&A

🗩

1:01:56Q&A

🗩

1:02:09@stelar7 Is this what you were looking for?

🗪

1:02:09@stelar7 Is this what you were looking for?

🗪

1:02:09@stelar7 Is this what you were looking for?

🗪

1:03:16Nehalem diagram: Only one FPU?

1:05:52@grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?

🗪

1:05:52@grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?

🗪

1:05:52@grumpygiant256 Worth timing the load/stores with no ALU ops to see how much we're memory bound?

🗪

1:12:46@thesizik You counted _mm_and_ps wrong.

🗪

1:12:46@thesizik You counted _mm_and_ps wrong.

🗪

1:12:46@thesizik You counted _mm_and_ps wrong.

🗪

1:13:35@ieee754 Are you doing pre-multipled alpha? (Yes)

🗪

1:13:35@ieee754 Are you doing pre-multipled alpha? (Yes)

🗪

1:13:35@ieee754 Are you doing pre-multipled alpha? (Yes)

🗪

1:13:38@tenbroya Could you run the game with task manager open?

🗪

1:13:38@tenbroya Could you run the game with task manager open?

🗪

1:13:38@tenbroya Could you run the game with task manager open?

🗪

1:16:17@jayp2 Will this game only work for your specific processor?

🗪

1:16:17@jayp2 Will this game only work for your specific processor?

🗪

1:16:17@jayp2 Will this game only work for your specific processor?

🗪

1:16:43@toppstv Are you going to update the yellow background textures?

🗪

1:16:43@toppstv Are you going to update the yellow background textures?

🗪

1:16:43@toppstv Are you going to update the yellow background textures?

🗪

1:17:32@braincruser The texture fetch should be an L1 cache fetch.

🗪

1:17:32@braincruser The texture fetch should be an L1 cache fetch.

🗪

1:17:32@braincruser The texture fetch should be an L1 cache fetch.

🗪

1:18:10@0xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?

🗪

1:18:10@0xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?

🗪

1:18:10@0xwid In an alternate universe where nobody cares for art, do you think optimization would still be a focus for developers?

🗪

1:19:34@miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?

🗪

1:19:34@miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?

🗪

1:19:34@miblo Any idea why my cores get maxed out when running Handmade Hero with the XCB platform layer?

🗪

1:20:33@robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?

🗪

1:20:33@robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?

🗪

1:20:33@robotchocolatedino Why wasn't there a greater speed increase after removing gamma correction?

🗪

1:22:42@marumoto How will we split up the drawing onto multiple cores?

🗪

1:22:42@marumoto How will we split up the drawing onto multiple cores?

🗪

1:22:42@marumoto How will we split up the drawing onto multiple cores?

🗪

1:22:56@dingernalt2 What's the floating head?

🗪

1:22:56@dingernalt2 What's the floating head?

🗪

1:22:56@dingernalt2 What's the floating head?

🗪

1:23:14@nothings2 Question about _mm_ps_sqrt and common subexpression elimination

🗪

1:23:14@nothings2 Question about _mm_ps_sqrt and common subexpression elimination

🗪

1:23:14@nothings2 Question about _mm_ps_sqrt and common subexpression elimination

🗪

1:24:14@thesizik What's that drum-like background noise?

🗪

1:24:14@thesizik What's that drum-like background noise?

🗪

1:24:14@thesizik What's that drum-like background noise?

🗪

1:24:37@jayp2 Do you see all the questions?

🗪

1:24:37@jayp2 Do you see all the questions?

🗪

1:24:37@jayp2 Do you see all the questions?

🗪

1:25:07@thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?

🗪

1:25:07@thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?

🗪

1:25:07@thevaber Can rdtsc be inaccurate with CPUs that vary their cycle rate?

🗪

1:26:23@cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?

🗪

1:26:23@cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?

🗪

1:26:23@cubercaleb How does the CPU do things ahead of time if things are supposed to be done in order?

🗪

1:29:34@ttbjm Do you expect a 16x speedup from multi-threading?

🗪

1:29:34@ttbjm Do you expect a 16x speedup from multi-threading?

🗪

1:29:34@ttbjm Do you expect a 16x speedup from multi-threading?

🗪

1:29:59@gasto5 How do you select the instruction set for optimizing?

🗪

1:29:59@gasto5 How do you select the instruction set for optimizing?

🗪

1:29:59@gasto5 How do you select the instruction set for optimizing?

🗪

1:32:55@nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?

🗪

1:32:55@nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?

🗪

1:32:55@nothings2 Aren't the Unity hardware survey results pretty different than the Steam ones?

🗪

1:34:01@captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?

🗪

1:34:01@captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?

🗪

1:34:01@captainkraft What are the gains you get by writing your own software renderer vs using SDL, GPUs, etc?

🗪

1:35:24@jayp2 Can a processor work through different types of calculations in a single cycle?

🗪

1:35:24@jayp2 Can a processor work through different types of calculations in a single cycle?

🗪

1:35:24@jayp2 Can a processor work through different types of calculations in a single cycle?

🗪

1:37:16@ca2dev What kinds of things can be delegated to the GPU?

🗪

1:37:16@ca2dev What kinds of things can be delegated to the GPU?

🗪

1:37:16@ca2dev What kinds of things can be delegated to the GPU?

🗪

Keyboard Navigation

Global Keys

Menu toggling

In-Menu and Index Controls

Quotes and References Menus and Index

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu