Assembly Analysis and Front-end Register Clears — Handmade Chat — Episode Guide

0:03Welcome to the chat

🗩

0:03Welcome to the chat

🗩

0:03Welcome to the chat

🗩

1:32Advocate ZII (Zero Is Initialisation)¹

8:38Describe Jesse Meyer's ZII experiment²

10:04DOS vs Linux memory mapping, page faults and profiling

18:38Memory mapping and profiling: 1) Hunt for minimum

22:04Memory mapping and profiling: 2) Statistical breakdown, ignoring outliers

24:52General advice on profiling CPU performance

25:42Create xorclear.cpp

🖮

25:42Create xorclear.cpp

🖮

25:42Create xorclear.cpp

🖮

27:24Set up our xorclear experiment in Compiler Explorer³

28:55Initially, msvc seems to generate better code than clang⁴

32:17Walk through the xorclear code in conjunction with the clang-generated assembly⁵

39:14Macro-ops subject to fusion (cmp and jne)⁶

44:30Memory Execution Units and Scalar Arithmetic Units

46:36Port usage of ADD (R64, I8)⁷

49:50Port usage of CMP (R64, I32)⁸

51:27Writing Identity into the Matrices array using mov, movaps and movups instructions⁹

58:29xorps¹⁰

1:00:28Does Clang do anything more than -O3?

🗩

1:00:28Does Clang do anything more than -O3?

🗩

1:00:28Does Clang do anything more than -O3?

🗩

1:01:06@chronic_quagga -mavx2?

🗪

1:01:06@chronic_quagga -mavx2?

🗪

1:01:06@chronic_quagga -mavx2?

🗪

1:01:21Loading and writing zeros¹¹

1:04:40Horrible code: 1) Superfluous zero writes¹²

1:05:29Try moving the Identity and Zero matrix_4x4 outside of main()¹³

1:06:10Move Identity and Zero matrix_4x4 back inside main()¹⁴

1:06:22Horrible code: 1) Superfluous zero writes (cont.)¹⁵

1:06:52Horrible code: 2) Using seven instructions to move 64 bytes¹⁶

1:08:27Hunt uops for mov¹⁷

1:11:03MOVUPS (M128, XMM)¹⁸

1:14:56Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV¹⁹

1:15:49MOVQ (M64, XMM)²⁰

1:16:08MOV permutations²¹

1:16:50Port usage of mov, movaps and movups²²

1:17:20@jaege8 Next page?

🗪

1:17:20@jaege8 Next page?

🗪

1:17:20@jaege8 Next page?

🗪

1:17:59MOV (M32, I32)²³

1:19:33Horrible code: 2) Using seven instructions to move 64 bytes (cont.)²⁴

1:20:01Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()²⁵^,26

1:22:36The clang-generated code is now better, with one loop unroll²⁷

1:25:42@oldganon O3 didn't help here

🗪

1:25:42@oldganon O3 didn't help here

🗪

1:25:42@oldganon O3 didn't help here

🗪

1:25:56Thoughts on explicitly writing out intrinsics

1:26:54Walk through the xorclear code in conjunction with the msvc-generated assembly²⁸

1:28:09Hunt uops for rep²⁹

1:29:09MOVSB_REPE³⁰

1:30:59Determine to try a dependent clear

1:32:15@dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math

🗪

1:32:15@dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math

🗪

1:32:15@dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math

🗪

1:32:39Temporarily try moving the Identity and Zero matrix_4x4 outside of main()³¹

1:33:50@daniel_collin_ You can leave it inside and set it to static

🗪

1:33:50@daniel_collin_ You can leave it inside and set it to static

🗪

1:33:50@daniel_collin_ You can leave it inside and set it to static

🗪

1:35:16Introduce a conditional clear in xorclear³²

1:39:14Compare clang vs msvc on our conditional clear³³

1:44:14@sainst0 Does it change if you give it -mtune=znver2?

🗪

1:44:14@sainst0 Does it change if you give it -mtune=znver2?

🗪

1:44:14@sainst0 Does it change if you give it -mtune=znver2?

🗪

1:44:47Clang often outputs slow code, but faster intrinsics-heavy code

1:45:34Walk through the msvc-generated code for our conditional clear³⁴

1:46:22Why clearing to zero is free³⁵

1:49:05Non-free zero-clearing: 1) When frontend-bound

1:51:45@peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle

🗪

1:51:45@peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle

🗪

1:51:45@peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle

🗪

1:52:29Non-free zero-clearing: 2) Code size, alignment differences

1:53:22Our movaps and xorps operations are free

1:53:58Try declaring the rows uninitialised, only conditionally setting to zero³⁶

1:54:45Our code introduced an extra jmp³⁷

1:56:20Always initialise to zero³⁸

1:57:07Replace the branch with a masked blend³⁹

2:02:22msvc doesn't bother to blend with 0⁴⁰

2:04:41Fill the second column with 1s⁴¹

2:04:56msvc doesn't bother to do the full blend on each row⁴²

2:05:30Make each row different⁴³

2:05:51Our instructions will overlap⁴⁴

2:07:09Q&A

🗩

2:07:09Q&A

🗩

2:07:09Q&A

🗩

2:07:26@jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!

🗪

2:07:26@jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!

🗪

2:07:26@jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!

🗪

2:08:37@somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction

🗪

2:08:37@somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction

🗪

2:08:37@somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction

🗪

2:09:56@centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?

🗪

2:09:56@centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?

🗪

2:09:56@centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?

🗪

2:10:09Memory bandwidth will be the bottleneck

🗩

2:10:09Memory bandwidth will be the bottleneck

🗩

2:10:09Memory bandwidth will be the bottleneck

🗩

2:13:35@vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?

🗪

2:13:35@vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?

🗪

2:13:35@vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?

🗪

2:13:46Try rolling back to older clang versions⁴⁵

2:14:24@i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?

🗪

2:14:24@i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?

🗪

2:14:24@i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?

🗪

2:16:29Isolating architecture-dependent code

🗩

2:16:29Isolating architecture-dependent code

🗩

2:16:29Isolating architecture-dependent code

🗩

2:18:34@mindmark42 Q: Could you show what gcc does?

🗪

2:18:34@mindmark42 Q: Could you show what gcc does?

🗪

2:18:34@mindmark42 Q: Could you show what gcc does?

🗪

2:18:43GCC uses all scalar mov instructions⁴⁶

2:19:18@vodonikhs Q: Try Clang 6

🗪

2:19:18@vodonikhs Q: Try Clang 6

🗪

2:19:18@vodonikhs Q: Try Clang 6

🗪

2:19:24Clang 6 still looks bad⁴⁷

2:19:58gcc -O3 generates the correct code⁴⁸

2:20:26@skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?

🗪

2:20:26@skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?

🗪

2:20:26@skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?

🗪

2:21:03@maliusarth Q: You haven't tried latest clang with O3, did you?

🗪

2:21:03@maliusarth Q: You haven't tried latest clang with O3, did you?

🗪

2:21:03@maliusarth Q: You haven't tried latest clang with O3, did you?

🗪

2:21:10Latest clang with -O3 generates bad code⁴⁹

2:21:13Compilers should produce reliable code without the need for switches, optimisation passes, etc.

2:23:23@sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?

🗪

2:23:23@sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?

🗪

2:23:23@sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?

🗪

2:24:30@drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?

🗪

2:24:30@drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?

🗪

2:24:30@drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?

🗪

2:25:05Share the godbolt link⁵⁰

2:25:33Close it down with a plug of Star Code Galaxy⁵¹

🗩

2:25:33Close it down with a plug of Star Code Galaxy⁵¹

🗩

2:25:33Close it down with a plug of Star Code Galaxy⁵¹

🗩

Keyboard Navigation

Global Keys

Menu toggling

In-Menu and Index Controls

Quotes and References Menus and Index

Quotes, References and Credits Menus

Filter Menu

Filter and Link Menus

Credits Menu