Assembly Analysis and Front-end Register Clears
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:03Welcome to the chat
🗩
0:03Welcome to the chat
🗩
0:03Welcome to the chat
🗩
1:32Advocate ZII (Zero Is Initialisation)1
1:32Advocate ZII (Zero Is Initialisation)1
1:32Advocate ZII (Zero Is Initialisation)1
8:38Describe Jesse Meyer's ZII experiment2
8:38Describe Jesse Meyer's ZII experiment2
8:38Describe Jesse Meyer's ZII experiment2
10:04DOS vs Linux memory mapping, page faults and profiling
10:04DOS vs Linux memory mapping, page faults and profiling
10:04DOS vs Linux memory mapping, page faults and profiling
18:38Memory mapping and profiling: 1) Hunt for minimum
18:38Memory mapping and profiling: 1) Hunt for minimum
18:38Memory mapping and profiling: 1) Hunt for minimum
22:04Memory mapping and profiling: 2) Statistical breakdown, ignoring outliers
22:04Memory mapping and profiling: 2) Statistical breakdown, ignoring outliers
22:04Memory mapping and profiling: 2) Statistical breakdown, ignoring outliers
24:52General advice on profiling CPU performance
24:52General advice on profiling CPU performance
24:52General advice on profiling CPU performance
25:42Create xorclear.cpp
🖮
25:42Create xorclear.cpp
🖮
25:42Create xorclear.cpp
🖮
27:24Set up our xorclear experiment in Compiler Explorer3
27:24Set up our xorclear experiment in Compiler Explorer3
27:24Set up our xorclear experiment in Compiler Explorer3
28:55Initially, msvc seems to generate better code than clang4
28:55Initially, msvc seems to generate better code than clang4
28:55Initially, msvc seems to generate better code than clang4
32:17Walk through the xorclear code in conjunction with the clang-generated assembly5
32:17Walk through the xorclear code in conjunction with the clang-generated assembly5
32:17Walk through the xorclear code in conjunction with the clang-generated assembly5
39:14Macro-ops subject to fusion (cmp and jne)6
39:14Macro-ops subject to fusion (cmp and jne)6
39:14Macro-ops subject to fusion (cmp and jne)6
44:30Memory Execution Units and Scalar Arithmetic Units
44:30Memory Execution Units and Scalar Arithmetic Units
44:30Memory Execution Units and Scalar Arithmetic Units
46:36Port usage of ADD (R64, I8)7
46:36Port usage of ADD (R64, I8)7
46:36Port usage of ADD (R64, I8)7
49:50Port usage of CMP (R64, I32)8
49:50Port usage of CMP (R64, I32)8
49:50Port usage of CMP (R64, I32)8
51:27Writing Identity into the Matrices array using mov, movaps and movups instructions9
51:27Writing Identity into the Matrices array using mov, movaps and movups instructions9
51:27Writing Identity into the Matrices array using mov, movaps and movups instructions9
58:29xorps10
58:29xorps10
58:29xorps10
1:00:28Does Clang do anything more than -O3?
🗩
1:00:28Does Clang do anything more than -O3?
🗩
1:00:28Does Clang do anything more than -O3?
🗩
1:01:06chronic_quagga -mavx2?
🗪
1:01:06chronic_quagga -mavx2?
🗪
1:01:06chronic_quagga -mavx2?
🗪
1:01:21Loading and writing zeros11
1:01:21Loading and writing zeros11
1:01:21Loading and writing zeros11
1:04:40Horrible code: 1) Superfluous zero writes12
1:04:40Horrible code: 1) Superfluous zero writes12
1:04:40Horrible code: 1) Superfluous zero writes12
1:05:29Try moving the Identity and Zero matrix_4x4 outside of main()13
1:05:29Try moving the Identity and Zero matrix_4x4 outside of main()13
1:05:29Try moving the Identity and Zero matrix_4x4 outside of main()13
1:06:10Move Identity and Zero matrix_4x4 back inside main()14
1:06:10Move Identity and Zero matrix_4x4 back inside main()14
1:06:10Move Identity and Zero matrix_4x4 back inside main()14
1:06:22Horrible code: 1) Superfluous zero writes (cont.)15
1:06:22Horrible code: 1) Superfluous zero writes (cont.)15
1:06:22Horrible code: 1) Superfluous zero writes (cont.)15
1:06:52Horrible code: 2) Using seven instructions to move 64 bytes16
1:06:52Horrible code: 2) Using seven instructions to move 64 bytes16
1:06:52Horrible code: 2) Using seven instructions to move 64 bytes16
1:08:27Hunt uops for mov17
1:08:27Hunt uops for mov17
1:08:27Hunt uops for mov17
1:11:03MOVUPS (M128, XMM)18
1:11:03MOVUPS (M128, XMM)18
1:11:03MOVUPS (M128, XMM)18
1:14:56Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV19
1:14:56Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV19
1:14:56Check the Intel 64 and IA-32 Architectures Software Developer Manual for MOV19
1:15:49MOVQ (M64, XMM)20
1:15:49MOVQ (M64, XMM)20
1:15:49MOVQ (M64, XMM)20
1:16:08MOV permutations21
1:16:08MOV permutations21
1:16:08MOV permutations21
1:16:50Port usage of mov, movaps and movups22
1:16:50Port usage of mov, movaps and movups22
1:16:50Port usage of mov, movaps and movups22
1:17:20jaege8 Next page?
🗪
1:17:20jaege8 Next page?
🗪
1:17:20jaege8 Next page?
🗪
1:17:59MOV (M32, I32)23
1:17:59MOV (M32, I32)23
1:17:59MOV (M32, I32)23
1:19:33Horrible code: 2) Using seven instructions to move 64 bytes (cont.)24
1:19:33Horrible code: 2) Using seven instructions to move 64 bytes (cont.)24
1:19:33Horrible code: 2) Using seven instructions to move 64 bytes (cont.)24
1:20:01Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()25,26
1:20:01Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()25,26
1:20:01Hand-write and -read 128-bit rows using _mm_setr_ps() and _mm_storeu_ps()25,26
1:22:36The clang-generated code is now better, with one loop unroll27
1:22:36The clang-generated code is now better, with one loop unroll27
1:22:36The clang-generated code is now better, with one loop unroll27
1:25:42oldganon O3 didn't help here
🗪
1:25:42oldganon O3 didn't help here
🗪
1:25:42oldganon O3 didn't help here
🗪
1:25:56Thoughts on explicitly writing out intrinsics
1:25:56Thoughts on explicitly writing out intrinsics
1:25:56Thoughts on explicitly writing out intrinsics
1:26:54Walk through the xorclear code in conjunction with the msvc-generated assembly28
1:26:54Walk through the xorclear code in conjunction with the msvc-generated assembly28
1:26:54Walk through the xorclear code in conjunction with the msvc-generated assembly28
1:28:09Hunt uops for rep29
1:28:09Hunt uops for rep29
1:28:09Hunt uops for rep29
1:29:09MOVSB_REPE30
1:29:09MOVSB_REPE30
1:29:09MOVSB_REPE30
1:30:59Determine to try a dependent clear
1:30:59Determine to try a dependent clear
1:30:59Determine to try a dependent clear
1:32:15dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math
🗪
1:32:15dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math
🗪
1:32:15dragoonx6 handmade_hero Try something like -O3 -march=skylake -ffast-math
🗪
1:32:39Temporarily try moving the Identity and Zero matrix_4x4 outside of main()31
1:32:39Temporarily try moving the Identity and Zero matrix_4x4 outside of main()31
1:32:39Temporarily try moving the Identity and Zero matrix_4x4 outside of main()31
1:33:50daniel_collin_ You can leave it inside and set it to static
🗪
1:33:50daniel_collin_ You can leave it inside and set it to static
🗪
1:33:50daniel_collin_ You can leave it inside and set it to static
🗪
1:35:16Introduce a conditional clear in xorclear32
1:35:16Introduce a conditional clear in xorclear32
1:35:16Introduce a conditional clear in xorclear32
1:39:14Compare clang vs msvc on our conditional clear33
1:39:14Compare clang vs msvc on our conditional clear33
1:39:14Compare clang vs msvc on our conditional clear33
1:44:14sainst0 Does it change if you give it -mtune=znver2?
🗪
1:44:14sainst0 Does it change if you give it -mtune=znver2?
🗪
1:44:14sainst0 Does it change if you give it -mtune=znver2?
🗪
1:44:47Clang often outputs slow code, but faster intrinsics-heavy code
1:44:47Clang often outputs slow code, but faster intrinsics-heavy code
1:44:47Clang often outputs slow code, but faster intrinsics-heavy code
1:45:34Walk through the msvc-generated code for our conditional clear34
1:45:34Walk through the msvc-generated code for our conditional clear34
1:45:34Walk through the msvc-generated code for our conditional clear34
1:46:22Why clearing to zero is free35
1:46:22Why clearing to zero is free35
1:46:22Why clearing to zero is free35
1:49:05Non-free zero-clearing: 1) When frontend-bound
1:49:05Non-free zero-clearing: 1) When frontend-bound
1:49:05Non-free zero-clearing: 1) When frontend-bound
1:51:45peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle
🗪
1:51:45peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle
🗪
1:51:45peterfors Skylake's memory subsystem is in charge of the loads and store requests and ordering. Since Haswell, it's possible to sustain two memory reads (on ports 2 and 3) and one memory write (on port 4) each cycle
🗪
1:52:29Non-free zero-clearing: 2) Code size, alignment differences
1:52:29Non-free zero-clearing: 2) Code size, alignment differences
1:52:29Non-free zero-clearing: 2) Code size, alignment differences
1:53:22Our movaps and xorps operations are free
1:53:22Our movaps and xorps operations are free
1:53:22Our movaps and xorps operations are free
1:53:58Try declaring the rows uninitialised, only conditionally setting to zero36
1:53:58Try declaring the rows uninitialised, only conditionally setting to zero36
1:53:58Try declaring the rows uninitialised, only conditionally setting to zero36
1:54:45Our code introduced an extra jmp37
1:54:45Our code introduced an extra jmp37
1:54:45Our code introduced an extra jmp37
1:56:20Always initialise to zero38
1:56:20Always initialise to zero38
1:56:20Always initialise to zero38
1:57:07Replace the branch with a masked blend39
1:57:07Replace the branch with a masked blend39
1:57:07Replace the branch with a masked blend39
2:02:22msvc doesn't bother to blend with 040
2:02:22msvc doesn't bother to blend with 040
2:02:22msvc doesn't bother to blend with 040
2:04:41Fill the second column with 1s41
2:04:41Fill the second column with 1s41
2:04:41Fill the second column with 1s41
2:04:56msvc doesn't bother to do the full blend on each row42
2:04:56msvc doesn't bother to do the full blend on each row42
2:04:56msvc doesn't bother to do the full blend on each row42
2:05:30Make each row different43
2:05:30Make each row different43
2:05:30Make each row different43
2:05:51Our instructions will overlap44
2:05:51Our instructions will overlap44
2:05:51Our instructions will overlap44
2:07:09Q&A
🗩
2:07:09Q&A
🗩
2:07:09Q&A
🗩
2:07:26jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!
🗪
2:07:26jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!
🗪
2:07:26jessem3y3r handmade_hero Hi Casey. Jesse from twitter here. Thank you so much for taking the time to explain and demonstrate this on Handmade Hero! Deeply appreciated!
🗪
2:08:37somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction
🗪
2:08:37somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction
🗪
2:08:37somebody_took_my_name If you take a look at different add / sub ops with immediates, you'll see nice tricks with the lea instruction
🗪
2:09:56centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?
🗪
2:09:56centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?
🗪
2:09:56centhusiast Q: I compiled the code with icc, Intel's compiler, with O2 and it takes 3.5 seconds to run it. Is this really bad?
🗪
2:10:09Memory bandwidth will be the bottleneck
🗩
2:10:09Memory bandwidth will be the bottleneck
🗩
2:10:09Memory bandwidth will be the bottleneck
🗩
2:13:35vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?
🗪
2:13:35vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?
🗪
2:13:35vodonikhs Q: Someone has mentioned that older Clang versions generate better code. Could it be because of Heartbleed mitigation?
🗪
2:13:46Try rolling back to older clang versions45
2:13:46Try rolling back to older clang versions45
2:13:46Try rolling back to older clang versions45
2:14:24i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?
🗪
2:14:24i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?
🗪
2:14:24i_am_seabass Q: I read that mixing SSE2 and AVX2 will incur a performance penalty. How would you handle optimizing code, if you want to support AVX2, but also SSE for older systems? Would you just have separate builds for each?
🗪
2:16:29Isolating architecture-dependent code
🗩
2:16:29Isolating architecture-dependent code
🗩
2:16:29Isolating architecture-dependent code
🗩
2:18:34mindmark42 Q: Could you show what gcc does?
🗪
2:18:34mindmark42 Q: Could you show what gcc does?
🗪
2:18:34mindmark42 Q: Could you show what gcc does?
🗪
2:18:43GCC uses all scalar mov instructions46
2:18:43GCC uses all scalar mov instructions46
2:18:43GCC uses all scalar mov instructions46
2:19:18vodonikhs Q: Try Clang 6
🗪
2:19:18vodonikhs Q: Try Clang 6
🗪
2:19:18vodonikhs Q: Try Clang 6
🗪
2:19:24Clang 6 still looks bad47
2:19:24Clang 6 still looks bad47
2:19:24Clang 6 still looks bad47
2:19:58gcc -O3 generates the correct code48
2:19:58gcc -O3 generates the correct code48
2:19:58gcc -O3 generates the correct code48
2:20:26skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?
🗪
2:20:26skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?
🗪
2:20:26skincell3 Q: Could you provide the twitter conversation link that you are responding to, for the YouTube video?
🗪
2:21:03maliusarth Q: You haven't tried latest clang with O3, did you?
🗪
2:21:03maliusarth Q: You haven't tried latest clang with O3, did you?
🗪
2:21:03maliusarth Q: You haven't tried latest clang with O3, did you?
🗪
2:21:10Latest clang with -O3 generates bad code49
2:21:10Latest clang with -O3 generates bad code49
2:21:10Latest clang with -O3 generates bad code49
2:21:13Compilers should produce reliable code without the need for switches, optimisation passes, etc.
2:21:13Compilers should produce reliable code without the need for switches, optimisation passes, etc.
2:21:13Compilers should produce reliable code without the need for switches, optimisation passes, etc.
2:23:23sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?
🗪
2:23:23sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?
🗪
2:23:23sir_klausi handmade_hero Is it possible those extra jumps clang generates are a spectre mitigation?
🗪
2:24:30drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?
🗪
2:24:30drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?
🗪
2:24:30drmaruq Spectre mitigation is on hardware level? Why would clang mess up the exe?
🗪
2:25:05Share the godbolt link50
2:25:05Share the godbolt link50
2:25:05Share the godbolt link50
2:25:33Close it down with a plug of Star Code Galaxy51
🗩
2:25:33Close it down with a plug of Star Code Galaxy51
🗩
2:25:33Close it down with a plug of Star Code Galaxy51
🗩
You have arrived at the (current) end of Handmade Chat