Wide Unpacking and Masking
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next marker
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu Movement

a
w
s
d
h j k l


Quotes and References Menus

Enter Jump to timecode

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
0:25Overview of optimization work
0:25Overview of optimization work
0:25Overview of optimization work
1:30Recap where we were yesterday
1:30Recap where we were yesterday
1:30Recap where we were yesterday
1:50Current issue: Black bars
1:50Current issue: Black bars
1:50Current issue: Black bars
3:20Blackboard: Writing correct values to destination
3:20Blackboard: Writing correct values to destination
3:20Blackboard: Writing correct values to destination
5:35It's ok to do all operations for all pixels
5:35It's ok to do all operations for all pixels
5:35It's ok to do all operations for all pixels
6:52Blackboard: Another option: Combine old/new values
6:52Blackboard: Another option: Combine old/new values
6:52Blackboard: Another option: Combine old/new values
8:14Blackboard: Build a mask
8:14Blackboard: Build a mask
8:14Blackboard: Build a mask
9:00Masking out the invalid new values
9:00Masking out the invalid new values
9:00Masking out the invalid new values
10:50Making sure we save the original destination
10:50Making sure we save the original destination
10:50Making sure we save the original destination
11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently
12:55Problem with WriteMask: Haven't computed it yet!
12:55Problem with WriteMask: Haven't computed it yet!
12:55Problem with WriteMask: Haven't computed it yet!
14:00Use cheesy set macros to set WriteMask
14:00Use cheesy set macros to set WriteMask
14:00Use cheesy set macros to set WriteMask
14:16Handmade Hero: A Bit Garish edition
14:16Handmade Hero: A Bit Garish edition
14:16Handmade Hero: A Bit Garish edition
15:20Fixing the 'problem': Mi macro for uint setting
15:20Fixing the 'problem': Mi macro for uint setting
15:20Fixing the 'problem': Mi macro for uint setting
16:00Another thing: Fabian's rounding mode comment
16:00Another thing: Fabian's rounding mode comment
16:00Another thing: Fabian's rounding mode comment
16:57Some work to do with the last for(I) loop
16:57Some work to do with the last for(I) loop
16:57Some work to do with the last for(I) loop
19:34The explicit version of unrolling the loop
19:34The explicit version of unrolling the loop
19:34The explicit version of unrolling the loop
22:00Checking we're still working: under 100 cycles now
22:00Checking we're still working: under 100 cycles now
22:00Checking we're still working: under 100 cycles now
23:10Doing the destination the same way
23:10Doing the destination the same way
23:10Doing the destination the same way
23:50Just saved more cycles moving things out
23:50Just saved more cycles moving things out
23:50Just saved more cycles moving things out
24:35Fixing the WriteMask nonsense
24:35Fixing the WriteMask nonsense
24:35Fixing the WriteMask nonsense
25:38SSE Comparison Operations
25:38SSE Comparison Operations
25:38SSE Comparison Operations
26:20Blackboard: Comparisons for wide operations
26:20Blackboard: Comparisons for wide operations
26:20Blackboard: Comparisons for wide operations
29:43Using comparisons to generate WriteMask directly
29:43Using comparisons to generate WriteMask directly
29:43Using comparisons to generate WriteMask directly
31:50Working WriteMask with wide operations
31:50Working WriteMask with wide operations
31:50Working WriteMask with wide operations
32:10Problem: can't get rid of if entirely...
32:10Problem: can't get rid of if entirely...
32:10Problem: can't get rid of if entirely...
32:40Solution: Clamp U and V
32:40Solution: Clamp U and V
32:40Solution: Clamp U and V
33:40Get rid of the if entirely!
33:40Get rid of the if entirely!
33:40Get rid of the if entirely!
33:54Handmade Hero: Uniformly Stretchy Edition
33:54Handmade Hero: Uniformly Stretchy Edition
33:54Handmade Hero: Uniformly Stretchy Edition
34:05Fixing the bug: U/V copypasta typo
34:05Fixing the bug: U/V copypasta typo
34:05Fixing the bug: U/V copypasta typo
35:05Doing the texel fetch wide as well
35:05Doing the texel fetch wide as well
35:05Doing the texel fetch wide as well
37:30Not optimizing yet, just translating to SIMD
37:30Not optimizing yet, just translating to SIMD
37:30Not optimizing yet, just translating to SIMD
39:45Adjusting the texture fetch to use the wide values
39:45Adjusting the texture fetch to use the wide values
39:45Adjusting the texture fetch to use the wide values
40:30Converting the fetch coord by truncating
40:30Converting the fetch coord by truncating
40:30Converting the fetch coord by truncating
42:00Getting fX and fY by subtraction
42:00Getting fX and fY by subtraction
42:00Getting fX and fY by subtraction
43:30All correct, under 70 cycles
43:30All correct, under 70 cycles
43:30All correct, under 70 cycles
44:10No longer need to initialize the Texel values
44:10No longer need to initialize the Texel values
44:10No longer need to initialize the Texel values
46:00Everything in SIMD now but texel loads
46:00Everything in SIMD now but texel loads
46:00Everything in SIMD now but texel loads
46:50Blackboard: Unpacking the color data
46:50Blackboard: Unpacking the color data
46:50Blackboard: Unpacking the color data
48:30Pulling out colors using masks and shifting
48:30Pulling out colors using masks and shifting
48:30Pulling out colors using masks and shifting
53:20Blackboard: The matrix of sample reads
53:20Blackboard: The matrix of sample reads
53:20Blackboard: The matrix of sample reads
55:00Packing the sample data into 4-wide registers
55:00Packing the sample data into 4-wide registers
55:00Packing the sample data into 4-wide registers
55:48Some crazy emacs macro kung-fu
55:48Some crazy emacs macro kung-fu
55:48Some crazy emacs macro kung-fu
56:50Doing the Texels the same way as Dest
56:50Doing the Texels the same way as Dest
56:50Doing the Texels the same way as Dest
58:05Working texel read, and...almost 50cy/pixel
58:05Working texel read, and...almost 50cy/pixel
58:05Working texel read, and...almost 50cy/pixel
59:25What if there's nothing in the mask?
59:25What if there's nothing in the mask?
59:25What if there's nothing in the mask?
1:01:19Q&A
🗩
1:01:19Q&A
🗩
1:01:19Q&A
🗩
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:02:03grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:03:03garlandobloom Are you pulling this code over into ground splats soon?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:15ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:05:44ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:35flyingsand What does it mean for intrinsics that don't have a specified throughput?
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:08:51kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:11:56tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:15:36flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:21:00grumpygiant Agner Fog says the throughput is 1
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:16mrstone56 [What is latency vs throughput?]
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:22:46themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:23:54tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:25:45hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:27:12allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:28:50ttbjm Is the normal map code going to be converted to SIMD?
🗪
1:29:27End of the stream
🗩
1:29:27End of the stream
🗩
1:29:27End of the stream
🗩

Wide Unpacking and Masking

Masking the write:

In SIMD, doing operations "4-wide" means that one wide (packed) operation operates on four pixels. So there's no difference between doing an operation on one pixel or two or three or four, except when it comes to reading and writing.

The way we can make sure we only write the pixels we're actually operating on meaningfully is by masking out the ones we aren't. Instead of doing a conditional check every loop, we want to build a mask that's filled with 1s in the places where we'll keep the pixels, and 0s in the places where we'll throw out the pixels. If we're operating on four pixels at once and we're hanging 2 off the edge, the mask might look like:

[0x00000000,0x00000000,0xFFFFFFFF,0xFFFFFFFF]

By doing a bitwise AND with the pixel data we generate, we can mask out the values that are invalid, since the zeroes in the mask will knock out any bits set in our data. Likewise, the 1s will ensure any values we want to keep will remain in place.

We still need to preserve the destination how it was, and the easiest way to do that is to remember what the destination looked like before, and use those values wherever we knocked out values in our data. So we generate an inverted mask that might look something like:

[0xFFFFFFFF,0xFFFFFFFF,0x00000000,0x00000000]

Using the same AND technique, we can grab out the destination values that should remain unchanged. Then, we can combine that with the set of valid pixel values we generated using the other mask using a bitwise OR. Since the places where the two sets of values overlap are set to 0s in one of them, the data will effectively just be copied from one onto the other with no interference.