Wide Unpacking and Masking

?

?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next timestamp
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu and Index Controls

a

w
s

d

h j k l

←

↑
↓

→

Esc Close menu / unfocus timestamp

Quotes and References Menus and Index

Enter Jump to timestamp

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)

Previous: 'Packing Pixels for the Framebuffer'

0:25Overview of optimization work

0:25Overview of optimization work

0:25Overview of optimization work

1:30Recap where we were yesterday

1:30Recap where we were yesterday

1:30Recap where we were yesterday

1:50Current issue: Black bars

1:50Current issue: Black bars

1:50Current issue: Black bars

3:20Blackboard: Writing correct values to destination

3:20Blackboard: Writing correct values to destination

3:20Blackboard: Writing correct values to destination

5:35It's ok to do all operations for all pixels

5:35It's ok to do all operations for all pixels

5:35It's ok to do all operations for all pixels

6:52Blackboard: Another option: Combine old/new values

6:52Blackboard: Another option: Combine old/new values

6:52Blackboard: Another option: Combine old/new values

8:14Blackboard: Build a mask

8:14Blackboard: Build a mask

8:14Blackboard: Build a mask

9:00Masking out the invalid new values

9:00Masking out the invalid new values

9:00Masking out the invalid new values

10:50Making sure we save the original destination

10:50Making sure we save the original destination

10:50Making sure we save the original destination

11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently

11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently

11:38Haven't SIMD-ized the load yet, deal with OriginalDest differently

12:55Problem with WriteMask: Haven't computed it yet!

12:55Problem with WriteMask: Haven't computed it yet!

12:55Problem with WriteMask: Haven't computed it yet!

14:00Use cheesy set macros to set WriteMask

14:00Use cheesy set macros to set WriteMask

14:00Use cheesy set macros to set WriteMask

14:16Handmade Hero: A Bit Garish edition

14:16Handmade Hero: A Bit Garish edition

14:16Handmade Hero: A Bit Garish edition

15:20Fixing the 'problem': Mi macro for uint setting

15:20Fixing the 'problem': Mi macro for uint setting

15:20Fixing the 'problem': Mi macro for uint setting

16:00Another thing: Fabian's rounding mode comment

16:00Another thing: Fabian's rounding mode comment

16:00Another thing: Fabian's rounding mode comment

16:57Some work to do with the last for(I) loop

16:57Some work to do with the last for(I) loop

16:57Some work to do with the last for(I) loop

19:34The explicit version of unrolling the loop

19:34The explicit version of unrolling the loop

19:34The explicit version of unrolling the loop

22:00Checking we're still working: under 100 cycles now

22:00Checking we're still working: under 100 cycles now

22:00Checking we're still working: under 100 cycles now

23:10Doing the destination the same way

23:10Doing the destination the same way

23:10Doing the destination the same way

23:50Just saved more cycles moving things out

23:50Just saved more cycles moving things out

23:50Just saved more cycles moving things out

24:35Fixing the WriteMask nonsense

24:35Fixing the WriteMask nonsense

24:35Fixing the WriteMask nonsense

25:38SSE Comparison Operations

25:38SSE Comparison Operations

25:38SSE Comparison Operations

26:20Blackboard: Comparisons for wide operations

26:20Blackboard: Comparisons for wide operations

26:20Blackboard: Comparisons for wide operations

29:43Using comparisons to generate WriteMask directly

29:43Using comparisons to generate WriteMask directly

29:43Using comparisons to generate WriteMask directly

31:50Working WriteMask with wide operations

31:50Working WriteMask with wide operations

31:50Working WriteMask with wide operations

32:10Problem: can't get rid of if entirely...

32:10Problem: can't get rid of if entirely...

32:10Problem: can't get rid of if entirely...

32:40Solution: Clamp U and V

32:40Solution: Clamp U and V

32:40Solution: Clamp U and V

33:40Get rid of the if entirely!

33:40Get rid of the if entirely!

33:40Get rid of the if entirely!

33:54Handmade Hero: Uniformly Stretchy Edition

33:54Handmade Hero: Uniformly Stretchy Edition

33:54Handmade Hero: Uniformly Stretchy Edition

34:05Fixing the bug: U/V copypasta typo

34:05Fixing the bug: U/V copypasta typo

34:05Fixing the bug: U/V copypasta typo

35:05Doing the texel fetch wide as well

35:05Doing the texel fetch wide as well

35:05Doing the texel fetch wide as well

37:30Not optimizing yet, just translating to SIMD

37:30Not optimizing yet, just translating to SIMD

37:30Not optimizing yet, just translating to SIMD

39:45Adjusting the texture fetch to use the wide values

39:45Adjusting the texture fetch to use the wide values

39:45Adjusting the texture fetch to use the wide values

40:30Converting the fetch coord by truncating

40:30Converting the fetch coord by truncating

40:30Converting the fetch coord by truncating

42:00Getting fX and fY by subtraction

42:00Getting fX and fY by subtraction

42:00Getting fX and fY by subtraction

43:30All correct, under 70 cycles

43:30All correct, under 70 cycles

43:30All correct, under 70 cycles

44:10No longer need to initialize the Texel values

44:10No longer need to initialize the Texel values

44:10No longer need to initialize the Texel values

46:00Everything in SIMD now but texel loads

46:00Everything in SIMD now but texel loads

46:00Everything in SIMD now but texel loads

46:50Blackboard: Unpacking the color data

46:50Blackboard: Unpacking the color data

46:50Blackboard: Unpacking the color data

48:30Pulling out colors using masks and shifting

48:30Pulling out colors using masks and shifting

48:30Pulling out colors using masks and shifting

53:20Blackboard: The matrix of sample reads

53:20Blackboard: The matrix of sample reads

53:20Blackboard: The matrix of sample reads

55:00Packing the sample data into 4-wide registers

55:00Packing the sample data into 4-wide registers

55:00Packing the sample data into 4-wide registers

55:48Some crazy emacs macro kung-fu

55:48Some crazy emacs macro kung-fu

55:48Some crazy emacs macro kung-fu

56:50Doing the Texels the same way as Dest

56:50Doing the Texels the same way as Dest

56:50Doing the Texels the same way as Dest

58:05Working texel read, and...almost 50cy/pixel

58:05Working texel read, and...almost 50cy/pixel

58:05Working texel read, and...almost 50cy/pixel

59:25What if there's nothing in the mask?

59:25What if there's nothing in the mask?

59:25What if there's nothing in the mask?

1:01:19Q&A

🗩

1:01:19Q&A

🗩

1:01:19Q&A

🗩

1:02:03@grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:02:03@grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:02:03@grumpygiant256 Could you not just align the X coord to a 4-pixel boundary up front, and thereby use aligned loads and stores?

🗪

1:03:03@garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:03:03@garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:03:03@garlandobloom Are you pulling this code over into ground splats soon?

🗪

1:05:15@ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:15@ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:15@ostrovskivlad Is it me or after this whole SIMD conversion the cycles per pixel are much more consistent?

🗪

1:05:44@ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:05:44@ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:05:44@ifingerbangedurcat I have kind of missed the past few days, I'm wondering if doing CPU intrinsics exclusively for SSE2 in your game code is bad or are we targetting SSE2? For example, should we wrap everything into platform-specific files so its easier to target other platforms?

🗪

1:08:35@flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:35@flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:35@flyingsand What does it mean for intrinsics that don't have a specified throughput?

🗪

1:08:51@kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:08:51@kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:08:51@kelimion Instead of loading the destination first would it be faster to skip that and instead do a masked write e.g. _mm_maskmoveu_si128

🗪

1:11:56@tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:11:56@tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:11:56@tobeypeters Would it be a good idea to just use SIMD for all our math operations in all our programs?

🗪

1:15:36@flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:15:36@flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:15:36@flyingsand Example of an intrinsic with no throughput: _mm_cmpgt_ps

🗪

1:21:00@grumpygiant Agner Fog says the throughput is 1

🗪

1:21:00@grumpygiant Agner Fog says the throughput is 1

🗪

1:21:00@grumpygiant Agner Fog says the throughput is 1

🗪

1:22:16@mrstone56 [What is latency vs throughput?]

🗪

1:22:16@mrstone56 [What is latency vs throughput?]

🗪

1:22:16@mrstone56 [What is latency vs throughput?]

🗪

1:22:46@themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:22:46@themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:22:46@themarsala What is the end goal of the optimization, trying to get below a certain threshold, or just to get everything converted?

🗪

1:23:54@tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:23:54@tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:23:54@tobeypeters Does size of variables and stuff matter to SIMD, like 32bit vs 64bit?

🗪

1:25:45@hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:25:45@hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:25:45@hellotanjent Is the SSE code doing any cache prefetch or hinting stuff yet?

🗪

1:27:12@allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:27:12@allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:27:12@allaizn Couldn't we use a half-float instead of floats as we don't need that much precision with only 255 discrete values?

🗪

1:28:50@ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:28:50@ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:28:50@ttbjm Is the normal map code going to be converted to SIMD?

🗪

1:29:27End of the stream

🗩

1:29:27End of the stream

🗩

1:29:27End of the stream

🗩

Next: 'Counting Intrinsics'

Wide Unpacking and Masking

Masking the write:

In SIMD, doing operations "4-wide" means that one wide (packed) operation operates on four pixels. So there's no difference between doing an operation on one pixel or two or three or four, except when it comes to reading and writing.

The way we can make sure we only write the pixels we're actually operating on meaningfully is by masking out the ones we aren't. Instead of doing a conditional check every loop, we want to build a mask that's filled with 1s in the places where we'll keep the pixels, and 0s in the places where we'll throw out the pixels. If we're operating on four pixels at once and we're hanging 2 off the edge, the mask might look like:

[0x00000000,0x00000000,0xFFFFFFFF,0xFFFFFFFF]

By doing a bitwise AND with the pixel data we generate, we can mask out the values that are invalid, since the zeroes in the mask will knock out any bits set in our data. Likewise, the 1s will ensure any values we want to keep will remain in place.

We still need to preserve the destination how it was, and the easiest way to do that is to remember what the destination looked like before, and use those values wherever we knocked out values in our data. So we generate an inverted mask that might look something like:

[0xFFFFFFFF,0xFFFFFFFF,0x00000000,0x00000000]

Using the same AND technique, we can grab out the destination values that should remain unchanged. Then, we can combine that with the set of valid pixel values we generated using the other mask using a bitwise OR. Since the places where the two sets of values overlap are set to 0s in one of them, the data will effectively just be copied from one onto the other with no interference.