Packing Pixels for the Framebuffer
?
?

Keyboard Navigation

Global Keys

[, < / ], > Jump to previous / next episode
W, K, P / S, J, N Jump to previous / next timestamp
t / T Toggle theatre / SUPERtheatre mode
V Revert filter to original state Y Select link (requires manual Ctrl-c)

Menu toggling

q Quotes r References f Filter y Link c Credits

In-Menu and Index Controls

a
w
s
d
h j k l


Esc Close menu / unfocus timestamp

Quotes and References Menus and Index

Enter Jump to timestamp

Quotes, References and Credits Menus

o Open URL (in new tab)

Filter Menu

x, Space Toggle category and focus next
X, ShiftSpace Toggle category and focus previous
v Invert topics / media as per focus

Filter and Link Menus

z Toggle filter / linking mode

Credits Menu

Enter Open URL (in new tab)
2:05Load up the code and consider optimisation
2:05Load up the code and consider optimisation
2:05Load up the code and consider optimisation
4:09handmade_render_group.cpp: Comment out if(ShouldFill[I])
4:09handmade_render_group.cpp: Comment out if(ShouldFill[I])
4:09handmade_render_group.cpp: Comment out if(ShouldFill[I])
5:34Blackboard: Interleaving four SIMD values
5:34Blackboard: Interleaving four SIMD values
5:34Blackboard: Interleaving four SIMD values
14:27Blackboard: Establishing the order we need
14:27Blackboard: Establishing the order we need
14:27Blackboard: Establishing the order we need
15:46handmade_render_group.cpp: Write the SIMD register names that we want to end up with
15:46handmade_render_group.cpp: Write the SIMD register names that we want to end up with
15:46handmade_render_group.cpp: Write the SIMD register names that we want to end up with
16:29Internet: Intel Intrinsics Guide1
16:29Internet: Intel Intrinsics Guide1
16:29Internet: Intel Intrinsics Guide1
17:23Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32
17:23Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32
17:23Blackboard: __mm_unpackhi_epi32 and __mm_unpacklo_epi32
19:04Blackboard: Using these operations to generate what we need
19:04Blackboard: Using these operations to generate what we need
19:04Blackboard: Using these operations to generate what we need
24:17handmade_render_group.cpp: Name the registers in register order
24:17handmade_render_group.cpp: Name the registers in register order
24:17handmade_render_group.cpp: Name the registers in register order
25:15Internet: Double-check the parameter order of the unpack operations
25:15Internet: Double-check the parameter order of the unpack operations
25:15Internet: Double-check the parameter order of the unpack operations
26:22handmade_render_group.cpp: Start to populate the registers
26:22handmade_render_group.cpp: Start to populate the registers
26:22handmade_render_group.cpp: Start to populate the registers
26:52Internet: Keeping in mind how often you move between __m128 and __m128i
26:52Internet: Keeping in mind how often you move between __m128 and __m128i
26:52Internet: Keeping in mind how often you move between __m128 and __m128i
28:39handmade_render_group.cpp: Cast the Blended values from float to int
28:39handmade_render_group.cpp: Cast the Blended values from float to int
28:39handmade_render_group.cpp: Cast the Blended values from float to int
29:47Use structured art to enable us to see what's happening
29:47Use structured art to enable us to see what's happening
29:47Use structured art to enable us to see what's happening
34:47Debugger: Watch how our art gets shuffled
34:47Debugger: Watch how our art gets shuffled
34:47Debugger: Watch how our art gets shuffled
38:40handmade_render_group.cpp: Produce the rest of the pixel values we need
38:40handmade_render_group.cpp: Produce the rest of the pixel values we need
38:40handmade_render_group.cpp: Produce the rest of the pixel values we need
41:43Convert 32-bit floating point values to 8-bit integers
41:43Convert 32-bit floating point values to 8-bit integers
41:43Convert 32-bit floating point values to 8-bit integers
44:07// TODO(casey): Set the rounding to something known
44:07// TODO(casey): Set the rounding to something known
44:07// TODO(casey): Set the rounding to something known
45:08Blackboard: Using 8-bits of these 32-bit registers
45:08Blackboard: Using 8-bits of these 32-bit registers
45:08Blackboard: Using 8-bits of these 32-bit registers
47:32handmade_render_group.cpp: Bitwise OR and Shift these values
47:32handmade_render_group.cpp: Bitwise OR and Shift these values
47:32handmade_render_group.cpp: Bitwise OR and Shift these values
50:27Blackboard: How the shift operations work
50:27Blackboard: How the shift operations work
50:27Blackboard: How the shift operations work
52:44handmade_render_group.cpp: Implement these shifts
52:44handmade_render_group.cpp: Implement these shifts
52:44handmade_render_group.cpp: Implement these shifts
55:06Debugger: Take a look at the Out value
55:06Debugger: Take a look at the Out value
55:06Debugger: Take a look at the Out value
57:33handmade_render_group.cpp: Break out the values
57:33handmade_render_group.cpp: Break out the values
57:33handmade_render_group.cpp: Break out the values
58:22Debugger: Inspect these values
58:22Debugger: Inspect these values
58:22Debugger: Inspect these values
58:35handmade_render_group.cpp: Fix the test case
58:35handmade_render_group.cpp: Fix the test case
58:35handmade_render_group.cpp: Fix the test case
59:32Debugger: Inspect our stuff
59:32Debugger: Inspect our stuff
59:32Debugger: Inspect our stuff
1:00:13handmade_render_group.cpp: Write Out to Pixel
1:00:13handmade_render_group.cpp: Write Out to Pixel
1:00:13handmade_render_group.cpp: Write Out to Pixel
1:01:08Debugger: Crash and reload
1:01:08Debugger: Crash and reload
1:01:08Debugger: Crash and reload
1:01:43Debugger: Note that we are writing unaligned
1:01:43Debugger: Note that we are writing unaligned
1:01:43Debugger: Note that we are writing unaligned
1:04:22Blackboard: Alignment
1:04:22Blackboard: Alignment
1:04:22Blackboard: Alignment
1:05:54handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction
1:05:54handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction
1:05:54handmade_render_group.cpp: Issue _mm_storeu_si128 to cause the compiler to use the (unaligned) mov instruction
1:07:23Recap and glimpse into the future
1:07:23Recap and glimpse into the future
1:07:23Recap and glimpse into the future
1:08:30Q&A
🗩
1:08:30Q&A
🗩
1:08:30Q&A
🗩
1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?
🗪
1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?
🗪
1:09:59braincruser Will the operations be reordered to reduce the number of ops and load / stores?
🗪
1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?
🗪
1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?
🗪
1:12:01mmozeiko You are calculating Out like or(or(or(r, g), b), a). Would it be better to do it like this: or(or(r, g), or(b, a)), so first two or's are not dependent on each other?
🗪
1:14:57handmade_render_group.cpp: Write it the way mmozeiko suggests
1:14:57handmade_render_group.cpp: Write it the way mmozeiko suggests
1:14:57handmade_render_group.cpp: Write it the way mmozeiko suggests
1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?
🗪
1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?
🗪
1:17:31uspred Do you need to start with 32-bit floats? Is there further optimization that doesn't need the casting?
🗪
1:18:21Blackboard: Multiplying floats vs Multiplying integers
1:18:21Blackboard: Multiplying floats vs Multiplying integers
1:18:21Blackboard: Multiplying floats vs Multiplying integers
1:19:54mmozeiko Same for texture bilinear adds together
🗪
1:19:54mmozeiko Same for texture bilinear adds together
🗪
1:19:54mmozeiko Same for texture bilinear adds together
🗪
1:20:03handmade_render_group.cpp: Implement mmozeiko's suggestion
1:20:03handmade_render_group.cpp: Implement mmozeiko's suggestion
1:20:03handmade_render_group.cpp: Implement mmozeiko's suggestion
1:23:00flaturated Can you compile /O2 to compare it to last week's performance?
🗪
1:23:00flaturated Can you compile /O2 to compare it to last week's performance?
🗪
1:23:00flaturated Can you compile /O2 to compare it to last week's performance?
🗪
1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?
🗪
1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?
🗪
1:23:16brblackmer Why did you make macros for your SIMD operations (mmSquare, etc.) vs making functions?
🗪
1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?
🗪
1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?
🗪
1:23:39quikligames Are these intrinsics the same on other operating systems or compilers, as long as it's using Intel architecture?
🗪
1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)
🗪
1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)
🗪
1:24:40mmozeiko Why do you say unaligned store is nasty? As far as I know, for latest Intel CPUs (at least starting from Ivy Bridge) unaligned load / store is not very expensive anymore (<5% difference)
🗪
1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?
🗪
1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?
🗪
1:26:25plain_flavored Is scalar access to __m128 elements still slow on Intel?
🗪
1:27:18braincruser The processor window is 192 instructions
🗪
1:27:18braincruser The processor window is 192 instructions
🗪
1:27:18braincruser The processor window is 192 instructions
🗪
1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function
🗪
1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function
🗪
1:28:01gasto5 I don't understand how one optimizes by using the intrinsic or function
🗪
1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?
🗪
1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?
🗪
1:28:51mmozeiko _mm_cvttps_epi32 always truncates. Would that be better than messing with rounding mode?
🗪
1:30:45handmade_render_group.cpp: Switch to _mm_cvttps_epi32
1:30:45handmade_render_group.cpp: Switch to _mm_cvttps_epi32
1:30:45handmade_render_group.cpp: Switch to _mm_cvttps_epi32
1:32:50Wrap up
🗩
1:32:50Wrap up
🗩
1:32:50Wrap up
🗩