1:36Reintroducing the Intel Architecture Code Analyzer
1:36Reintroducing the Intel Architecture Code Analyzer
1:36Reintroducing the Intel Architecture Code Analyzer
10:46A long time ago, in RAD's source tree
10:46A long time ago, in RAD's source tree
10:46A long time ago, in RAD's source tree
13:00blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey
13:00blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey
13:00blowtard, an analytical tool for the Xbox 360's PowerPC Tri-Core Xenon written by Casey
22:04How IACA's output differs from Casey's stats in blowtard
22:04How IACA's output differs from Casey's stats in blowtard
22:04How IACA's output differs from Casey's stats in blowtard
30:56Looking at how to get our cycle count down
30:56Looking at how to get our cycle count down
30:56Looking at how to get our cycle count down
32:21Manually unroll the Fetch / Sample loop
32:21Manually unroll the Fetch / Sample loop
32:21Manually unroll the Fetch / Sample loop
37:27Use _mm_setr_ps as suggested by Fabian a long time ago
37:27Use _mm_setr_ps as suggested by Fabian a long time ago
37:27Use _mm_setr_ps as suggested by Fabian a long time ago
42:14Taking a look at the total throughput count
42:14Taking a look at the total throughput count
42:14Taking a look at the total throughput count
43:18Casey needs some more soya [sic] milk
43:18Casey needs some more soya [sic] milk
43:18Casey needs some more soya [sic] milk
44:17Could we do a load once, and grab out the two values that we needed?
44:17Could we do a load once, and grab out the two values that we needed?
44:17Could we do a load once, and grab out the two values that we needed?
45:48Explanation of possible texel loading optimisation
🖌
45:48Explanation of possible texel loading optimisation
🖌
45:48Explanation of possible texel loading optimisation
🖌
50:32Figuring out how the compiler is loading the texel data
50:32Figuring out how the compiler is loading the texel data
50:32Figuring out how the compiler is loading the texel data
1:00:18This is fine, then
1:00:18This is fine, then
1:00:18This is fine, then
1:01:01We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid
1:01:01We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid
1:01:01We multiply by TexturePitch and sizeof(uint32) four-wide manually, which is stupid
1:02:06Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)
1:02:06Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)
1:02:06Shift up FetchX_4x by 2, rather than multiply by sizeof(uint32)
1:03:40Premultiply FetchY_4x by TexturePitch_4x
1:03:40Premultiply FetchY_4x by TexturePitch_4x
1:03:40Premultiply FetchY_4x by TexturePitch_4x
1:04:07Give the compiler the wide stuff so that it can see it as wide
1:04:07Give the compiler the wide stuff so that it can see it as wide
1:04:07Give the compiler the wide stuff so that it can see it as wide
1:11:21_mm_mul_epi32 does not do integer * integer
1:11:21_mm_mul_epi32 does not do integer * integer
1:11:21_mm_mul_epi32 does not do integer * integer
1:13:43Port pressure (we're back to InterIteration)
1:13:43Port pressure (we're back to InterIteration)
1:13:43Port pressure (we're back to InterIteration)
1:27:22Designing how to break up the renderer for multithreading to ease pressure on the caches
🖌
1:27:22Designing how to break up the renderer for multithreading to ease pressure on the caches
🖌
1:27:22Designing how to break up the renderer for multithreading to ease pressure on the caches
🖌
1:32:22Divide the frame buffer into chunks that are sized appropriately for the cache
🖌
1:32:22Divide the frame buffer into chunks that are sized appropriately for the cache
🖌
1:32:22Divide the frame buffer into chunks that are sized appropriately for the cache
🖌
1:39:55The plan for setting up the renderer
🖌
1:39:55The plan for setting up the renderer
🖌
1:39:55The plan for setting up the renderer
🖌
1:40:47Implementation of interleaved scanlines, in readiness for hyperthreading
1:40:47Implementation of interleaved scanlines, in readiness for hyperthreading
1:40:47Implementation of interleaved scanlines, in readiness for hyperthreading
1:46:36The logic of interleaved scanlines
🖌
1:46:36The logic of interleaved scanlines
🖌
1:46:36The logic of interleaved scanlines
🖌
1:52:37Updating compiler directives for folks who use LLVM
1:52:37Updating compiler directives for folks who use LLVM
1:52:37Updating compiler directives for folks who use LLVM
1:55:20Implementation of frame buffer divisions, in readiness for multi-core processing
1:55:20Implementation of frame buffer divisions, in readiness for multi-core processing
1:55:20Implementation of frame buffer divisions, in readiness for multi-core processing
2:05:30Go to Disassembly of DrawRectangleQuickly() in order to diagnose bogus cycle count
2:05:30Go to Disassembly of DrawRectangleQuickly() in order to diagnose bogus cycle count
2:05:30Go to Disassembly of DrawRectangleQuickly() in order to diagnose bogus cycle count
2:10:04Frame buffer divisions, continued
2:10:04Frame buffer divisions, continued
2:10:04Frame buffer divisions, continued
2:20:50Introduce GetClampedRectArea
2:20:50Introduce GetClampedRectArea
2:20:50Introduce GetClampedRectArea
2:22:12Problematic thing: Our convention for rectangles before was that they did not include their final value
2:22:12Problematic thing: Our convention for rectangles before was that they did not include their final value
2:22:12Problematic thing: Our convention for rectangles before was that they did not include their final value
2:27:33Fix the cycle counter for DrawRectangleQuickly() again
2:27:33Fix the cycle counter for DrawRectangleQuickly() again
2:27:33Fix the cycle counter for DrawRectangleQuickly() again
2:29:42A shortcut didn't work out. (!quote 297 + !quote 298)
2:29:42A shortcut didn't work out. (!quote 297 + !quote 298)
2:29:42A shortcut didn't work out. (!quote 297 + !quote 298)
2:30:56Loft FillRect above the loop
2:30:56Loft FillRect above the loop
2:30:56Loft FillRect above the loop
2:36:34Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time
2:36:34Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time
2:36:34Introduce PixelPxRow in order to keep PixelPx as a wide value rather than having to set it each time
2:39:50Check IACA for performance difference and revert to setting PixelPx each time through the loop
2:39:50Check IACA for performance difference and revert to setting PixelPx each time through the loop
2:39:50Check IACA for performance difference and revert to setting PixelPx each time through the loop
2:43:28Shuffle calculations around to figure out how the performance is affected, for good or ill
2:43:28Shuffle calculations around to figure out how the performance is affected, for good or ill
2:43:28Shuffle calculations around to figure out how the performance is affected, for good or ill
2:51:17Thinking about that alignment problem
🖌
2:51:17Thinking about that alignment problem
🖌
2:51:17Thinking about that alignment problem
🖌
2:55:58Align MinX and MaxX
2:55:58Align MinX and MaxX
2:55:58Align MinX and MaxX
3:00:18Microsoft Visual Studio 2013 has stopped working
3:00:18Microsoft Visual Studio 2013 has stopped working
3:00:18Microsoft Visual Studio 2013 has stopped working
3:03:03Change our loads and stores to no longer be unaligned
3:03:03Change our loads and stores to no longer be unaligned
3:03:03Change our loads and stores to no longer be unaligned
3:04:05Assess performance difference and revert back to the unaligned load and store instructions
3:04:05Assess performance difference and revert back to the unaligned load and store instructions
3:04:05Assess performance difference and revert back to the unaligned load and store instructions
3:05:12Make sure that we actually always fill the real clip region and not write outside the clip region
3:05:12Make sure that we actually always fill the real clip region and not write outside the clip region
3:05:12Make sure that we actually always fill the real clip region and not write outside the clip region
3:07:10Our options for filling the pixels
🖌
3:07:10Our options for filling the pixels
🖌
3:07:10Our options for filling the pixels
🖌
3:09:12Implementation of alignment to the ending edge
3:09:12Implementation of alignment to the ending edge
3:09:12Implementation of alignment to the ending edge
3:16:48Clip the leading edge
3:16:48Clip the leading edge
3:16:48Clip the leading edge
3:21:33Try setting StartupClipMask by using _mm_srli_si128
3:21:33Try setting StartupClipMask by using _mm_srli_si128
3:21:33Try setting StartupClipMask by using _mm_srli_si128
3:22:28// TODO(casey): This is stupid.
3:22:28// TODO(casey): This is stupid.
3:22:28// TODO(casey): This is stupid.
3:26:10Early-out the FillRect tests
3:26:10Early-out the FillRect tests
3:26:10Early-out the FillRect tests
3:30:01Start passing ClipRect through to DrawRectangleQuickly
3:30:01Start passing ClipRect through to DrawRectangleQuickly
3:30:01Start passing ClipRect through to DrawRectangleQuickly
3:35:35Moment of realisation, with introduction of the InvertedInfinityRectangle
3:35:35Moment of realisation, with introduction of the InvertedInfinityRectangle
3:35:35Moment of realisation, with introduction of the InvertedInfinityRectangle
3:37:48Temporarily adjust ClipRect in order to avoid a crash
3:37:48Temporarily adjust ClipRect in order to avoid a crash
3:37:48Temporarily adjust ClipRect in order to avoid a crash
3:39:24Introduce TiledRenderGroupToOutput outside of the timer
3:39:24Introduce TiledRenderGroupToOutput outside of the timer
3:39:24Introduce TiledRenderGroupToOutput outside of the timer
3:43:57Update DrawRectangle to take the clipping information
3:43:57Update DrawRectangle to take the clipping information
3:43:57Update DrawRectangle to take the clipping information
3:47:18Update DrawRectangle{,Quickly} to use the Even / Odd information
3:47:18Update DrawRectangle{,Quickly} to use the Even / Odd information
3:47:18Update DrawRectangle{,Quickly} to use the Even / Odd information
3:49:20Break the screen up into pieces and render them separately
3:49:20Break the screen up into pieces and render them separately
3:49:20Break the screen up into pieces and render them separately
3:54:34Stretch your legs, Casey
3:54:34Stretch your legs, Casey
3:54:34Stretch your legs, Casey
3:56:28We can finally end the stream
3:56:28We can finally end the stream
3:56:28We can finally end the stream
3:57:12rygorous a) your top and right clip is off-by-1!
🗪
3:57:12rygorous a) your top and right clip is off-by-1!
🗪
3:57:12rygorous a) your top and right clip is off-by-1!
🗪
3:59:54mmozeiko _mm_mullo_epi32 is SSE4 intrinsic
🗪
3:59:54mmozeiko _mm_mullo_epi32 is SSE4 intrinsic
🗪
3:59:54mmozeiko _mm_mullo_epi32 is SSE4 intrinsic
🗪
4:04:57mmozeiko Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.
🗪
4:04:57mmozeiko Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.
🗪
4:04:57mmozeiko Will you revert yesterday changes where you changed bilinear pixel unpacking code from float mul to int mul? It was faster with float mul.
🗪
4:05:24an0nymal How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.
🗪
4:05:24an0nymal How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.
🗪
4:05:24an0nymal How many more marathon streams will we have? I thoroughly enjoyed the 4+ hours today.
🗪
4:05:47quikligames You should give a big thanks to
Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat
🗪
4:05:47quikligames You should give a big thanks to
Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat
🗪
4:05:47quikligames You should give a big thanks to
Rygorous for sticking around and trying to give you tips knowing full well that you wouldn't see them in chat
🗪
4:06:07mmozeiko would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?
🗪
4:06:07mmozeiko would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?
🗪
4:06:07mmozeiko would it be better to have tile sizes always divisible by 4 horizontally (or even 16 to be cache aligned), then there will be no need to deal with alignment and masking?
🗪
4:07:07rygorous (clip) one too few pixels. look at the edge of the screen.
🗪
4:07:07rygorous (clip) one too few pixels. look at the edge of the screen.
🗪
4:07:07rygorous (clip) one too few pixels. look at the edge of the screen.
🗪
4:09:20rygorous just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding
🗪
4:09:20rygorous just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding
🗪
4:09:20rygorous just pretty sure I saw glitchiness/off-by-1-pixel stuff near the edges but it might've been the video encoding
🗪
4:11:08mmozeiko (tile size %4) - not masking for textures, but ClipMask variable
🗪
4:11:08mmozeiko (tile size %4) - not masking for textures, but ClipMask variable
🗪
4:11:08mmozeiko (tile size %4) - not masking for textures, but ClipMask variable
🗪
4:13:30abnercoimbre Q: holy crap. our 1st marathon.
🗪
4:13:30abnercoimbre Q: holy crap. our 1st marathon.
🗪
4:13:30abnercoimbre Q: holy crap. our 1st marathon.
🗪
4:13:50Time for Casey to go to bed, with closing remarks
4:13:50Time for Casey to go to bed, with closing remarks
4:13:50Time for Casey to go to bed, with closing remarks