Поиск патентов

Настройки

Глубина выборки

Укажите год

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Ключевые слова. Может быть несколько по одной на строку

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка

Автор

Ведите корректный номера.

Владелец

Ведите корректный номера.

Классы IPC

Ведите корректный номера.

Классы CPC

Ведите корректный номера.

Начиная с года

Укажите год

Заканчивая годом

Укажите год

Применить Всего найдено 224. Отображено 105.

27-12-2016 дата публикации

Interlocked increment memory allocation and access

Номер: US9529632B2

Автор: MANTOR MICHAEL, MCCARDLE JOHN, ZINI MARCOS, EMBERLING BRIAN, Mantor Michael, McCardle John, Zini Marcos, Emberling Brian

Принадлежит: MANTOR MICHAEL, MCCARDLE JOHN, ZINI MARCOS, EMBERLING BRIAN, ADVANCED MICRO DEVICES INC, Mantor Michael, McCardle John, Zini Marcos, Emberling Brian, Advanced Micro Devices, Inc.

A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts. A corresponding method of assigning a memory to a plurality of reader threads includes determining a first number corresponding to a number of writer threads having a block allocated in said memory, launching a first number of reader threads, entering a first wavefront of said reader threads from said group of wavefronts to ...

Подробнее

Номер записи: 1

23-08-2016 дата публикации

Method and system for synchronization of workitems with divergent control flow

Номер: US0009424099B2

Автор: Michael C. Houston, Benedict R. Gaster, Lee W. Howes, Michael Mantor, Dominik Behr, HOUSTON MICHAEL C, GASTER BENEDICT R, HOWES LEE W, MANTOR MICHAEL, BEHR DOMINIK, Houston Michael C., Gaster Benedict R., Howes Lee W., Mantor Michael, Behr Dominik

Принадлежит: Advanced Micro Devices, Inc., ADVANCED MICRO DEVICES INC

Disclosed methods, systems, and computer program products embodiments include synchronizing a group of workitems on a processor by storing a respective program counter associated with each of the workitems, selecting at least one first workitem from the group for execution, and executing the selected at least one first workitem on the processor. The selecting is based upon the respective stored program counter associated with the at least one first workitem.

Подробнее

Номер записи: 2

08-09-2016 дата публикации

PROVIDING ASYNCHRONOUS DISPLAY SHADER FUNCTIONALITY ON A SHARED SHADER CORE

Номер: US20160260246A1

Автор: David Oldcorn, Chris Brennan, Michael Mantor, Layla A. Mah, OLDCORN DAVID, BRENNAN CHRIS, MANTOR MICHAEL, MAH LAYLA A, Oldcorn David, Brennan Chris, Mantor Michael, Mah Layla A.

Принадлежит: Advanced Micro Devices, Inc.

A method, a non-transitory computer readable medium, and a processor for performing display shading for computer graphics are presented. Frame data is received by a display shader, the frame data including at least a portion of a rendered frame. Parameters for modifying the frame data are received by the display shader. The parameters are applied to the frame data by the display shader to create a modified frame. The modified frame is displayed on a display device.

Подробнее

Номер записи: 3

29-11-2016 дата публикации

Preemptive context switching of processes on ac accelerated processing device (APD) based on time quanta

Номер: US0009507632B2

Автор: Robert Scott Hartog, Ralph Clay Taylor, Michael Mantor, Kevin McGrath, Sebastien Nussbaum, Nuwan Jayasena, Rex McCrary, Mark Leather, Philip Rogers, Thomas Woller, HARTOG ROBERT SCOTT, TAYLOR RALPH CLAY, MANTOR MICHAEL, MCGRATH KEVIN, NUSSBAUM SEBASTIEN, JAYASENA NUWAN, MCCRARY REX, LEATHER MARK, ROGERS PHILIP, WOLLER THOMAS, Hartog Robert Scott, Taylor Ralph Clay, Mantor Michael, McGrath Kevin, Nussbaum Sebastien, Jayasena Nuwan, McCrary Rex, Leather Mark, Rogers Philip, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc., HARTOG ROBERT SCOTT, TAYLOR RALPH CLAY, MANTOR MICHAEL, MCGRATH KEVIN, NUSSBAUM SEBASTIEN, JAYASENA NUWAN, MCCRARY REX, LEATHER MARK, ROGERS PHILIP, WOLLER THOMAS, ADVANCED MICRO DEVICES INC, Hartog Robert Scott, Taylor Ralph Clay, Mantor Michael, McGrath Kevin, Nussbaum Sebastien, Jayasena Nuwan, McCrary Rex, Leather Mark, Rogers Philip, Woller Thomas

Methods, systems, and computer readable media for preemptive context-switching of processes on an accelerated processing device are based upon a comparison of the running time of the process and a threshold time quanta. A method includes preempting a process running on an accelerated processing device based upon a running time of the process and a threshold time quanta.

Подробнее

Номер записи: 4

22-12-2016 дата публикации

HYBRID RENDER WITH PREFERRED PRIMITIVE BATCH BINNING AND SORTING

Номер: US20160371873A1

Автор: Michael Mantor, Laurent Lefebvre, Mikko Alho, Mika Tuomi, Kiia Kallio, MANTOR MICHAEL, LEFEBVRE LAURENT, ALHO MIKKO, TUOMI MIKA, KALLIO KIIA, Mantor Michael, Lefebvre Laurent, Alho Mikko, Tuomi Mika, Kallio Kiia

Принадлежит: Advanced Micro Devices, Inc., ATI Technologies ULC

A system, method and a computer program product are provided for hybrid rendering with deferred primitive batch binning A primitive batch is generated from a sequence of primitives. Initial bin intercepts are identified for primitives in the primitive batch. A bin for processing is identified. The bin corresponds to a region of a screen space. Pixels of the primitives intercepting the identified bin are processed. Next bin intercepts are identified while the primitives intercepting the identified bin are processed. 1. A method comprising:generating a primitive batch from a sequence of one or more primitives, wherein the primitives in the batch intercept one or more rows and one or more columns associated with a plurality of bins defining a display area;sorting the one or more primitives in the batch by a row intercepted by the one or more primitives;sorting the one or more primitives in the batch by a column intercepted by the one or more primitives;rasterizing all the primitives in the batch that intercept a row and column queried; andquerying all bins that contain primitives until all primitives in all bins containing primitives are rasterized.2. The method of wherein sorting the one or more primitives in the batch by a row intercepted by the one or more primitives includes determining a minimum row intercepted by the one or more primitives and a maximum row intercepted by the one or more primitives.3. The method of wherein sorting the one or more primitives in the batch by a column intercepted by the one or more primitives includes determining a minimum column intercepted by the one or more primitives and a maximum column intercepted by the one or more primitives.4. The method of claim 1 , further comprising determining whether a batch break condition exists prior to inserting a new primitive in the batch.5. The method of wherein a batch break condition includes any one of the following: there is not enough space in a memory to store primitives claim 4 , an event ...

Подробнее

Номер записи: 5

15-08-2012 дата публикации

A processing unit that enables asyncronous task dispatch

Номер: CN102640115A

Автор: Mantor Michael, Mccrary Rex

Принадлежит:

A processing unit includes a plurality of virtual engines and a shader core. The plurality of virtual engines is configured to (i) receive, from an operating system (OS), a plurality of tasks substantially in parallel with each other and (ii) load a set of state data associated with each of the plurality of tasks. The shader core is configured to execute the plurality of tasks substantially in parallel based on the set of state data associated with each of the plurality of tasks. The processing unit may also include a scheduling module that schedules the plurality of tasks to be issued to the shader core.

Подробнее

Номер записи: 6

16-03-2017 дата публикации

PREEMPTIVE CONTEXT SWITCHING OF PROCESSES ON AN ACCELERATED PROCESSING DEVICE (APD) BASED ON TIME QUANTA

Номер: US20170076421A1

Автор: Robert Scott Hartog, Ralph Clayton Taylor, Michael Mantor, Kevin John McGrath, Sebastien Nussbaum, Nuwan Jayasena, Rex McCrary, Mark Leather, Philip J. Rogers, Thomas Woller, HARTOG ROBERT SCOTT, TAYLOR RALPH CLAYTON, MANTOR MICHAEL, MCGRATH KEVIN JOHN, NUSSBAUM SEBASTIEN, JAYASENA NUWAN, MCCRARY REX, LEATHER MARK, ROGERS PHILIP J, WOLLER THOMAS, Hartog Robert Scott, Taylor Ralph Clayton, Mantor Michael, McGrath Kevin John, Nussbaum Sebastien, Jayasena Nuwan, McCrary Rex, Leather Mark, Rogers Philip J., Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

Methods and apparatus are described. A method includes an accelerated processing device running a process. When a maximum time interval during which the process is permitted to run expires before the process completes, the accelerated processing device receives an operating-system-initiated instruction to stop running the process. The accelerated processing device stops the process from running in response to the received operating-system-initiated instruction. 1. A method for use in an accelerated processing device (APD) running a process , the method comprising:when a maximum time interval during which the process is permitted to run expires before the process completes, receiving an operating-system-initiated instruction to stop running the process; andstopping the process from running in response to the received operating-system-initiated instruction.2. The method of claim 1 , further comprising:detecting an expiration of the maximum time interval; andgenerating an interrupt in response to detecting the expiration of the maximum time interval.3. The method of claim 2 , further comprisingreceiving a setting of a value for the maximum time interval from at least one of a user interface, a central processing unit (CPU), and the operating system;setting a timer based on the received setting;starting the timer and starting the running of the process concurrently; anddetecting the expiration of the maximum time interval when the timer expires.4. The method of claim 1 , wherein the maximum time interval is only applicable to the process running on the APD claim 1 , is applicable to all processes that run on the APD claim 1 , or is applicable to a particular group of process that run on the APD.5. The method of claim 1 , wherein the maximum time interval is dynamically configurable based on an actual workload claim 1 , a predicted workload claim 1 , or a characteristic of the process running on the APD.6. The method of claim 1 , wherein the maximum time interval is pre- ...

Подробнее

Номер записи: 7

18-07-2012 дата публикации

A processing unit with a plurality of shader engines

Номер: CN102598061A

Автор: Mantor Michael, Taylor Ralph C, Brady Jeffrey T.

Принадлежит:

A processor includes a first shader engine and a second shader engine. The first shader engine is configured to process pixel shaders for a first subset of pixels to be displayed on a display device. The second shader engine is configured to process pixel shaders for a second subset of pixels to be displayed on the display device. Both the first and second shader engines are also configured to process general-compute shaders and non-pixel graphics shaders. The processor may also include a level-one (Ll) data cache, coupled to and positioned between the first and second shader ...

Подробнее

Номер записи: 8

19-01-2012 дата публикации

DYNAMIC CONTROL OF SIMDs

Номер: US20120013627A1

Автор: EMBERLING Brian, Mantor Michael J., SHAH Tushar K.

Принадлежит: Advanced Micro Devices, Inc.

Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time. 1. A method improve performance in a computing system , comprising:determining a required power level for a processing application; anddynamically enabling and disabling one or more single instruction multiple data units (SIMDs) in a shader complex based on said power level.2. The method of claim 1 , further comprising:configuring, in real-time, a plurality of registers to indicate when said SIMDs are to be enabled and disabled.3. The method of claim 1 , further comprising:determining a number of SIMDs needed for said processing application.4. The method of claim 2 , further comprising:reviewing said configured registers; andassigning work threads based on configuration of said registers.5. The method of claim 2 , further comprising:servicing one or more pending work requests prior to said configuring.6. The method of claim 1 , wherein said dynamically enabling and disabling comprises enabling and disabling said SIMDs during their active execution period and independent of activity in a shader engine associated with said SIMDs.7. The method of ...

Подробнее

Номер записи: 9

24-05-2012 дата публикации

Method and System for Synchronizing Thread Wavefront Data and Events

Номер: US20120131596A1

Автор: Lefebvre Laurent, Mantor Michael, Szasz Deborah Lynne

Принадлежит:

Systems and methods for synchronizing thread wavefronts and associated events are disclosed. According to an embodiment, a method for synchronizing one or more thread wavefronts and associated events includes inserting a first event associated with a first data output from a first thread wavefront into an event synchronizer. The event synchronizer is configured to release the first event before releasing events inserted subsequent to the first event. The method further includes releasing the first event from the event synchronizer after the first data is stored in the memory. Corresponding system and computer readable medium embodiments are also disclosed. 1. A method for synchronizing one or more thread wavefronts and associated events , comprising:inserting, into an event synchronizer, a first event associated with first data output from a first thread wavefront, wherein the event synchronizer is configured to release the first event before releasing events inserted subsequent to the first event; andreleasing the first event from the event synchronizer after the first data is stored in a memory.2. The method of claim 1 , further comprising:providing the released first event to one or more client modules.3. The method of claim 1 , further comprising:inserting, into the event synchronizer, one or more second events associated with a second wavefront configured to be executed after the first wavefront; andreleasing the one or more second events from the event synchronizer after the releasing of the first event.4. The method of claim 3 , further comprising:executing the second wavefront according to the released one or more second events.5. The method of claim 1 , wherein the event synchronizer includes a first-in-first-out (FIFO) queue.6. The method of claim 1 , further comprising:monitoring the first data in one or more buffers before the first data is stored in the memory; anddetecting, based on the monitoring, the completion of storing the first data in the memory ...

Подробнее

Номер записи: 10

26-07-2012 дата публикации

Mechanisms for Enabling Task Scheduling

Номер: US20120188259A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip, Taylor Ralph Clay, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

Embodiments described herein provide a method including receiving a command to schedule a first process and selecting a command queue associated with the first process. The method also includes scheduling the first process to run on an accelerated processing device and preempting a second process running on the accelerated processing device to allow the first process to run on the accelerated processing device. 1. A method , comprising:scheduling a first process;scheduling the first process to run on an accelerated processing device (APD); andpreempting a second process running on the APD, in response to receiving a command, to allow the first process to run on the APD.2. The method of claim 1 , wherein the first process is one of a graphics process and a compute process.3. The method of claim 1 , wherein the preempting comprises:stopping the second process running on the APD; andsaving of a context state associated with the second process.4. The method of claim 3 , wherein claim 3 , after the first process has completed claim 3 , the preempting further comprises:restoring the context state of the second process; andrestarting the second process to run on the APD.5. The method of claim 1 , further comprising:monitoring the command queue for a new command.6. The method of claim 1 , further comprising:placing the APD into a reduced power state if the command queue is empty.7. The method of claim 1 , further comprising:allowing an operating system to monitor a resource utilization of the APD.8. The method of claim 7 , wherein the monitoring is based on the first or second process.9. An accelerated processing device (APD) claim 7 , comprising:a shader core configured to run a first and second process contained within a list of processes;a dispatcher configured to receive a command to schedule the first process, wherein the first process is associated with a command queue; anda scheduler configured to preempt the second process, in response to receiving a software ...

Подробнее

Номер записи: 11

02-08-2012 дата публикации

Preemptive Context Switching

Номер: US20120194524A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит: Advanced Micro Devices, Inc.

Methods, systems, and computer readable media embodiments are disclosed for preemptive context-switching of processes running on a accelerated processing device. Embodiments include, detecting by an accelerated processing device a memory exception, and preempting a process from running on the accelerated processing device based upon the detected exception. 1. A method , comprising:detecting by a accelerated processing device a memory exception; andpreempting a process from running on the accelerated processing device based upon the detected exception.2. The method of claim 1 , wherein the preempting of the process comprises preempting of the process from running on an accelerated processor portion of the accelerated processing device.3. The method of claim 1 , further comprising:requesting, by an input output memory management unit coupled to the accelerated processing device, data from the memory;determining, by the accelerated processing device, whether the data is absent from an accessible area of the memory;receiving, at the accelerated processing device, notification of the absence; andgenerating an interrupt associated with the absence.4. The method of claim 3 , further comprising:queuing an event indicating the exception in the memory, wherein the queued event is accessible by an operating system (OS).5. The method of claim 4 , further comprising:requesting, by the accelerated processing device, fault handling associated with the exception from the input output memory management device.6. The method of claim 5 , further comprising:receiving a signal indicating a status regarding the queued event from the OS.7. The method of claim 3 , wherein the determining whether the data is absent comprises:signaling to a driver associated with the accelerated processing device regarding the absence; anddetermining by the kernel mode driver whether to preempt or stall the process.8. The method of claim 1 , wherein the preempting comprises:determining a type of the ...

Подробнее

Номер записи: 12

02-08-2012 дата публикации

Managed Task Scheduling on a Graphics Processing Device (APD)

Номер: US20120194525A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит: Advanced Micro Devices, Inc.

Provided herein is a method including receiving a run list including one or more processes to run on an accelerated processing device, wherein each of the one or more processes is associated with a corresponding independent job command queue. The method also includes scheduling each of the one or more processes to run on the accelerated processing device based on a criteria associated with each process. 1. A method , comprising:receiving a run list comprising one or more processes to run on an accelerated processing device, wherein each of the one or more processes is associated with a corresponding independent job command queue; andscheduling each of the one or more processes to run on the accelerated processing device based on a criteria associated with each process.2. The method of claim 1 , wherein the criteria comprises one or more of a predetermined time quanta and a process priority.3. The method of claim 1 , wherein the one or more processes comprise one or more of a graphic process and a compute process.4. The method of claim 1 , further comprising:generating a task list of one of more processes to run on the graphics processor, wherein the task list comprises a superset of the one or more processes in the run list.5. The method of claim 4 , further comprising determining claim 4 , using software claim 4 , the one or more processes in the task list.6. The method of claim 1 , further comprising determining claim 1 , using software claim 1 , the one or more processes in the run list.7. The method of claim 6 , wherein the determining is based on a process scheduling criteria.8. The method of claim 1 , wherein the scheduling is performed by the graphics processor.9. The method of claim 8 , wherein the scheduling performed by the graphics processor is performed autonomously.10. The method of claim 1 , further comprising allowing software access to the run list.11. The method of claim 1 , further comprising deleting respective ones of the one or more processes in ...

Подробнее

Номер записи: 13

09-08-2012 дата публикации

Preemptive context switching of processes on ac accelerated processing device (APD) based on time quanta

Номер: US20120200576A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

Methods, systems, and computer readable media for preemptive context-switching of processes on an accelerated processing device are based upon a comparison of the running time of the process and a threshold time quanta. A method includes preempting a process running on an accelerated processing device based upon a running time of the process and a threshold time quanta. 1. A method , comprising:preempting a process running on an accelerated processing device based upon a running time of the process and a threshold time quanta.2. The method of claim 1 , wherein the process is at least one of a graphics process or a compute process.3. The method of claim 1 , wherein the accelerated processing device is in communication with a central processing unit.4. The method of claim 1 , wherein the preempting comprises:detecting a timer expiration indicating that the running time is equal to or greater than the threshold time quanta.5. The method of claim 4 , wherein the preempting further comprises:generating an interrupt corresponding to the timer expiration; andinitiating of the preemption by an operating system based on receipt of the interrupt.6. The method of claim 4 , wherein the preempting further comprises:detecting the timer expiration by a hardware-based scheduler; andinitiating of the preemption by the hardware-based scheduler based on receipt of the interrupt.7. The method of claim 1 , wherein the preempting comprises:saving a context of the process; andterminating the process after the saving.8. The method of claim 7 , wherein the saving comprises one of: saving a state of a graphics pipeline associated with the process and saving a state of a wavefront associated with the process; and wherein terminating comprises one of: removing an entry associated with the process from a run list and removing the entry from the run list managed by the accelerated processing device.9. The method of claim 1 , further comprising:running a second process on the accelerated ...

Подробнее

Номер записи: 14

09-08-2012 дата публикации

Process Device Context Switching

Номер: US20120200579A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит: Advanced Micro Devices, Inc.

Methods, systems, and computer readable media embodiments are disclosed for preemptive context-switching of processes running on an accelerated processing device. A method includes, responsive to an exception upon access to a memory by a process running on a accelerated processing device, whether to preempt the process based on the exception, and preempting, based upon the determining, the process from running on the accelerated processing device. 1. A method , comprising:responsive to an exception upon access to a memory by a process running on a accelerated processing device, determining whether to preempt the process based on the exception; andpreempting, based upon the determining, the process from running on the accelerated. processing device.2. The method of claim 1 , wherein the determining whether to preempt the process is performed by an operating system.3. The method of claim 1 , further comprising:requesting, using a memory management unit, data from the memory;determining, using the memory management unit, that the data is absent from an accessible area of the memory;receiving notification of the absence; andgenerating an interrupt associated with the absence.4. The method of claim 3 , further comprising:queuing an event indicating the exception to the operating system.5. The method of claim 4 , further comprising:requesting, by the accelerated processing device, fault handling associated with the exception from the memory management unit.6. The method of claim 5 , further comprising:receiving, from an operating system, a signal indicating a status regarding the queued event.7. The method of claim 1 , wherein the determining comprises:determining a type of the exception; andselecting to preempt or stall the process based upon the determined type.8. The method of claim 7 , wherein:the determining further comprises accessing statistics associated with memory exceptions; andthe selecting to preempt or stall is further based upon the accessed statistics.9. ...

Подробнее

Номер записи: 15

09-08-2012 дата публикации

Systems and Methods for Improving Divergent Conditional Branches

Номер: US20120204014A1

Автор: Emberling Brian D., Leather Mark, Mantor Michael, Rubin Norman

Принадлежит:

Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads. 1. A method comprising: executing first code associated with the conditional statement in a smaller set of threads of the first and second sets of threads, and', 'executing second code associated with the conditional statement in a larger set of threads of the first and second sets of threads upon the smaller set of threads finishing execution., 'responsive to a first set of threads evaluating a conditional statement as true and a second set of threads evaluating the conditional statement as false2. The method of claim 1 , wherein the executing of the first and second codes is performed using a single instruction multiple data (SIMD) processing core.3. The method of claim 1 , further comprising:responsive to the first set of threads evaluating the conditional statement as true and the second set of threads evaluating the conditional statement as false, storing an identifier associated with the larger set of threads.4. The method of claim 3 , further comprising storing the identifier in a set of registers claim 3 , wherein a number of the registers is equal to log(N) claim 3 , wherein N represents a sum of the first set of threads and the second set of threads.5. The method of claim 1 , wherein the identifier is a mask claim 1 , the method further comprising:responsive to the first set of threads ...

Подробнее

Номер записи: 16

09-05-2013 дата публикации

Method and System for Workitem Synchronization

Номер: US20130117750A1

Автор: Emberling Brian D., Gaster Benedict R., Houston Michael C., Howes Lee W., Leather Mark, Mantor Michael, Rubin Norman

Принадлежит: Advanced Micro Devices, Inc.

Method, system, and computer program product embodiments for synchronizing workitems on one or more processors are disclosed. The embodiments include executing a barrier skip instruction by a first workitem from the group, and responsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points. 1. A method of synchronizing a group of workitems on one or more processors , comprising:executing a barrier skip instruction by a first workitem from the group; andresponsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points.2. The method of claim 1 , further comprising:configuring the barrier to synchronize the group at a sequence of synchronization points, wherein the sequence includes the plurality of points.3. The method of claim 1 , wherein reconfiguring the barrier to synchronize other workitems includes:incrementing a skip count associated with the barrier, wherein the skip count is used in determining whether all workitems from the group have reached the barrier.4. The method of claim 1 , further comprising:synchronizing the other workitems at first and second points of the plurality of points, wherein the first workitem did not reach the barrier at the first and second points.5. The method of claim 4 , wherein synchronizing the other workitems comprises:for each of the other workitems that reach the barrier, determining if it is a last one of the other workitems; andwhen the last of the other workitems reaches the barrier, unblock all the other workitems to resume processing.6. The method of claim 5 , wherein the determining if it is the last one of the other workitems comprises:comparing a sum ...

Подробнее

Номер записи: 17

30-05-2013 дата публикации

Saving and Restoring Non-Shader State Using a Command Processor

Номер: US20130135327A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip, Taylor Ralph Clay, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

Provided is a system including a command processor configured for interrupting processing of a first set of instructions executing within a shader core. 1. A system , comprising:a command processor;wherein the command processor is configured for interrupting processing of a first set of instructions executing within a shader core.2. The system of claim 1 , wherein the shader core is configured to (i) save a context state associated with the first set of instructions after the interrupting and (ii) process a second set of instructions after the context state has been saved.3. The system of claim 2 , wherein the shader core is further configured to (iii) restore the context state after processing of the second set of instructions concludes and (iv) resume processing of the first set of instructions after the context state has been restored.4. The system of claim 3 , wherein the context state includes data associated with at least one from the group including active wavefronts claim 3 , a register claim 3 , and a local data store.5. The system of claim 4 , further comprising an arbitration mechanism;wherein the command processor instructs the arbitration mechanism to manage access of all wavefronts submitted to the shader core for processing.6. The system of claim 5 , wherein the arbitration mechanism includes a dispatch controller and a resource arbiter.7. The system of claim 1 , wherein the interrupting includes performing a trap routine.8. The system of claim 1 , wherein the interrupting is based on a time quanta.9. A system claim 1 , comprising:a command processor;wherein upon receipt of an interrupt command, the command processor is configured to (i) interrupt processing of a first set of instructions processed within a shader core and (ii) save a context state associated with the first set of instructions after the interrupting.10. The system of claim 9 , wherein the command processor is further configured for (iii) fetching a second set of instructions after the ...

Подробнее

Номер записи: 18

06-06-2013 дата публикации

Method and Apparatus for Servicing Page Fault Exceptions

Номер: US20130141446A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит: Advanced Micro Devices, Inc.

A method, apparatus and computer readable media for servicing page fault exceptions in a accelerated processing device (APD). A page fault related to a wavefront is detected. A fault handling request to a translation mechanism is sent when the page fault is detected. A fault handling response corresponding to the detected page fault from the translation mechanism is received. Confirmation that the detected page fault has been handled through performing page mapping based on the fault handling response is received. 1. A method for servicing page faults within an accelerated processing device (APD) , comprising:sending a fault handling request to a translation mechanism when a page fault is detected;receiving a fault handling response corresponding to the fault handling request; andreceiving confirmation the detected page fault has been resolved through performing page mapping based on the fault handling response.2. The method of claim 1 , wherein the sending the fault handling request comprises:receiving, at an input output memory management unit (IOMMU) driver of the APD, the fault handling request;sending, using the IOMMU driver, the fault handling request to an operating system (OS);receiving, at the IOMMU driver, a fault handling completion signal from the OS; andtransmitting, to an IOMMU, the fault handling completion signal.3. The method of claim 2 , further comprising receiving at the IOMMU driver the fault handling completion signal from a kernel mode driver (KMD).4. The method of claim 1 , further comprising periodically retrying the faulted wavefronts to determine if outstanding faults have been satisfied.5. The method of claim 1 , wherein the fault includes at least one of a page fault claim 1 , a translation lookaside buffer (TLB) and a memory exception.6. The method of claim 1 , further comprising using a memory controller claim 1 , an IOMMU or a memory management unit (MMU) as the translation mechanism.7. The method of claim 1 , wherein the receiving a ...

Подробнее

Номер записи: 19

06-06-2013 дата публикации

Method and Apparatus for Accommodating Multiple, Concurrent Work Inputs

Номер: US20130141447A1

Автор: Hartog Robert Scott, Leather Mark, Mantor Michael, McCrary Rex, Nussbaum Sebastien, Rogers Philip, Taylor Ralph Clay, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

A method of accommodating more than one compute input is provided. The method creates an APD arbitration policy that dynamically assigns compute instructions from a sequence of instructions awaiting processing to the APD compute units for execution of a run list. 1. A method of arbitrating in an accelerated processing device (APD) including first and second APD compute units , the method comprising:assigning a first compute instruction from a sequence of instructions awaiting processing to SIMDs within the APD first compute unit;assigning a second compute instruction from the sequence of instructions to SIMDs within the APD second compute unit; andswitching from processing the first and second compute instructions after a time quantum to dynamically assign the next instruction in the sequence.2. The method of claim 1 , wherein the time quantum is based on a scheduler policy.3. The method of claim 2 , wherein the scheduler policy includes a round robin methodology4. The method of claim 1 , wherein the sequence of instructions are associated with an active group.5. The method of claim 4 , wherein the active group is gang scheduled.6. The method of claim 5 , wherein switching from processing the first and second compute instructions comprises rotating a run list through the gang scheduled active group.7. The method of claim 4 , wherein the active group is associated with an active group list.8. The method of claim 1 , wherein the first and second compute instructions are associated with an active list.9. The method of claim 1 , wherein the first and second compute units are configured to execute a run list.10. The method of claim 1 , wherein the first and second APD units are representative of a plurality of SIMDs.11. The method of claim 1 , wherein the SIMDs are configured to process a respective portion of the first compute instruction.12. The method of claim 1 , wherein the SIMDs are configured to process a respective portion of the second compute instruction.13. A ...

Подробнее

Номер записи: 20

06-06-2013 дата публикации

Handling Virtual-to-Physical Address Translation Failures

Номер: US20130145202A1

Автор: Hartog Robert Scott, Leather Mark, Mantor Michael, McCrary Rex, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит: Advanced Micro Devices, Inc.

A method tolerates virtual to physical address translation failures. A translation request is sent from a graphics processing device to a translation mechanism. The translation request is associated with a first wavefront. A fault notification is received within an accelerated processing device (APD) from the translation mechanism that a request cannot be acknowledged. The first wavefront is, stored within a shader core of the APD if the fault notification is received. The first wavefront is replaced with a second wavefront if the fault notification is received, the second wavefront being ready to be executed. 1. A method comprising:responsive to receiving a fault notification that a translation request cannot be acknowledged, storing a first wavefront of the APD when the fault notification is received; andreplacing the first wavefront with a second wavefront when the fault notification is received.2. The method of claim 1 , wherein the translation mechanism is an input/output memory management unit (IOMMU).3. The method of claim 1 , wherein the second wavefront is ready for execution.4. The method of claim 1 , further comprising periodically retrying the stored first wavefront as a new request.5. The method of claim 1 , further comprising tracking claim 1 , with the APD claim 1 , a number of wavefronts that receive not acknowledgements.6. The method of claim 5 , further comprising initiating a context switching request if the number of not acknowledgements exceeds a threshold value.7. The method of claim 1 , further comprising storing claim 1 , in a single instruction multiple data (SIMD) of the SC claim 1 , a plurality of wavefronts.8. A computer readable medium having stored thereon computer executable instructions that claim 1 , if executed by a computing device claim 1 , cause the computing device to perform a method comprising:sending, from an accelerated processing device (APD) to a translation mechanism, a translation request that is associated with a first ...

Подробнее

Номер записи: 21

13-06-2013 дата публикации

Partitioning Resources of a Processor

Номер: US20130147816A1

Автор: Hartog Robert Scott, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Rogers Philip J., Taylor Ralph Clay, Woller Thomas

Принадлежит:

Embodiments describe herein provide an apparatus, a computer readable medium and a method for simultaneously processing tasks within an APD. The method includes processing a first task within an APD. The method also includes reducing utilization of the APD by the first task to facilitate simultaneous processing of the second task, such that the utilization remains below a threshold. 1. A method comprising:processing a first task received within an accelerated processing device (APD); andreducing utilization of the APD by the first task to facilitate simultaneous processing of a second task, such that the utilization remains below a threshold.2. The method of claim 1 , wherein the first and second tasks are received from a central processing unit (CPU).3. The method of claim 1 , wherein the first and second tasks include a plurality of wavefronts.4. The method of claim 1 , wherein the threshold is determined in accordance with (i) a number of single command memory devices in use claim 1 , (ii) a number of local data shares available claim 1 , or (iii) a predetermined allotment of APD resources.5. The method of claim 1 , wherein the reducing the utilization of the APD is performed by removing wavefronts associated with the first task and temporarily context switching the wavefronts to a memory.6. The method of claim 1 , wherein the reducing the utilization of the APD is performed by ceasing processing of portions of the first task.7. The method of claim 6 , wherein the portions include one or more wavefronts from a plurality of wavefronts of the first task.8. The method of claim 6 , wherein the ceasing processing includes removing one or more last sent wavefronts from a plurality of wavefronts of the first task.9. The method of claim 8 , further comprising temporarily context switching the plurality of wavefronts to a memory.10. The method of claim 8 , further comprising storing the plurality of wavefronts in the memory last one in last one out.11. The method of claim ...

Подробнее

Номер записи: 22

20-06-2013 дата публикации

SYSCALL MECHANISM FOR PROCESSOR TO PROCESSOR CALLS

Номер: US20130155074A1

Автор: Mantor Michael, Rubin Norman

Принадлежит: Advanced Micro Devices, Inc.

Provided is a method for processing system calls from a GPU to a CPU. The method includes a GPU storing a plurality of tasks in a memory, with each task representing a function to be performed on the CPU. The method also includes generating a CPU interrupt, and processing of the stored plurality of tasks by the CPU. 1. A method for execution in a system comprising:storing a plurality of tasks output from an APD in memory, each task representing a function to be performed on a CPU; andgenerating a CPU interrupt.2. The method of further comprising:processing by the CPU, the stored plurality of tasks.3. The method of claim 1 , wherein the tasks are stored in a plurality of mailboxes.4. The method of claim 3 , further comprising:a work item scanning mailboxes for an empty mailbox.5. The method of claim 1 , further comprising:executing an instruction for a wave front to go to sleep.6. A method for execution in a computer system including a memory claim 1 , the method comprising:receiving, in a CPU, an interrupt from an APD; andprocessing by the CPU a plurality of tasks stored in the memory, each task representing a function to be performed on the CPU.7. The method of claim 6 , further comprising:writing the results of the processing by the CPU in the memory.8. A system claim 6 , comprising:a memory configured to store a plurality of tasks, each task representing a function to be performed on a CPU; andan APD configured to generate a CPU interrupt.9. A system claim 6 , comprising:a memory configured to store a plurality of tasks, each task representing a function to be performed on a CPU;wherein the CPU is (i) configured to receive an interrupt and (ii) process the stored plurality of tasks from the memory.10. The system of claim 9 , wherein the memory is further configured to store the plurality of tasks in a plurality of mailboxes.11. An article of manufacture including a computer-readable medium having instructions stored thereon that claim 9 , when executed by a ...

Подробнее

Номер записи: 23

20-06-2013 дата публикации

Policies for Shader Resource Allocation in a Shader Core

Номер: US20130155077A1

Автор: Mark Leather, Michael Mantor, Philip J. Rogers, Ralph Clay Taylor, Rex McCrary, Robert Scott Hartog, Sebastien Nussbaum, Thomas WOLLER

Принадлежит: Advanced Micro Devices Inc

A method of determining priority within an accelerated processing device is provided. The accelerated processing device includes compute pipeline queues that are processed in accordance with predetermined criteria. The queues are selected based on priority characteristics and the selected queue is processed until a time quantum lapses or a queue having a higher priority becomes available for processing.

Подробнее

Номер записи: 24

20-06-2013 дата публикации

Saving and Restoring Shader Context State

Номер: US20130155079A1

Автор: Hartog Robert Scott, Jayasena Nuwan, Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas R.

Принадлежит:

Provided is a method for processing a command in a computing system including an accelerated processing device (APD) having a command processor. The method includes executing an interrupt routine to save one or more contexts related to a first set of instructions on a shader core in response to an instruction to preempt processing of the first set of instructions. 1. A method for processing a command in a computing system including an accelerated processing device (APD) having a command processor , the method comprising:responsive to an instruction to preempt processing of a first set of instructions, executing an interrupt routine to save one or more contexts related to the first set of instructions on a shader core.2. The method of claim 1 , wherein the interrupt routine is a trap routine.3. The method of claim 1 , wherein the one or more contexts include context of wavefronts implementing one or more of the first set of instructions.4. The method of claim 3 , wherein the one or more contexts include contexts of respective work-items of the wavefronts.5. The method of claim 1 , wherein the one or more contexts include contents of at least one of general purpose registers and local memory.6. The method of claim 1 , further comprising processing a second set of instructions upon completion of the preemption of the first set of instructions; andresuming processing of the first set of instructions upon completion of the processing of the second set of instructions.7. The method of claim 6 , further comprising resuming processing of the first set of instructions from a point of preemption.8. The method of claim 1 , further comprising restoring the one or more contexts related to the first set of instructions.9. The method of claim 1 , wherein the instruction to preempt processing of the first instruction is transmitted via the command processor.10. The method of claim 9 , farther comprising transmitting the instruction to preempt processing of the first instruction to ...

Подробнее

Номер записи: 25

20-06-2013 дата публикации

Software Mechanisms for Managing Task Scheduling on an Accelerated Processing Device (APD)

Номер: US20130160017A1

Автор: Hartog Robert Scott, Jayasena Nuwan S., Leather Mark, Mantor Michael, McCrary Rex, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas Roy

Принадлежит:

Embodiments describe herein provide a method of for managing task scheduling on a accelerated processing device. The method includes executing a first task within the accelerated processing device (APD), monitoring for an interruption of the execution of the first task, and switching to a second task when an interruption is detected. 1. A method comprising;executing a first task within an accelerated processing device;monitoring for an interruption of the executing of the first task; andswitching to a second task if an interruption is detected.2. The method of claim 1 , wherein the first task is within a run list comprising a plurality of tasks.3. The method of claim 2 , wherein the run list is a smaller subset of an active list of tasks.4. The method of claim 1 , wherein the interruption is caused by a completing task claim 1 , a stalling task or a faulting task.5. The method of claim 1 , further comprising:sending a notification to a scheduler that the accelerated processing device has switched the interrupted first task with the second task; andperiodically verifying, using the scheduler, that the second task has a relatively next highest priority value.6. The method of claim 5 , further comprising:sending, using the scheduler, a plurality of instructional messages that are used to update the accelerated processing device.7. The method of claim 6 , wherein the instructional messages comprise adding a new task to a run list claim 6 , removing a task from the run list claim 6 , or installing a new run list.8. The method of claim 5 , further comprising:transmitting status messages from the accelerated processing device to the scheduler, wherein the statuses comprise running, stop running, running the next process, not running due to fault or not running due to stall.9. A computer readable medium having stored thereon computer executable instructions that claim 5 , if executed by a computing device claim 5 , cause the computing device to perform a method claim 5 , ...

Подробнее

Номер записи: 26

20-06-2013 дата публикации

Method for Resuming an APD Wavefront in Which a Subset of Elements Have Faulted

Номер: US20130160019A1

Автор: Kevin McGrath, Mark Leather, Michael Mantor, Nuwan Jayasena, Philip J. Rogers, Ralph Clay Taylor, Rex McCrary, Robert Scott Hartog, Sebastien Nussbaum, Woller Thomas R

Принадлежит: Advanced Micro Devices Inc

A method resumes an accelerated processing device (APD) wavefront in which a subset of elements have faulted. A restore command for a job including a wavefront is received. A list of context states for the wavefront is read from a memory associated with a APD. An empty shell wavefront is created for restoring the list of context states. A portion of not acknowledged data is masked over a portion of acknowledged data within the restored wavefronts.

Подробнее

Номер записи: 27

25-07-2013 дата публикации

Multithreaded Computing

Номер: US20130191852A1

Автор: Gaster Benedict R., Houston Michael Clair, Howes Lee W., Mantor Michael

Принадлежит: Advanced Micro Devices, Inc.

A system, method, and computer program product are provided for improving resource utilization of multithreaded applications. Rather than requiring threads to block while waiting for data from a channel or requiring context switching to minimize blocking, the techniques disclosed herein provide an event-driven approach to launch kernels only when needed to perform operations on channel data, and then terminate in order to free resources. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. 1. A method , comprising:defining a channel;defining a consumer kernel configured to read data from the channel; andregistering a channel event configured to launch the consumer kernel when a condition of the channel is satisfied.2. The method of claim 1 , wherein the condition of the channel is satisfied when at least one block of data is in the channel.3. The method of claim 1 , wherein the condition of the channel is satisfied when the channel is full.4. The method of claim 1 , further comprising:allocating the channel at runtime to a memory unit of a processing unit.5. The method of claim 1 , further comprising:executing a hardware scheduling system configured to watch the channel event and trigger the launch of the consumer kernel.6. The method of claim 1 , further comprising:placing the consumer kernel in a command queue when the condition of the channel is satisfied, wherein the command queue is configured to handle the launch of the consumer kernel.7. The method of claim 1 , further comprising:launching the consumer kernel;reading the data from the channel at the consumer kernel;consuming the data; andterminating the consumer kernel.8. The method of claim 1 , further comprising:defining a producer kernel configured to write data to the channel.9. A computer-readable storage device having instructions stored thereon claim 1 , execution of which claim 1 , by a computing device claim 1 , causes ...

Подробнее

Номер записи: 28

03-10-2013 дата публикации

Hardware Managed Allocation and Deallocation Evaluation Circuit

Номер: US20130262812A1

Автор: Lefebvre Laurent, Mantor Michael

Принадлежит:

A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than using memory polling to ensure that enough space is available in memory locations for, for example, write instructions, the techniques disclosed herein provide a system and method to automate this evaluation mechanism in environments such as data-parallel processing to efficiently check available space in memory locations before instructions such as write threads are allowed. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. 1. An apparatus , comprising: receive a request to access a memory;', 'access a register to determine an amount of available space in the memory associated with the request; and', 'when the determined amount of available space accommodates an amount of data associated with the request, update the amount of available space stored in the register based on the amount of data., 'an evaluation circuit, configured to2. The apparatus of claim 1 , wherein:the request is an allocation request; and decrement the amount of available space stored in the register by the amount of data; and', 'send a confirmation to a crawler to allocate the memory based on the request., 'if the amount of data is less or equal to the amount of available space, the evaluation circuit is further configured to3. The apparatus of claim 2 , wherein the evaluation circuit is further configured to:send a notification to the crawler to stall the request if the amount of data is more than the amount of available space stored in the register.4. The apparatus of claim 1 , wherein the request is a deallocation request and the evaluation circuit is further configured to:increment the amount of available space stored in the register by the amount of data; andsend a confirmation to a crawler to deallocate the memory based on the request.5. The apparatus of claim 1 , further comprising ...

Подробнее

Номер записи: 29

03-10-2013 дата публикации

Hardware Managed Ordered Circuit

Номер: US20130262834A1

Автор: Lefebvre Laurent, Mantor Michael

Принадлежит:

A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than requiring memory polling to ensure ordered execution of processes or threads, the techniques disclosed herein provide a system and method to allow any process or thread to run out of order as long as needed, but ensure ordered execution of multiple ordered instructions when needed. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. 1. An apparatus , comprising:a scoreboard structure configured to store information associated with a plurality of wavefronts; anda controller comprising a plurality of counters and configured to control an order of operations, such that a next one of the plurality of wavefronts to be processed is determined based on the stored information and an ordering scheme.2. The apparatus of claim 1 , wherein:the plurality of wavefronts include a plurality of ordered instructions; anda respective one of the counters is configured to track a corresponding one of the plurality of ordered instructions.3. The apparatus of claim 2 , wherein the controller further comprises a second set of plurality of up/down counters and a respective one of the up/down counters is associated with a corresponding one of the counters.4. The apparatus of claim 3 , wherein the controller is further configured to:identify a highest or next highest priority wavefront of the plurality of wavefronts according to the ordering scheme;identify a highest priority instruction of the highest or next highest priority wavefront and process the highest priority instruction; andincrement a value of one of the counters associated with the highest priority instruction.5. The apparatus of claim 4 , wherein:if additional ordered instructions for the highest or next highest priority wavefront are expected, the controller is further configured to increment values of an associated one of ...

Подробнее

Номер записи: 30

05-12-2013 дата публикации

Method and System for Synchronization of Workitems with Divergent Control Flow

Номер: US20130326524A1

Автор: Behr Dominik, Gaster Benedict R., Houston Michael C., Howes Lee W., Mantor Michael

Принадлежит: Advanced Micro Devices, Inc.

Disclosed methods, systems, and computer program products embodiments include synchronizing a group of workitems on a processor by storing a respective program counter associated with each of the workitems, selecting at least one first workitem from the group for execution, and executing the selected at least one first workitem on the processor. The selecting is based upon the respective stored program counter associated with the at least one first workitem. 1. A method of synchronizing a group of workitems on a processor and each workitem being associated with a program counter , comprising:executing at least one first workitem from the group based upon a value of the stored program counter associated with the at least one first workitem.2. The method of claim 1 , further comprising:storing the respective program counter associated with each of the workitems;3. The method of claim 2 , wherein the storing comprises:determining a divergent control flow point associated with at least one of the workitems; andwriting a value of a program counter of the determined divergent control flow point to a memory location.4. The method of claim 2 , wherein the storing comprises:halting execution of at least one of the workitems upon reaching a convergence point or a synchronization point; andwriting a value of a program counter of the halted at least one of the workitems to a memory location.5. The method of claim 2 , wherein the storing comprises:storing the respective program counter only at one or more selected points in respective instruction streams.6. The method of claim 5 , wherein the one or more selected points include only one or more of divergent control flow points claim 5 , synchronization points claim 5 , and convergence points.7. The method of claim 1 , wherein the executing at least one first workitem comprises:selecting the at least one first workitem from the group based upon the value of the stored program counter associated with the at least one first ...

Подробнее

Номер записи: 31

23-01-2014 дата публикации

METHOD FOR URGENCY-BASED PREEMPTION OF A PROCESS

Номер: US20140022263A1

Автор: Hartog Robert Scott, Jayasena Nuwan S., Leather Mark, Mantor Michael, McCrary Rex Eldon, McGrath Kevin, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay

Принадлежит:

The desire to use an Accelerated Processing Device (APD) for general computation has increased due to the APD's exemplary performance characteristics. However, current systems incur high overhead when dispatching work to the APD because a process cannot be efficiently identified or preempted. The occupying of the APD by a rogue process for arbitrary amounts of time can prevent the effective utilization of the available system capacity and can reduce the processing progress of the system. Embodiments described herein can overcome this deficiency by enabling the system software to pre-empt a process executing on the APD for any reason. The APD provides an interface for initiating such a pre-emption. This interface exposes an urgency of the request which determines whether the process being preempted is allowed a grace period to complete its issued work before being forced off the hardware. 1. A method , comprising preempting a first process running on an accelerated processing device (APD) based upon urgency of a second process.2. The method of claim 1 , wherein the urgency is reflective of a priority of the second process and a threshold time quanta.3. The method of claim 1 , wherein the urgency is based on operating system considerations.4. The method of claim 1 , wherein the first process and second process are at least one of a graphics process or a compute process.5. The method of claim 3 , wherein the APD is in communication with a central processing unit.6. The method of claim 1 , wherein the preempting comprises detecting a timer expiration indicating that the running time is equal to or greater than the threshold time quanta.7. The method of claim 6 , wherein the preempting further comprises:generating an interrupt corresponding to the timer expiration; andinitiating of the preemption by an operating system based on receipt of the interrupt.8. The method of claim 6 , wherein the preempting further comprises:detecting the timer expiration by a hardware-based ...

Подробнее

Номер записи: 32

03-01-2019 дата публикации

Stream processor with overlapping execution

Номер: US20190004807A1

Автор: Bin He, Brian D. Emberling, Jian Yang, Jiasheng Chen, Michael J. Mantor, Qingcheng WANG, YunXiao Zou

Принадлежит: Advanced Micro Devices Inc

Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.

Подробнее

Номер записи: 33

03-01-2019 дата публикации

STREAM PROCESSOR WITH DECOUPLED CROSSBAR FOR CROSS LANE OPERATIONS

Номер: US20190004814A1

Автор: Carson Derek, Chen Jiasheng, Hakami Mohammad Reza, He Bin, Lottes Timothy, Mantor Michael J., Smith Justin David

Принадлежит:

Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands. 1. A system comprising:a multi-lane execution pipeline;a vector register file; anda crossbar; retrieve a plurality of data operands from the vector register file;', 'convey the plurality of data operands to the multi-lane execution pipeline via the crossbar responsive to determining a permutation is required; and', 'convey the plurality of data operands to the multi-lane execution pipeline by bypassing the crossbar responsive to determining a permutation is not required., 'wherein the system is configured to2. The system as recited in claim 1 , wherein the crossbar comprises multiple layers claim 1 , and wherein the system is further configured to:perform a first permutation of data operands across lanes of the multi-lane execution pipeline with a first layer of N×N crossbars, wherein N is a positive integer; andperform a second permutation of data operands across lanes of the multi-lane execution pipeline with a second layer of N×N crossbars.3. The system as recited in claim 1 , wherein the crossbar comprises a first N/2-by-N/2 ...

Подробнее

Номер записи: 34

03-01-2019 дата публикации

BIN STREAMOUT PREEMPTION IN A GRAPHICS PROCESSING PIPELINE

Номер: US20190005604A1

Автор: Acharya Anirudh R., Goel Vineet, Mantor Michael, Sakharshete Swapnil

Принадлежит:

A stage of a graphics pipeline in a graphics processing unit (GPU) detects an interrupt concurrently with the stage processing primitives in a first bin that represents a first portion of a first frame generated by a first application. The stage forwards a completed portion of the primitives to a subsequent stage of the graphics pipeline in response to the interrupt. The stage diverts a second bin that represents a second portion of the first frame from the stage to a memory in response to the interrupt. The stage processes primitives in a third bin that represents a portion of a second frame generated by a second application subsequent to diverting the second bin to the memory. The stage can then retrieve the second bin from the memory in response to the stage completing processing of the primitives in the third bin for additional processing. 1. A method comprising:detecting, at a stage of a graphics pipeline in a graphics processing unit (GPU), an interrupt concurrently with the stage processing primitives in a first bin that represents a first portion of a first frame generated by a first application;forwarding a completed portion of the primitives to a subsequent stage of the graphics pipeline in response to the interrupt; anddiverting at least one second bin that represents a second portion of the first frame from the stage to a memory in response to the interrupt.2. The method of claim 1 , wherein forwarding the completed portion of the primitives comprises forwarding the primitives in the first bin in response to completing processing of the primitives in the first bin.3. The method of claim 1 , wherein forwarding the completed portion of the primitives comprises forwarding a first subset of the primitives in the first bin in response to completing processing of the first subset of the primitives.4. The method of claim 3 , further comprising:diverting a second subset of the primitives in the first bin from the stage to the memory prior to completing ...

Подробнее

Номер записи: 35

14-01-2021 дата публикации

VMID AS A GPU TASK CONTAINER FOR VIRTUALIZATION

Номер: US20210011760A1

Автор: Acharya Anirudh R., ASARO Anthony, CHENG Jeffrey Gongxian, Fowler Mark, Mantor Michael J., McCrary Rex Eldon

Принадлежит:

Systems, apparatuses, and methods for abstracting tasks in virtual memory identifier (VMID) containers are disclosed. A processor coupled to a memory executes a plurality of concurrent tasks including a first task. Responsive to detecting one or more instructions of the first task which correspond to a first operation, the processor retrieves a first identifier (ID) which is used to uniquely identify the first task, wherein the first ID is transparent to the first task. Then, the processor maps the first ID to a second ID and/or a third ID. The processor completes the first operation by using the second ID and/or the third ID to identify the first task to at least a first data structure. In one implementation, the first operation is a memory access operation and the first data structure is a set of page tables. Also, in one implementation, the second ID identifies a first application of the first task and the third ID identifies a first operating system (OS) of the first task. 1. A system comprising:a memory storing program instructions of a plurality of tasks, wherein the plurality of tasks include a first task; execute the first task and one or more other tasks concurrently;', receive a first identifier (ID) which uniquely identifies the first task, wherein the first ID does not identify a source hierarchy of the first task;', 'map the first ID to a second ID which identifies the source hierarchy of the first task; and', 'complete the first operation by performing an access to a first data structure using the second ID to identify the first task., 'responsive to detecting one or more instructions of the first task which correspond to a first operation], 'a processor coupled to the memory, wherein the processor is configured to2. The system as recited in claim 1 , wherein the processor accesses a mapping table to map the first ID to the second ID and to a third ID.3. The system as recited in claim 2 , wherein the second ID identifies a first application and the ...

Подробнее

Номер записи: 36

17-02-2022 дата публикации

CREATING INTERCONNECTS BETWEEN DIES USING A CROSS-OVER DIE AND THROUGH-DIE VIAS

Номер: US20220051985A1

Автор: Agarwal Rahul, ALFANO MICHAEL S., Loh Gabriel H., Mantor Michael, SMITH ALAN D., SWAMINATHAN RAJA, Wong Gabriel

Принадлежит:

A semiconductor package includes a first die, a second die, and an interconnect die coupled to a first plurality of through-die vias in the first die and a second plurality of through-die vias in the second die. The interconnect die provides communications pathways the first die and the second die. 1. A semiconductor package comprising:a first die;a second die; andan interconnect die coupled to a first plurality of through-die vias in the first die and a second plurality of through-die vias in the second die.2. The semiconductor package of claim 1 , wherein the first die includes a first die pad region on a first surface of a first substrate claim 1 , the first plurality of through-die vias connecting the first die pad region to a second surface of the first substrate; andwherein the second die includes a second die pad region on first a surface of a second substrate, the second plurality of through-die vias connecting the second die pad region to a second surface of the second substrate.3. The semiconductor package of claim 2 , wherein a first plurality of die pads of the interconnect die is connected to the first plurality of through-die vias and a second plurality of die pads of the interconnect die is connected to the second plurality of through-die vias.4. The semiconductor package of claim 1 , wherein the interconnect die is hybrid bonded to the first die and the second die.5. The semiconductor package of claim 1 , wherein the first die claim 1 , the second die claim 1 , and the interconnect die are system-on-a-chip dies.6. The semiconductor package of claim 1 , wherein the interconnect die includes fabricated redistribution layer structures that implement communication pathways between the first die and the second die.7. The semiconductor package of claim 1 , wherein a third die is coupled to the first die using third plurality of through-die vias in the first die; andwherein a fourth die is coupled to the second die using a fourth plurality of through- ...

Подробнее

Номер записи: 37

06-02-2020 дата публикации

VMID AS A GPU TASK CONTAINER FOR VIRTUALIZATION

Номер: US20200042348A1

Автор: Acharya Anirudh R., ASARO Anthony, CHENG Jeffrey Gongxian, Fowler Mark, Mantor Michael J., McCrary Rex Eldon

Принадлежит:

Systems, apparatuses, and methods for abstracting tasks in virtual memory identifier (VMID) containers are disclosed. A processor coupled to a memory executes a plurality of concurrent tasks including a first task. Responsive to detecting one or more instructions of the first task which correspond to a first operation, the processor retrieves a first identifier (ID) which is used to uniquely identify the first task, wherein the first ID is transparent to the first task. Then, the processor maps the first ID to a second ID and/or a third ID. The processor completes the first operation by using the second ID and/or the third ID to identify the first task to at least a first data structure. In one implementation, the first operation is a memory access operation and the first data structure is a set of page tables. Also, in one implementation, the second ID identifies a first application of the first task and the third ID identifies a first operating system (OS) of the first task. 1. A system comprising:a memory storing program instructions of a plurality of tasks, wherein the plurality of tasks include a first task; execute the first task and one or more other tasks concurrently;', receive a first identifier (ID) which uniquely identifies the first task, wherein the first ID does not identify a source hierarchy of the first task;', 'map the first ID to a second ID which identifies the source hierarchy of the first task; and', 'complete the first operation by performing an access to a first data structure using the second ID to identify the first task., 'responsive to detecting one or more instructions of the first task which correspond to a first operation], 'a processor coupled to the memory, wherein the processor is configured to2. The system as recited in claim 1 , wherein the processor accesses a mapping table to map the first ID to the second ID and to a third ID.3. The system as recited in claim 2 , wherein the second ID identifies a first application and the ...

Подробнее

Номер записи: 38

18-02-2021 дата публикации

RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE

Номер: US20210049729A1

Автор: Mantor Michael, McCrary Rex Eldon, Paltashev Timour T.

Принадлежит:

A graphics processing unit (GPU) includes a plurality of programmable processing cores configured to process graphics primitives and corresponding data and a plurality of fixed-function hardware units. The plurality of processing cores and the plurality of fixed-function hardware units are configured to implement a configurable number of virtual pipelines to concurrently process different command flows. Each virtual pipeline includes a configurable number of fragments and an operational state of each virtual pipeline is specified by a different context. The configurable number of virtual pipelines can be modified from a first number to a second number that is different than the first number. An emulation of a fixed-function hardware unit can be instantiated on one or more of the graphics processing cores in response to detection of a bottleneck in a fixed-function hardware unit. One or more of the virtual pipelines can then be reconfigured to utilize the emulation instead of the fixed-function hardware unit. 120-. (canceled)21. An apparatus , comprising: a plurality of programmable processing cores configured to process graphics primitives and corresponding data; and', 'a plurality of fixed-function hardware units, wherein the shared resources are allocated to implement a configurable number of virtual pipelines, wherein the virtual pipelines are to concurrently execute commands that are fed to each virtual pipeline, wherein each virtual pipeline includes a configurable number of shared resources, and wherein each virtual pipeline is mapped to memory hierarchy resources of the apparatus., 'a plurality of shared resources comprising22. The apparatus of claim 21 , further comprising:a command processor configured to schedule and dispatch commands to a configurable number of queues, whereinthe configurable number of queues are configured to store packets comprising the commands.23. The apparatus of claim 22 , further comprising:an application driver configured to ...

Подробнее

Номер записи: 39

04-03-2021 дата публикации

ACCUMULATORS FOR BEHAVIORAL CHARACTERISTICS OF WAVES

Номер: US20210064366A1

Автор: ISENBERG William David, Mantor Michael, RAMSEY Randy

Принадлежит:

An apparatus such as a graphics processing unit (GPU) includes a plurality of processing elements configured to concurrently execute a plurality of first waves and accumulators associated with the plurality of processing elements. The accumulators are configured to store accumulated values representative of behavioral characteristics of the plurality of first waves that are concurrently executing on the plurality of processing elements. The apparatus also includes a dispatcher configured to dispatch second waves to the plurality of processing elements based on comparisons of values representative of behavioral characteristics of the second waves and the accumulated values stored in the accumulators. In some cases, the behavioral characteristics of the plurality of first waves comprise at least one of fetch bandwidths, usage of an arithmetic logic unit (ALU), and number of export operations. 1. An apparatus comprising:a plurality of processing elements configured to concurrently execute a plurality of first waves;accumulators associated with the plurality of processing elements, wherein the accumulators are configured to store accumulated values representative of behavioral characteristics of the plurality of first waves that are concurrently executing on the plurality of processing elements; anda dispatcher configured to dispatch second waves to the plurality of processing elements based on comparisons of values representative of behavioral characteristics of the second waves and the accumulated values stored in the accumulators.2. The apparatus of claim 1 , wherein the behavioral characteristics of the plurality of first waves comprise at least one of fetch bandwidths claim 1 , usage of an arithmetic logic unit (ALU) claim 1 , and number of export operations.3. The apparatus of claim 1 , wherein the accumulators have corresponding maximum values claim 1 , and wherein the dispatcher is configured to determine available portions of the accumulators that are equal to ...

Подробнее

Номер записи: 40

29-05-2014 дата публикации

Prefetch Kernels on Data-Parallel Processors

Номер: US20140149677A1

Автор: "OConnor James Michael", Jayasena Nuwan S., Mantor Michael

Принадлежит: Advanced Micro Devices, Inc.

Embodiments include methods, systems and computer readable media configured to execute a first kernel (e.g. compute or graphics kernel) with reduced intermediate state storage resource requirements. These include executing a first and second (e.g. prefetch) kernel on a data-parallel processor, such that the second kernel begins executing before the first kernel. The second kernel performs memory operations that are based upon at least a subset of memory operations in the first kernel. 1. A method , comprising:executing a second kernel on a data-parallel processor; andexecuting a first kernel on the data-parallel processor, such that the second kernel begins executing before the first kernel,wherein the second kernel performs memory operations that are based upon at least a subset of memory operations in the first kernel.2. The method of claim 1 , further comprising:using the second kernel to access memory locations in the subset before instructions for accessing the memory locations are executed by the first kernel.3. The method of claim 2 , wherein the accessing comprises at least one of:fetching data from off-chip memory to on-chip memory, wherein the fetched data is used by the first kernel;updating one or more page address mapping in a translation lookaside buffer, wherein the page translations are used by the first kernel; andfetching one or more memory pages from backing storage, wherein the fetched memory pages are used by the first kernel.4. The method of claim 1 , wherein the second kernel has an intermediate state that is substantially a minimum intermediate state allowing for determining of memory addresses associated with the first set of memory operations.5. The method of claim 1 , wherein the executing the second kernel comprises:accessing one or more data-dependent memory addresses.6. The method of claim 1 , wherein the executing the second kernel comprises:performing a single memory access for a respective memory page, wherein the respective memory ...

Подробнее

Номер записи: 41

05-06-2014 дата публикации

Optimized Context Switching for Long-Running Processes

Номер: US20140157287A1

Автор: Gaster Benedict R., Howes Lee W., Mantor Michael

Принадлежит: Advanced Micro Devices, Inc

Methods, systems, and computer readable storage media embodiments allow for low overhead context switching of threads. In embodiments, applications, such as, but not limited to, iterative data-parallel applications, substantially reduce the overhead of context switching by adding a user or higher-level program configurability of a state to be saved upon preempting of a executing thread. These methods, systems, and computer readable storage media include aspects of running a group of threads on a processor, saving state information by respective threads in the group in response to a signal from a scheduler, and pre-empting running of the group after the saving of the state information. 1. A method , comprising:running a group of threads on a processor;saving state information by respective threads in the group in response to a signal from a scheduler; andpre-empting the running of the group after the saving.2. The method of claim 1 , wherein the saving the state information comprises:selectively saving elements from a context of the respective threads.3. The method of claim 1 , wherein the saving the state information comprises:detecting the signal from the scheduler;calling a user-specified code block by each of the respective threads in response to the detected signal, wherein the code block is configured to save the state information.4. The method of claim 3 , wherein the saving the state information further comprises:determining a point at which to yield the running by the respective threads; andcalling the code block at the determined point.5. The method of claim 4 , wherein the determining the point comprises:determining the point in order to reduce an amount of the state information to be saved.6. The method of claim 5 , wherein the determining the point in order to reduce an amount of the state information to be saved is performed by a compiler.7. The method of claim 5 , wherein the determining the point in order to reduce an amount of the state information ...

Подробнее

Номер записи: 42

22-03-2018 дата публикации

PRIMITIVE SHADER

Номер: US20180082399A1

Автор: Lefebvre Laurent, Mantor Michael, Martin Todd, NIJASURE Mangesh P., Ramsey Randy W.

Принадлежит:

Improvements in the graphics processing pipeline are disclosed. More specifically, a new primitive shader stage performs tasks of the vertex shader stage or a domain shader stage if tessellation is enabled, a geometry shader if enabled, and a fixed function primitive assembler. The primitive shader stage is compiled by a driver from user-provided vertex or domain shader code, geometry shader code, and from code that performs functions of the primitive assembler. Moving tasks of the fixed function primitive assembler to a primitive shader that executes in programmable hardware provides many benefits, such as removal of a fixed function crossbar, removal of dedicated parameter and position buffers that are unusable in general compute mode, and other benefits. 1. A method for performing three-dimensional graphics rendering , the method comprising:performing per-vertex operations on a set of vertices with a primitive shader program executing in parallel processing units;performing culling operations on a set of primitives associated with the set of vertices, to generate a set of culled primitives, the culling operations being performed with the primitive shader;identifying one or more screen subdivisions for the set of culled primitives, with the primitive shader; andtransmitting the set of culled primitives to a set of screen-space pipelines based on the identified screen subdivisions of the set of culled primitives.2. The method of claim 1 , wherein:tessellation is enabled and the per-vertex operations comprise domain shader operations for evaluating barycentric coordinates produced by a tessellator stage of a graphics processing pipeline.3. The method of claim 1 , wherein:tessellation is disabled and the per-vertex operations comprise vertex shader operations for transforming vertex positions for a vertex shader stage of a graphics processing pipeline.4. The method of claim 1 , further comprising:performing operations for determining non-position attributes for ...

Подробнее

Номер записи: 43

31-03-2022 дата публикации

VERTICAL AND HORIZONTAL BROADCAST OF SHARED OPERANDS

Номер: US20220100528A1

Автор: ANANTHANARAYAN Arun Vaidyanathan, KAZAKOV Maxim V., Lagudu Sateesh, Mantor Michael, NAGABHUSHANAMGARI Prasad, Rush Allen H.

Принадлежит:

An array processor includes processor element arrays distributed in rows and columns. The processor element arrays perform operations on parameter values. The array processor also includes memory interfaces that broadcast sets of the parameter values to mutually exclusive subsets of the rows and columns of the processor element arrays. In some cases, the array processor includes single-instruction-multiple-data (SIMD) units including subsets of the processor element arrays in corresponding rows, workgroup processors (WGPs) including subsets of the SIMD units, and a memory fabric configured to interconnect with an external memory that stores the parameter values. The memory interfaces broadcast the parameter values to the SIMD units that include the processor element arrays in rows associated with the memory interfaces and columns of processor element arrays that are implemented across the SIMD units in the WGPs. The memory interfaces access the parameter values from the external memory via the memory fabric. 1. An apparatus comprising:processor element arrays distributed in rows and columns, wherein the processor element arrays are configured to perform operations on parameter values; andmemory interfaces configured to broadcast sets of the parameter values to mutually exclusive subsets of the rows and columns of the processor element arrays.2. The apparatus of claim 1 , wherein the processor element arrays comprise vector arithmetic logic unit (ALU) processors claim 1 , and wherein the memory interfaces comprise direct memory access (DMA) engines.3. The apparatus of claim 1 , wherein each of the memory interfaces broadcasts the parameter values to the processor element arrays in a corresponding one of the rows and a corresponding one of the columns.4. The apparatus of claim 3 , wherein a first memory interface of the memory interfaces broadcasts first parameter values to the processor element arrays in a first row and a first column claim 3 , and wherein a second ...

Подробнее

Номер записи: 44

31-03-2022 дата публикации

Dynamically adaptable arrays for vector and matrix operations

Номер: US20220100813A1

Автор: Allen H. Rush, Arun Vaidyanathan Ananthanarayan, Michael Mantor, Prasad NAGABHUSHANAMGARI, Sateesh Lagudu

Принадлежит: Advanced Micro Devices Inc

An array processor includes processor element arrays distributed in rows and columns. The processor element arrays perform operations on parameter values. The array processor also includes memory interfaces that are dynamically mapped to mutually exclusive subsets of the rows and columns of the processor element arrays based on dimensions of matrices that provide the parameter values to the processor element arrays. In some cases, the processor element arrays are vector arithmetic logic unit (ALU) processors and the memory interfaces are direct memory access (DMA) engines. The rows of the processor element arrays in the subsets are mutually exclusive to the rows in the other subsets and the columns of the processor element arrays in the subsets are mutually exclusive to the columns in the other subsets. The matrices can be symmetric or asymmetric, e.g., one of the matrices can be a vector having a single column.

Подробнее

Номер записи: 45

25-03-2021 дата публикации

Matrix multiplication unit with flexible precision operations

Номер: US20210089304A1

Автор: Bin He, Jian Huang, Jiasheng Chen, Michael Mantor

Принадлежит: Advanced Micro Devices Inc

A processing unit such as a graphics processing unit (GPU) includes a plurality of vector signal processors (VSPs) that include multiply/accumulate elements. The processing unit also includes a plurality of registers associated with the plurality of VSPs. First portions of first and second matrices are fetched into the plurality of registers prior to a first round that includes a plurality of iterations. The multiply/accumulate elements perform matrix multiplication and accumulation on different combinations of subsets of the first portions of the first and second matrices in the plurality of iterations prior to fetching second portions of the first and second matrices into the plurality of registers for a second round. The accumulated results of multiplying the first portions of the first and second matrices are written into an output buffer in response to completing the plurality of iterations.

Подробнее

Номер записи: 46

25-03-2021 дата публикации

EXCEPTION HANDLER FOR SAMPLING DRAW DISPATCH IDENTIFIERS

Номер: US20210090205A1

Автор: Ashkar Alexander Fuad, EMBERLING Brian, Mantor Michael, NIJASURE Mangesh P., RAMSEY Randy

Принадлежит:

The address of the draw or dispatch packet responsible for creating an exception is tied to a shader/wavefront back to the draw command from which it originated. In various embodiments, a method of operating a graphics pipeline and exception handling includes receiving, at a command processor of a graphics processing unit (GPU), an exception signal indicating an occurrence of a pipeline exception at a shader stage of a graphics pipeline. The shader stage generates an exception signal in response to a pipeline exception and transmits the exception signal to the command processor. The command processor determines, based on the exception signal, an address of a command packet responsible for the occurrence of the pipeline exception. 1. A method , comprising:receiving, at a command processor of a graphics processing unit (GPU), an exception signal indicating an occurrence of a pipeline exception at a shader stage of a graphics pipeline;transmitting the exception signal to the command processor; anddetermining, based on the exception signal, an address of a command packet responsible for the occurrence of the pipeline exception.2. The method of claim 1 , wherein receiving the exception signal comprises:receiving the exception signal at an exception handler of the command processor.3. The method of claim 1 , further comprising:storing, at a ring buffer, an address associated with each draw or dispatch submitted to the graphics pipeline.4. The method of claim 3 , further comprising:processing a header of the command packet in a command stream submitted to the GPU; andadvancing, for each storing of the address associated with each draw or dispatch, a write pointer of the ring buffer.5. The method of claim 3 , further comprising:advancing a read pointer of the ring buffer after wavefronts associated with each draw or dispatch complete processing through the graphics pipeline.6. The method of claim 1 , wherein the command packet comprises a draw call.7. The method of claim 1 ...

Подробнее

Номер записи: 47

25-03-2021 дата публикации

REDUNDANCY METHOD AND APPARATUS FOR SHADER COLUMN REPAIR

Номер: US20210090208A1

Автор: Brady Jeffrey T., Mantor Michael J., Socarras Angel E.

Принадлежит: Advanced Micro Devices, Inc.

Methods and systems are described. A system includes a redundant shader pipe array that performs rendering calculations on data provided thereto and a shader pipe array that includes a plurality of shader pipes, each of which performs rendering calculations on data provided thereto. The system also includes a circuit that identifies a defective shader pipe of the plurality of shader pipes in the shader pipe array. In response to identifying the defective shader pipe, the circuit generates a signal. The system also includes a redundant shader switch. The redundant shader switch receives the generated signal, and, in response to receiving the generated signal, transfers the data for the defective shader pipe to the redundant shader pipe array. 1. A system comprising:a redundant shader pipe array configured to perform rendering calculations on data provided thereto;a shader pipe array comprising a plurality of shader pipes, each of the plurality of shader pipes being configured to perform rendering calculations on data provided thereto;a circuit configured to identify a defective shader pipe of the plurality of shader pipes in the shader pipe array, and, in response to identifying the defective shader pipe, generate a signal; and receive the generated signal, and', 'in response to receiving the generated signal, transfer the data for the defective shader pipe to the redundant shader pipe array., 'a redundant shader switch configured to2. The system of claim 1 , wherein the redundant shader switch is further configured to transfer the data for the defective shader pipe without transferring the data for all other shader pipes in the shader pipe array that were not identified as being defective.3. The system of claim 1 , wherein the redundant shader switch is further configured to directly switch the data for defective shader pipe via a horizontal path to the redundant shader pipe array.4. The system of claim 1 , wherein:the shader pipe array further comprises a plurality ...

Подробнее

Номер записи: 48

01-04-2021 дата публикации

COLLAPSING BUBBLES IN A PROCESSING UNIT PIPELINE

Номер: US20210096877A1

Автор: Chen Jiasheng, He Bin, Mantor Michael

Принадлежит:

An arithmetic logic unit (ALU) pipeline of a processing unit collapses execution bubbles in response to a stall at a stage of the ALU pipeline. An execution bubble occurs at the pipeline in response to an invalid instruction being placed in the pipeline for execution. The invalid instruction thus consumes an available “slot” in the pipeline, and proceeds through the pipeline until a stall in a subsequent stage (that is, a stage after the stage executing the invalid instruction) is detected. In response to detecting the stall, the ALU continues to execute instructions that are behind the invalid instruction in the pipeline, thereby collapsing the execution bubble and conserving resources of the ALU.in response to a stall at a stage of the ALU pipeline. 1. A method comprising:identifying a first execution bubble at a first stage of an arithmetic logic unit (ALU) and a first stall condition at a second stage of the ALU; andin response to identifying the first execution bubble and the first stall condition, collapsing the first execution bubble.2. The method of claim 1 , wherein:collapsing the first execution bubble comprises executing a first instruction at a third stage of the ALU during the first stall condition.3. The method of claim 2 , wherein:collapsing the first execution bubble comprises executing a second instruction at a fourth stage of the ALU during the first stall condition.4. The method of claim 2 , wherein:collapsing the first execution bubble comprises issuing a second instruction to the ALU during the first stall condition.5. The method of claim 2 , further comprising:stalling the third stage of the ALU in response to collapsing the first execution bubble and in response to determining the first stall condition persists at the second stage of the ALU.6. The method of claim 1 , wherein:identifying the first execution bubble comprises identifying the first execution bubble in response to identifying an invalid instruction executing at the ALU.7. The ...

Подробнее

Номер записи: 49

26-04-2018 дата публикации

METHOD AND SYSTEM FOR PERFORMING LOW POWER AND LOW LATENCY MULTI-PRECISION COMPUTATION

Номер: US20180113709A1

Автор: Chen Jiasheng, He Bin, Mantor Michael, Zou Yunxiao

Принадлежит: Advanced Micro Devices, Inc.

A method and apparatus for performing a multi-precision computation in a plurality of arithmetic logic units (ALUs) includes pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having one-level staggering between the first and second channel devices. A third SIMD block channel device is paired with a fourth SIMD block channel device to create a second block pair having one-level staggering between the third and fourth channel devices. A plurality of source inputs are received at the first block pair and the second block pair. The first block pair computes a first result, and the second block pair computes a second result. 1. A method for performing a multi-precision computation in a plurality of arithmetic logic units (ALUs) , comprising:pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having one-level staggering between the first and second channel devices;pairing a third SIMD block channel device with a fourth SIMD block channel device to create a second block pair having one-level staggering between the third and fourth channel devices;receiving a plurality of source inputs at the first block pair and the second block pair;computing a first result by the first block pair; andcomputing a second result by the second block pair.2. The method of claim 1 , further comprising clock gating one or more of the SIMD block channel devices during a period that SIMD block channel device is not used for calculations.3. The method of claim 1 , further comprising outputting the first result at an output register of the first SIMD block channel device.4. The method of claim 1 , further comprising outputting the first result at an output register of the second SIMD block channel device.5. The method of claim 1 , further comprising outputting the second result at an output register of the third ...

Подробнее

Номер записи: 50

26-04-2018 дата публикации

PIPELINE INCLUDING SEPARATE HARDWARE DATA PATHS FOR DIFFERENT INSTRUCTION TYPES

Номер: US20180113714A1

Автор: Chen Jiasheng, He Bin, Mantor Michael, Socarras Angel E., WANG Qingcheng, Yuan Wei, Zou Yunxiao

Принадлежит:

A processing element is implemented in a stage of a pipeline and configured to execute an instruction. A first array of multiplexers is to provide information associated with the instruction to the processing element in response to the instruction being in a first set of instructions. A second array of multiplexers is to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions. A control unit is to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set. 1. An apparatus comprising:a first processing element implemented in a first stage of a pipeline and configured to execute an instruction;a first array of multiplexers to provide information associated with the instruction to the first processing element in response to the instruction being in a first set of instructions;a second array of multiplexers to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions; anda control unit to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set.2. The apparatus of claim 1 , wherein the second set of instructions is a subset of the first set of instructions.3. The apparatus of claim 1 , wherein:the second set of instructions includes instructions executed by the pipeline at or above a threshold frequency; andthe first set of instructions includes instructions executed by the pipeline below the threshold frequency.4. The apparatus of claim 1 , wherein the first set of instructions includes at least ten times as many instructions as the second set of instructions.5. The apparatus of claim 1 , wherein the control unit determines that the received instruction is in the second set based on an opcode of the first ...

Подробнее

Номер записи: 51

26-04-2018 дата публикации

RECONFIGURABLE VIRTUAL GRAPHICS AND COMPUTE PROCESSOR PIPELINE

Номер: US20180114290A1

Автор: Mantor Michael, McCrary Rex Eldon, Paltashev Timour T.

Принадлежит:

A graphics processing unit (GPU) includes a plurality of programmable processing cores configured to process graphics primitives and corresponding data and a plurality of fixed-function hardware units. The plurality of processing cores and the plurality of fixed-function hardware units are configured to implement a configurable number of virtual pipelines to concurrently process different command flows. Each virtual pipeline includes a configurable number of fragments and an operational state of each virtual pipeline is specified by a different context. The configurable number of virtual pipelines can be modified from a first number to a second number that is different than the first number. An emulation of a fixed-function hardware unit can be instantiated on one or more of the graphics processing cores in response to detection of a bottleneck in a fixed-function hardware unit. One or more of the virtual pipelines can then be reconfigured to utilize the emulation instead of the fixed-function hardware unit. 1. An apparatus comprising:a plurality of programmable processing cores configured to process graphics primitives and corresponding data; anda plurality of fixed-function hardware units, wherein the plurality of processing cores and the plurality of fixed-function hardware units are configured to implement a configurable number of virtual pipelines to concurrently process different command flows, wherein each virtual pipeline includes a configurable number of fragments, and wherein an operational state of each virtual pipeline is specified by a different context.2. The apparatus of claim 1 , further comprising:a configurable number of queues for storing packets that include commands for execution by corresponding virtual pipelines; anda command processor configured to schedule and dispatch commands to the configurable number of queues.3. The apparatus of claim 2 , wherein each virtual pipeline comprises:a super-pipe fragment that implements a state machine to ...

Подробнее

Номер записи: 52

03-05-2018 дата публикации

SUPER SINGLE INSTRUCTION MULTIPLE DATA (SUPER-SIMD) FOR GRAPHICS PROCESSING UNIT (GPU) COMPUTING

Номер: US20180121386A1

Автор: Chen Jiasheng, He Bin, Mantor Michael, Socarras Angel E., Zou Yunxiao

Принадлежит: Advanced Micro Devices, Inc.

A super single instruction, multiple data (SIMD) computing structure and a method of executing instructions in the super-SIMD is disclosed. The super-SIMD structure is capable of executing more than one instruction from a single or multiple thread and includes a plurality of vector general purpose registers (VGPRs), a first arithmetic logic unit (ALU), the first ALU coupled to the plurality of VGPRs, a second ALU, the second ALU coupled to the plurality of VGPRs, and a destination cache (Do$) that is coupled via bypass and forwarding logic to the first ALU, the second ALU and receiving an output of the first ALU and the second ALU. The Do$ holds multiple instructions results to extend an operand by-pass network to save read and write transactions power. A compute unit (CU) and a small CU including a plurality of super-SIMDs are also disclosed. 1. A super single instruction , multiple data (SIMD) , the super-SIMD structure capable of executing more than one instruction from a single or multiple thread comprising:a plurality of vector general purpose registers (VGPRs);a first arithmetic logic unit (ALU), the first ALU coupled to the plurality of VGPRs;a second ALU, the second ALU coupled to the plurality of VGPRs; anda destination cache (Do$s) that is coupled via bypass and forwarding logic to the first ALU and the second ALU and receiving an output of the first ALU and the second ALU.2. The super-SIMD of wherein the first ALU is a full ALU.3. The super-SIMD of wherein the second ALU is a core ALU.4. The super-SIMD of wherein the core ALU is capable of executing certain opcodes.5. The super-SIMD of wherein the Do$ holds multiple instructions results to extend an operand by-pass network to save read and write transactions power.6. A compute unit (CU) claim 1 , the CU comprising: a plurality of vector general purpose registers (VGPRs) grouped in sets;', 'a plurality of first arithmetic logic units (ALUs), each first ALU coupled to one set of the plurality of VGPRs;', 'a ...

Подробнее

Номер записи: 53

25-08-2022 дата публикации

ACCESS LOG AND ADDRESS TRANSLATION LOG FOR A PROCESSOR

Номер: US20220269620A1

Автор: ASARO Anthony, Cheng Gongxian Jeffrey, Fowler Mark, Mantor Michael, Sander Benjamin T.

Принадлежит:

A processor maintains an access log indicating a stream of cache misses at a cache of the processor. In response to each of at least a subset of cache misses at the cache, the processor records a corresponding entry in the access log, indicating a physical memory address of the memory access request that resulted in the corresponding miss. In addition, the processor maintains an address translation log that indicates a mapping of physical memory addresses to virtual memory addresses. In response to an address translation (e.g., a page walk) that translates a virtual address to a physical address, the processor stores a mapping of the physical address to the corresponding virtual address at an entry of the address translation log. Software executing at the processor can use the two logs for memory management. 120-. (canceled)21. A method comprising:recording, at a processor, a first log indicating a set of physical memory addresses associated with a stream of cache misses at the processor;providing the first log to an operating system executing at the processor; andtransferring data to a first cache based on the first log.22. The method of claim 21 , further comprising:recording, at the processor, a second log indicating a mapping of the set of physical memory addresses to a corresponding set of virtual addresses; andproviding the second log to the operating system executing at the processor;wherein transferring data to the first cache comprises transferring data based on the second log.23. The method of claim 22 , wherein providing the first log and the second log comprises:providing the first log and the second log to the operating system in response to a number of physical memory addresses in the first set exceeding a threshold.24. The method of claim 21 , further comprising:identifying a memory access pattern based on the first log.25. The method of claim 24 , wherein transferring data comprises transferring data based on the memory access pattern.26. The method ...

Подробнее

Номер записи: 54

25-04-2019 дата публикации

HYBRID RENDER WITH DEFERRED PRIMITIVE BATCH BINNING

Номер: US20190122417A1

Автор: Alho Mikko, Buss Patrick Klas Rudolf, Fowler Mark, Kallio Kiia, Kelley Timothy, Komppa Jari Antero, Lefebvre Laurent, Mantor Michael, Tuomi Kaj, Tuomi Mika

Принадлежит:

A system, method and a non-transitory computer readable storage medium are provided for hybrid rendering with deferred primitive batch binning. A primitive batch is generated from one or more primitives. A bin is identified for processing the primitive batch. At least a portion of each primitive intersecting the identified bin is processed and a next bin for processing the primitive batch is identified based on an intercept walk order. The processing is iteratively repeated for the one or more primitives in the primitive batch for successive bins until all primitives of the primitive batch are completely processed. Then, the one or more primitives in the primitive batch are further processed. 1. A method for use in a computer system , the method comprising:generating a primitive batch from one or more primitives;identifying a bin for processing the primitive batch, wherein the bin for processing is identified using bin intercept information for each primitive in the primitive batch;processing at least a portion of each primitive intersecting the identified bin, wherein the processing is performed on a per-bin basis and only the portion of each primitive located within the identified bin is processed;identifying a next bin for processing the primitive batch based on an intercept walk order;iteratively repeating the processing of the one or more primitives in the primitive batch for successive bins until all primitives of the primitive batch are completely processed; andfurther processing the one or more primitives in the primitive batch, wherein the further processing includes a shading processing operation.2. The method of claim 1 , wherein the processing of the at least a portion of each primitive intersecting the identified bin follows an order of processing associated with an arrival identifier of each primitive.3. The method of claim 1 , wherein the shading processing operation is a deferred shading processing operation in response to the identified bin having ...

Подробнее

Номер записи: 55

27-05-2021 дата публикации

DEDICATED VECTOR SUB-PROCESSOR SYSTEM

Номер: US20210157588A1

Автор: Chen Jiasheng, He Bin, Huang Jian, Mantor Michael

Принадлежит:

A processor includes a plurality of vector sub-processors (VSPs) and a plurality of memory banks dedicated to respective VSPs. A first memory bank corresponding to a first VSP includes a first plurality of high vector general purpose register (VGPR) banks and a first plurality of low VGPR banks corresponding to the first plurality of high VGPR banks. The first memory bank further includes a plurality of operand gathering components that store operands from respective high VGPR banks and low VGPR banks. The operand gathering components are assigned to individual threads while the threads are executed by the first VSP. 1. A processor , comprising:a plurality of vector sub-processors (VSPs); and a first plurality of high vector general purpose register (VGPR) banks; and', 'a first plurality of low VGPR banks corresponding to the plurality of high VGPR banks., 'a plurality of memory banks dedicated to respective VSPs of the plurality of VSPs, wherein a first memory bank dedicated to a first VSP of the plurality of VSPs comprises2. The processor of claim 1 , further comprising a second memory bank dedicated to a second VSP of the plurality of VSPs claim 1 , wherein the second memory bank comprises:a second plurality of high VGPR banks; anda second plurality of low VGPR banks corresponding to the second plurality of high VGPR banks.3. The processor of claim 1 , further comprising a broadcast switch configured to broadcast operands between the plurality of VSPs.4. The processor of claim 1 , wherein the first memory bank further comprises a plurality of operand gathering components corresponding to VGPR banks of the first VSP claim 1 , wherein a first operand gathering component is configured to store a first plurality of operands from a corresponding high VGPR bank and to store a second plurality of operands from a corresponding low VGPR bank.5. The processor of claim 4 , further comprising a phase multiplexer of the first VSP claim 4 , wherein the phase multiplexer is ...

Подробнее

Номер записи: 56

27-05-2021 дата публикации

WORKLOAD-BASED CLOCK ADJUSTMENT AT A PROCESSING UNIT

Номер: US20210157639A1

Автор: HOSSEINZADEH NAMIN Ashkan, Mantor Michael, NIJASURE Mangesh P., Regniere Louis

Принадлежит:

A graphics processing unit (GPU) adjusts a frequency of clock based on identifying a program thread executing at the processing unit, wherein the program thread is detected based on a workload to be executed. By adjusting the clock frequency based on the identified program thread, the processing unit adapts to different processing demands of different program threads. Further, by identifying the program thread based on workload, the processing unit adapts the clock frequency based on processing demands, thereby conserving processing resources. 1. A method comprising:receiving, at a graphics processing unit (GPU), a plurality of commands from a central processing unit (CPU), the plurality of commands associated with a plurality of program threads concurrently executing at the CPU, each of the plurality of threads associated with a corresponding specified clock frequency,determining, at the GPU, a first workload to be executed at the GPU based on at least one of the plurality of commands;identifying, based on the first workload, a first program thread of the plurality of program threads concurrently executing at the CPU; andin response to identifying the first program thread, adjusting a clock signal of the GPU to the specified clock frequency associated with the first program thread.2. The method of claim 1 , wherein:identifying the first program thread comprises identifying the first program thread in response to the first workload exceeding a first workload threshold.3. The method of claim 2 , further comprising:determining, at the GPU, a second workload to be executed at the GPU after the first workload based on at least one other command of the plurality of commands;identifying, based on the second workload, a second program thread of the plurality of program threads concurrently executing at the CPU; andin response to identifying the second program thread, adjusting the clock signal of the GPU from the first frequency to the specified frequency associated with ...

Подробнее

Номер записи: 57

02-05-2019 дата публикации

PACKED 16 BITS INSTRUCTION PIPELINE

Номер: US20190129718A1

Автор: Chen Jiasheng, Emberling Brian D., Finger Eric J., Giduthuri Radhakrishna, He Bin, Mantor Michael J., Zou Yunxiao

Принадлежит:

Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction. 1. A processor comprising:a register file configured to store data;a first arithmetic logic unit (ALU) and a second ALU; and receive a first instruction;', 'read, from the register file, one or more M bit source operands indicated by the first instruction;', select a first N bit source operand from one of the one or more M-bit source operands;', 'send the first N bit source operand to a given ALU of the first ALU and the second ALU; and', 'send an indication to each of the first ALU and the second ALU specifying the packed math operation indicated by the first instruction., 'in response to determining the first instruction comprises a packed math operation to perform on data with a size of N bits, where N and M are integers and N is less than M], 'control logic configured to2. The processor as recited in claim 1 , wherein the instruction includes a bit which indicates a location of the first source operand within one of the at least two source operands.3. The processor as recited in claim 1 , wherein the control logic is further configured to:determine from the first instruction that a ...

Подробнее

Номер записи: 58

02-05-2019 дата публикации

WAVE CREATION CONTROL WITH DYNAMIC RESOURCE ALLOCATION

Номер: US20190129756A1

Автор: KAZAKOV Maxim V., Mantor Michael

Принадлежит:

Footprints, or resource allocations, of waves within resources that are shared by processor cores in a multithreaded processor are measured concurrently with the waves executing on the processor cores. The footprints are averaged over a time interval. A number of waves are spawned and dispatched for execution in the multithreaded processor based on the average footprint. In some cases, the waves are spawned at a rate that is determined based on the average value of the footprints of waves within the resources. The rate of spawning waves is modified in response to a change in the average value of the footprints of the waves within the resources. 1. A method comprising:measuring resource allocations of waves within resources that are shared by processor cores in a multithreaded processor concurrently with the waves executing on the processor cores;averaging, at the multithreaded processor, the resource allocations over a time interval; andspawning, at the multithreaded processor, a number of waves that are dispatched for execution in the multithreaded processor based on the average resource allocation.2. The method of claim 1 , wherein measuring the resource allocations of the waves comprises measuring the resource allocations of the waves at times corresponding to at least one of creation of the waves claim 1 , allocation of resources to the waves claim 1 , deallocation of resources from the waves claim 1 , or at time intervals corresponding to a predetermined number of execution cycles.3. The method of claim 1 , wherein measuring the resource allocations over the time interval comprises measuring maximal resource allocations of the waves while the waves are executing on the processor cores.4. The method of claim 1 , wherein measuring the resource allocations of the waves within the resources comprises measuring the resource allocations of the waves within the resources during a trailing time interval relative to a reference time.5. The method of claim 4 , wherein ...

Подробнее

Номер записи: 59

01-09-2022 дата публикации

HYBRID RENDER WITH DEFERRED PRIMITIVE BATCH BINNING

Номер: US20220277508A1

Автор: Alho Mikko, Buss Patrick Klas Rudolf, Fowler Mark, Kallio Kiia, Kelley Timothy, Komppa Jari Antero, Lefebvre Laurent, Mantor Michael, Tuomi Kaj, Tuomi Mika

Принадлежит:

A method, computer system, and a non-transitory computer-readable storage medium for performing primitive batch binning are disclosed. The method, computer system, and non-transitory computer-readable storage medium include techniques for generating a primitive batch from a plurality of primitives, computing respective bin intercepts for each of the plurality of primitives in the primitive batch, and shading the primitive batch by iteratively processing each of the respective bin intercepts computed until all of the respective bin intercepts are processed. 1. A method for use in a computer system , the method comprising:generating a primitive batch from a plurality of primitives;computing respective bin intercepts for each of the plurality of primitives in the primitive batch; andshading the primitive batch by iteratively processing each of the respective bin intercepts computed until all of the respective bin intercepts are processed, selecting a particular bin intercept from each of the respective bin intercepts computed,', 'rasterizing a first primitive and all other primitives that have the particular bin intercept, and', 'selecting a new particular bin intercept from a remaining subset of the respective bin intercepts computed, wherein the first primitive has the new particular bin intercept., 'wherein iteratively processing each of the respective bin intercepts comprises2. The method of claim 1 , wherein the rasterizing follows an order of processing associated with an arrival identifier of each primitive.3. The method of claim 1 , wherein the rasterizing includes a deferred shading processing operation in response to the particular bin intercept having an overlapping region of more than one primitive.4. The method of claim 1 , wherein the selecting the particular bin intercept comprises:determining initial uppermost-left intersection points for primitives in the primitive batch.5. The method of claim 1 , wherein selecting the new particular bin intercept ...

Подробнее

Номер записи: 60

24-05-2018 дата публикации

DUAL MODE LOCAL DATA STORE

Номер: US20180143907A1

Автор: Burton Hans, CLIFTON Daniel, Mantor Michael J.

Принадлежит:

A system and method for efficiently processing access requests for a shared resource are described. Each of many requestors are assigned to a partition of a shared resource. When a controller determines no requestor generates an access request for an unassigned partition, the controller permits simultaneous access to the assigned partitions for active requestors. When the controller determines at least one active requestor generates an access request for an unassigned partition, the controller allows a single active requestor to gain exclusive access to the entire shared resource while stalling access for the other active requestors. The controller alternatives exclusive access among the active requestors. In various embodiments, the shared resource is a local data store in a graphics processing unit and each of the multiple requestors is a single instruction multiple data (SIMD) compute unit. 1. A computing system comprising:a shared resource comprising a plurality of partitions;a plurality of requestors, each assigned to a different partition of the plurality of partitions of the shared resource, each configured to generate a request to the plurality of partitions; and in response to determining no active requestor targets an unassigned partition, provide simultaneous access to partitions assigned to the plurality of active requestors; and', select a first requestor of the plurality of active requestors;', 'provide the first requestor with access to all partitions of the plurality of partitions; and', 'stall access to the shared resource for each of the plurality of requestors other than the first requestor when providing the first requestor with access to all partitions., 'in response to determining an active requestor targets an unassigned partition], 'a controller coupled to the shared resource, wherein in response to receiving a plurality of requests from a plurality of active requestors of the plurality of requestors for access to the shared resource, the ...

Подробнее

Номер записи: 61

24-05-2018 дата публикации

LOW POWER AND LOW LATENCY GPU COPROCESSOR FOR PERSISTENT COMPUTING

Номер: US20180144435A1

Автор: Chen Jiasheng, Lyashevsky Alexander, Mantor Michael J., Paltashev Timour, Wakeland Carl Kittredge

Принадлежит:

Systems, apparatuses, and methods for implementing a graphics processing unit (GPU) coprocessor are disclosed. The GPU coprocessor includes a SIMD unit with the ability to self-schedule sub-wave procedures based on input data flow events. A host processor sends messages targeting the GPU coprocessor to a queue. In response to detecting a first message in the queue, the GPU coprocessor schedules a first sub-task for execution. The GPU coprocessor includes an inter-lane crossbar and intra-lane biased indexing mechanism for a vector general purpose register (VGPR) file. The VGPR file is split into two files. The first VGPR file is a larger register file with one read port and one write port. The second VGPR file is a smaller register file with multiple read ports and one write port. The second VGPR introduces the ability to co-issue more than one instruction per clock cycle. 1. A system comprising:a queue;a graphics processing unit (GPU);a GPU coprocessor; anda host processor configured to send messages targeting the GPU coprocessor to the queue; monitor the queue; and', 'schedule a first sub-task for execution responsive to detecting a first message in the queue, wherein the first sub-task is a persistent thread., 'wherein the GPU coprocessor is configured to2. The system as recited in claim 1 , wherein the GPU coprocessor is further configured to:perform a lookup of an event table for the first message responsive to detecting the first message;map the first message to a first event using the event table; andschedule the first sub-task for execution responsive to mapping the first message to the first event; andcontinue to service subsequent messages when compute resources are available.4. The system as recited in claim 1 , wherein the GPU coprocessor comprises:a first vector general purpose register (VGPR) file with one read port and one write port;a second VGPR file with multiple read ports and one write port;a single instruction, multiple data (SIMD) unit;a biased ...

Подробнее

Номер записи: 62

23-05-2019 дата публикации

SELECTIVE PREFETCHING IN MULTITHREADED PROCESSING UNITS

Номер: US20190155604A1

Автор: EMBERLING Brian, Mantor Michael

Принадлежит:

A processing unit includes a plurality of processing elements and one or more caches. A first thread executes a program that includes one or more prefetch instructions to prefetch information into a first cache. Prefetching is selectively enabled when executing the first thread on a first processing element dependent upon whether one or more second threads previously executed the program on the first processing element. The first thread is then dispatched to execute the program on the first processing element. In some cases, a dispatcher receives the first thread four dispatching to the first processing element. The dispatcher modifies the prefetch instruction to disable prefetching into the first cache in response to the one or more second threads having previously executed the program on the first processing element. 1. A method comprising:selectively enabling, at a processing unit that comprises a plurality of processing elements, prefetching for a first thread that executes a program on a first processing element dependent upon whether at least one second thread previously executed the program on the first processing element, wherein the program includes at least one prefetch instruction to prefetch information into a first cache; anddispatching the first thread to execute the program on the first processing element.2. The method of claim 1 , wherein selectively enabling prefetching for the first thread comprises disabling prefetching for the first thread in response to the at least one second thread having previously executed the program on the first processing element.3. The method of claim 1 , wherein selectively enabling prefetching for the first thread comprises enabling prefetching for the first thread in response to the at least one second thread not having previously executed the program on the first processing element.4. The method of claim 3 , wherein the processing unit comprises a cache hierarchy that includes the first cache to cache information for ...

Подробнее

Номер записи: 63

14-06-2018 дата публикации

REMOVING OR IDENTIFYING OVERLAPPING FRAGMENTS AFTER Z-CULLING

Номер: US20180165872A1

Автор: Alho Mikko, Brennan Christopher J., Buss Patrick Klas Rudolf, Fowler Mark, Kallio Kiia, Komppa Jari Antero, Lefebvre Laurent, Mantor Michael, Tuomi Kaj, Tuomi Mika

Принадлежит:

Techniques for removing or identifying overlapping fragments in a fragment stream after z-culling are disclosed. The techniques include maintaining a first-in-first-out buffer that stores post-z-cull fragments. Each time a new fragment is received at the buffer, the screen position of the fragment is checked against all other fragments in the buffer. If the screen position of the fragment matches the screen position of a fragment in the buffer, then the fragment in the buffer is removed or marked as overlapping. If the screen position of the fragment does not match the screen position of any fragment in the buffer, then no modification is performed to fragments already in the buffer. In either case, he fragment is added to the buffer. The contents of the buffer are transmitted to the pixel shader for pixel shading at a later time. 1. A method for identifying overlapping fragments in a stream of fragments for processing by a pixel shader , the method comprising:receiving, from a stream of z-culled fragments, a first fragment, the first fragment having a first screen position;identifying, in a deferred pixel shading buffer that stores fragments, a second fragment having a second screen position that matches the first screen position;responsive to the identifying, modifying the deferred pixel shading buffer based on the match; andtransmitting the fragments of the deferred pixel shading buffer to a pixel shader for shading.2. The method of claim 1 , further comprising:binning a plurality of input primitives to generate a plurality of binned input primitives;rasterizing the plurality of binned input primitives to generate a set of fragments; andz-culling the set of fragments to produce the stream of z-culled fragments.3. The method of claim 2 , wherein:binning the plurality of input primitives to generate the plurality of binned input primitives includes assigning the primitives of the plurality of input primitives to bins, where each bin is associated with a different ...

Подробнее

Номер записи: 64

30-05-2019 дата публикации

PRECISE SUSPEND AND RESUME OF WORKLOADS IN A PROCESSING UNIT

Номер: US20190163527A1

Автор: Acharya Anirudh R., Mantor Michael

Принадлежит:

A first workload is executed in a first subset of pipelines of a processing unit. A second workload is executed in a second subset of the pipelines of the processing unit. The second workload is dependent upon the first workload. The first and second workloads are suspended and state information for the first and second workloads is stored in a first memory in response to suspending the first and second workloads. In some cases, a third workload executes in a third subset of the pipelines of the processing unit concurrently with executing the first and second workloads. In some cases, a fourth workload is executed in the first and second pipelines after suspending the first and second workloads. The first and second pipelines are resumed on the basis of the stored state information in response to completion or suspension of the fourth workload. 1. A method comprising:executing a first workload in a first subset of pipelines of a processing unit;executing a second workload in a second subset of the pipelines of the processing unit, wherein the second workload is dependent upon the first workload;suspending the first and second workloads; andstoring state information for the first and second workloads in a first memory in response to suspending the first and second workloads.2. The method of claim 1 , wherein executing the first workload comprises executing a compute workload in a compute pipeline of the processing unit claim 1 , and wherein executing the second workload comprises executing a graphics workload in a graphics pipeline of the processing unit.3. The method of claim 1 , further comprising:executing a third workload in a third subset of the pipelines of the processing unit concurrently with executing the first and second workloads, wherein the first, second, and third subsets of the pipelines are mutually exclusive.4. The method of claim 3 , wherein suspending the first and second workloads comprises suspending the first and second workloads while ...

Подробнее

Номер записи: 65

30-05-2019 дата публикации

PRIMITIVE LEVEL PREEMPTION USING DISCRETE NON-REAL-TIME AND REAL TIME PIPELINES

Номер: US20190164328A1

Автор: Acharya Anirudh R., Goel Vineet, Mantor Michael, Martin Todd, NIJASURE Mangesh P., Sakharshete Swapnil

Принадлежит:

Processing of non-real-time and real-time workloads is performed using discrete pipelines. A first pipeline includes a first shader and one or more fixed function hardware blocks. A second pipeline includes a second shader that is configured to emulate the at least one fixed function hardware block. First and second memory elements store first state information for the first pipeline and second state information for the second pipeline, respectively. A non-real-time workload executing in the first pipeline is preempted at a primitive boundary in response to a real-time workload being dispatched for execution in the second pipeline. The first memory element retains the first state information in response to preemption of the non-real-time workload. The first pipeline is configured to resume processing the subsequent primitive on the basis of the first state information stored in the first memory element. 1. An apparatus comprising:a first pipeline that comprises a first shader and at least one fixed function hardware block; anda second pipeline that comprises a second shader that is configured to emulate the at least one fixed function hardware block;wherein a non-real-time workload executing in the first pipeline is preempted in response to a real-time workload being submitted for execution in the second pipeline.2. The apparatus of claim 1 , further comprising:a command processor; anda geometry engine, wherein the first pipeline and the second pipeline are implemented in the command processor and the geometry engine.3. The apparatus of claim 2 , further comprising:first and second memory elements to store first state information for the first pipeline and second state information for the second pipeline, respectively, wherein the non-real-time workload executing in the first pipeline is preempted at a primitive boundary in response to the real-time workload being submitted for execution in the second pipeline.4. The apparatus of claim 2 , wherein the non-real-time ...

Подробнее

Номер записи: 66

01-07-2021 дата публикации

LOW POWER AND LOW LATENCY GPU COPROCESSOR FOR PERSISTENT COMPUTING

Номер: US20210201439A1

Автор: Chen Jiasheng, Lyashevsky Alexander, Mantor Michael J., Paltashev Timour, Wakeland Carl Kittredge

Принадлежит:

Systems, apparatuses, and methods for implementing a graphics processing unit (GPU) coprocessor are disclosed. The GPU coprocessor includes a SIMD unit with the ability to self-schedule sub-wave procedures based on input data flow events. A host processor sends messages targeting the GPU coprocessor to a queue. In response to detecting a first message in the queue, the GPU coprocessor schedules a first sub-task for execution. The GPU coprocessor includes an inter-lane crossbar and intra-lane biased indexing mechanism for a vector general purpose register (VGPR) file. The VGPR file is split into two files. The first VGPR file is a larger register file with one read port and one write port. The second VGPR file is a smaller register file with multiple read ports and one write port. The second VGPR introduces the ability to co-issue more than one instruction per clock cycle. 1. A system comprising:a queue;a graphics processing unit (GPU);a GPU coprocessor; anda host processor configured to send messages targeting the GPU coprocessor to the queue; monitor the queue; and', 'schedule a first sub-task for execution responsive to detecting a first message in the queue, wherein the first sub-task is a persistent thread., 'wherein the GPU coprocessor is configured to2. The system as recited in claim 1 , wherein the GPU coprocessor is further configured to:perform a lookup of an event table for the first message responsive to detecting the first message;map the first message to a first event using the event table; andschedule the first sub-task for execution responsive to mapping the first message to the first event; andcontinue to service subsequent messages when compute resources are available.4. The system as recited in claim 1 , wherein the GPU coprocessor comprises:a first vector general purpose register (VGPR) file with one read port and one write port;a second VGPR file with multiple read ports and one write port;a single instruction, multiple data (SIMD) unit;a biased ...

Подробнее

Номер записи: 67

21-06-2018 дата публикации

Efficient arbitration for memory accesses

Номер: US20180173649A1

Автор: Anthony Asaro, Kostantinos Danny Christidis, Mark Fowler, Michael J. Mantor, Robert Scott Hartog, Rostyslav Kyrychynskyi

Принадлежит: Advanced Micro Devices Inc, ATI TECHNOLOGIES ULC

A system and method for efficient arbitration of memory access requests are described. One or more functional units generate memory access requests for a partitioned memory. An arbitration unit stores the generated requests and selects a given one of the stored requests. The arbitration unit identifies a given partition of the memory which stores a memory location targeted by the selected request. The arbitration unit determines whether one or more other stored requests access memory locations in the given partition. The arbitration unit sends each of the selected memory access request and the identified one or more other memory access requests to the memory to be serviced out of order.

Подробнее

Номер записи: 68

06-06-2019 дата публикации

STREAM PROCESSOR WITH LOW POWER PARALLEL MATRIX MULTIPLY PIPELINE

Номер: US20190171448A1

Автор: Chen Jiasheng, Mantor Michael J., Rush Allen, Zou Yunxiao

Принадлежит:

Systems, apparatuses, and methods for implementing a low power parallel matrix multiply pipeline are disclosed. In one embodiment, a system includes at least first and second vector register files coupled to a matrix multiply pipeline. The matrix multiply pipeline comprises a plurality of dot product units. The dot product units are configured to calculate dot or outer products for first and second sets of operands retrieved from the first vector register file. The results of the dot or outer product operations are written back to the second vector register file. The second vector register file provides the results from the previous dot or outer product operations as inputs to subsequent dot or outer product operations. The dot product units receive the results from previous phases of the matrix multiply operation and accumulate these previous dot or outer product results with the current dot or outer product results. 1. A system comprising:a first vector register file; and calculate a plurality of products of elements of a first set of operands and corresponding elements of a second set of operands; and', 'calculate a sum of an accumulation input and the plurality of products, wherein the sum is an output of the dot product unit., 'a first execution pipeline coupled to the first vector register file, wherein the first execution pipeline comprises a plurality of dot product units, and wherein each dot product unit of the plurality of dot product units is configured to2. The system as recited in claim 1 , wherein the system is configured to read the first and second sets of operands from the first vector register file and provide the first and second sets of operands to the first execution pipeline.3. The system as recited in claim 1 , wherein the system further comprises a second vector register file claim 1 , wherein the system is configured to read a plurality of accumulation inputs from the second vector register file and provide the plurality of accumulation ...

Подробнее

Номер записи: 69

08-07-2021 дата публикации

HYBRID RENDER WITH PREFERRED PRIMITIVE BATCH BINNING AND SORTING

Номер: US20210209831A1

Автор: Alho Mikko, Kallio Kiia, Lefebvre Laurent, Mantor Michael, Tuomi Mika

Принадлежит:

A method, system, and non-transitory computer readable storage medium for rasterizing primitives are disclosed. The method, system, and non-transitory computer readable storage medium includes: generating a primitive batch from a sequence of one or more primitives, wherein the primitive batch includes primitives sorted into one or more row groups based on which row of a plurality of rows each primitive intersects; and processing each row group, the processing for each row group including: identifying one or more primitive column intercepts for each of the one or more primitives in the row group, wherein each combination of primitive column intercept and row identifies a bin; and rasterizing the one or more primitives that intersect the bin. 1. A method for rasterizing , the method comprising:generating a primitive batch from a sequence of one or more primitives, wherein the primitive batch includes primitives sorted into one or more row groups based on which row of a plurality of rows each primitive intersects; and identifying one or more primitive column intercepts for each of the one or more primitives in the row group, wherein each combination of primitive column intercept and row identifies a bin; and', 'rasterizing the one or more primitives that intersect the bin., 'processing each row group, the processing for each row group including2. The method of claim 1 , further comprising setting a display area for rasterization corresponding to the area of the bin for each bin that includes a primitive.3. The method of claim 1 , wherein the primitive batch is generated by accumulating primitives of the sequence into different row groups.4. The method of claim 1 , further comprising determining whether a batch break condition exists before inserting a new primitive in the batch.5. The method of claim 1 , further comprising storing a mask that indicates the one or more columns for each of the plurality of rows that contains at least one primitive.6. The method of claim ...

Подробнее

Номер записи: 70

02-10-2014 дата публикации

Hybrid Render with Deferred Primitive Batch Binning

Номер: US20140292756A1

Автор: Alho Mikko, Buss Patrick Klas Rudolf, Fowler Mark, Kelley Timothy, Kia Kallio, Komppa Jari Antero, Lefebvre Laurent, Mantor Michael, Tuomi Kaj, Tuomi Mika

Принадлежит:

A system, method and a computer program product are provided for hybrid rendering with deferred primitive batch binning. A primitive batch is generated from a sequence of primitives. Initial bin intercepts are identified for primitives in the primitive batch. A bin for processing is identified. The bin corresponds to a region of a screen space. Pixels of the primitives intercepting the identified bin are processed. Next bin intercepts are identified while the primitives intercepting the identified bin are processed. 1. A method comprising:generating a primitive batch from a sequence of primitives;identifying initial bin intercepts for primitives in the primitive batch;identifying a bin for processing, wherein the bin corresponds to a region of a screen space;processing each primitive intercepting the identified bin; andduring the processing of primitives intercepting the identified bin, identifying next bin intercepts for the processed primitives.2. The method of claim 1 , further comprising:when a deferred shading processing operation is enabled, delaying shading of pixels associated with each primitive of the identified bin until receipt of a complete set of pixels for the identified bin;determining contributing and non-contributing fragments associated with each bin, wherein the contributing fragments affect at least one of a final pixel color and a pixel depth;discarding non-contributing fragments; andshading all contributing fragments.3. The method of claim 1 , further comprising:iteratively repeating the processing of primitives for successive bins until all primitives of the primitive batch have been completely processed.4. The method of claim 1 , wherein the identifying initial bin intercepts comprises:determining initial uppermost-left intersection points for primitives in the primitive batch; andtemporarily storing primitive data including at least the uppermost-left intersection bin.5. The method of claim 1 , wherein identifying a bin to be processed ...

Подробнее

Номер записи: 71

27-07-2017 дата публикации

SIMD PROCESSING UNIT WITH LOCAL DATA SHARE AND ACCESS TO A GLOBAL DATA SHARE OF A GPU

Номер: US20170212757A1

Автор: EMBERLING Brian, Mantor Michael J.

Принадлежит: Advanced Micro Devices, Inc.

A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads. w 1. A non-transitory computer-readable medium having stored thereon computer-executable instructions that , if executed by a computing device , cause the computing device to perform a method comprising:allocating a set of pixels of an image to a set of single-instruction multiple-data (SIMD) processors;allocating a subset of pixels of the set of pixels to each thread executing on a processing lane of each of the set of SIMD processors;storing the subset of pixels in a general purpose register (GPR) file associated with each processing lane;computing a per-thread private result based on the subset of pixels in a private space in the GPR file;accumulating the per-thread private result with additional per-thread private results computed by threads from a same lane to generate a per-lane local result stored in a global space in the GPR file; andwriting the per-lane local result from the global space in the GPR file to a private area of a local data share (LDS) associated with the processing lane, the LDS associated only with a SIMD processor of the set of SIMD processors which contains the processing lane.2. The non-transitory computer-readable medium of claim 1 , the method further comprising:reading each per-lane local result from the LDS into a first single GPR file of a first single processing lane;reducing the results of all per-lane local results from the LDS to find a SIMD-local result; andwriting the SIMD-local result from the first single GPR file to a private area of a global data share (GDS) ...

Подробнее

Номер записи: 72

26-07-2018 дата публикации

SOFTWARE CONTROL OF STATE SETS

Номер: US20180210657A1

Автор: Ashkar Alexander Fuad, Mantor Michael J., McCrary Rex Eldon, Wise Harry J.

Принадлежит:

Systems, apparatuses, and methods for implementing software control of state sets are disclosed. In one embodiment, a processor includes at least an execution unit and a plurality of state registers. The processor is configured to detect a command to allocate a first state set for storing a first state, wherein the command is generated by software, and wherein the first state specifies values for the plurality of state registers. The command is executed on the execution unit while the processor is in a second state, wherein the second state is different from the first state. The first state set of the processor is allocated with the first state responsive to executing the command on the execution unit. The processor is configured to allocate the first state set for the first state prior to the processor entering the first state. 1. A processor comprising:an execution unit; anda plurality of state registers; detect a first command to allocate a first state set for storing a first state, wherein the first command is generated by software and the first state specifies values for the plurality of state registers;', 'execute the first command on the execution unit while the processor is in a second state, wherein the second state is different from the first state; and', 'store the first state in the first state set responsive to executing the first command on the execution unit., 'wherein the processor is configured to2. The processor as recited in claim 1 , wherein the processor is configured to allocate the first state set prior to the processor entering the first state.3. The processor as recited in claim 1 , wherein the processor is configured to detect and execute a second command to reserve the first state set to prevent the first state set from being modified.4. The processor as recited in claim 3 , wherein the processor is configured to:detect a third command for the processor to use the first state, wherein the second command is generated by software;execute the ...

Подробнее

Номер записи: 73

26-07-2018 дата публикации

STEREO RENDERING

Номер: US20180211434A1

Автор: Mantor Michael, NIJASURE Mangesh P., Smith Jeffrey M.

Принадлежит: Advanced Micro Devices, Inc.

Techniques for generating a stereo image from a single set of input geometry in a three-dimensional rendering pipeline are disclosed. Vertices are processed through the end of the world-space pipeline. In the primitive assembler, at the end of the world-space pipeline, before perspective division, each clip-space vertex is duplicated. The primitive assembler generates this duplicated clip-space vertex using the y, z, and w coordinates of the original vertex and based on an x coordinate that is offset in the x-direction in clip-space as compared with the x coordinate of the original vertex. Both the original vertex clip-space vertex and the modified clip-space vertex are then sent through the rest of the pipeline for processing, including perspective division, viewport transform, rasterization, pixel shading, and other operations. The result is that a single set of input vertices is rendered into a stereo image. 1. A method for generating a stereo image , the method comprising:processing a first vertex through a vertex shader stage of a graphics processing pipeline to generate a first clip space vertex;obtaining a modified x coordinate in clip space, the modified x coordinate being the sum of a constant clip space offset value and an x coordinate of the first clip space vertex;obtaining a second clip space vertex based on the modified x coordinate, the second clip space vertex including y, z, and w coordinates identical to those of the first clip space vertex, and the modified x coordinate; andprocessing both the first clip space vertex and the second clip space vertex to form the stereo image.2. The method of claim 1 , wherein obtaining the modified x coordinate comprises:receiving the modified x coordinate from the vertex shader stage of the graphics processing pipeline.3. The method of claim 2 , further comprising:generating the modified x coordinate by multiplying a modified model-view-projection matrix by the first vertex to obtain a result and extracting the ...

Подробнее

Номер записи: 74

26-07-2018 дата публикации

SPLIT FRAME RENDERING

Номер: US20180211435A1

Автор: Mantor Michael, Martin Todd, NIJASURE Mangesh P.

Принадлежит: Advanced Micro Devices, Inc.

Improvements in the graphics processing pipeline that allow multiple pipelines to cooperate to render a single frame are disclosed. Two approaches are provided. In a first approach, world-space pipelines for the different graphics processing pipelines process all work for draw calls received from a central processing unit (CPU). In a second approach, the world-space pipelines divide up the work. Work that is divided is synchronized and redistributed at various points in the world-space pipeline. In either approach, the triangles output by the world-space pipelines are distributed to the screen-space pipelines based on the portions of the render surface overlapped by the triangles. Triangles are rendered by screen-space pipelines associated with the render surface portions overlapped by those triangles. 1. A method for sharing graphics processing work among multiple accelerated processing devices , the method comprising:obtaining, at a first accelerated processing device (“APD”), a set of triangles processed by a first world-space pipeline of the first APD and a second world-space pipeline of a second APD;obtaining, at the second APD, the set of triangles;discarding, at the first APD, a first subset of the set of triangles that do not overlap a first render surface portion associated with the first APD and processing a first subset of the set of triangles that do overlap the first render surface portion associated with the first APD in a first screen-space pipeline of the first APD;discarding, at the second APD, a second subset of the set of triangles that do not overlap a second render surface portion associated with the second APD and processing a second subset of the set of triangles that do overlap the second render surface portion associated with the second APD in a second screen-space pipeline of the second APD.2. The method of claim 1 , further comprising:processing each draw call of a set of draw calls at both the first APD and the second APD to generate the ...

Подробнее

Номер записи: 75

05-08-2021 дата публикации

SPATIAL PARTITIONING IN A MULTI-TENANCY GRAPHICS PROCESSING UNIT

Номер: US20210241516A1

Автор: Leather Mark, Mantor Michael

Принадлежит:

A graphics processing unit (GPU) or other apparatus includes a plurality of shader engines. The apparatus also includes a first front end (FE) circuit and one or more second FE circuits. The first FE circuit is configured to schedule geometry workloads for the plurality of shader engines in a first mode. The first FE circuit is configured to schedule geometry workloads for a first subset of the plurality of shader engines and the one or more second FE circuits are configured to schedule geometry workloads for a second subset of the plurality of shader engines in a second mode. In some cases, a partition switch is configured to selectively connect the first FE circuit or the one or more second FE circuits to the second subset of the plurality of shader engines depending on whether the apparatus is in the first mode or the second mode. 1. An apparatus comprising:a plurality of shader engines; anda first front end (FE) circuit and at least one second FE circuit, wherein the first FE circuit is configured to schedule geometry workloads for the plurality of shader engines in a first mode, and wherein the first FE circuit is configured to schedule geometry workloads for a first subset of the plurality of shader engines and the at least one second FE circuit is configured to schedule geometry workloads for a second subset of the plurality of shader engines in a second mode.2. The apparatus of claim 1 , further comprising:a partition switch configured to selectively connect the first FE circuit or the at least one second FE circuit to the second subset of the plurality of shader engines depending on whether the apparatus is in the first mode or the second mode.3. The apparatus of claim 1 , wherein the first FE circuit is configured to schedule geometry workloads for concurrent execution by the plurality of shader engines in the first mode claim 1 , and wherein claim 1 , in the second mode claim 1 , the first FE circuit is configured to schedule geometry workloads for ...

Подробнее

Номер записи: 76

02-07-2020 дата публикации

Prefetch kernels on data-parallel processors

Номер: US20200210341A1

Автор: James Michael O'connor, Michael Mantor, Nuwan S. Jayasena

Принадлежит: Advanced Micro Devices Inc

Embodiments include methods, systems and non-transitory computer-readable computer readable media including instructions for executing a prefetch kernel with reduced intermediate state storage resource requirements. These include executing a prefetch kernel on a graphics processing unit (GPU), such that the prefetch kernel begins executing before a processing kernel. The prefetch kernel performs memory operations that are based upon at least a subset of memory operations in the processing kernel.

Подробнее

Номер записи: 77

23-08-2018 дата публикации

VARIABLE WAVEFRONT SIZE

Номер: US20180239606A1

Автор: Emberling Brian D., Fowler Mark, Leather Mark M., Mantor Michael J.

Принадлежит:

Systems, apparatuses, and methods for processing variable wavefront sizes on a processor are disclosed. In one embodiment, a processor includes at least a scheduler, cache, and multiple execution units. When operating in a first mode, the processor executes the same instruction on multiple portions of a wavefront before proceeding to the next instruction of the shader program. When operating in a second mode, the processor executes a set of instructions on a first portion of a wavefront. In the second mode, when the processor finishes executing the set of instructions on the first portion of the wavefront, the processor executes the set of instructions on a second portion of the wavefront, and so on until all portions of the wavefront have been processed. The processor determines the operating mode based on one or more conditions. 1. A processor comprising:a plurality of execution units; anda scheduler; schedule the plurality of execution units to execute a first instruction on first and second portions of a wavefront prior to scheduling the plurality of execution units to execute a second instruction on the first portion of the wavefront, responsive to detecting a first indication; and', 'schedule the plurality of execution units to execute the first instruction and the second instruction on the first portion of a wavefront prior to scheduling the plurality of execution units to execute the first instruction on the second portion of the wavefront, responsive to not detecting the first indication., 'wherein the scheduler is configured to2. The processor as recited in claim 1 , wherein the first indication is a parameter declared within a software instruction.3. The processor as recited in claim 1 , wherein the processor is configured to operate in a plurality of operating modes claim 1 , and wherein the first indication is a command for the processor to operate in a first mode.4. The processor as recited in claim 3 , wherein the processor further comprises a ...

Подробнее

Номер записи: 78

23-08-2018 дата публикации

SUSPEND AND RESTORE PROCESSOR OPERATIONS

Номер: US20180239635A1

Автор: Ashkar Alexander Fuad, Mantor Michael J., McCrary Rex Eldon, Ramsey Randy Wayne, Wise Harry J.

Принадлежит:

Systems, apparatuses, and methods for suspending and restoring operations on a processor are disclosed. In one embodiment, a processor includes at least a control unit, multiple execution units, and multiple work creation units. In response to detecting a request to suspend a software application executing on the processor, the control unit sends requests to the plurality of work creation units to stop creating new work. The control unit waits until receiving acknowledgements from the work creation units prior to initiating a suspend operation. Once all work creation units have acknowledged that they have stopped creating new work, the control unit initiates the suspend operation. Also, when a restore operation is initiated, the control unit prevents any work creation units from launching new work-items until all previously in-flight work-items have been restored to the same work creation units and execution units to which they were previously allocated. 1. A processor comprising:a plurality of execution units;a plurality of work creation units; anda control unit coupled to the plurality of execution units and the plurality of work creation units; send requests to the plurality of work creation units to stop creating new work; and', 'wait until receiving acknowledgements from the plurality of work creation units in response to the requests, prior to initiating a suspend operation., 'wherein responsive to detecting a request to suspend a software application executing on the processor, the control unit is configured to2. The processor as recited in claim 1 , wherein the control unit is further configured to initiate the suspend operation responsive to receiving acknowledgements from the plurality of work creation units claim 1 , wherein initiating the suspend operation comprises:determining which work-items are in-flight;determining which work-items have been assigned to which work creation units;determining which execution units have been allocated for the in-flight ...

Подробнее

Номер записи: 79

01-08-2019 дата публикации

SYSTEM AND METHOD FOR PROTECTING GPU MEMORY INSTRUCTIONS AGAINST FAULTS

Номер: US20190235953A1

Автор: Gurumurthi Sudhanva, Kalamatianos John, Mantor Michael

Принадлежит: Advanced Micro Devices, Inc.

A system and method for protecting memory instructions against faults are described. The system and method include converting the slave instructions to dummy operations, modifying memory arbiter to issue up to N master and N slave global/shared memory instructions per cycle, sending master memory requests to memory system, using slave requests for error checking, entering master requests to the GM/LM FIFO, storing slave requests in a register, and comparing the entered master requests with the stored slave requests. 120.-. (canceled)21. A method for protecting memory instructions against faults , the method comprising:selecting a memory instruction to issue, the memory instruction indicating a master instruction and a slave instruction to be executed in lockstep;performing at least one of an operand error check, a data error check, and a parity error check for the memory instruction;sending data to a data cache and providing a response from data cache to a plurality of single instruction multiple data (SIMD) processors of the master and slave instructions and check for parity errors in the plurality of SIMDs; andcompleting memory access for the memory instruction by returning data from the plurality of SIMDs the master instruction and the slave instruction.22. The method of claim 21 , wherein the data cache is one of a L1 data cache and a local data share (LDS).23. The method of claim 21 , further comprising passing an error check if there is no error.24. The method of claim 21 , further comprising replaying the master and slave instruction if an error check results in an error.25. The method of claim 21 , further comprising utilizing parity information on the master instructions using at least one error check logic.26. The method of claim 21 , further comprising executing the master instruction in a data cache.27. A system for protecting memory instructions against faults claim 21 , the system comprising:a memory arbiter configured to select a memory instruction to ...

Подробнее

Номер записи: 80

08-09-2016 дата публикации

REDUNDANCY METHOD AND APPARATUS FOR SHADER COLUMN REPAIR

Номер: US20160260192A1

Автор: Brady Jeffrey T., Mantor Michael J., Socarras Angel E.

Принадлежит: Advanced Micro Devices, Inc.

Methods, systems and non-transitory computer readable media are described. A system includes a shader pipe array, a redundant shader pipe array, a sequencer and a redundant shader switch. The shader pipe array includes multiple shader pipes, each of which perform rendering calculations on data provided thereto. The redundant shader pipe array also performs rendering calculations on data provided thereto. The sequencer identifies at least one defective shader pipe in the shader pipe array, and, in response, generates a signal. The redundant shader switch receives the generated signal, and, in response, transfers the data destined for each shader pipe identified as being defective independently to the redundant shader pipe array. 1. A system comprising:a shader pipe array comprising a plurality of shader pipes, each of the plurality of shader pipes being configured to perform rendering calculations on data provided thereto;a redundant shader pipe array configured to perform rendering calculations on data provided thereto;a sequencer configured to identify at least one defective shader pipe of the plurality of shader pipes in the shader pipe array, and, in response to identifying the at least one defective shader pipe, generate a signal; and receive the generated signal, and', 'in response to receiving the generated signal, transfer the data destined for each shader pipe identified as being defective independently to the redundant shader pipe array., 'a redundant shader switch configured to2. The system of claim 1 , wherein the redundant shader switch is further configured to transfer the data destined for each shader pipe identified as being defective without transferring the data destined to all other shader pipes in the shader pipe array that were not identified as being defective.3. The system of claim 1 , wherein the redundant shader switch is further configured to directly switch the data destined for each shader pipe identified as being defective via at least ...

Подробнее

Номер записи: 81

11-12-2014 дата публикации

GRAPHICS PROCESSING HARDWARE FOR USING COMPUTE SHADERS AS FRONT END FOR VERTEX SHADERS

Номер: US20140362102A1

Автор: Cerny Mark Evan, Mantor Michael, Scanlin Jason, Simpson David

Принадлежит:

A GPU is configured to read and process data produced by a compute shader via the one or more ring buffers and pass the resulting processed data to a vertex shader as input. The GPU is further configured to allow the compute shader and vertex shader to write through a cache. Each ring buffer is configured to synchronize the compute shader and the vertex shader to prevent processed data generated by the compute shader that is written to a particular ring buffer from being overwritten before the data is accessed by the vertex shader. It is emphasized that this abstract is provided to comply with the rules requiring an abstract that will allow a searcher or other reader to quickly ascertain the subject matter of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. 1. A graphics processing system , comprising:a graphics processor unit (GPU);a cache implemented on the GPU;one or more ring buffers implemented by the GPU;a compute shader configured to run on the GPU; anda vertex shader configured to run on the GPU,wherein the GPU is configured to read and process data produced by the compute shader via the one or more ring buffers and pass the resulting processed data to the vertex shader as input,wherein the GPU is further configured to allow the one or more compute shaders and the one or more vertex shaders to read and write through the cache, wherein the one or more ring buffers are configured to synchronize the compute shader and the vertex shader to prevent processed data generated by the compute shader that is written to a particular ring buffer of the one or more ring buffers from being overwritten in the particular ring buffer before the data is accessed by the vertex shader.2. The system of wherein the one or more ring buffers include an index ring buffer and the data includes one or more indices for one or more vertices of one or more polygons.3. The system of claim 2 , ...

Подробнее

Номер записи: 82

27-09-2018 дата публикации

SINGLE PASS FLEXIBLE SCREEN/SCALE RASTERIZATION

Номер: US20180276790A1

Автор: Kallio Kiia, Lefebvre Laurent, Mantor Michael, Tuomi Mika

Принадлежит:

An apparatus, such as a head mounted device (HMD), includes one or more processors configured to implement a graphics pipeline that renders pixels in window space with a nonuniform pixel spacing. The apparatus also includes a first distortion function that maps the non-uniformly spaced pixels in window space to uniformly spaced pixels in raster space. The apparatus further includes a scan converter configured to sample the pixels in window space through the first distortion function. The scan converter is configured to render display pixels used to generate an image for display to a user based on the uniformly spaced pixels in raster space. In some cases, the pixels in the window space are rendered such that a pixel density per subtended area is constant across the user's field of view. 1. A method comprising:rendering, in a graphics pipeline, pixels in window space with a non-uniform pixel spacing;sampling, with a scan converter, the pixels in window space through a distortion function that maps the non-uniformly spaced pixels in window space to uniformly spaced pixels in raster space; andgenerating an image for display to a user using display pixels that are rendered by the scan converter based on the uniformly spaced pixels in raster space.2. The method of claim 1 , wherein rendering the pixels in the window space with the non-uniform pixel spacing comprises rendering the pixels in the window space such that a pixel density per subtended area is constant across a field of view of the user.3. The method of claim 2 , wherein rendering the pixels in the window space comprises rendering the pixels in the window space such that the pixel density per subtended angle is relatively high in a fovea region associated with the user and relatively low on a visual periphery associated with the user.4. The method of claim 1 , wherein sampling the pixels in window space through the distortion function comprises sampling the pixels in window space through a vertical distortion ...

Подробнее

Номер записи: 83

12-09-2019 дата публикации

SOFTWARE-CONTROLLED VARIABLE WAVEFRONT SIZE EXECUTION AT GPU

Номер: US20190278605A1

Автор: EMBERLING Brian, Mantor Michael

Принадлежит:

A system includes a processor configured to operate in at least a first mode and a second mode. In the first mode the first processor operates to execute an instruction for an entire wavefront before executing a next instruction for the entire wavefront. In the second mode the processor operates to execute a set instructions for a portion of a wavefront before executing the set instructions for another portion of the same wavefront. The system further includes a memory coupled to the processor. The memory is configured to store a shader program for execution by the processor, wherein the shader program includes at least one indication associated with one of the first mode or the second mode. The processor is further to implement one of the first mode or the second mode while executing the shader program responsive to the at least one indication present in the first shader program. 1. A computer-implemented method comprising:providing a first processor configurable to operate in at least a first mode and a second mode, wherein in the first mode the first processor operates to execute an instruction for an entire wavefront before executing a next instruction for the entire wavefront and in the second mode the first processor operates to execute a set of instructions for a portion of a wavefront before executing the set of instructions for another portion of the same wavefront;configuring a shader program to be executed by the first processor to include at least one indication associated with one of the first mode or the second mode; andexecuting the shader program at the first processor, wherein a mode of operation of the first processor during execution of the shader program is responsive to the at least one indication present in the shader program.2. The method of claim 1 , wherein the at least one indication includes a command for the first processor to operate in the second mode.3. The method of claim 1 , wherein configuring the shader program comprises:analyzing, ...

Подробнее

Номер записи: 84

17-09-2020 дата публикации

Processing unit with mixed precision operations

Номер: US20200293286A1

Автор: Bin He, Jiasheng Chen, Michael Mantor

Принадлежит: Advanced Micro Devices Inc

A graphics processing unit (GPU) implements operations, with associated op codes, to perform mixed precision mathematical operations. The GPU includes an arithmetic logic unit (ALU) with different execution paths, wherein each execution path executes a different mixed precision operation. By implementing mixed precision operations at the ALU in response to designate op codes that delineate the operations, the GPU efficiently increases the precision of specified mathematical operations while reducing execution overhead.

Подробнее

Номер записи: 85

17-09-2020 дата публикации

PIPELINE INCLUDING SEPARATE HARDWARE DATA PATHS FOR DIFFERENT INSTRUCTION TYPES

Номер: US20200293329A1

Автор: Chen Jiasheng, He Bin, Mantor Michael, Socarras Angel E., WANG Qingcheng, Yuan Wei, Zou Yunxiao

Принадлежит:

A processing element is implemented in a stage of a pipeline and configured to execute an instruction. A first array of multiplexers is to provide information associated with the instruction to the processing element in response to the instruction being in a first set of instructions. A second array of multiplexers is to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions. A control unit is to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set.

Подробнее

Номер записи: 86

01-11-2018 дата публикации

MEMORY PROTECTION IN HIGHLY PARALLEL COMPUTING HARDWARE

Номер: US20180314579A1

Автор: Mantor Michael, Sampayo Carlos

Принадлежит: Advanced Micro Devices, Inc.

Techniques for handling memory errors are disclosed. Various memory units of an accelerated processing device (“APD”) include error units for detecting errors in data stored in the memory (e.g., using parity protection or error correcting code). Upon detecting an error considered to be an “initial uncorrectable error,” the error unit triggers transmission of an initial uncorrectable error interrupt (“IUE interrupt”) to a processor. This IUE interrupt includes information identifying the specific memory unit in which the error occurred (and possible other information about the error). A halt interrupt is generated and transmitted to the processor in response to the data having the error being consumed (i.e., used by an operation such as an instruction or command), which causes the APD to halt operations. If the data having the error is not consumed, then the halt interrupt is never generated (that the error occurred may remain logged, however). 1. A method for handling errors that occur in memory of an accelerated processing device (“APD”) , the method comprising:detecting a first error in first data in a first memory unit of the APD, the first error being an initial uncorrectable error;transmitting a first initial uncorrectable error interrupt (“IUE interrupt”) to a processor coupled to the APD, the first IUE interrupt including information identifying the first memory unit as the memory unit in which the first error occurs;forwarding the first data with an indication that the first data includes the first error to a second memory unit; andupon detecting that the first data is consumed, transmitting a first halt interrupt to the processor and halting operations on the APD.2. The method of claim 1 , further comprising:detecting a second error in a second memory unit of the APD, the second memory unit being the first memory unit or another memory unit of the APD, the second error being an initial uncorrectable error;determining that the second error is considered ...

Подробнее

Номер записи: 87

19-11-2015 дата публикации

REDUNDANCY METHOD AND APPARATUS FOR SHADER COLUMN REPAIR

Номер: US20150332427A1

Автор: Brady Jeffrey T., Mantor Michael J., Socarras Angel E.

Принадлежит: Advanced Micro Devices, Inc.

Methods, systems and non-transitory computer readable media are described. A system includes a shader pipe array, a redundant shader pipe array, a sequencer and a redundant shader switch. The shader pipe array includes multiple shader pipes, each of which perform rendering calculations on data provided thereto. The redundant shader pipe array also performs rendering calculations on data provided thereto. The sequencer identifies at least one defective shader pipe in the shader pipe array, and, in response, generates a signal. The redundant shader switch receives the generated signal, and, in response, transfers the data destined for each shader pipe identified as being defective independently to the redundant shader pipe array.

Подробнее

Номер записи: 88

08-11-2018 дата публикации

POLICIES FOR SHADER RESOURCE ALLOCATION IN A SHADER CORE

Номер: US20180321946A1

Автор: Hartog Robert Scott, Leather Mark, Mantor Michael, McCrary Rex, Nussbaum Sebastien, Rogers Philip J., Taylor Ralph Clay, Woller Thomas

Принадлежит: Advanced Micro Devices, Inc.

A method for use in a processor for arbitrating between multiple processes to select wavefronts for execution on a shader core is provided. The processor includes a compute pipeline configured to issue wavefronts to the shader core for execution, a hardware queue descriptor associated with the compute pipeline, and the shader core. The shader core is configured to execute work for the compute pipeline corresponding to a first memory queue descriptor executed using data for the first memory queue descriptor that is loaded into a first hardware queue descriptor. The processor is configured to detect a context switch condition, and, responsive to the context switch condition, perform a context switch operation including loading data for a second memory queue descriptor into the first hardware queue descriptor. The shader core is configured to execute work corresponding to the second memory queue descriptor that is loaded into the first hardware queue descriptor. 1. A method for arbitrating between multiple processes to select wavefronts for execution on a shader core , the method comprising:executing work, for a first compute pipeline configured to issue wavefronts to the shader core for execution, the work corresponding to a first memory queue descriptor executed using data for the first memory queue descriptor that is loaded into a first hardware queue descriptor;detecting a context switch condition;responsive to the context switch condition, performing a context switch operation, the context switch operation including loading data for a second memory queue descriptor into the first hardware queue descriptor; andexecuting work corresponding to the second memory queue descriptor using the data for the second memory queue descriptor that is loaded into the first hardware queue descriptor.2. The method of claim 1 , wherein the context switch condition includes:a memory queue descriptor having a higher priority than the first memory descriptor becoming ready for ...

Подробнее

Номер записи: 89

17-10-2019 дата публикации

SPLIT FRAME RENDERING

Номер: US20190318527A1

Автор: Mantor Michael, Martin Todd, NIJASURE Mangesh P.

Принадлежит: Advanced Micro Devices, Inc.

Improvements in the graphics processing pipeline that allow multiple pipelines to cooperate to render a single frame are disclosed. Two approaches are provided. In a first approach, world-space pipelines for the different graphics processing pipelines process all work for draw calls received from a central processing unit (CPU). In a second approach, the world-space pipelines divide up the work. Work that is divided is synchronized and redistributed at various points in the world-space pipeline. In either approach, the triangles output by the world-space pipelines are distributed to the screen-space pipelines based on the portions of the render surface overlapped by the triangles. Triangles are rendered by screen-space pipelines associated with the render surface portions overlapped by those triangles. 1. A method for sharing graphics processing work among multiple accelerated processing devices (“APD”) of a set of APDs , the method comprising:transmitting a set of draw calls to each APD of the set of APDs;splitting the set of draw calls into a set of primitive groups;for each primitive group of the set of primitive groups, designating an input assembler to receive that primitive group, wherein for each primitive group, the designated input assembler is the same for each APD;at each APD, for a first set of primitive groups designated to be received by input assemblers of that APD, transmitting the first set of primitive groups to the designated input assemblers; andat each APD, for a second set of primitive groups designated to be received by input assemblers outside of that APD, discarding the second set of primitive groups.2. The method of claim 1 , wherein:designating the input assemblers to receive the primitive groups occurs in a round robin fashion with respect to the input assemblers.3. The method of claim 1 , further comprising:distributing the primitive groups, from each input assembler, to one or more attached vertex, geometry, and tessellation units for ...

Подробнее

Номер записи: 90

13-12-2018 дата публикации

Stream processor with high bandwidth and low power vector register file

Номер: US20180357064A1

Автор: Bin He, Jiasheng Chen, Mark M. Leather, Michael J. Mantor, YunXiao Zou

Принадлежит: Advanced Micro Devices Inc

Systems, apparatuses, and methods for implementing a high bandwidth, low power vector register file for use by a parallel processor are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of processing pipeline. The parallel processing unit includes a vector arithmetic logic unit and a high bandwidth, low power, vector register file. The vector register file includes multi-bank high density random-access memories (RAMs) to satisfy register bandwidth requirements. The parallel processing unit also includes an instruction request queue and an instruction operand buffer to provide enough local bandwidth for VALU instructions and vector I/O instructions. Also, the parallel processing unit is configured to leverage the RAM's output flops as a last level cache to reduce duplicate operand requests between multiple instructions. The parallel processing unit includes a vector destination cache to provide additional R/W bandwidth for the vector register file.

Подробнее

Номер записи: 91

28-12-2017 дата публикации

METHOD AND PROCESSING APPARATUS FOR GATING REDUNDANT THREADS

Номер: US20170371393A1

Автор: Chen Jiasheng, Gilani Syed Zohaib M., He Bin, Mantor Michael, Paltashev Timour T., WANG Qingcheng, Zou Yunxiao

Принадлежит:

Described is a method and processing apparatus to improve power efficiency by gating redundant threads processing. In particular, the method for gating redundant threads in a graphics processor includes determining if data for a thread and data for at least another thread are within a predetermined similarity threshold, gating execution of the at least another thread if the data for the thread and the data for the at least another thread are within the predetermined similarity threshold, and using an output data from the thread as an output data for the at least another thread. 1. A method for gating redundant threads processing in a graphics processor shader block , the method comprising:determining if data for a thread and data for at least another thread are within a predetermined similarity threshold;gating execution of the at least another thread if the data for the thread and the data for the at least another thread are within the predetermined similarity threshold; andusing an output data from the thread as output data for the at least another thread.2. The method of claim 1 , further comprising:disabling a redundant thread gating control circuit when a non-graphics application is running on the graphics processor.3. The method of claim 1 , further comprising:enabling zero detection mode for sparse data, wherein detection of zero values for operands and output gates off execution of relevant thread.4. The method of claim 1 , further comprising:generating a signal if the data for the thread and the data for the at least another thread are within the predetermined similarity threshold; andsending the signal to a clock gating circuit to trigger gating of the at least another thread.5. The method of claim 4 , further comprising:setting a multiplexor to select the output data from the thread in response to receiving the signal.6. The method of claim 1 , wherein the data for the thread and the data for the at least another thread are input data.7. The method of ...

Подробнее

Номер записи: 92

28-12-2017 дата публикации

System and method for using virtual vector register files

Номер: US20170371654A1

Автор: Ljubisa Bajic, Michael Mantor, Rajabali M. Koduri, Syed Zohaib M. Gilani

Принадлежит: Advanced Micro Devices Inc, ATI TECHNOLOGIES ULC

Described is a system and method for using virtual vector register files. In particular, a graphics processor includes a logic unit, a virtual vector register file coupled to the logic unit, a vector register backing store coupled to the virtual vector register file, and a virtual vector register file controller coupled to the virtual vector register file. The virtual vector register file includes a N deep vector register file and a M deep vector register file, where N is less than M. The virtual vector register file controller performing eviction and allocation between the N deep vector register file, the M deep vector register file and the vector register backing store dependent on at least access requests for certain vector registers.

Подробнее

Номер записи: 93

28-12-2017 дата публикации

SYSTEM AND METHOD FOR PROTECTING GPU MEMORY INSTRUCTIONS AGAINST FAULTS

Номер: US20170371743A1

Автор: Gurumurthi Sudhanva, Kalamatianos John, Mantor Michael

Принадлежит: Advanced Micro Devices, Inc.

A system and method for protecting memory instructions against faults are described. The system and method include converting the slave instructions to dummy operations, modifying memory arbiter to issue up to N master and N slave global/shared memory instructions per cycle, sending master memory requests to memory system, using slave requests for error checking, entering master requests to the GM/LM FIFO, storing slave requests in a register, and comparing the entered master requests with the stored slave requests. 1. A system for protecting memory instructions against faults , the system comprising:a memory arbiter that selects a master/slave memory instruction pair to issue, wherein the master instruction and the slave instruction are executed in lockstep;a memory request that receives the master instruction from the memory arbiter; anda data cache that receives the master instruction and allows access by at least two single instruction multiple data (SIMD) processors to complete both the master and slave memory instructions.2. The system of wherein data cache is one of a L1 data cache and a local data share (LDS).3. The system of further comprising a return that returns the instructions from the data cache to the at least two SIMDs and completes both the master and slave instructions.4. The system of wherein the at least one pair of SIMDs arranged in a master/slave relationship.5. The system of wherein the memory arbiter picks a master instruction and its equivalent slave instruction issued from the pair of SIMDs arranged in master/slave relationship.6. The system of claim 1 , further comprising:a master error check logic that receives master and slave memory instructions and performs master error check logic error checking between the master and slave operations;a data error check logic that receives master and slave instruction data and performs data error check logic error checking between the master and slave operations;an address coalescing logic that ...

Подробнее

Номер записи: 94

03-12-2020 дата публикации

GRAPHICS CONTEXT BOUNCING

Номер: US20200379767A1

Автор: Ashkar Alexander Fuad, Luo Yi, Mantor Michael, McCrary Rex Eldon, Wise Harry J.

Принадлежит:

A method of context bouncing includes receiving, at a command processor of a graphics processing unit (GPU), a conditional execute packet providing a hash identifier corresponding to an encapsulated state. The encapsulated state includes one or more context state packets following the conditional execute packet. A command packet following the encapsulated state is executed based at least in part on determining whether the hash identifier of the encapsulated state matches one of a plurality of hash identifiers of active context states currently stored at the GPU. 1. A method , comprising:receiving, at a command processor of a graphics processing unit (GPU), a conditional execute packet providing an identifier corresponding to an encapsulated state, wherein the encapsulated state includes one or more context state packets following the conditional execute packet; andexecuting, based at least in part on determining whether the identifier of the encapsulated state matches one of a plurality of identifiers of active context states currently stored at the GPU, a command packet following the encapsulated state.2. The method of claim 1 , wherein determining whether the identifier of the encapsulated state matches one of a plurality of identifiers comprises:querying a context manager to search an identifier table storing the plurality of identifiers of active context states currently stored at the GPU.3. The method of claim 2 , further comprising:draining, in response to determining the identifier matching one of the plurality of identifiers of active context states, the encapsulated state; andassigning the matching one of the plurality of identifiers of active context states to the command packet following the encapsulated state.4. The method of claim 2 , further comprising:allocating, in response to determining the identifier does not match any of the plurality of identifiers of active context states, a new context state set; andexecuting the one or more context state ...

Подробнее

Номер записи: 95

01-10-1999 дата публикации

Method and apparatus for texture level of detail dithering

Номер: CA2267487A1

Автор: Michael Mantor, Ralph Clayton Taylor, Steven Manno, Thomas A. Piazza

Принадлежит: Real 3D Inc

A computationally efficient method for minimizing the visible effects of texture LOD transitions across a polygon. The minimization is accomplished by adding a dithering offset value to the LOD value computed for each pixel covered by a graphics primitive to produce a dithered pixel LOD value. The dithering offsets mat be generated from a table look-up based on the location of the pixel within a span of pixels. The dithered pixel LOD value is used to as an index in the selection of a single LOD texture map from which a textured pixel value is retrieved. The range of dithering offset values can be adjusted by modulating the values in the table look--up.

Подробнее

Номер записи: 96

08-12-2020 дата публикации

System and method for protecting GPU memory instructions against faults

Номер: US10860418B2

Автор: John Kalamatianos, Michael Mantor, Sudhanva Gurumurthi

Принадлежит: Advanced Micro Devices Inc

A system and method for protecting memory instructions against faults are described. The system and method include converting the slave instructions to dummy operations, modifying memory arbiter to issue up to N master and N slave global/shared memory instructions per cycle, sending master memory requests to memory system, using slave requests for error checking, entering master requests to the GM/LM FIFO, storing slave requests in a register, and comparing the entered master requests with the stored slave requests.

Подробнее

Номер записи: 97

19-08-2015 дата публикации

Scalable and unified compute system

Номер: EP2297723A4

Автор: Jeffrey T Brady, Marcos P Zini, Mark C Fowler, Michael J Mantor

Принадлежит: Advanced Micro Devices Inc

Подробнее

Номер записи: 98

25-06-2013 дата публикации

Video instruction processing of desired bytes in multi-byte buffers by shifting to matching byte location

Номер: US8473721B2

Автор: Andrew E. Gruber, Christopher L. Spencer, Daniel W. Wong, Jeffrey T. Brady, Michael J. Mantor

Принадлежит: Advanced Micro Devices Inc, ATI TECHNOLOGIES ULC

Disclosed herein is a processing unit configured to process video data, and applications thereof. In an embodiment, the processing unit includes a buffer and an execution unit. The buffer is configured to store a data word, wherein the data word comprises a plurality of bytes of video data. The execution unit is configured to execute a single instruction to (i) shift bytes of video data contained in the data word to align a desired byte of video data and (ii) process the desired byte of the video data to provide processed video data.

Подробнее

Номер записи: 99

08-12-2020 дата публикации

Redundancy method and apparatus for shader column repair

Номер: US10861122B2

Автор: Angel E. Socarras, Jeffrey T. Brady, Michael J. Mantor

Принадлежит: Advanced Micro Devices Inc

Methods, systems and non-transitory computer readable media are described. A system includes a shader pipe array, a redundant shader pipe array, a sequencer and a redundant shader switch. The shader pipe array includes multiple shader pipes, each of which perform rendering calculations on data provided thereto. The redundant shader pipe array also performs rendering calculations on data provided thereto. The sequencer identifies at least one defective shader pipe in the shader pipe array, and, in response, generates a signal. The redundant shader switch receives the generated signal, and, in response, transfers the data destined for each shader pipe identified as being defective independently to the redundant shader pipe array.

Подробнее

Номер записи: 100

20-08-2019 дата публикации

Split frame rendering

Номер: US10388056B2

Автор: Mangesh P. Nijasure, Michael Mantor, Todd Martin

Принадлежит: Advanced Micro Devices Inc

Improvements in the graphics processing pipeline that allow multiple pipelines to cooperate to render a single frame are disclosed. Two approaches are provided. In a first approach, world-space pipelines for the different graphics processing pipelines process all work for draw calls received from a central processing unit (CPU). In a second approach, the world-space pipelines divide up the work. Work that is divided is synchronized and redistributed at various points in the world-space pipeline. In either approach, the triangles output by the world-space pipelines are distributed to the screen-space pipelines based on the portions of the render surface overlapped by the triangles. Triangles are rendered by screen-space pipelines associated with the render surface portions overlapped by those triangles.

Подробнее

Номер записи: 101

19-01-2012 дата публикации

Dynamic control of simds

Номер: WO2012009252A2

Автор: Brian Emberling, Michael J. Mantor, Rashad Oreifej, Tushar K. Shah

Принадлежит: Advanced Micro Devices, Inc.

Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. In this way, embodiments achieve optimal usage of a graphics processing unit. Embodiments of the invention also achieve dynamic grain (e.g., medium grain) clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work).

Подробнее

Номер записи: 102

23-10-2002 дата публикации

A linear surface memory to spatial tiling algorithm/mechanism

Номер: GB2374780A

Автор: John Austin Carey, Matthew Radecki, Michael Mantor, Ralph Clayton Taylor, Scott Hartog, Thomas A Piazza

Принадлежит: Real 3D Inc

A computer graphics system and a method of configuring data in a memory unit of a computer graphics system. Generally, the data is configured such that the number of memory page breaks is reduced when data is accessed from the memory for image computation. For example, when the memory is used to store pixel values, each page of the memory is comprised of pixel values for a rectangular or tile array of pixels. This increases the spatial coherence between the pixel values and the pixels of the polygons that are rasterized when the system renders an image. Preferably, a translation algorithm is provided to allow standard operating systems and software applications to work with the tiled configuration of the pixel values in the memory. This algorithm translates the scalar memory address initially provided by the operating system or the software application, and translates that first scalar memory address to a second scalar memory address that will properly access the value for the pixel conventionally associated with the first scalar memory address.

Подробнее

Номер записи: 103

11-04-2017 дата публикации

SIMD processing unit with local data share and access to a global data share of a GPU

Номер: US9619428B2

Автор: Brian Emberling, Michael J. Mantor

Принадлежит: Advanced Micro Devices Inc

A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.

Подробнее

Номер записи: 104

06-08-2019 дата публикации

Memory protection in highly parallel computing hardware

Номер: US10372522B2

Автор: Carlos Sampayo, Michael Mantor

Принадлежит: Advanced Micro Devices Inc

Techniques for handling memory errors are disclosed. Various memory units of an accelerated processing device (“APD”) include error units for detecting errors in data stored in the memory (e.g., using parity protection or error correcting code). Upon detecting an error considered to be an “initial uncorrectable error,” the error unit triggers transmission of an initial uncorrectable error interrupt (“IUE interrupt”) to a processor. This IUE interrupt includes information identifying the specific memory unit in which the error occurred (and possible other information about the error). A halt interrupt is generated and transmitted to the processor in response to the data having the error being consumed (i.e., used by an operation such as an instruction or command), which causes the APD to halt operations. If the data having the error is not consumed, then the halt interrupt is never generated (that the error occurred may remain logged, however).

Подробнее

Номер записи: 105