Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 35. Отображено 33.
17-03-2022 дата публикации

HYBRID WILDCARD MATCH TABLE

Номер: US20220086089A1
Принадлежит:

Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts. 1. A network switch comprising:{'claim-text': ['a code field including one or more codes; and', 'a value field including rule comparison data; and'], '#text': 'a static random access memory (SRAM) entry table including a plurality of entries, wherein each of the entries comprise:'}one or more SRAM pools;at least one spillover ternary content addressable memory (TCAM) pool; anda non-transitory computer readable medium storing request interface control logic dispatching a search key to one or more of the SRAM pools and the at least one spillover TCAM pool and determining if the search key is a match for one or more of the entries based on which of the codes are satisfied by the search key and which of the codes are not satisfied by the search key.2. The network switch of claim 1 , wherein each of the codes occupies either 2 bytes or 3 bytes of the code field.3. The network switch of claim 2 , wherein at least one of the codes is an equal-match type code or a not-equal-match type code claim 2 , the at least one of the codes comprising:an identifier that distinguishes the at least one of the codes from other types of the codes;a nibble index that identifies a location of match data to compare within both the search key and the value field; anda bit length that indicates a number of bits after the location within both the search key and the value field to compare as the data, wherein the at least one of the codes relates to determining if the match data of the search key is the same as the match data from the value field.4. The network switch of claim 3 , wherein at least another one of the codes is an in-range ...

Подробнее
01-04-2021 дата публикации

TRANSPOSE OPERATIONS USING PROCESSING ELEMENT ARRAY

Номер: US20210096823A1
Принадлежит:

Provided are integrated circuits and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing. 1. An integrated circuit device , comprising:a state buffer operable to receive data elements of a tensor;a results buffer; anda processing element array, map a block of data elements of the tensor to a number of row partitions of the state buffer, each row partition having a number of columns;', 'load the data elements into the processing element array in corresponding rows and columns of processing elements;', 'perform a series of multiplication operations with an identity matrix such that each column of the identity matrix is sequentially multiplied by each column of data elements in the processing element array;', 'after each multiplication operation, sum multiplication products for each column of the processing element array that performs a multiplication operation with a column of the identity matrix;', 'store, in a corresponding column partition of the results buffer, the summed multiplication products for each column of the processing element array that performs a multiplication operation, the results buffer having a same number of column partitions as columns in the processing element array, wherein the summed multiplication products for subsequent multiplication operations are stored in subsequent rows for each corresponding column partition, and wherein the results buffer has a same number of rows as ...

Подробнее
01-04-2021 дата публикации

TRANSPOSED CONVOLUTION USING SYSTOLIC ARRAY

Номер: US20210097375A1
Принадлежит:

In one example, a neural network accelerator can execute a set of instructions to: load a first weight data element from a memory into a systolic array, the first weight data element having first coordinates; extract, from the instructions, information indicating a first subset of input data elements to be obtained from the memory, the first subset being based on a stride of a transposed convolution operation and second coordinates of first weight data element in a rotated array of weight data elements; based on the information, obtain the first subset of input data elements from the memory; load the first subset of input data elements into the systolic array; and control the systolic array to perform first computations based on the first weight data element and the first subset of input data elements to generate output data elements of an array of output data elements. 1. A method for performing a transposed convolution operation in a neural network accelerator , comprising:obtaining, from a memory, a first weight data element of an array of weight data elements, wherein the obtaining is based on first coordinates of the first weight data element in the array of weight data elements;loading the first weight data element into a systolic array of the neural network accelerator;receiving a selection of a first subset of input data elements of an array of input data elements to multiply with the first weight data element, the first subset being selected based on second coordinates of the first weight data element in a 180-degree rotated version of the array of weight data elements, and on a stride of the transposed convolution operation;streaming each input data element of the first subset, starting from a first source address from the memory, into the systolic array to compute first partial sums;obtaining, from the memory, a second weight data element of the array of weight data elements, wherein the obtaining is based on third coordinates of the second weight data ...

Подробнее
27-05-2021 дата публикации

HIERARCHICAL PARTITIONING OF OPERATORS

Номер: US20210158131A1
Принадлежит:

Methods and apparatuses for hierarchical partitioning of operators of a neural network for execution on an acceleration engine are provided. Neural networks are built in machine learning frameworks using neural network operators. The neural network operators are compiled into executable code for the acceleration engine. Development of new framework-level operators can exceed the capability to map the newly developed framework-level operators onto the acceleration engine. To enable neural networks to be executed on an acceleration engine, hierarchical partitioning can be used to partition the operators of the neural network. The hierarchical partitioning can identify operators that are supported by a compiler for execution on the acceleration engine, operators to be compiled for execution on a host processor, and operators to be executed on the machine learning framework. 1. A method for hierarchical partitioning of operators of a neural network for compiling on an acceleration engine , the method comprising:obtaining a set of neural network operators of a neural network generated by a machine learning framework;comparing a list of neural network operators supported by a compiler with the set of neural network operators of the neural network;partitioning, based on the comparison, the set of neural network operators of the neural network into a set of first neural network operators that are not supported by the compiler and a set of second neural network operators that are supported by the compiler;providing the set of second neural network operators to the compiler; andpartitioning the set of second neural network operators into a set of third neural network operators capable of execution on the acceleration engine and a set of fourth neural network operators to be executed by a host processor.2. The method of claim 1 , further comprising partitioning the set of third neural network operators into a set of fifth neural network operators to be executed on the host ...

Подробнее
27-05-2021 дата публикации

EFFICIENT UTILIZATION OF PROCESSING ELEMENT ARRAY

Номер: US20210158132A1
Принадлежит:

A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters. 1. A computer-implemented method comprising:receiving a neural network model for implementing using a neural network accelerator that includes a first number of rows of processing elements, the neural network model including a network layer that includes a convolution operation for generating an output feature map using a second number of input feature maps and a set of filters;determining that the second number is equal to or less than a half of the first number; padding the second number of input feature maps with padding data to generate padded input feature maps;', 'dividing each of the padded input feature maps into partitions; and', 'dividing the convolution operation into sub-operations based on the partitions;, 'adding operations to the neural network model, the operations includinggenerating, based on the neural network model, instructions for execution by the neural network accelerator to implement the convolution operation;detecting, from the instructions, a first instruction and a second instruction that both use a first partition in the ...

Подробнее
12-05-2016 дата публикации

HYBRID WILDCARD MATCH TABLE

Номер: US20160134537A1
Принадлежит:

Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts. 1. A network switch comprising: a code field including one or more codes;', 'a value field including rule comparison data; and', 'a priority field that indicates a priority of the entry with respect to other of the plurality of entries;, 'an SRAM entry table including a plurality of entries that are each associated with a different matching rule, wherein each of the entries comprisea plurality of SRAM pools;at least one spillover TCAM pool; anda request interface control logic dispatching a search key to one or more active pools of the plurality of SRAM pools and the at least one spillover TCAM pool and returning results data that is based on whether the search key matched the rule of one or more of the entries.2. The network switch of claim 1 , wherein each of the codes occupies either 2 bytes or 3 bytes of the code field.3. The network switch of claim 2 , wherein at least one of the codes is an equal-match type code or a not-equal-match type code claim 2 , the at least one of the codes comprising:an identifier that distinguishes the at least one of the codes from other types of the codes;a nibble index that identifies a location of match data to compare within both the search key and the value field; anda bit length that indicates a number of bits after the location within both the search key and the value field to compare as the data, wherein the at least one of the codes relates to determining if the match data of the search key is the same as the match data from the value field.4. The network switch of claim 3 , wherein at least another one of the codes is an in-range type code or a not-in-range type code ...

Подробнее
15-09-2022 дата публикации

DILATED CONVOLUTION USING SYSTOLIC ARRAY

Номер: US20220292163A1
Принадлежит:

In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided. 1. A method comprising:loading a first weight data element of a set of weight data elements from a memory into a systolic array;determining a first set of memory fetch parameters based on a first computation instruction, the first set of memory fetch parameters including a first start address of a first subset of a set of input data elements in the memory, a gap between elements of the first subset in the memory, and a number of the first subset;controlling a memory access circuit using the first set of memory fetch parameters to fetch the first subset from the memory to the systolic array; andcontrolling the systolic array to perform first computations based on the first weight data element and the first subset to compute first partial sums.2. The method of claim 1 , further comprising:loading a second weight data element of the set of weight data elements from the memory into the systolic array;determining a second set of memory fetch parameters based on a second computation instruction, the second set of memory fetch parameters including a second start address of a second subset of the set of input data elements in the memory, ...

Подробнее
12-08-2021 дата публикации

Neural network operation reordering for parallel execution

Номер: US20210247984A1
Принадлежит: Amazon Technologies Inc

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Подробнее
31-12-2020 дата публикации

TRANSPOSE OPERATIONS USING PROCESSING ELEMENT ARRAY

Номер: US20200409664A1
Принадлежит:

Provided are systems and methods for transposing a tensor using processing element array operations. In some cases, it may be necessary to transpose elements of a tensor to perform a matrix operation. The tensor may be decomposed into blocks of data elements having dimensions consistent with the dimensions of a systolic array. An identity multiplication may be performed on each block of data elements loaded into a systolic array and the multiplication products summed in column partitions of a results buffer. The data elements in the column partitions of results buffer can then be mapped to row partitions of a buffer memory for further processing. 1. A method for transposing a tensor using processing element array operations , the method comprising:mapping a block of data elements of the tensor to a number of row partitions of a state buffer of an integrated circuit, each row partition having a number of columns;loading the data elements into a processing element array of the integrated circuit in corresponding rows and columns of processing elements;performing a series of multiplication operations with an identity matrix such that each column of the identity matrix is sequentially multiplied by each column of data elements in the processing element array;after each multiplication operation, summing multiplication products for each column of the processing element array that performs a multiplication operation with a column of the identity matrix;storing, in a corresponding column partition of a results buffer, the summed multiplication products for each column of the processing element array that performs a multiplication operation, the results buffer having a same number of column partitions as columns in the processing element array, wherein the summed multiplication products for subsequent multiplication operations are stored in subsequent rows for each corresponding column partition, and wherein the results buffer has a same number of rows as the processing ...

Подробнее
31-12-2020 дата публикации

Neural network operation reordering for parallel execution

Номер: US20200409717A1
Принадлежит: Amazon Technologies Inc

Techniques are disclosed for reordering operations of a neural network to improve runtime efficiency. In some examples, a compiler receives a description of the neural network comprising a plurality of operations. The compiler may determine which execution engine of a plurality of execution engines is to perform each of the plurality of operations. The compiler may determine an order of performance associated with the plurality of operations. The compiler may identify a runtime inefficiency based on the order of performance and a hardware usage for each of the plurality of operations. An operation may be reordered to reduce the runtime inefficiency. Instructions may be compiled based on the plurality of operations, which include the reordered operation.

Подробнее
31-12-2020 дата публикации

DILATED CONVOLUTION USING SYSTOLIC ARRAY

Номер: US20200410036A1
Принадлежит:

In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided. 1. A method , comprising:loading a first weight data element of an array of weight data elements from a memory into a systolic array, the first weight data element being at first coordinates within the array of weight data elements;receiving a selection of a first subset of input data elements of an array of input data elements, the first subset being selected based on the first coordinates of the first weight data element, a stride of a dilated convolution operation, and a rate of the dilated convolution operation;streaming each input data element of the selected first subset starting from a first address from the memory into the systolic array to multiply with the first weight data element to compute first partial sums;loading a second weight data element from the memory into the systolic array, the second weight data being at second coordinates within the array of weight data element;receiving a selection of a second subset of input data elements of the array of input data elements, the second subset being selected based on the second coordinates of the second weight data element, the stride of the dilated convolution ...

Подробнее
31-12-2020 дата публикации

NEURAL NETWORK LAYER-BY-LAYER DEBUGGING

Номер: US20200410354A1
Принадлежит:

Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor. 1. A method of debugging a neural network execution on a target processor , the method comprising:receiving, by a debugger program operating on a host system, a request to debug an execution of a neural network on the target processor, the neural network comprising a plurality of layers;generating, using a reference processor on the host system and based on a first sample input, a plurality of first reference tensors for the neural network; converting, by a compiler operating on the host system, the neural network having the particular length into first machine instructions;', 'executing, using the target processor and based on the first sample input or on one of the plurality of first reference tensors, the first machine instructions to generate a first device tensor; and', 'determining, by the debugger program, whether the first device tensor matches a first reference tensor of the plurality of first reference tensors;, 'repeatedly reducing the plurality of layers of the neural network to produce a plurality of lengths, and for each particular length of a ...

Подробнее
05-04-2022 дата публикации

Registers for restricted memory

Номер: US11294599B1
Принадлежит: Amazon Technologies Inc

Provided are integrated circuits and methods for operating integrated circuits. An integrated circuit can include a plurality of memory banks and an execution engine including a set of execution components. Each execution component can be associated with a respective memory bank and can read from and write to the respective memory bank. The integrated circuit can further include a set of registers each associated with a respective memory bank from the plurality of memory banks. The integrated circuit can further be operable to load to or store from the set of registers in parallel, and load to or store from the set of registers serially. A parallel operation followed by a serial operation enables data to be moved from many memory banks into one memory bank. A serial operation followed by a parallel operation enables data to be moved from one memory bank into many memory banks.

Подробнее
12-08-2010 дата публикации

Mechanism for Managing Resource Locking in a Multi-Threaded Environment

Номер: US20100205608A1
Принадлежит: Huynh Jeffrey T, Nemirovsky Mario D

A mechanism is disclosed for implementing resource locking in a massively multi-threaded environment. The mechanism receives from a stream a request to obtain a lock on a resource. In response, the mechanism determines whether the resource is currently locked. If so, the mechanism adds the stream to a wait list. At some point, based upon the wait list, the mechanism determines that it is the stream's turn to lock the resource; thus, the mechanism grants the stream a lock. In this manner, the mechanism enables the stream to reserve and to obtain a lock on the resource. By implementing locking in this way, a stream is able to submit only one lock request. When it is its turn to obtain a lock, the stream is granted that lock. This lock reservation methodology makes it possible to implement resource locking efficiently in a massively multi-threaded environment.

Подробнее
11-05-2021 дата публикации

Compile-time scheduling

Номер: US11003429B1
Принадлежит: Amazon Technologies Inc

Scheduling of the operations of an integrated circuit device such as a hardware accelerator, including scheduling of movement of data into and out of the accelerator, can be performed by a compiler that produces program code for the accelerator. The compiler can produce a graph that represents operations to be performed by the accelerator. Using the graph, the compiler can determine estimated execution times for the operations represented by each node in the graph. The compiler can schedule operations by determining an estimated execution time for set of dependent operations that depend from an operation. The compiler can then select an operation that has a shortest estimated execution time from among a set of operations and which has a set of dependent operations that has a longest estimated execution time as compared to other sets of dependent operations.

Подробнее
30-12-2020 дата публикации

Neural network layer-by-layer debugging

Номер: WO2020264335A1
Принадлежит: Amazon Technologies, Inc.

Techniques are disclosed for debugging a neural network execution on a target processor. A reference processor may generate a plurality of first reference tensors for the neural network. The neural network may be repeatedly reduced to produce a plurality of lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is identified for which the first device tensor does not match the first reference tensor. Tensor output is enabled for a lower-level intermediate representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to generate a second device tensor.

Подробнее
04-01-2022 дата публикации

Hybrid wildcard match table

Номер: US11218410B2
Принадлежит: Marvell Asia Pte Ltd

Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts.

Подробнее
19-12-2007 дата публикации

A multi-threaded packet processing engine for stateful packet processing

Номер: EP1868111A1
Принадлежит: ConSentry Networks Inc

A processing engine (101) to accomplish a multiplicity of tasks has a multiplicity of processing tribes (104), each tribe comprising a multiplicity of context register sets and a multiplicity of processing resources for concurrent processing of a multiplicity of threads to accomplish the tasks, a memory structure having a multiplicity of memory blocks, each block storing data for processing threads, and an interconnect structure (109) and control system enabling tribe-to-tribe migration of contexts to move threads from 10 tribe-to-tribe. The processing engine is characterized in that individual ones of the tribes have preferential access to individual ones of the multiplicity of memory blocks.

Подробнее
19-05-2022 дата публикации

LAYER BY LAYER NEURAL NETWORK DEBUGGING

Номер: DE112020003105T5
Принадлежит: Amazon Technologies Inc

Es werden Techniken zur Fehlersuche bei der Ausführung eines neuronalen Netzwerks auf einem Zielprozessor offenbart. Ein Referenzprozessor kann mehrere erste Referenztensoren für das neuronale Netzwerk erzeugen. Das neuronale Netzwerk kann wiederholt reduziert werden, um mehrere Längen zu erzeugen. Für jede der Längen wandelt ein Compiler das neuronale Netzwerk in erste Maschinenanweisungen um, der Zielprozessor führt die ersten Maschinenanweisungen aus, um einen ersten Gerätetensor zu erzeugen, und das Debugger-Programm bestimmt, ob der erste Gerätetensor mit einem ersten Referenztensor übereinstimmt. Es wird eine kürzeste Länge bestimmt, bei der der erste Gerätetensor nicht mit dem ersten Referenztensor übereinstimmt. Die Tensorausgabe wird für eine untergeordnete Zwischendarstellung des kürzesten neuronalen Netzwerks aktiviert, und das neuronale Netzwerk wird in zweite Maschinenanweisungen umgewandelt, die vom Zielprozessor ausgeführt werden, um einen zweiten Gerätetensor zu erzeugen. Techniques for debugging the execution of a neural network on a target processor are disclosed. A reference processor can generate multiple first reference tensors for the neural network. The neural network can be repeatedly reduced to produce multiple lengths. For each of the lengths, a compiler converts the neural network into first machine instructions, the target processor executes the first machine instructions to generate a first device tensor, and the debugger program determines whether the first device tensor matches a first reference tensor. A shortest length is determined where the first device tensor does not match the first reference tensor. The tensor output is enabled for an intermediate sub-representation of the shortest neural network, and the neural network is converted into second machine instructions, which are executed by the target processor to produce a second device tensor.

Подробнее
25-01-2022 дата публикации

Debug for computation networks using error detection codes

Номер: US11232016B1
Принадлежит: Amazon Technologies Inc

Techniques disclosed herein relate generally to debugging complex computing systems, such as those executing neural networks. A neural network processor includes a processing engine configured to execute instructions to implement multiple layers of a neural network. The neural network processor includes a debugging circuit configured to generate error detection codes for input data to the processing engine or error detection codes for output data generated by the processing engine. The neural network processor also includes an interface to a memory device, where the interface is configured to save the error detection codes generated by the debugging circuit into the memory device. The error detection codes generated by the debugging circuit are compared with expected error detection codes generated using a function model of the neural network to identify defects of the neural network.

Подробнее
12-08-2010 дата публикации

Mechanism for Achieving Packet Flow Control In a Multi-Threaded, Multi-Packet Environment

Номер: US20100202292A1
Принадлежит: Individual

A processing engine to accomplish a multiplicity of tasks has a multiplicity of processing tribes, each tribe comprising a multiplicity of context register sets and a multiplicity of processing resources for concurrent processing of a multiplicity of threads to accomplish the tasks, a memory structure having a multiplicity of memory blocks, each block storing data for processing threads, and an interconnect structure and control system enabling tribe-to-tribe migration of contexts to move threads from tribe-to-tribe. The processing engine is characterized in that individual ones of the tribes have preferential access to individual ones of the multiplicity of memory blocks.

Подробнее
26-03-2024 дата публикации

Hybrid wildcard match table

Номер: US11943142B2
Принадлежит: Marvell Asia Pte Ltd

Embodiments of the present invention are directed to a wildcard matching solution that uses a combination of static random access memories (SRAMs) and ternary content addressable memories (TCAMs) in a hybrid solution. In particular, the wildcard matching solution uses a plurality of SRAM pools for lookup and a spillover TCAM pool for unresolved hash conflicts.

Подробнее
14-11-2023 дата публикации

Dilated convolution using systolic array

Номер: US11816559B2
Принадлежит: Amazon Technologies Inc

In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.

Подробнее
09-01-2024 дата публикации

Data selection circuit

Номер: US11868875B1
Принадлежит: Amazon Technologies Inc

Provided are systems and methods for operating a neural network processor, wherein the processor includes an input selector circuit that can be configured to select the data that will be input into the processor's computational array. In various implementations, the selector circuit can determine, for a row of the array, whether the row input will be the output from a buffer memory or data that the input selector circuit has selected for a different row. The row can receive an input feature map from a set of input data or an input feature map that was selected for inputting into a different row, such that the input feature map is input into more than one row at a time. The selector circuit can also include a delay circuit, so that the duplicated input feature map can be input into the computational array later than the original input feature map.

Подробнее
28-09-2023 дата публикации

Transposed convolution using systolic array

Номер: US20230306249A1
Принадлежит: Amazon Technologies Inc

In one example, a neural network accelerator can execute a set of instructions to: load a first weight data element from a memory into a systolic array, the first weight data element having first coordinates; extract, from the instructions, information indicating a first subset of input data elements to be obtained from the memory, the first subset being based on a stride of a transposed convolution operation and second coordinates of first weight data element in a rotated array of weight data elements; based on the information, obtain the first subset of input data elements from the memory; load the first subset of input data elements into the systolic array; and control the systolic array to perform first computations based on the first weight data element and the first subset of input data elements to generate output data elements of an array of output data elements.

Подробнее
30-12-2020 дата публикации

Dilated convolution using systolic array

Номер: WO2020264264A1
Принадлежит: Amazon Technologies, Inc.

In one example, a non-transitory computer readable medium stores instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to: load a first weight data element of an array of weight data elements from a memory into a systolic array; select a subset of input data elements from the memory into the systolic array to perform first computations of a dilated convolution operation, the subset being selected based on a rate of the dilated convolution operation and coordinates of the weight data element within the array of weight data elements; and control the systolic array to perform the first computations based on the first weight data element and the subset to generate first output data elements of an output data array. An example of a compiler that generates the instructions is also provided.

Подробнее
01-04-2021 дата публикации

Transposed convolution using systolic array

Номер: WO2021061566A1
Принадлежит: Amazon Technologies, Inc.

In one example, a neural network accelerator can execute a set of instructions to: load a first weight data element from a memory into a systolic array, the first weight data element having first coordinates; extract, from the instructions, information indicating a first subset of input data elements to be obtained from the memory, the first subset being based on a stride of a transposed convolution operation and second coordinates of first weight data element in a rotated array of weight data elements; based on the information, obtain the first subset of input data elements from the memory; load the first subset of input data elements into the systolic array; and control the systolic array to perform first computations based on the first weight data element and the first subset of input data elements to generate output data elements of an array of output data elements.

Подробнее
03-06-2021 дата публикации

Efficient utilization of processing element array

Номер: WO2021108800A1
Принадлежит: Amazon Technologies, Inc.

A computer-implemented method includes receiving a neural network model for implementation using a processing element array, where the neural network model includes a convolution operation on a set of input feature maps and a set of filters. The method also includes determining, based on the neural network model, that the convolution operation utilizes less than a threshold number of rows in the processing element array for applying a set of filter elements to the set of input feature maps, where the set of filter elements includes one filter element in each filter of the set of filters. The method further includes generating, for the convolution operation and based on the neural network model, a first instruction and a second instruction for execution by respective rows in the processing element array, where the first instruction and the second instruction use different filter elements of a filter in the set of filters.

Подробнее
02-07-2024 дата публикации

Memory operation for systolic array

Номер: US12026607B1
Принадлежит: Amazon Technologies Inc

A neural network accelerator executes instructions to: load a first weight data element of an array of weight data elements from a memory into a systolic array; extract, from the instructions, information indicating a first number of input data elements to be obtained from a first address of the memory and a second number of input data elements to be skipped between adjacent input data elements to be obtained, the first address being based on first coordinates of the first weight data element, and the first and second numbers being based on a stride of a convolution operation; based on the information, obtain first input data elements from the first address of the memory; and control the systolic array to perform first computations based on the first weight data element and the first input data elements to generate first output data elements of an output data array.

Подробнее
09-11-2023 дата публикации

Efficient utilization of processing element array

Номер: US20230359876A1
Принадлежит: Amazon Technologies Inc

Generating instructions for programming a processing element array to implement a convolution operation can include determining that the convolution operation under-utilizes the processing element array. The convolution operation involves using the processing element array to perform a series of matrix multiplications between a set of filters and a set of input matrices. Each filter comprises a weight matrix. Each input matrix is assigned to a respective row in the processing element array. Under-utilization can be determined through detecting that less than a threshold number of rows would be used concurrently. In response to determining that the convolution operation under-utilizes the processing element array, instructions can be added for modifying the convolution operation to increase the number of rows used concurrently. The added instructions are executable to cause at least one input matrix to be processed in parallel across more rows compared to processing without modifying the convolution operation.

Подробнее
09-04-2024 дата публикации

Transposed convolution using systolic array

Номер: US11954583B2
Принадлежит: Amazon Technologies Inc

In one example, a neural network accelerator can execute a set of instructions to: load a first weight data element from a memory into a systolic array, the first weight data element having first coordinates; extract, from the instructions, information indicating a first subset of input data elements to be obtained from the memory, the first subset being based on a stride of a transposed convolution operation and second coordinates of first weight data element in a rotated array of weight data elements; based on the information, obtain the first subset of input data elements from the memory; load the first subset of input data elements into the systolic array; and control the systolic array to perform first computations based on the first weight data element and the first subset of input data elements to generate output data elements of an array of output data elements.

Подробнее
06-10-2022 дата публикации

Effiziente Ausnutzung eines Verarbeitungselementarrays

Номер: DE112020005799T5
Принадлежит: Amazon Technologies Inc

Ein computerimplementiertes Verfahren beinhaltet Empfangen eines neuronalen Netzmodells zur Umsetzung unter Verwendung eines Verarbeitungselementarrays, wobei das neuronale Netzmodell eine Faltungsoperation an einem Satz von Eingabemerkmalskarten und einem Satz von Filtern beinhaltet. Das Verfahren beinhaltet außerdem Bestimmen, basierend auf dem neuronalen Netzmodell, dass die Faltungsoperation weniger als eine Schwellenwertanzahl von Zeilen in dem Verarbeitungselementarray verwendet, um einen Satz von Filterelementen auf den Satz von Eingabemerkmalskarten anzuwenden, wobei der Satz von Filterelementen ein einzelnes Filterelement in jedem Filter des Satzes von Filtern beinhaltet. Das Verfahren beinhaltet ferner Erzeugen, für die Faltungsoperation und basierend auf dem neuronalen Netzmodell, einer ersten Anweisung und einer zweiten Anweisung zur Ausführung durch entsprechende Zeilen in dem Verarbeitungselementarray, wobei die erste Anweisung und die zweite Anweisung unterschiedliche Filterelemente eines Filters in dem Satz von Filtern verwenden.

Подробнее
17-09-2024 дата публикации

Static memory allocation for neural network inference

Номер: US12093806B1
Принадлежит: Amazon Technologies Inc

Static memory allocation may be performed for weight values across multiple processing units executing a neural network. A neural network may be received for execution across multiple processing units. A partitioning scheme may be applied to divide the neural network into subgraphs. The subgraphs may be assigned to different processing units. The weights for the operations of the subgraph may be statically allocated in dedicated caches for the processing units as part of the instructions to execute the neural network across the processing units.

Подробнее