Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 9777. Отображено 100.
29-03-2012 дата публикации

Performing a multiply-multiply-accumulate instruction

Номер: US20120079252A1
Автор: Eric S. Sprangle
Принадлежит: Intel Corp

In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.

Подробнее
05-04-2012 дата публикации

Efficient Parallel Floating Point Exception Handling In A Processor

Номер: US20120084533A1
Принадлежит: Individual

Methods and apparatus are disclosed for handling floating point exceptions in a processor that executes single-instruction multiple-data (SIMD) instructions. In one embodiment a numerical exception is identified for a SIMD floating point operation and SIMD micro-operations are initiated to generate two packed partial results of a packed result for the SIMD floating point operation. A SIMD denormalization micro-operation is initiated to combine the two packed partial results and to denormalize one or more elements of the combined packed partial results to generate a packed result for the SIMD floating point operation having one or more denormal elements. Flags are set and stored with packed partial results to identify denormal elements. In one embodiment a SIMD normalization micro-operation is initiated to generate a normalized pseudo internal floating point representation prior to the SIMD floating point operation when it uses multiplication.

Подробнее
12-04-2012 дата публикации

Efficient implementation of arrays of structures on simt and simd architectures

Номер: US20120089792A1
Принадлежит: Linares Medical Devices LLC, Nvidia Corp

One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).

Подробнее
10-05-2012 дата публикации

Dedicated instructions for variable length code insertion by a digital signal processor (dsp)

Номер: US20120117360A1
Автор: Jagadeesh Sankaran
Принадлежит: Texas Instruments Inc

In accordance with at least some embodiments, a digital signal processor (DSP) includes an instruction fetch unit and an instruction decode unit in communication with the instruction fetch unit. The DSP also includes a register set and a plurality of work units in communication with the instruction decode unit. The DSP selectively uses a dedicated insert instruction to insert a variable number of bits into a register.

Подробнее
26-07-2012 дата публикации

Predicting a result for an actual instruction when processing vector instructions

Номер: US20120191957A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

The described embodiments provide a processor that executes vector instructions. In the described embodiments, while dispatching instructions at runtime, the processor encounters an Actual instruction. Upon determining that a result of the Actual instruction is predictable, the processor dispatches a prediction micro-operation associated with the Actual instruction, wherein the prediction micro-operation generates a predicted result vector for the Actual instruction. The processor then executes the prediction micro-operation to generate the predicted result vector. In the described embodiments, when executing the prediction micro-operation to generate the predicted result vector, if the predicate vector is received, for each element of the predicted result vector for which the predicate vector is active, otherwise, for each element of the predicted result vector, generating the predicted result vector comprises setting the element of the predicted result vector to true.

Подробнее
26-07-2012 дата публикации

Sharing a fault-status register when processing vector instructions

Номер: US20120192005A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

The described embodiments provide a processor that executes vector instructions. In the described embodiments, the processor initializes an architectural fault-status register (FSR) and a shadow copy of the architectural FSR by setting each of N bit positions in the architectural FSR and the shadow copy of the architectural FSR to a first predetermined value. The processor then executes a first first-faulting or non-faulting (FF/NF) vector instruction. While executing the first vector instruction, the processor also executes one or more subsequent FF/NF instructions. In these embodiments, when executing the first vector instruction and the subsequent vector instructions, the processor updates one or more bit positions in the shadow copy of the architectural FSR to a second predetermined value upon encountering a fault condition. However, the processor does not update bit positions in the architectural FSR upon encountering a fault condition for the first vector instruction and the subsequent vector instructions.

Подробнее
16-08-2012 дата публикации

Running unary operation instructions for processing vectors

Номер: US20120210099A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

During operation, a processor generates a result vector. In particular, the processor records a value from an element at a key element position in an input vector into a base value. Next, for each active element in the result vector to the right of the key element position, the processor generates a result vector by setting the element in the result vector equal to a result of performing a unary operation on the base value a number of times equal to a number of relevant elements. The number of relevant elements is determined from the key element position to and including a predetermined element in the result vector, where the predetermined element in the result vector may be one of: a first element to the left of the element in the result vector; or the element in the result vector.

Подробнее
18-10-2012 дата публикации

Allocation of counters from a pool of counters to track mappings of logical registers to physical registers for mapper based instruction executions

Номер: US20120265969A1
Принадлежит: International Business Machines Corp

A computer system assigns a particular counter from among a plurality of counters currently in a counter free pool to count a number of mappings of logical registers from among a plurality of logical registers to a particular physical register from among a plurality of physical registers, responsive to an execution of an instruction by a mapper unit mapping at least one logical register from among the plurality of logical registers to the particular physical register, wherein the number of the plurality of counters is less than a number of the plurality of physical registers. The computer system, responsive to the counted number of mappings of logical registers to the particular physical register decremented to less than a minimum value, returns the particular counter to the counter free pool.

Подробнее
03-01-2013 дата публикации

Processing vectors using wrapping add and subtract instructions in the macroscalar architecture

Номер: US20130007422A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a sum or difference operation on another input vector dependent upon the input vector and the control vector.

Подробнее
14-02-2013 дата публикации

Data processing device and data processing method

Номер: US20130038474A1
Автор: Daisuke Baba
Принадлежит: Panasonic Corp

A decoder reads an instruction for information specifying a bit sequence storage area, information indicating a first bit range, and information indicating a second bit range that is contiguous with the first bit range, then outputs a decoded signal in response to the information so read, and a bit manipulation circuit generates and outputs an output sequence based on a bit sequence stored in the bit sequence storage area by inserting uniform predetermined values between a first bit range and a second bit range in accordance with the decoded signal output from the decoder.

Подробнее
28-03-2013 дата публикации

Method, apparatus and instructions for parallel data conversions

Номер: US20130080742A1
Автор: Gopalan Ramanujam
Принадлежит: Individual

Method, apparatus, and program means for performing a conversion. In one embodiment, a disclosed apparatus includes a destination storage location corresponding to a first architectural register. A functional unit operates responsive to a control signal, to convert a first packed first format value selected from a set of packed first format values into a plurality of second format values. Each of the first format values has a plurality of sub elements having a first number of bits The second format values have a greater number of bits. The functional unit stores the plurality of second format values into an architectural register.

Подробнее
25-04-2013 дата публикации

Multi-addressable register files and format conversions associated therewith

Номер: US20130103932A1
Принадлежит: International Business Machines Corp

A multi-addressable register file is addressed by a plurality of types of instructions, including scalar, vector and vector-scalar extension instructions. It may be determined that data is to be translated from one format to another format. If so determined, a convert machine instruction is executed that obtains a single precision datum in a first representation in a first format from a first register; converts the single precision datum of the first representation in the first format to a converted single precision datum of a second representation in a second format; and places the converted single precision datum in a second register.

Подробнее
02-05-2013 дата публикации

Running shift for divide instructions for processing vectors

Номер: US20130111193A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

In the described embodiments, a processor generates a result vector when executing a RunningShiftForDivide1P or RunningShiftForDivide2P instruction. In these embodiments, upon executing a RunningShiftForDivide1P/2P instruction, the processor receives a first input vector and a second input vector. The processor then records a base value from an element at a key element position in the first input vector. Next, when generating the result vector, for each active element in the result vector to the right of the key element position, the processor generates a shifted base value using shift values from the second input vector. The processor then corrects the shifted base value when a predetermined condition is met. Next, the processor sets the element of the result vector equal to the shifted base value.

Подробнее
09-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130117537A1
Принадлежит: Individual

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

Подробнее
09-05-2013 дата публикации

Method and apparatus for unpacking packed data

Номер: US20130117540A1
Принадлежит: Individual

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

Подробнее
16-05-2013 дата публикации

Apparatus and method for reducing overhead caused by communication between clusters

Номер: US20130124825A1
Принадлежит: SAMSUNG ELECTRONICS CO LTD

A technique for minimizing overhead caused by copying or moving a value from one cluster to another cluster is provided. A number of operations, for example, a mov operation for moving or copying a value from one cluster to another cluster and a normal operation may be executed concurrently. Accordingly, access to a register file outside of the cluster may be reduced and the performance of code may be improved.

Подробнее
06-06-2013 дата публикации

Bitstream Buffer Manipulation With A SIMD Merge Instruction

Номер: US20130145125A1
Принадлежит: Individual

Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.

Подробнее
20-06-2013 дата публикации

Reducing issue-to-issue latency by reversing processing order in half-pumped simd execution units

Номер: US20130159666A1
Принадлежит: International Business Machines Corp

Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction.

Подробнее
27-06-2013 дата публикации

Method and apparatus for generating flags for a processor

Номер: US20130166889A1
Принадлежит: Advanced Micro Devices Inc

A method and apparatus are described for generating flags in response to processing data during an execution pipeline cycle of a processor. The processor may include a multiplexer configured generate valid bits for received data according to a designated data size, and a logic unit configured to control the generation of flags based on a shift or rotate operation command, the designated data size and information indicating how many bytes and bits to rotate or shift the data by. A carry flag may be used to extend the amount of bits supported by shift and rotate operations. A sign flag may be used to indicate whether a result is a positive or negative number. An overflow flag may be used to indicate that a data overflow exists, whereby there are not a sufficient number of bits to store the data.

Подробнее
04-07-2013 дата публикации

Processor for Executing Wide Operand Operations Using a Control Register and a Results Register

Номер: US20130173888A1
Принадлежит: Microunity Systems Engineering Inc

A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers.

Подробнее
11-07-2013 дата публикации

Performing A Multiply-Multiply-Accumulate Instruction

Номер: US20130179661A1
Автор: Eric Sprangle
Принадлежит: Individual

In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.

Подробнее
18-07-2013 дата публикации

Processor with multi-level looping vector coprocessor

Номер: US20130185540A1
Принадлежит: Texas Instruments Inc

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.

Подробнее
18-07-2013 дата публикации

Bitstream buffer manipulation with a simd merge instruction

Номер: US20130185541A1
Принадлежит: Individual

Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.

Подробнее
18-07-2013 дата публикации

Processor with instruction variable data distribution

Номер: US20130185544A1
Принадлежит: Texas Instruments Inc

A vector processor includes a plurality of execution units arranged in parallel, a register file, and a plurality of load units. The register file includes a plurality of registers coupled to the execution units. Each of the load units is configured to load, in a single transaction, a plurality of the registers with data retrieved from memory. The loaded registers corresponding to different execution units. Each of the load units is configured to distribute the data to the registers in accordance with an instruction selectable distribution. The instruction selectable distribution specifies one of plurality of distributions. Each of the distributions specifies a data sequence that differs from the sequence in which the data is stored in memory.

Подробнее
15-08-2013 дата публикации

System for implementing vector look-up table operations in a SIMD processor

Номер: US20130212353A1
Автор: Mimar Tibet
Принадлежит:

The present invention incorporates a system for vector Look-Up Table (LUT) operations into a single-instruction multiple-data (SIMD) processor in order to implement plurality of LUT operations simultaneously, where each of the LUT contents could be the same or different. Elements of one or two vector registers are used to form LUT indexes, and the output of vector LUT operation is written into a vector register. No dedicated LUT memory is required; rather, data memory is organized as multiple separate data memory banks, where a portion of each data memory bank is used for LUT operations. For a single-input vector LUT operation, the address input of each LUT is operably coupled to any of the input vector register's elements using input vector element mapping logic in one embodiment. Thus, one input vector element can produce (a positive integer) N output elements using N different LUTs, or (another positive integer) K input vector elements can produce N output elements, where K is an integer from one to N. 136.-. (canceled)37. A method for performing a plurality of lookup table operations in parallel in one step in a processor , the method comprising:providing a memory that is partitioned into a plurality of memory banks, each of said plurality of memory banks is independently addressable, the number of said plurality of memory banks is at least the same as a number of vector elements of at least one source vector, said memory is shared for use as a local data memory by said processor for access by load and store instructions and a plurality of lookup tables;providing a vector register array with ability to store a plurality of vectors;storing one of said plurality of lookup tables into each of said plurality of memory banks at a base address, said plurality of lookup tables each containing a plurality of entries;storing said at least one source vector into said vector register array;using index values to select entries of said plurality of lookup tables in ...

Подробнее
15-08-2013 дата публикации

Conditional vector mapping in a SIMD processor

Номер: US20130212355A1
Автор: Mimar Tibet
Принадлежит:

The present invention provides a method for mapping input vector register elements to output vector register elements in one step in relation to a control vector register controlling vector-to-vector mapping and condition code values. The method includes storing an input vector having N-elements of input data in a vector register and storing a control vector having N-elements in a vector register, and providing for enabling vector-to-vector mapping where the mask bit is not set to selectively disable. The masking of certain elements is useful to partition large mappings of vectors or matrices into sizes that fits the number of elements of a given SIMD, and merging of multiple mapped results together. This method and system provides a highly efficient mechanism of mapping vector register elements in parallel based on a user-defined mapping and prior calculated condition codes, and merging these mapped vector elements with another vector using a mask. 151.-. (canceled)52. A method for performing vector operations in parallel in one step , the method comprising the steps of:providing a vector register file including a plurality of vector registers;storing a first input vector in said vector register file;storing a control vector in said vector register file, wherein said control vector is selected as a source operand of said vector operations;selecting a condition flag from a plurality of condition flags for each vector element position in accordance with a condition select field from a vector instruction, said plurality of condition flags are derived from results of executing a prior instruction sequence;mapping the elements of said first input vector to the elements of a first output vector, in accordance with a first field of respective element of said control vector; andstoring elements of said first output vector on an element-by-element basis conditionally, if mask bit of respective element of said control vector is interpreted as false and in accordance with ...

Подробнее
05-09-2013 дата публикации

METHODS, APPARATUS, AND INSTRUCTIONS FOR CONVERTING VECTOR DATA

Номер: US20130232318A1
Принадлежит:

A computer processor includes a decoder for decoding machine instructions and an execution unit for executing those instructions. The decoder and the execution unit are capable of decoding and executing vector instructions that include one or more format conversion indicators. For instance, the processor may be capable of executing a vector-load-convert-and-write (VLoadConWr) instruction that provides for loading data from memory to a vector register. The VLoadConWr instruction may include a format conversion indicator to indicate that the data from memory should be converted from a first format to a second format before the data is loaded into the vector register. Other embodiments are described and claimed. 1. A processor for executing a machine instruction combining data format conversion with at least one vector operation , the processor comprising:control logic capable of executing processor instructions comprising a vector-load-convert-and-write instruction having a format conversion indicator and a vector register indicator;wherein, in response to the vector-load-convert-and-write instruction, the control logic is capable of:converting data from a first format to a second format, based at least in part on the format conversion indicator; andafter converting the data to the second format, saving the data in the second format to multiple elements of a vector register identified by the vector register indicator.2. A processor according to claim 1 , wherein the control logic is further capable of executing a vector-load-convert-compute-and-write instruction having the format conversion indicator and the vector register indicator claim 1 , wherein:in response to the vector-load-convert-compute-and-write instruction, the control logic is capable of:converting data from the first format to the second format, based at least in part on the format conversion indicator;performing a vector arithmetic operation, based at least in part on the data in the second format; ...

Подробнее
12-09-2013 дата публикации

Method, apparatus and instructions for parallel data conversions

Номер: US20130238879A1
Автор: Gopalan Ramanujam
Принадлежит: Individual

Method, apparatus, and program means for performing a conversion. In one embodiment, a disclosed apparatus includes a destination storage location corresponding to a first architectural register. A functional unit operates responsive to a control signal, to convert a first packed first format value selected from a set of packed first format values into a plurality of second format values. Each of the first format values has a plurality of sub elements having a first number of bits. The second format values have a greater number of bits. The functional unit stores the plurality of second format values into an architectural register.

Подробнее
12-09-2013 дата публикации

Operation processing device, mobile terminal and operation processing method

Номер: US20130238880A1
Автор: Masahiko Toichi
Принадлежит: Fujitsu Ltd

An operation processing device for executing a plurality of operations for aligned data by one vector instruction includes a first mask storage unit and a second mask storage unit. The first mask storage unit stores first mask data to designate each of the plurality of operations a true or false operation, and the second mask storage unit stores second mask data to designate a number to be true continuously, in the plurality of operations.

Подробнее
19-09-2013 дата публикации

Copying character data having a termination character from one memory location to another

Номер: US20130246739A1
Принадлежит: International Business Machines Corp

Copying characters of a set of terminated character data from one memory location to another memory location using parallel processing and without causing unwarranted exceptions. The character data to be copied is loaded within one or more vector registers. In particular, in one embodiment, an instruction (e.g., a Vector Load to block Boundary instruction) is used that loads data in parallel in a vector register to a specified boundary, and provides a way to determine the number of characters loaded. To determine the number of characters loaded (a count), another instruction (e.g., a Load Count to Block Boundary instruction) is used. Further, an instruction (e.g., a Vector Find Element Not Equal instruction) is used to find the index of the first delimiter character, i.e., the first termination character, such as a zero or null character within the character data. This instruction checks a plurality of bytes of data in parallel.

Подробнее
19-09-2013 дата публикации

Vector find element not equal instruction

Номер: US20130246751A1
Принадлежит: International Business Machines Corp

Processing of character data is facilitated. A Find Element Not Equal instruction is provided that compares data of multiple vectors for inequality and provides an indication of inequality, if inequality exists. An index associated with the unequal element is stored in a target vector register. Further, the same instruction, the Find Element Not Equal instruction, also searches a selected vector for null elements, also referred to as zero elements. A result of the instruction is dependent on whether the null search is provided, or just the compare.

Подробнее
17-10-2013 дата публикации

SHUFFLE PATTERN GENERATING CIRCUIT, PROCESSOR, SHUFFLE PATTERN GENERATING METHOD, AND INSTRUCTION SEQUENCE

Номер: US20130275718A1
Автор: Baba Daisuke, Ueda Kyoko
Принадлежит:

Based on an input index sequence () composed of four indices (each having a bit width of 8 bits), a shift-copier generates an index sequence () by shifting each index leftward by 1 bit and making two copies of each index, and outputs the generated index sequence (). An adder generates a shuffle pattern () by adding 1, 0, 1, 0, 1, 0, 1 and 0 to the indices in the index sequence () from left to right, and outputs the generated shuffle pattern (). 1. A shuffle pattern generating circuit comprising:a shift-copier that generates an index sequence by: receiving an input index sequence composed of a plurality of indices, and a signal indicating a number of bits and a number of copies; shifting each index in the input index sequence leftward by the number of bits; and making the number of copies of each index in the input index sequence, and outputs the generated index sequence; andan adder that receives the index sequence output by the shift-copier and a signal indicating an additional value to be added to each index in the index sequence output by the shift-copier, and adds the additional value to each index in the index sequence output by the shift-copier.2. The shuffle pattern generating circuit of claim 1 , whereinthe additional value is different for each copy made from a same index in the input index sequence.3. The shuffle pattern generating circuit of claim 2 , whereinthe number of the copies is N (where N is an integer equal to or greater than 2), andthe additional value to be added to one of the copies made from a same index in the input index sequence is 0, and the additional value to be added to each of the remaining N−1 copies is an integer ranging from 1 to N−1.4. The shuffle pattern generating circuit of claim 1 , further comprising:a bit width changer that receives a signal indicating a bit width of each index in the input index sequence, and changes a bit width of each index to the bit width indicated by the signal.5. The shuffle pattern generating circuit ...

Подробнее
17-10-2013 дата публикации

PACKED DATA OPERATION MASK SHIFT PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS

Номер: US20130275719A1
Принадлежит:

A method of an aspect includes receiving a packed data operation mask shift instruction. The packed data operation mask shift instruction indicates a source having a packed data operation mask, indicates a shift count number of bits, and indicates a destination. The method further includes storing a result in the destination in response to the packed data operation mask shift instruction. The result includes a sequence of bits of the packed data operation mask that have been shifted by the shift count number of bits. Other methods, apparatus, systems, and instructions are disclosed. 1. A method comprising:receiving a packed data operation mask shift instruction, the packed data operation mask shift instruction indicating a source having a packed data operation mask, indicating a shift count number of bits, and indicating a destination; andstoring a result in the destination in response to the packed data operation mask shift instruction, the result including a sequence of bits of the packed data operation mask that have been shifted by the shift count number of bits.2. The method of claim 1 , wherein storing the result comprises storing the sequence of the bits of the packed data operation mask that have been logically shifted to the right by the shift count number of bits with the shift count number of zeros shifted in on the left.3. The method of claim 1 , wherein storing the result comprises storing the sequence of the bits of the packed data operation mask that have been logically shifted to the left by the shift count number of bits with the shift count number of zeros shifted in on the right.4. The method of claim 1 , wherein the packed data operation mask is an N-bit packed data operation mask claim 1 , wherein the shift count number of bits is an M-bit shift count number of bits claim 1 , and wherein the result includes:(a) in a least significant N-bits of the destination, an (N−M)-bit sequence of the bits of the N-bit packed data operation mask that has ...

Подробнее
17-10-2013 дата публикации

Apparatus and method of improved extract instructions

Номер: US20130275730A1
Принадлежит: Intel Corp

An apparatus is described that includes instruction execution logic circuitry to execute first, second, third and fourth instructions. Both the first instruction and the second instruction select a first group of input vector elements from one of multiple first non overlapping sections of respective first and second input vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction select a second group of input vector elements from one of multiple second non overlapping sections of respective third and fourth input vectors. The second group has a second bit width that is larger than the first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus includes masking layer circuitry to mask the first and second groups of the first and third instructions at a first granularity, where, respective resultants produced therewith are respective resultants of the first and third instructions. The masking circuitry is also to mask the first and second groups of the second and fourth instructions at a second granularity, where, respective resultants produced therewith are respective resultants of the second and fourth instructions.

Подробнее
17-10-2013 дата публикации

PROCESSORS

Номер: US20130275732A1
Принадлежит:

A processing apparatus comprises a plurality of processors each arranged to perform an instruction, and a bus arranged to carry data and control tokens between the processors. Each processor is arranged, if it receives a control token via the bus, to carry out the instruction, and on carrying out the instruction, to perform an operation on the data, to identify any of the processors which are to be data target processors, and to transmit output data to any identified data target processors, to identify any of the processors which are to be control target processors, and to transmit a control token to any identified control target processors. 1. A processing apparatus comprising a plurality of chips each comprising a plurality of processors , each processor arranged to perform an instruction , and a bus arranged to carry data tokens and control tokens between the processors , wherein each processor is arranged , if it receives a control token via the bus , to carry out the instruction , and on carrying out the instruction , to perform an operation on the data to produce a result , to identify any of the processors which are to be data target processors , and to transmit output data to any identified data target processors , to identify any of the processors which are to be control target processors , and to transmit a control token to any identified control target processors , each chip having an input device which receives tokens from the bus and an output device from which tokens can be transferred to another of the chips , wherein the plurality of processors are between the input and output device and each processor has an address associated with it , the addresses being within a range , the apparatus being arranged , on receipt by the output device of a token having a target address which is outside the range , to perform a modification of the target address , and to transfer the token to said other chip.2. A processing apparatus according to wherein each ...

Подробнее
24-10-2013 дата публикации

System, apparatus and method for translating vector instructions

Номер: US20130283022A1
Автор: Ruchira Sasanka
Принадлежит: Intel Corp

Vector translation instructions are used to demarcate the beginning and the end of a code region to be translated. The code region includes a first set of vector instructions defined in an instruction set of a source processor. A processor receives the vector translation instructions and the demarcated code region, and translates the code region into translated code. The translated code includes a second set of vector instructions defined in an instruction set of a target processor. The translated code is executed by the target processor to produce a result value, the result value being the same as an original result value produced by the source processor executing the code region. The target processor stores the result value at a location that is not a vector register, the location being the same as an original location used by the source processor to store the original result value.

Подробнее
14-11-2013 дата публикации

PERFORMING A CYCLIC REDUNDANCY CHECKSUM OPERATION RESPONSIVE TO A USER-LEVEL INSTRUCTION

Номер: US20130305015A1
Принадлежит:

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed. 1. A system comprising: a cache;', a first 32-bit register to store a first operand;', 'a second 32-bit register to store a second operand;', 'a first 64-bit register to store a third operand; and', 'a second 64-bit register to store a fourth operand; and, 'a set of registers, including, a first execution unit coupled to the first and the second 32-bit registers to perform a first XOR operation on at least one bit of the first and the second operands and to store a result of the first XOR operation in a first destination register responsive to a first instruction of the ISA; and', 'a second execution unit coupled to the first and the second 64-bit registers to perform a second XOR operation on at least one bit of the third and the fourth operands and to store a result of the second XOR operation in a second destination register responsive to a second instruction of the ISA;, 'a plurality of execution units to perform exclusive-OR (XOR) operations on data of a configurable size responsive to instructions of an instruction set architecture (ISA) for the processor, the plurality of execution units including], 'a processor comprisinga memory coupled with the processor;a data storage device coupled with the processor;an audio I/O device coupled with the processor; anda communication device coupled to the processor.2. The system of claim 1 , wherein the first and the second execution units are to perform the first and the second XOR operations in the same number of cycles.3. The system of claim 1 , wherein the first destination register comprises the first 32-bit ...

Подробнее
14-11-2013 дата публикации

PERFORMING A CYCLIC REDUNDANCY CHECKSUM OPERATION RESPONSIVE TO A USER-LEVEL INSTRUCTION

Номер: US20130305016A1
Принадлежит:

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed. 1. A system comprising: [ a first 32-bit register to store a first operand;', 'a second 32-bit register to store a second operand;, 'a set of registers, including, 'a second 64-bit register to store a fourth operand;', 'a first 64-bit register to store a third operand; and'}, a first execution unit coupled to the first and the second 32-bit registers to perform a first XOR operation on at least one bit of the first and the second operands and to store a result of the first XOR operation in a first destination register responsive to a first instruction of the ISA; and', 'a second execution unit coupled to the first and the second 64-bit registers to perform a second XOR operation on at least one bit of the third and the fourth operands and to store a result of the second XOR operation in a second destination register responsive to a second instruction of the ISA; and, 'a plurality of execution units to perform exclusive-OR (XOR) operations on data of a configurable size responsive to instructions of an instruction set architecture (ISA) for the processor, the plurality of execution units including, 'a memory controller;, 'a processor comprisinga memory coupled with the memory controller;a data storage device coupled with the processor;an audio I/0 interface coupled with the processor; anda communication device coupled with the processor;2. The system of claim 1 , wherein the first and the second execution units are to perform the first and the second XOR operations in the same number of cycles.3. The system of claim 1 , wherein the first destination register ...

Подробнее
14-11-2013 дата публикации

Performing a cyclic redundancy checksum operation responsive to a user-level instruction

Номер: US20130305118A1
Принадлежит: Intel Corp

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed.

Подробнее
21-11-2013 дата публикации

Apparatus and method for selecting elements of a vector computation

Номер: US20130311530A1
Принадлежит: Intel Corp

An apparatus and method are described for performing a vector reduction. For example, an apparatus according to one embodiment comprises: a reduction logic tree comprised of a set of N-1 reduction logic blocks used to perform reduction in a single operation cycle for N vector elements; a first input vector register storing a first input vector communicatively coupled to the set of reduction logic blocks; a second input vector register storing a second input vector communicatively coupled to the set of reduction logic blocks; a mask register storing a mask value controlling a set of one or more multiplexers, each of the set of multiplexers selecting a value directly from the first input vector register or an output containing a processed value from one of the reduction logic blocks; and an output vector register coupled to outputs of the one or more multiplexers to receive values output passed through by each of the multiplexers responsive to the control signals.

Подробнее
21-11-2013 дата публикации

ROTATE INSTRUCTIONS THAT COMPLETE EXECUTION WITHOUT READING CARRY FLAG

Номер: US20130311756A1
Принадлежит:

A method of one aspect may include receiving a rotate instruction. The rotate instruction may indicate a source operand and a rotate amount. A result may be stored in a destination operand indicated by the rotate instruction. The result may have the source operand rotated by the rotate amount. Execution of the rotate instruction may complete without reading a carry flag. 1. A method comprising:receiving a rotate instruction, the rotate instruction indicating a source operand and a rotate amount;storing a result in a destination operand indicated by the rotate instruction, the result having the source operand rotated by the rotate amount; andcompleting execution of the rotate instruction without reading a carry flag.2. The method of claim 1 , wherein completing comprises completing execution of the rotate instruction without reading an overflow flag.3. The method of claim 2 , wherein completing comprises completing execution of the rotate instruction without writing the carry flag and without writing the overflow flag.4. The method of claim 2 , wherein completing comprises completing execution of the rotate instruction without reading a sign flag claim 2 , without reading a zero flag claim 2 , without reading an auxiliary carry flag claim 2 , and without reading a parity flag.5. The method of claim 4 , wherein completing comprises completing execution of the rotate instruction without writing the carry flag claim 4 , without writing the overflow flag claim 4 , without writing the sign flag claim 4 , without writing the zero flag claim 4 , without writing the auxiliary carry flag claim 4 , and without writing the parity flag.6. The method of claim 1 , wherein receiving comprises receiving a rotate instruction that explicitly specifies the source operand and that explicitly specifies the destination operand.7. The method of claim 1 , wherein receiving comprises receiving a rotate instruction that explicitly specifies a second source operand having the rotate amount.8. ...

Подробнее
12-12-2013 дата публикации

Speed up secure hash algorithm (sha) using single instruction multiple data (simd) architectures

Номер: US20130332742A1
Автор: Shay Gueron, Vlad KRASNOV
Принадлежит: Intel Corp

A processing apparatus may comprise logic to preprocess a message according to a selected secure hash algorithm (SHA) algorithm to generate a plurality of message blocks, logic to generate hash values by preparing message schedules in parallel using single instruction multiple data (SIMD) instructions for the plurality of message blocks and to perform compression in serial for the plurality of message blocks, and logic to generate a message digest conforming to the selected SHA algorithm.

Подробнее
19-12-2013 дата публикации

Single instruction multiple data (simd) reconfigurable vector register file and permutation unit

Номер: US20130339649A1
Принадлежит: Intel Corp

An apparatus may comprise a register file and a permutation unit coupled to the register file. The register file may have a plurality of register banks and an input to receive a selection signal. The selection signal may select one or more unit widths of a register bank as a data element boundary for read or write operations.

Подробнее
19-12-2013 дата публикации

EFFICIENT ZERO-BASED DECOMPRESSION

Номер: US20130339661A1
Принадлежит:

A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed. 1. A method for decompressing a run length encoded (RLE) set of source data elements in a computer processor that includes a vector execution unit , the method comprising:initializing a result and an insertion point variable;generating a first mask by comparing the set of source data elements to a set of zero elements comprising the of same number of elements as the set of source data elements;generating a count of trailing zeros in the first mask;generating a second mask comprising a set of ones based on the count of trailing zeros;performing a masked copied of the set of source data elements into a temporary buffer using the second mask;reading a number of RLE zeros from a first data element in the set of source data elements, the first data element being indexed based on the count of trailing zeros;shifting the temporary buffer to the left based on the insertion point variable;performing a copy of the temporary buffer into the result variable;updating the insertion point variable;shifting the set of source data elements to the right based on the number of trailing zeros;determining whether the set of source data elements contains more valid input;in the ...

Подробнее
19-12-2013 дата публикации

METHOD AND APPARATUS FOR REDUCING AREA AND COMPLEXITY OF INSTRUCTION WAKEUP LOGIC IN A MULTI-STRAND OUT-OF-ORDER PROCESSOR

Номер: US20130339679A1
Принадлежит: Intel Corporation

A computer system, a computer processor and a method executable on a computer processor involve placing each sequence of a plurality of sequences of computer instructions being scheduled for execution in the processor into a separate queue. The head instruction from each queue is stored into a first storage unit prior to determining whether the head instruction is ready for scheduling. For each instruction in the first storage unit that is determined to be ready, the instruction is moved from the first storage unit to a second storage unit. During a first processor cycle, each instruction in the first storage unit that is determined to be not ready is retained in the first storage unit, and the determining of whether the instruction is ready is repeated during the next processor cycle. Scheduling logic performs scheduling of instructions contained in the second storage unit. 1. A computer system that is configured to perform the following:placing each sequence of a plurality of sequences of computer instructions being scheduled for execution in a computer processor into a separate queue;storing a head instruction from each queue into a first storage unit prior to determining whether the head instruction is ready for scheduling;for each instruction in the first storage unit that is determined to be ready, moving the instruction from the first storage unit to a second storage unit;during a first processor cycle, for each instruction in the first storage unit that is determined to be not ready, retaining the instruction in the first storage unit and repeating the determining of whether the instruction is ready during the next processor cycle; andapplying scheduling logic to perform scheduling of instructions contained in the second storage unit.2. The computer system of claim 1 , wherein the processor is a multi-strand out-of-order processor configured to execute each sequence of the plurality of sequences as a separate strand.3. The computer system of claim 1 , ...

Подробнее
19-12-2013 дата публикации

METHODS TO OPTIMIZE A PROGRAM LOOP VIA VECTOR INSTRUCTIONS USING A SHUFFLE TABLE AND A MASK STORE TABLE

Номер: US20130339682A1
Принадлежит:

According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array. The code optimizer is configured to generate second code representing the program loop with vector instructions including a shuffle instruction and a store instruction, the store instruction to shuffle using a shuffle table elements of the first array based on the second array in a vector manner, the store instruction to store using a mask store table the shuffled elements in the third array in a vector manner. 1. A computer-implemented method , comprising:receiving first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array; and a shuffle instruction to shuffle elements of the first array based on the third array using a shuffle table in a vector manner, and', 'a store instruction to store the shuffled elements of the first array in the second array using a mask store table in a vector manner., 'generating second code representing the program loop with vector instructions, the second code including'}2. The method of claim 1 , wherein the second code further comprises instructions tocompare elements of the third array with a predetermined threshold, generating a comparison result, andgenerate a mask based on the comparison result, the elements of the first array to be shuffled based on the mask.3. The method of claim 2 , wherein the second code further comprises an instruction to load elements of the shuffle table selected based on the mask claim 2 , the elements of the first array to be shuffled via the shuffle instruction based on the selected elements of the shuffle table.4. The method of claim 2 , wherein the second code further comprises an instruction to load elements of the mask store table selected based on the mask claim 2 , the shuffled ...

Подробнее
02-01-2014 дата публикации

Vector multiplication with accumulation in large register space

Номер: US20140006755A1
Принадлежит: Intel Corp

An apparatus is described having an instruction execution pipeline that has a vector functional unit to support a vector multiply add instruction. The vector multiply add instruction to multiply respective K bit elements of two vectors and accumulate a portion of each of their respective products with another respective input operand in an X bit accumulator, where X is greater than K.

Подробнее
09-01-2014 дата публикации

Systems, apparatuses, and methods for performing a horizontal add or subtract in response to a single instruction

Номер: US20140013075A1
Принадлежит: Intel Corp

Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, and an opcode are describes.

Подробнее
09-01-2014 дата публикации

EFFICIENT HARDWARE INSTRUCTIONS FOR SINGLE INSTRUCTION MULTIPLE DATA PROCESSORS

Номер: US20140013076A1
Принадлежит: ORACLE INTERNATIONAL CORPORATION

A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented. 1. A processor that , within the processor , loads values from a vector of values into a series of subregisters of a SIMD register:wherein values within the vector of values are contiguous;wherein each value in the vector of values is represented by a fixed number of bits;wherein the SIMD register has the series of subregisters, each of which has a number of bits that is greater than the fixed number of bits used to represent each value from the vector of values; loading each value, in the vector of values, into a separate subregister of the series of subregisters; and', 'setting to zero all bits, in each subregister of the series of subregisters, other than bits storing values from the vector of values., 'wherein the processor is configured to respond to one or more instructions by2. The processor of claim 1 , wherein the processor is further configured to respond to the one or more instructions by:shifting each value, such each value is byte aligned within each subregister of the series of subregisters.3. The processor of claim 1 , wherein each subregister claim 1 , of the series of subregisters claim 1 , is eight bits.4. The processor of claim 1 , wherein the one or more instructions ...

Подробнее
09-01-2014 дата публикации

RECONFIGURABLE DEVICE FOR REPOSITIONING DATA WITHIN A DATA WORD

Номер: US20140013082A1
Принадлежит: Intel Corporation

Disclosed is a system and device and related methods for data manipulation, especially for SIMD operations such as permute, shift, and rotate. An apparatus includes a permute section that repositions data on sub-word boundaries and a shift section that repositions the data distances smaller than the sub-word width. The sub-word width is configurable and selectable, and the permute section and shift section may operate on different boundary widths. In a first stage, the permute section repositions the data at the nearest sub-word boundary and, in a second stage, the shift section repositions the data to its final desired position. The shift section includes multi-stages set in a logarithmic cascade relationship. Additionally, each shifter within each of the multi-stages is highly connected, allowing fast and precise data movements. 1. An apparatus , comprising:an input for receiving data in a data word, the data word including a plurality of sub-words having a predetermined width, and for receiving a command to reposition the data within the data word;a permute section structured to reposition the data when the command is to reposition the data a distance of an integer multiple of the predetermined width; anda shift section structured to reposition the data when the command is to reposition the data a distance less than the predetermined width of the sub-word.2. The apparatus of claim 1 , in which the predetermined width of the sub-words is configurable.3. The apparatus of claim 2 , in which the input is structured to accept the predetermined width of the sub-words as an operating mode.4. The apparatus of claim 1 , wherein the permute section is additionally structured to reposition the data in a first action when the command is to reposition the data a distance greater than the predetermined width claim 1 , and in which the shift section is structured to reposition the permuted data in a second action less than the predetermined width.5. The apparatus of claim 1 , ...

Подробнее
09-01-2014 дата публикации

Cache coprocessing unit

Номер: US20140013083A1
Автор: Ashish Jha
Принадлежит: Intel Corp

A cache coprocessing unit in a computing system includes a cache array to store data, a hardware decode unit to decode instructions that are offloaded from being executed by an execution cluster of the computing system to reduce load and store operations between the execution cluster and the cache coprocessing unit, and a set of one or more operation units to perform operations on the cache array according to the decoded instructions.

Подробнее
09-01-2014 дата публикации

Processor system with predicate register, computer system, method for managing predicates and computer program product

Номер: US20140013087A1
Принадлежит: FREESCALE SEMICONDUCTOR INC

A processor system is adapted to carry out a predicate swap instruction of an instruction set to swap, via a data pathway, predicate data in a first predicate data location of a predicate register with data in a corresponding additional predicate data location of a first additional predicate data container and to swap, via a data pathway, predicate data in a second predicate storage location of the predicate register with data in a corresponding additional predicate data location in a second additional predicate data container.

Подробнее
16-01-2014 дата публикации

Systems, apparatuses, and methods for performing a double blocked sum of absolute differences

Номер: US20140019713A1
Принадлежит: Intel Corp

Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described.

Подробнее
30-01-2014 дата публикации

APPARATUS AND METHOD FOR AN INSTRUCTION THAT DETERMINES WHETHER A VALUE IS WITHIN A RANGE

Номер: US20140032877A1
Принадлежит:

A method is described that includes performing the following with a single instruction: receiving a first input operand V; receiving a second input operand S; calculating V−S; determining if V−S is positive or negative; and, providing as a resultant: V if V−S is negative; V−S if V−S is positive. 1. An apparatus , comprising:a semiconductor chip having an instruction execution pipeline, said instruction execution pipeline having an execution unit with logic circuitry to perform the following for an instruction:accept first and second input operands that represent a candidate value and a range; an amount representative of the candidate value if the candidate value is within the range;', 'an amount representative of how far the candidate value extends beyond the range if the candidate value is beyond the range., 'present as a resultant of said instruction2. The apparatus of further comprising encoding logic to encode said resultant into a bit mask.3. The apparatus of wherein said encoding logic is part of said logic circuitry claim 2 , said bit mask created as a second resultant of said instruction.4. The apparatus of wherein said bit mask is stored to mask register space.5. The apparatus of wherein said instruction execution pipeline can execute vector instructions.6. The apparatus of wherein the instruction accepts a third input operand that represents the beginning of the range.7. A method claim 5 , comprising: determining if a candidate value extends beyond a range; and,', the candidate value if the candidate value does not extend beyond the range;', 'a value that represents the extent beyond the range that the candidate value extends beyond the range, the value being calculated by the instruction., 'providing], 'performing the following with a single instruction8. The method of further comprising encoding a bit mask from said resultant.9. The method of wherein said encoding is performed with said single instruction.10. The method of further comprising storing said ...

Подробнее
06-02-2014 дата публикации

Programmable device for software defined radio terminal

Номер: US20140040594A1

A programmable device suitable for software defined radio terminal is disclosed. In one aspect, the device includes a scalar cluster providing a scalar data path and a scalar register file and arranged for executing scalar instructions. The device may further include at least two interconnected vector clusters connected with the scalar cluster. Each of the at least two vector clusters provides a vector data path and a vector register file and is arranged for executing at least one vector instruction different from vector instructions performed by any other vector cluster of the at least two vector clusters.

Подробнее
06-02-2014 дата публикации

Packed load/store with gather/scatter

Номер: US20140040596A1
Принадлежит: International Business Machines Corp

Embodiments relate to packed loading and storing of data. An aspect includes a method for packed loading and storing of data distributed in a system that includes memory and a processing element. The method includes fetching and decoding an instruction for execution by the processing element. The processing element gathers a plurality of individually addressable data elements from non-contiguous locations in the memory which are narrower than a nominal width of register file elements in the processing element based on the instruction. The data elements are packed and loaded into register file elements of a register file entry by the processing element based on the instruction, such that at least two of the data elements gathered from the non-contiguous locations in the memory are packed and loaded into a single register file element of the register file entry.

Подробнее
06-02-2014 дата публикации

Predication in a vector processor

Номер: US20140040597A1
Принадлежит: International Business Machines Corp

Embodiments relate to vector processor predication in an active memory device. An aspect includes a system for vector processor predication in an active memory device. The system includes memory in the active memory device and a processing element in the active memory device. The processing element is configured to perform a method including decoding an instruction with a plurality of sub-instructions to execute in parallel. One or more mask bits are accessed from a vector mask register in the processing element. The one or more mask bits are applied by the processing element to predicate operation of a unit in the processing element associated with at least one of the sub-instructions.

Подробнее
06-02-2014 дата публикации

PACKED ROTATE PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS

Номер: US20140040604A1
Принадлежит:

A method of an aspect includes receiving a masked packed rotate instruction. The instruction indicates a first source packed data including a plurality of packed data elements, a packed data operation mask having a plurality of mask elements, at least one rotation amount, and a destination storage location. A result packed data is stored in the destination storage location in response to the instruction. The result packed data includes result data elements that each correspond to a different one of the mask elements in a corresponding relative position. Result data elements that are not masked out by the corresponding mask element include one of the data elements of the first source packed data in a corresponding position that has been rotated. Result data elements that are masked out by the corresponding mask element include a masked out value. Other methods, apparatus, systems, and instructions are disclosed. 1. A method comprising:receiving a masked packed rotate instruction, the masked packed rotate instruction indicating a first source packed data including a plurality of packed data elements, indicating a packed data operation mask having a plurality of mask elements, indicating at least one rotation amount, and indicating a destination storage location; andstoring a result packed data in the destination storage location in response to the masked packed rotate instruction, the result packed data including a plurality of result data elements that each correspond to a different one of the mask elements in a corresponding relative position, in which result data elements that are not masked out by the corresponding mask element include one of the data elements of the first source packed data in a corresponding position that has been rotated, and in which result data elements that are masked out by the corresponding mask element include a masked out value.2. The method of claim 1 , wherein storing comprises storing the result packed data in which the result data ...

Подробнее
20-02-2014 дата публикации

Super multiply add (super madd) instruction

Номер: US20140052968A1
Принадлежит: Intel Corp

A method of processing an instruction is described that includes fetching and decoding the instruction. The instruction has separate destination address, first operand source address and second operand source address components. The first operand source address identifies a location of a first mask pattern in mask register space. The second operand source address identifies a location of a second mask pattern in the mask register space. The method further includes fetching the first mask pattern from the mask register space; fetching the second mask pattern from the mask register space; merging the first and second mask patterns into a merged mask pattern; and, storing the merged mask pattern at a storage location identified by the destination address.

Подробнее
27-02-2014 дата публикации

Allocation of counters from a pool of counters to track mappings of logical registers to physical registers for mapper based instruction executions

Номер: US20140059329A1
Принадлежит: International Business Machines Corp

A computer system assigns a particular counter from among a plurality of counters currently in a counter free pool to count a number of mappings of logical registers from among a plurality of logical registers to a particular physical register from among a plurality of physical registers, responsive to an execution of an instruction by a mapper unit mapping at least one logical register from among the plurality of logical registers to the particular physical register, wherein the number of the plurality of counters is less than a number of the plurality of physical registers. The computer system, responsive to the counted number of mappings of logical registers to the particular physical register decremented to less than a minimum value, returns the particular counter to the counter free pool.

Подробнее
06-03-2014 дата публикации

VECTOR INSTRUCTIONS TO ENABLE EFFICIENT SYNCHRONIZATION AND PARALLEL REDUCTION OPERATIONS

Номер: US20140068226A1
Принадлежит:

In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed. 1. A processor comprising:a single instruction multiple data (SIMD) unit to perform operations on a plurality of data elements responsive to a single instruction; anda control unit coupled to the SIMD unit to provide the plurality of data elements to the SIMD unit, and to enable the SIMD unit to perform a SIMD instruction to generate a count of identical elements of a third vector having a third plurality of data elements, the count of identical elements corresponding to a population count of unique integer values in the third vector, and to store the population count of unique integer values in a data element of a destination storage corresponding to one of the third plurality of data elements having the unique integer value, and to further write an indicator of a mask to indicate each unique element, to compute a histogram.2. The processor of claim 1 , wherein the control unit is to enable an atomic SIMD operation to be performed on at least some of the plurality of data elements responsive to a first SIMD instruction to be executed under a first mask and a second SIMD instruction to be executed under a second mask claim 1 , wherein the first SIMD instruction is to obtain the plurality of data elements from first memory locations and reserve the first memory locations claim 1 , pursuant to an input mask corresponding to the first mask.3. The processor of claim 2 , wherein the second SIMD instruction is to store a second plurality of ...

Подробнее
13-03-2014 дата публикации

REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS

Номер: US20140075153A1

Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction. 1. A method comprising: said plurality of instructions comprises a first vector instruction and a second vector instruction, and', 'execution of said second vector instruction depends on an execution result of said first vector instruction,, 'receiving a plurality of instructions, wherein'}executing said first vector instruction utilizing a processor functional unit, wherein said processor functional unit comprises a pipelined execution unit;determining a forwarding order of a first portion of said execution result and a second portion of said execution result in dependence on a parity of a clock signal applied to said processor functional unit upon receipt of said first vector instruction; andforwarding said first portion of said execution result and said second portion of said execution result from an output coupled to said pipelined execution unit to an input coupled to said pipelined execution unit according to said forwarding order.2. The method of claim 1 , whereinsaid receiving comprises receiving a set of scalar instructions and a ...

Подробнее
27-03-2014 дата публикации

Processor having instruction set with user-defined non-linear functions for digital pre-distortion (dpd) and other non-linear applications

Номер: US20140086361A1
Принадлежит: LSI Corp

A processor is provided having an instruction set with user-defined non-linear functions for digital pre-distortion (DPD) and other non-linear applications. A signal processing function, such as DPD, is implemented in software by obtaining at least one software instruction that performs at least one non-linear function for an input value, x, wherein the at least one non-linear function comprises at least one user-specified parameter; in response to at least one of the software instructions for at least one non-linear function having at least one user-specified parameter, performing the following steps: invoking at least one functional unit that implements the at least one software instruction to apply the non-linear function to the input value, x; and generating an output corresponding to the non-linear function for the input value, x. The user-specified parameter can optionally be loaded from memory into at least one register.

Подробнее
03-04-2014 дата публикации

Shift Significand of Decimal Floating Point Data

Номер: US20140095563A1
Принадлежит: International Business Machines Corp

A decimal floating point finite number in a decimal floating point format is composed from the number in a different format. A decimal floating point format includes fields to hold information relating to the sign, exponent and significand of the decimal floating point finite number. Other decimal floating point data, including infinities and NaNs (not a number), are also composed. Decimal floating point data are also decomposed from the decimal floating point format to a different format. For composition and decomposition, one or more instructions may be employed, including a shift significand instruction.

Подробнее
03-04-2014 дата публикации

APPARATUS AND METHOD FOR EFFICIENT GATHER AND SCATTER OPERATIONS

Номер: US20140095831A1
Принадлежит:

An apparatus and method are described for performing efficient gather operations in a pipelined processor. For example, a processor according to one embodiment of the invention comprises: gather setup logic to execute one or more gather setup operations in anticipation of one or more gather operations, the gather setup operations to determine one or more addresses of vector data elements to be gathered by the gather operations; and gather logic to execute the one or more gather operations to gather the vector data elements using the one or more addresses determined by the gather setup operations. 1. A processor comprising:gather setup logic to execute one or more gather setup operations in anticipation of one or more gather operations, the gather setup operations to compute a gather state to be used by subsequent gather operations; andgather logic to execute the one or more gather operations to gather vector data elements using the gather state computed by the gather setup operations.2. The processor as in wherein the gather setup operations comprise gather setup instructions and wherein the gather operations comprise gather instructions.3. The processor as in further comprising:a decoder to decode the gather setup instructions and gather instructions; andan execution unit to execute the gather setup instructions and gather instructions using the gather setup logic and gather logic, respectively, wherein the gather setup instructions calculate the addresses of data elements to be gathered when executed by the execution unit and wherein the addresses are provided to the decoder for use by the gather instructions during decoding.4. The processor as in further comprising:an index register to store an index value for each of the vector data elements to be gathered; anda base address register to store a base address for the vector data elements,wherein the gather setup logic is to determine the addresses of the vector data elements to be gathered by adding the index ...

Подробнее
03-04-2014 дата публикации

Systems, Apparatuses, and Methods for Performing Conflict Detection and Broadcasting Contents of a Register to Data Element Positions of Another Register

Номер: US20140095843A1
Принадлежит: Intel Corp

Systems, apparatuses, and methods of performing in a computer processor broadcasting data in response to a single vector packed broadcasting instruction that includes a source writemask register operand, a destination vector register operand, and an opcode. In some embodiments, the data of the source writemask register is zero extended prior to broadcasting.

Подробнее
03-04-2014 дата публикации

Variable clocked serial array processor

Номер: US20140095920A1
Автор: Laurence H. Cooke
Принадлежит: Individual

A serial array processor may have an execution unit, which is comprised of a multiplicity of single bit arithmetic logic units (ALUs), and which may perform parallel operations on a subset of all the words in memory by serially accessing and processing them, one bit at a time, while an instruction unit of the processor is pre-fetching the next instruction, a word at a time, in a manner orthogonal to the execution unit.

Подробнее
10-04-2014 дата публикации

Byte selection and steering logic for combined byte shift and byte permute vector unit

Номер: US20140101358A1
Принадлежит: International Business Machines Corp

Exemplary embodiments of the present invention disclose a method and system for executing data permute and data shift instructions. In a step, an exemplary embodiment encodes a control index value using the recoding logic into a 1-hot-of-n control for at least one of a plurality of datum positions in the one or more target registers. In another step, an exemplary embodiment conditions the 1-hot-of-n control by a gate-free logic configured for at least one of the plurality of datum positions in the one or more target registers for each of the data permute instructions and the at least one data shift instruction. In another step, an exemplary embodiment selects the 1-hot-of-n control or the conditioned 1-hot-of-n control based on a current instruction mode. In another step, an exemplary embodiment transforms the selected 1-hot-of-n control into a format applicable for the crossbar switch.

Подробнее
06-01-2022 дата публикации

MACHINE LEARNING ARCHITECTURE SUPPORT FOR BLOCK SPARSITY

Номер: US20220004597A1
Автор: Azizi Omid
Принадлежит:

This disclosure relates matrix operation acceleration for different matrix sparsity patterns. A matrix operation accelerator may be designed to perform matrix operations more efficiently for a first matrix sparsity pattern rather than for a second matrix sparsity pattern. A matrix with the second sparsity pattern may be converted to a matrix with the first sparsity pattern and provided to the matrix operation accelerator. By rearranging the rows and/or columns of the matrix, the sparsity pattern of the matrix may be converted to a sparsity pattern that is suitable for computation with the matrix operation accelerator. 1. A system , comprising:matrix operation circuitry configured to perform a matrix operation on a first matrix having a first sparsity to generate a second matrix; and generate the first matrix via a transformation of a third matrix from a second sparsity to the first sparsity;', 'in response to the matrix operation being performed on the first matrix to generate the second matrix, inversely transforming the second matrix to generate a result matrix; and, 'control circuitry communicatively coupled to the matrix operation circuitry and configured tooutput the result matrix.2. The system of claim 1 , wherein the transformation of the third matrix comprises rearranging a plurality of columns of the third matrix claim 1 , rearranging a plurality of rows of the third matrix claim 1 , or rearranging the plurality of columns and the plurality of rows.3. The system of claim 2 , wherein rearrangement of the plurality of columns or rearrangement of the plurality of rows is based at least in part on a read order of the third matrix from memory.4. The system of claim 3 , wherein the read order comprises a coprime stride.5. The system of claim 2 , wherein rearrangement of the plurality of columns or rearrangement of the plurality of rows is based at least in part on a size of the third matrix or the second sparsity.6. The system of claim 1 , wherein inversely ...

Подробнее
05-01-2017 дата публикации

INSTRUCTION AND LOGIC TO PROVIDE VECTOR HORIZONTAL MAJORITY VOTING FUNCTIONALITY

Номер: US20170003962A1
Принадлежит:

Instructions and logic provide vector horizontal majority voting functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read a number of values from data fields of the specified size in the source operand, corresponding to the mask specified by the instruction and store a result value to that number of corresponding data fields in the destination operand, the result value computed from the majority of values read from the number of data fields of the source operand. 1. A processor comprising:a vector register comprising a plurality of data fields to store values of vector elements;a decode stage to decode a first instruction specifying: a destination operand, a size of the vector elements, a portion of the data fields, and a source operand; and read a number of values from data fields of the size of the vector elements in the source operand; and', 'store a result value in the destination operand specified by the first instruction, wherein the result value is computed from most common values read from the number of the values from the data fields of the source operand., 'an execution unit, responsive to the decoded first instruction, to2. The processor of claim 1 , wherein the execution unit claim 1 , responsive to the decoded first instruction claim 1 , is to store the result value to corresponding data fields in the destination operand specified by the first instruction.3. The processor of claim 1 , wherein the first instruction specifies a mask identifying the portion of the data fields claim 1 , and wherein the number of the values read from the data fields in the source operand corresponds to vector elements in the source operand unmasked by the mask specified by the first instruction.4. The processor of claim 3 , wherein the result value is computed as a bitwise majority value ...

Подробнее
04-01-2018 дата публикации

Interruptible and restartable matrix multiplication instructions, processors, methods, and systems

Номер: US20180004510A1
Принадлежит: Intel Corp

A processor of an aspect includes a decode unit to decode a matrix multiplication instruction. The matrix multiplication instruction is to indicate a first memory location of a first source matrix, is to indicate a second memory location of a second source matrix, and is to indicate a third memory location where a result matrix is to be stored. The processor also includes an execution unit coupled with the decode unit. The execution unit, in response to the matrix multiplication instruction, is to multiply a portion of the first and second source matrices prior to an interruption, and store a completion progress indicator in response to the interruption. The completion progress indicator to indicate an amount of progress in multiplying the first and second source matrices, and storing corresponding result data to the third memory location, that is to have been completed prior to the interruption.

Подробнее
04-01-2018 дата публикации

APPARATUSES, METHODS, AND SYSTEMS FOR ELEMENT SORTING OF VECTORS

Номер: US20180004513A1
Принадлежит:

Systems, methods, and apparatuses relating to element sorting of vectors are described. In one embodiment, a processor incudes a decoder to decode an instruction into a decoded instruction; and an execution unit to execute the decoded instruction to: provide storage for a comparison matrix to store a comparison value for each element of an input vector compared against the other elements of the input vector, perform a comparison operation on elements of the input vector corresponding to storage of comparison values above a main diagonal of the comparison matrix, perform a different operation on elements of the input vector corresponding to storage of comparison values below the main diagonal of the comparison matrix, and store results of the comparison operation and the different operation in the comparison matrix. 1. A processor comprising:a decoder to decode an instruction into a decoded instruction; and provide storage for a comparison matrix to store a comparison value for each element of an input vector compared against the other elements of the input vector,', 'perform a comparison operation on elements of the input vector corresponding to storage of comparison values above a main diagonal of the comparison matrix,', 'perform a different operation on elements of the input vector corresponding to storage of comparison values below the main diagonal of the comparison matrix, and', 'store results of the comparison operation and the different operation in the comparison matrix., 'an execution unit to execute the decoded instruction to2. The processor of claim 1 , wherein the different operation is a different comparison operation than the comparison operation.3. The processor of claim 2 , wherein the comparison operation is one of a greater than or equal to operation and a greater than operation claim 2 , and the different comparison operation is the other.4. The processor of claim 1 , wherein the different operation is an anti-symmetrical operation to be ...

Подробнее
04-01-2018 дата публикации

Systems, Apparatuses, and Methods for Cumulative Summation

Номер: US20180004514A1
Принадлежит:

Systems, methods, and apparatuses for executing an instruction are described. For example, an instruction includes at least an opcode, a field for a packed data source operand, and a field for a packed data destination operand. When executed, the instruction causes for each data element position of the source operand, add to a value stored in that data element position all values stored in preceding data element positions of the packed data source operand and store a result of the addition into a corresponding data element position of the packed data destination operand. 1. An apparatus comprising:a decoder circuit to decode an instruction, wherein the instruction to include at least an opcode, a field for a packed data source operand, and a field for a packed data destination operand; andexecution circuitry to execute the decoded instruction to, for each data element position of the source operand, add to a value stored in that data element position all values stored in preceding data element positions of the packed data source operand and store a result of the addition into a corresponding data element position of the packed data destination operand.2. The apparatus of claim 1 , wherein the source operand is a packed data register and the destination operand is a packed data register.3. The apparatus of claim 2 , wherein a single packed data register is used for the source and destination operands.4. The apparatus of claim 1 , wherein the data elements of the source operand are stored in a little endian format.5. The apparatus of claim 1 , wherein the data elements of the source operand are stored in a big endian format.6. The apparatus of claim 1 , wherein the instruction to include a field writemask operand.7. The apparatus of claim 7 , wherein the execution circuitry to store a result of the adds based on values of the writemask operand.8. An method comprising:decoding an instruction, wherein the instruction to include at least an opcode, a field for a packed ...

Подробнее
04-01-2018 дата публикации

APPARATUS AND METHOD FOR PROPAGATING CONDITIONALLY EVALUATED VALUES IN SIMD/VECTOR EXECUTION USING AN INPUT MASK REGISTER

Номер: US20180004517A1
Принадлежит:

An apparatus and method for propagating conditionally evaluated values are disclosed. For example, a method according to one embodiment comprises: reading each value contained in an input mask register, each value being a true value or a false value and having a bit position associated therewith; for each true value read from the input mask register, generating a first result containing the bit position of the true value; for each false value read from the input mask register following the first true value, adding the vector length of the input mask register to a bit position of the last true value read from the input mask register to generate a second result; and storing each of the first results and second results in bit positions of an output register corresponding to the bit positions read from the input mask register. 1reading each value contained in an input mask register, each value being a true value or a false value and having a bit position associated therewith;storing in an output vector register a first value related to the size of the vector register for all values in the input mask register corresponding to a true value until the first false value is encountered;storing in the output vector register for the first false value and each subsequent false value, a second value equal to the bit positions of the false values; andonce the first false value is encountered, storing in the vector output register for each true value the second value associated with the last encountered false value.. A method for propagating conditionally evaluated values comprising the operations of: The present patent application is a divisional application claiming priority from U.S. patent application Ser. No. 13/997,183, filed Jun. 21, 2013, entitled “Apparatus and Method for Propagating Conditionally Evaluated Values in SIMD/VECTOR Execution Using an Input Mask Register”, which claims priority to International Application No. PCT/US2011/067094, filed Dec. 23, 2011, entitled ...

Подробнее
04-01-2018 дата публикации

Systems, Apparatuses, and Methods for Strided Load

Номер: US20180004518A1
Принадлежит:

Systems, methods, and apparatuses for strided loads are described. In an embodiment, an instruction to include at least an opcode, a field for at least two packed data source operands, a field for a packed data destination operand, and an immediate is designated as a strided load instruction. This instruction is executed to load packed data elements from the at least two packed data source operands using a stride and storing results of the strided loads in the packed data destination operand starting from a defined position determined in part from the immediate. 1. An apparatus comprising:a decoder circuit to decode an instruction, wherein the instruction to include at least an opcode, a field for at least two packed data source operands, a field for a packed data destination operand, and an immediate; andexecution circuitry to execute the decoded instruction to load packed data elements from the at least two packed data source operands using a stride and storing results of the strided loads in the packed data destination operand starting from a defined position determined in part from the immediate.2. The apparatus of claim 1 , wherein the packed data source operands are concatenated.3. The apparatus of claim 2 , wherein the defined position is determined by a round value provided by the immediate claim 2 , a number of packed data elements in each of the packed data source operands claim 2 , an offset provided by the immediate claim 2 , the number of packed data source operands claim 2 , and the stride determined from the immediate.4. The apparatus of claim 3 , wherein the stride is a value from the immediate plus 1.5. The apparatus of claim 3 , wherein the defined position is the round multiplied by the number of packed data elements in each of the packed data source operand multiplied be the number of packed data source operands plus an offset claim 3 , wherein this value is divided by the stride.6. The apparatus of claim 1 , wherein the instruction to include a ...

Подробнее
04-01-2018 дата публикации

Systems, Apparatuses, and Methods for Cumulative Product

Номер: US20180004519A1
Принадлежит: Intel Corp

Systems, methods, and apparatuses for executing an instruction are described. In some embodiments, the instruction includes at least an opcode, a field for a packed data source operand, and a field for a packed data destination operand. When executed, the instruction causes for each data element position of the source operand, multiply to a value stored in that data element position all values stored in preceding data element positions of the packed data source operand and store a result of the multiplication into a corresponding data element position of the packed data destination operand.

Подробнее
04-01-2018 дата публикации

Sort acceleration processors, methods, systems, and instructions

Номер: US20180004520A1
Автор: Shay Gueron, Vlad KRASNOV
Принадлежит: Intel Corp

A processor of an aspect includes packed data registers, and a decode unit to decode an instruction. The instruction may indicate a first source packed data to include at least four data elements, to indicate a second source packed data to include at least four data elements, and to indicate a destination storage location. An execution unit is coupled with the packed data registers and the decode unit. The execution unit, in response to the instruction, is to store a result packed data in the destination storage location. The result packed data may include at least four indexes that may identify corresponding data element positions in the first and second source packed data. The indexes may be stored in positions in the result packed data that are to represent a sorted order of corresponding data elements in the first and second source packed data.

Подробнее
04-01-2018 дата публикации

Split control stack and data stack platform

Номер: US20180004531A1
Принадлежит: Microsoft Technology Licensing LLC

In one example, a method includes allocating separate portions of memory for a control stack and a data stack. The method also includes, upon detecting a call instruction, storing a first return address in the control stack and a second return address in the data stack; and upon detecting a return instruction, popping the first return address from the control stack and the second return address from the data stack and raising an exception if the two return addresses do not match. Otherwise, the return instruction returns the first return address. Additionally, the method includes executing an exception handler in response to the return instruction detecting an exception, wherein the exception handler is to pop one or more return addresses from the control stack until the return address on a top of the control stack matches the return address on a top of the data stack.

Подробнее
07-01-2021 дата публикации

APPARATUS AND METHOD FOR PERFORMING DUAL SIGNED AND UNSIGNED MULTIPLICATION OF PACKED DATA ELEMENTS

Номер: US20210004227A1
Принадлежит: Intel Corporation

An apparatus and method for performing dual concurrent multiplications of packed data elements. For example one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed byte data elements; a second source register to store a second plurality of packed byte data elements; execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplier circuitry to concurrently multiply each of the packed byte data elements of the first plurality with a corresponding packed byte data element of the second plurality to generate a plurality of products; adder circuitry to add specified sets of the products to generate temporary results for each set of products; zero-extension or sign-extension circuitry to zero-extend or sign-extend the temporary result for each set to generate an extended temporary result for each set; accumulation circuitry to combine each of the extended temporary results with a selected packed data value stored in a third source register to generate a plurality of final results; and a destination register to store the plurality of final results as a plurality of packed data elements in specified data element positions. 1. A processor comprising:a decoder to decode an instruction to generate a decoded instruction, the instruction having fields for operand identifiers identifying a first packed data source operand, a second packed data source operand, and a third packed data source operand;execution circuitry to execute the decoded instruction to:multiply each of a first plurality of packed data elements of the first packed data source operand with a corresponding packed data element of a second plurality of the second packed data source operand to generate a plurality of products;add specified sets of the products to generate a temporary result for each set of products;zero-extend or sign-extend the temporary ...

Подробнее
07-01-2021 дата публикации

REDUCING LATENCY OF COMMON SOURCE DATA MOVEMENT INSTRUCTIONS

Номер: US20210004228A1
Принадлежит:

A move data instruction to move data from one location to another location is obtained. Based on obtaining the move data instruction, a determination is made as to whether the data to be moved is located in a buffer. The buffer is configured to maintain the data for use by multiple move data instructions. The buffer is used to move the data from the one location to the other location, based on determining that the data to be moved is in the buffer. 1. A computer program product for facilitating processing within a computing environment , the computer program product comprising: obtaining a move data instruction, the move data instruction to move data from one location to another location;', 'determining, based on obtaining the move data instruction, whether the data to be moved is already located in a buffer based on a previous move data instruction, the buffer configured to maintain the data for use by multiple move data instructions, wherein the data in the buffer is the same data to be moved for the previous move data instruction and the move data instruction; and', 'using the buffer to move the data from the one location to the other location, based on determining that the data to be moved is in the buffer, wherein the data is not re-written to the buffer., 'at least one computer readable storage medium readable by at least one processing circuit and storing instructions for performing a method comprising2. The computer program product of claim 1 , wherein the method further comprises setting a pointer to the buffer to use the buffer to move the data from the one location to the other location claim 1 , based on determining that the data to be moved is in the buffer.3. The computer program product of claim 1 , wherein the using comprises copying the data from the buffer to the other location at a select time to complete the move data instruction.4. The computer program product of claim 1 , wherein the buffer comprises a valid indicator claim 1 , the valid ...

Подробнее
07-01-2021 дата публикации

OPEN CHANNEL VECTOR COMMAND EXECUTION

Номер: US20210004229A1
Автор: Benisty Shay
Принадлежит:

A method and apparatus that provide a solid state drive controller configured to analyze input/output commands from a host computing device to determine whether the commands include a vector or non-vector command, and is configured to generate a plurality of non-vector commands based on the physical addresses contained in the vector command. 1. A device , comprising:a memory; and identify a set of physical addresses in a vector command fetched from a host computing device, wherein each physical address of the set of physical addresses corresponds to a location in the memory;', 'generate a set of non-vector commands, wherein each non-vector command in the set of non-vector commands corresponds to one of each physical address in the set of physical addresses;', 'execute the set of non-vector commands; and', 'generate a message to the host computing device indicating completion of the vector command in response to the execution of the set of non-vector commands., 'a controller communicatively coupled to the memory, wherein the controller is configured to2. The device of claim 1 , wherein the controller is further configured to:fetch a command from the host computing device; anddetermine the fetched command is a vector command, wherein the determination is based on whether the fetched command comprises a field configured to communicate a list of scatter-gather logical block addresses.3. The device of claim 1 , wherein the controller is further configured to fetch meta-data from the host computing device claim 1 , wherein the meta-data comprises a list of physical addresses.4. The device of claim 3 , wherein the vector command comprises a pointer configured to identify the set of physical addresses in the list of physical addresses in the meta-data.5. The device of claim 1 , wherein the set of physical addresses comprises two or more physical addresses.6. The device of claim 1 , further comprising a command register containing a plurality of 1-bit slots claim 1 , wherein ...

Подробнее
02-01-2020 дата публикации

ACCELERATOR APPARATUS AND METHOD FOR DECODING AND DE-SERIALIZING BIT-PACKED DATA

Номер: US20200004535A1
Принадлежит:

An apparatus and method for loading and storing multiple sets of packed data elements. For example, one embodiment of a processor comprises: a decoder to decode a multiple load instruction to generate a decoded multiple load instruction comprising a plurality of operations, the multiple load instruction including an opcode, source operands, and at least one destination operand; a first source register to store N packed index values; a second source register to store a base address value; execution circuitry to execute the operations of the decoded multiple load instruction, the execution circuitry comprising: parallel address generation circuitry to combine the base address from the second source register with each of the N packed index values to generate N system memory addresses; data load circuitry to cause N sets of data elements to be retrieved from the N system memory addresses, the data load circuitry to store the N sets of data elements in N vector destination registers identified by the at least one destination operand. 1. A processor comprising:a decoder to decode a multiple load instruction to generate a decoded multiple load instruction comprising a plurality of operations, the multiple load instruction including an opcode, source operands, and at least one destination operand;a first source register to store N packed index values;a second source register to store a base address value; parallel address generation circuitry to combine the base address from the second source register with each of the N packed index values to generate N system memory addresses;', 'data load circuitry to cause N sets of data elements to be retrieved from the N system memory addresses, the data load circuitry to store the N sets of data elements in N vector destination registers identified by the at least one destination operand., 'execution circuitry to execute the operations of the decoded multiple load instruction, the execution circuitry comprising2. The processor of ...

Подробнее
07-01-2021 дата публикации

PROCESSOR WITH TABLE LOOKUP UNIT

Номер: US20210004349A1
Принадлежит:

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core is configured to retrieve an instruction stream from program storage, and pass vector instructions in the instruction stream to the vector coprocessor core. The vector coprocessor core includes a register file, a plurality of execution units, and a table lookup unit. The register file includes a plurality of registers. The execution units are arranged in parallel to process a plurality of data values. The execution units are coupled to the register file. The table lookup unit is coupled to the register file in parallel with the execution units. The table lookup unit is configured to retrieve table values from one or more lookup tables stored in memory by executing table lookup vector instructions in a table lookup loop. 1. A device comprising: a set of address generators; and', receive an instruction to retrieve a set of table values from a set of tables stored in the memory, wherein the instruction includes a first field that specifies a first address generator of the set of address generators that stores a table offset associated with a first table of the set of tables; and', 'in response to the instruction, read the set of table values from the memory., 'a table lookup unit configured to], 'a vector core configured to couple to a memory, wherein the vector core includes2. The device of claim 1 , wherein the instruction includes a second field that specifies a number of the set of tables from which the set of table values are to be read.3. The device of claim 1 , wherein the instruction includes a second field that specifies a number of table values to be read from each table of the set of tables.4. The device of claim 1 , wherein the instruction includes a second field that specifies an offset of a first table value of the set of table values into the first table.5. The device of claim 1 , wherein the table lookup unit is ...

Подробнее
02-01-2020 дата публикации

SYSTEM AND METHOD FOR AUTOMATED MULTI-DIMENSIONAL NETWORK MANAGEMENT

Номер: US20200004767A1
Принадлежит:

Systems, methods, and devices for automated provisioning are disclosed herein. The system can include a memory including a user profile database having n-dimension attributes of a user. The system can include a user device and a source device. The system can include a server that can: generate and store a user profile in the user profile database and generate and store a characterization vector from the user profile. The server can identify a service for provisioning, receive updates to at least some of the attributes of the first user, and trigger regeneration of the characterization vector from the received inputs. The server can: regenerate the characterization vector, determine an efficacy of the provisioned services, and automatically identify a second service for provisioning for a second user based on the efficacy of the provisioned services to the first user. 1. (canceled)2. An automated multi-dimensional network management system comprising:a memory comprising: an electronic health records (EHR) database; and a network database comprising a plurality of nodes linked by a plurality of edges, at least some of the nodes corresponding to a user state, a user characteristic, and a remediation; and identify, via a machine-learning model, a first remediation to mitigate a likelihood of an adverse outcome identified in a risk profile based on the user state of a first user;', 'identify a data insufficiency based on missing data in the EHR database, wherein the data insufficiency prevents identification of a remediation;', 'select a medical service comprising a digital component and a non-digital component for provisioning to the first user, the medical service is selected to generate data to remedy the data insufficiency;', 'resolve the data insufficiency via provisioning of the selected medical service and receipt of electronic data generated from the provisioned medical service;', 'upon resolution of the data insufficiency, identify a second remediation to ...

Подробнее
03-01-2019 дата публикации

COMPILER FOR TRANSLATING BETWEEN A VIRTUAL IMAGE PROCESSOR INSTRUCTION SET ARCHITECTURE (ISA) AND TARGET HARDWARE HAVING A TWO-DIMENSIONAL SHIFT ARRAY STRUCTURE

Номер: US20190004777A1
Автор: Meixner Albert
Принадлежит:

A method is described that includes translating higher level program code including higher level instructions having an instruction format that identifies pixels to be accessed from a memory with first and second coordinates from an orthogonal coordinate system into lower level instructions that target a hardware architecture having an array of execution lanes and a shift register array structure that is able to shift data along two different axis. The translating includes replacing the higher level instructions having the instruction format with lower level shift instructions that shift data within the shift register array structure. 1. A computer-implemented method comprising:receiving a first sequence of instructions of a first instruction set architecture, wherein the first sequence of instructions (i) defines an image processing algorithm and (ii) includes one or more load instructions and one or more arithmetic instructions, wherein each load instruction specifies, using a two-dimensional address, a position of data within a region of image data;receiving, by a compiler, a request to translate the first sequence of instructions into instructions of a second instruction set architecture, wherein instructions in the second instruction set architecture are executable by an image processer comprising a two-dimensional array of processing elements and a two-dimensional shift-register array, wherein each shift-register in the two-dimensional shift-register array is dedicated to a respective processing element in the two-dimensional array of processing elements; andin response, translating the one or more load instructions in the first instruction set architecture into one or more shift instructions of the second instruction set architecture, wherein each shift instruction is operable to cause the image processor to shift data at the position within the region of image data specified by the two-dimensional address of a corresponding load instruction from a source ...

Подробнее
07-01-2021 дата публикации

CONVOLUTIONAL NEURAL NETWORK ON PROGRAMMABLE TWO DIMENSIONAL IMAGE PROCESSOR

Номер: US20210004633A1
Принадлежит:

A method is described that includes executing a convolutional neural network layer on an image processor having an array of execution lanes and a two-dimensional shift register. The two-dimensional shift register provides local respective register space for the execution lanes. The executing of the convolutional neural network includes loading a plane of image data of a three-dimensional block of image data into the two-dimensional shift register. The executing of the convolutional neural network also includes performing a two-dimensional convolution of the plane of image data with an array of coefficient values by sequentially: concurrently multiplying within the execution lanes respective pixel and coefficient values to produce an array of partial products; concurrently summing within the execution lanes the partial products with respective accumulations of partial products being kept within the two dimensional register for different stencils within the image data; and, effecting alignment of values for the two-dimensional convolution within the execution lanes by shifting content within the two-dimensional shift register array. 1. A method , comprising: loading different respective portions of an image having a plurality of image planes comprising pixel values into each two-dimensional shift register array of each of the plurality of stencil processors;', 'loading, into each two-dimensional shift register array of each of the plurality of stencil processors, respective coefficient sets of the plurality of coefficient sets that correspond to a respective portion of the image loaded into each two-dimensional shift-register array; and', concurrently multiplying within the execution lanes respective pixel values and coefficient values of the coefficient sets loaded into the stencil processor to produce an array of partial products;', 'concurrently summing within the execution lanes the partial products with respective accumulations of partial products being kept ...

Подробнее
03-01-2019 дата публикации

Instructions for vector operations with constant values

Номер: US20190004801A1
Принадлежит: Intel Corp

Disclosed embodiments relate to instructions for vector operations with immediate values. In one example, a system includes a memory and a processor that includes fetch circuitry to fetch the instruction from a code storage, the instruction including an opcode, a destination identifier to specify a destination vector register, a first immediate, and a write mask identifier to specify a write mask register, the write mask register including at least one bit corresponding to each destination vector register element, the at least one bit to specify whether the destination vector register element is masked or unmasked, decode circuitry to decode the fetched instruction, and execution circuitry to execute the decoded instruction, to, use the write mask register to determine unmasked elements of the destination vector register, and, when the opcode specifies to broadcast, broadcast the first immediate to one or more unmasked vector elements of the destination vector register.

Подробнее
03-01-2019 дата публикации

Stream processor with overlapping execution

Номер: US20190004807A1
Принадлежит: Advanced Micro Devices Inc

Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.

Подробнее
02-01-2020 дата публикации

PERSONALIZATION ENHANCED RECOMMENDATION MODELS

Номер: US20200005196A1
Принадлежит:

Methods, systems, apparatuses, and computer program products are provided for a two-phase technique for generating content recommendations. In a first phase, a baseline recommender is configured to generate a baseline content recommendation using one or more content recommendation models, such as a Smart Adaptive Recommendations (SAR) model, Factorization Machine (FM) or Matrix Factorization (MF) models, collaborative filtering models, and/or any other machine-learning models or techniques. In a second phase, a personalized recommender implements a vector combiner configured to combine profile vectors, content vectors, and the baseline content recommendations to generate combined user vectors. A model generator may train a machine-learning model using the combined user vectors and training data comprising actual interaction behavior of the users, which may be then applied to identify a content recommendation for a particular user. 1. A system for generating a machine-learning model for providing a content recommendation , the system comprising:at least one processor; and a baseline recommender configured to generate baseline content recommendations using historical user-content interactions by a plurality of users;', a profile vector generator configured to generate profile vectors corresponding to the users based on user profile information of the users;', 'a content vector generator configured to generate content vectors corresponding to the users based on content interaction data of the users;', 'a vector combiner configured to generate user vectors for the users, each user vector generated for a user including a baseline content recommendation of the baseline content recommendations, a profile vector of the profile vectors, and a content vector of the content vectors corresponding to the user; and', retrieve interaction training data corresponding to tracked interactions, by the users, with content of a plurality of content types; and', 'generate a ...

Подробнее
01-01-2015 дата публикации

APPARATUS AND METHOD TO RESERVE AND PERMUTE BITS IN A MASK REGISTER

Номер: US20150006847A1
Принадлежит:

An apparatus and method are described for performing a bit reversal and permutation on mask values. For example, a processor is described to execute an instruction to perform the operations of: reading a plurality of mask bits stored in a source mask register, the mask bits associated with vector data elements of a vector register; and performing a bit reversal operation to copy each mask bit from a source mask register to a destination mask register, wherein the bit reversal operation causes bits from the source mask register to be reversed within the destination mask register resulting in a symmetric, mirror image of the original bit arrangement. 1. A processor to execute an instruction to perform the operations of:reading a plurality of mask bits stored in a source mask register, the mask bits associated with vector data elements of a vector register; andperforming a bit reversal operation to copy each mask bit from a source mask register to a destination mask register, wherein the bit reversal operation causes bits from the source mask register to be reversed within the destination mask register resulting in a symmetric, mirror image of the original bit arrangement.2. The processor as in wherein the source and destination mask registers store 32 bits of mask data.3. The processor as in wherein the source and destination mask registers store 64 bits of mask data.4. The processor as in wherein the instruction comprises a macroinstruction and wherein the operations comprise microoperations.5. A processor to execute an instruction to perform the operations of:reading a plurality of mask bits stored in a first source register and control data stored in a second source register, the mask bits associated with vector data elements of a vector register; andperforming a mask bit permute operation to copy each mask bit from the source mask register to a destination mask register, wherein the control data stored in the second source register specifies a bit from the first ...

Подробнее
01-01-2015 дата публикации

Rotate then operate on selected bits facility and instructions therefor

Номер: US20150006860A1
Принадлежит: International Business Machines Corp

A rotate then operate instruction having a T bit is fetched and executed wherein a first operand in a first register is rotated by an amount and a Boolean operation is performed on a selected portion of the rotated first operand and a second operand in of a second register. If the T bit is ‘0’ the selected portion of the result of the Boolean operation is inserted into corresponding bits of a second operand of a second register. If the T bit is ‘1’, in addition to the inserted bits, the bits other than the selected portion of the rotated first operand are saved in the second register.

Подробнее
02-01-2020 дата публикации

Software reconfigurable digital phase lock loop architecture

Номер: US20200007134A1
Принадлежит: Texas Instruments Inc

A novel and useful apparatus for and method of software based phase locked loop (PLL). The software based PLL incorporates a reconfigurable calculation unit (RCU) that is optimized and programmed to sequentially perform all the atomic operations of a PLL or any other desired task in a time sharing manner. An application specific instruction-set processor (ASIP) incorporating the RCU includes an instruction set whose instructions are optimized to perform the atomic operations of a PLL. The RCU is clocked at a fast enough processor clock rate to insure that all PLL atomic operations are performed within a single PLL reference clock cycle.

Подробнее
20-01-2022 дата публикации

INSTRUCTIONS AND LOGIC TO PERFORM FLOATING POINT AND INTEGER OPERATIONS FOR MACHINE LEARNING

Номер: US20220019431A1
Принадлежит: Intel Corporation

A processing apparatus is provided comprising a multiprocessor having a multithreaded architecture. The multiprocessor can execute at least one single instruction to perform parallel mixed precision matrix operations. In one embodiment the apparatus includes a memory interface and an array of multiprocessors coupled to the memory interface. At least one multiprocessor in the array of multiprocessors is configured to execute a fused multiply-add instruction in parallel across multiple threads. 1. A graphics processing unit (GPU) comprising:a plurality of memory controllers;cache memory coupled with the plurality of memory controllers; a register file; and', 'circuitry coupled with the register file, the circuitry including a first core to perform a mixed precision matrix operation and a second core to perform, in response to a single instruction, multiple compute operations, wherein the multiple compute operations include a first operation to perform a fused multiply-add and a second operation to apply a rectified linear unit function to a result of the first operation., 'a graphics multiprocessor coupled with the cache memory and the plurality of memory controllers, the graphics multiprocessor having a single instruction, multiple thread (SIMT) architecture, wherein the graphics multiprocessor includes2. The GPU as in claim 1 , wherein the first operation and the second operation are single instruction multiple data (SIMD) operations.3. The GPU as in claim 1 , wherein the multiple compute operations are performed on input in a 16-bit floating-point format having a 1-bit sign and an 8-bit exponent.4. The GPU as in claim 3 , wherein the second core includes a dynamic precision processing resource that is configurable to automatically convert input in a 32-bit floating point format to the 16-bit floating-point format in conjunction with execution of the single instruction.5. The GPU as in claim 4 , wherein the dynamic precision processing resource includes ...

Подробнее
20-01-2022 дата публикации

Systems and methods to zero a tile register pair

Номер: US20220019438A1
Принадлежит: Intel Corp

Embodiments detailed herein relate to systems and methods to zero a tile register pair. In one example, a processor includes decode circuitry to decode a matrix pair zeroing instruction having fields for an opcode and an identifier to identify a destination matrix having a PAIR parameter equal to TRUE; and execution circuitry to execute the decoded matrix pair zeroing instruction to zero every element of a left matrix and a right matrix of the identified destination matrix.

Подробнее
20-01-2022 дата публикации

SPARSE MODIFIABLE BIT LENGTH DETERMINSTIC PULSE GENERATION FOR UPDATING ANALOG CROSSBAR ARRAYS

Номер: US20220019877A1
Принадлежит:

Provided are embodiments for a computer-implemented method, a system, and a computer program product for updating analog crossbar arrays. The embodiments include receiving a number used in matrix multiplication to represent using pulse generation for a crossbar array, and receiving a first bit-length to represent the number, wherein the bit-length is a modifiable bit length. The embodiments also include selecting pulse positions in a pulse sequence having the first bit length to represent the number, performing a computation using the selected pulse positions in the pulse sequence, and updating the crossbar array using the computation. 1. A computer-implemented method for pulse generation for updating analog crossbar arrays , the computer-implemented method comprising:receiving, by a processor, a number used in matrix multiplication to represent using pulse generation for a crossbar array;receiving, by the processor, a first bit-length to represent the number, wherein the bit-length is a modifiable bit length;selecting, by the processor, pulse positions in a pulse sequence having the first bit length to represent the number;performing, by the processor, a computation using the selected pulse positions in the pulse sequence; andupdating, by the processor, the crossbar array using the computation.2. The computer-implemented method of further comprising:selecting a second bit-length, wherein the second bit-length is different than the first bit-length;selecting pulse positions for an updated pulse sequence having the second bit-length to represent the number;performing a subsequent computation using the selected pulse positions in the updated pulse sequence; andupdating the crossbar array using the subsequent computation, wherein updating the crossbar array comprises updating a conductance value of one or more memristive devices of the crossbar array based at least in part on the computation and the subsequent computation.3. The computer-implemented method of further ...

Подробнее
12-01-2017 дата публикации

BIT-MASKED VARIABLE-PRECISION BARREL SHIFTER

Номер: US20170010893A1
Автор: QUINNELL Eric C.
Принадлежит:

According to one general aspect, an apparatus may include a monolithic shifter configured to receive a plurality of bytes of data, and, for each byte of data, a number of bits to shift the respective byte of data, wherein the number of bits for each byte of data need not be the same as for any other byte of data. The monolithic shifter may be configured to shift each byte of data by the respective number of bits. The apparatus may include a mask generator configured to compute a mask for each byte of data, wherein each mask indicates which bits, if any, are to be prevented from being polluted by a neighboring shifted byte of data. The apparatus may include a masking circuit configured to combine the shifted byte of data with a respective mask to create an unpolluted shifted byte of data. 1. An apparatus comprising: receive, as input, a plurality of bytes of data,', 'receive for each byte of data, as a dynamically adjustable input, a number of bits to shift the respective byte of data, wherein the number of bits for each byte of data need not be the same as for any other byte of data, and', 'shift, in parallel, each byte of data by the respective number of bits to create a shifted byte of data;, 'a monolithic shifter configured toa mask generator configured to compute a mask for each byte of data, wherein each mask indicates which bits, if any, are to be prevented from being polluted by a neighboring shifted byte of data;a mask shifter configured to, for each mask, create a shifted mask by shifting each mask according to the number of bits associated with each mask's respective byte of data; anda masking circuit configured to, for each shifted byte of data, combine the shifted byte of data with a respective shifted mask to create an unpolluted shifted byte of data.2. The apparatus of claim 1 , wherein the mask generator is configured to claim 1 , for each byte of data claim 1 , compute a mask that includes a width of multiple-bytes; andwherein the mask shifter is ...

Подробнее
12-01-2017 дата публикации

NEURAL NETWORK PROCESSOR

Номер: US20170011288A1
Принадлежит:

Implementing a neural network can include receiving a macro instruction for implementing the neural network within a control unit of a neural network processor. The macro instruction can indicate a first data set, a second data set, a macro operation for the neural network, and a mode of operation for performing the macro operation. The macro operation can be automatically initiated using a processing unit of the neural network processor by applying the second data set to the first data set based on the mode of operation. 1. A method , comprising:receiving a macro instruction for implementing a neural network within a control unit of a neural network processor;wherein the macro instruction indicates a first data set, a second data set, a macro operation for the neural network, and a mode of operation for performing the macro operation; andautomatically initiating the macro operation using a processing unit of the neural network processor by applying the second data set to the first data set based on the mode of operation.2. The method of claim 1 , wherein the macro operation comprises convolution and the mode of operation is selected from the group consisting of a scatter mode of operation and a gather mode of operation.3. The method of claim 1 , wherein the macro operation is selected from the group consisting of convolution and vector products.4. The method of claim 3 , wherein the macro operation comprises convolution claim 3 , the first data set is a selected region of a selected feature map claim 3 , and the second data set is a plurality of weights of a selected kernel.5. The method of claim 3 , wherein the macro operation comprises vector products claim 3 , the first data set is a plurality of feature classification values claim 3 , and the second data set is a plurality of weights for a feature classification layer of the neural network.6. The method of claim 1 , wherein the macro operation comprises evaluating a gate of a Long Short Term Memory (LSTM) cell. ...

Подробнее
12-01-2017 дата публикации

Efficient Decision Tree Traversal in an Adaptive Boosting (ADABOOST) Classifier

Номер: US20170011294A1
Принадлежит:

A method for object classification in a decision tree based adaptive boosting (AdaBoost) classifier implemented on a single-instruction multiple-data (SIMD) processor is provided that includes receiving feature vectors extracted from N consecutive window positions in an image in a memory coupled to the SIMD processor and evaluating the N consecutive window positions concurrently by the AdaBoost classifier using the feature vectors and vector instructions of the SIMD processor, in which the AdaBoost classifier concurrently traverses decision trees for the N consecutive window positions until classification is complete for the N consecutive window positions. 1. A method for object classification in a decision tree based adaptive boosting (AdaBoost) classifier implemented on a single-instruction multiple-data (SIMD) processor , the method comprising:receiving feature vectors extracted from N consecutive window positions in an image in a memory coupled to the SIMD processor, in which N is a vector width of the SIMD processor divided by a bit size of a feature, and in which a feature vector includes N feature values, one feature value for each of the N consecutive window positions; andevaluating the N consecutive window positions concurrently by the AdaBoost classifier using the feature vectors and vector instructions of the SIMD processor, in which the AdaBoost classifier concurrently traverses decision trees for the N consecutive window positions until classification is complete for the N consecutive window positions, in which a decision tree includes a plurality of nodes, a threshold value for each node, and a plurality of leaves, each leaf including a partial score.2. The method of claim 1 , in which evaluating the N consecutive window positions includes:loading a plurality of the feature vectors using a vector load instruction of the SIMD processor, in which one feature vector is loaded for each node of a single decision tree of the AdaBoost classifier;comparing ...

Подробнее
08-01-2015 дата публикации

Method and system of compiling program code into predicated instructions for excution on a processor without a program counter

Номер: US20150012729A1
Автор: Robison Arch D.
Принадлежит:

A predicated instruction compilation system includes a control flow graph generation module to generate a control flow graph of a program code to be compiled into the predicated instructions to be executed on a processor that does not include any program counter. Each of the instructions includes a predicate guard and a predicate update. The compilation system also includes a control flow transformation module to automatically generate the predicate guard and an update to the predicate state on the processor. A computer-implemented method of compiling a program code into predicated instructions is also described. 1. A computer-implemented method of compiling a program code into predicated instructions , comprising:extracting, from control flow of the program code, constraints between instructions of the program code;solving constraint problem between the instructions by assigning a predicate vector that satisfies the constraints to each of the instructions; andgenerating a predicate guard and a predicate update for each of the instructions based on the predicate vector such that the predicated instructions can be executed on a processor that does not include any program counter.2. The method of claim 1 , further comprisingcomputing a control flow graph of the program code; andextracting the constraints from the control flow graph.3. The method of claim 1 , further comprising computing constraints on predicate bits set by a programmer of the program code.4. The method of claim 1 , wherein the extracting further comprises generating a matrix of the constraints.5. The method of claim 4 , further comprising reducing multiple constraints into a single constraint if the multiple constraints occupy a single matrix element in the matrix.6. The method of claim 2 , wherein each of the predicated instructions corresponds to a hyper-edge of the control flow graph claim 2 , wherein the generating the predicate guard and predicate update is performed by determining solutions for ...

Подробнее
14-01-2016 дата публикации

INSTRUCTION SET FOR ELIMINATING MISALIGNED MEMORY ACCESSES DURING PROCESSING OF AN ARRAY HAVING MISALIGNED DATA ROWS

Номер: US20160011870A1
Принадлежит:

A processor is described having an instruction execution pipeline. The instruction execution pipeline includes an instruction fetch stage to fetch an instruction. The instruction format of the instruction specifies a first input vector, a second input vector and a third input operand. The instruction execution pipeline comprises an instruction decode stage to decode the instruction. The instruction execution pipeline includes a functional unit to execute the instruction. The functional unit includes a routing network to route a first contiguous group of elements from a first end of one of the input vectors to a second end of the instruction's resultant vector, and, route a second contiguous group of elements from a second end of the other of the input vectors to a first end of the instruction's resultant vector. The first and second ends are opposite vector ends. The first and second groups of contiguous elements are defined from the third input operand. The instruction is not capable of routing non-contiguous groups of elements from the input vectors to the instruction's resultant vector. A software pipeline that uses the instruction is also described 1. A processor , comprising: an instruction fetch stage to fetch an instruction, the instruction format of the instruction specifying a first input vector, a second input vector and a third input operand;', 'an instruction decode stage to decode said instruction;', "a functional unit to execute the instruction, the functional unit including a routing network to route a first contiguous group of elements from a first end of one of said input vectors to a second end of said instruction's resultant vector, and, route a second contiguous group of elements from a second end of the other of said input vectors to a first end of said instruction's resultant vector, said first and second ends being opposite vector ends, wherein, said first and second groups of contiguous elements are defined from said third input operand, said ...

Подробнее