Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 10924. Отображено 100.
29-03-2012 дата публикации

Performing a multiply-multiply-accumulate instruction

Номер: US20120079252A1
Автор: Eric S. Sprangle
Принадлежит: Intel Corp

In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.

Подробнее
05-04-2012 дата публикации

Efficient Parallel Floating Point Exception Handling In A Processor

Номер: US20120084533A1
Принадлежит: Individual

Methods and apparatus are disclosed for handling floating point exceptions in a processor that executes single-instruction multiple-data (SIMD) instructions. In one embodiment a numerical exception is identified for a SIMD floating point operation and SIMD micro-operations are initiated to generate two packed partial results of a packed result for the SIMD floating point operation. A SIMD denormalization micro-operation is initiated to combine the two packed partial results and to denormalize one or more elements of the combined packed partial results to generate a packed result for the SIMD floating point operation having one or more denormal elements. Flags are set and stored with packed partial results to identify denormal elements. In one embodiment a SIMD normalization micro-operation is initiated to generate a normalized pseudo internal floating point representation prior to the SIMD floating point operation when it uses multiplication.

Подробнее
12-04-2012 дата публикации

Efficient implementation of arrays of structures on simt and simd architectures

Номер: US20120089792A1
Принадлежит: Linares Medical Devices LLC, Nvidia Corp

One embodiment of the present invention sets forth a technique providing an optimized way to allocate and access memory across a plurality of thread/data lanes. Specifically, the device driver receives an instruction targeted to a memory set up as an array of structures of arrays. The device driver computes an address within the memory using information about the number of thread/data lanes and parameters from the instruction itself. The result is a memory allocation and access approach where the device driver properly computes the target address in the memory. Advantageously, processing efficiency is improved where memory in a parallel processing subsystem is internally stored and accessed as an array of structures of arrays, proportional to the SIMT/SIMD group width (the number of threads or lanes per execution group).

Подробнее
12-04-2012 дата публикации

Decoding instructions from multiple instructions sets

Номер: US20120089818A1
Автор: Simon John Craske
Принадлежит: ARM LTD

A data processing apparatus, method and computer program are described that are capable of decoding instructions from different instruction sets. The method comprising: receiving an instruction; if an operation code of said instruction is an operation code of an instruction from a base set of instructions decoding said instruction according to decode rules for said base set of instructions; and if said operation code of said instruction is an operation code of an instruction from at least one further set of instructions decoding said instruction according to a set of decode rules determined by an indicator value indicating which of said at least one further set of instructions is currently to be decoded.

Подробнее
10-05-2012 дата публикации

Dedicated instructions for variable length code insertion by a digital signal processor (dsp)

Номер: US20120117360A1
Автор: Jagadeesh Sankaran
Принадлежит: Texas Instruments Inc

In accordance with at least some embodiments, a digital signal processor (DSP) includes an instruction fetch unit and an instruction decode unit in communication with the instruction fetch unit. The DSP also includes a register set and a plurality of work units in communication with the instruction decode unit. The DSP selectively uses a dedicated insert instruction to insert a variable number of bits into a register.

Подробнее
17-05-2012 дата публикации

Retirement serialisation of status register access operations

Номер: US20120124340A1
Автор: James Nolan Hardage
Принадлежит: ARM LTD

A processor 2 for performing out-of-order execution of a stream of program instructions includes a special register access pipeline for performing status access instructions accessing a status register 20 . In order to serialise these status access instructions relative to other instructions within the system access timing control circuitry 32 permits dispatch of other instructions to proceed but controls the commit queue and the result queue such that no program instructions in program order succeeding the status access instruction are permitted to complete until after a trigger state has been detected in which all program instructions preceding in program order the status access instruction have been performed and made any updates to the architectural state. This is followed by the performance of the status access instruction itself.

Подробнее
26-07-2012 дата публикации

Predicting a result for an actual instruction when processing vector instructions

Номер: US20120191957A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

The described embodiments provide a processor that executes vector instructions. In the described embodiments, while dispatching instructions at runtime, the processor encounters an Actual instruction. Upon determining that a result of the Actual instruction is predictable, the processor dispatches a prediction micro-operation associated with the Actual instruction, wherein the prediction micro-operation generates a predicted result vector for the Actual instruction. The processor then executes the prediction micro-operation to generate the predicted result vector. In the described embodiments, when executing the prediction micro-operation to generate the predicted result vector, if the predicate vector is received, for each element of the predicted result vector for which the predicate vector is active, otherwise, for each element of the predicted result vector, generating the predicted result vector comprises setting the element of the predicted result vector to true.

Подробнее
26-07-2012 дата публикации

Sharing a fault-status register when processing vector instructions

Номер: US20120192005A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

The described embodiments provide a processor that executes vector instructions. In the described embodiments, the processor initializes an architectural fault-status register (FSR) and a shadow copy of the architectural FSR by setting each of N bit positions in the architectural FSR and the shadow copy of the architectural FSR to a first predetermined value. The processor then executes a first first-faulting or non-faulting (FF/NF) vector instruction. While executing the first vector instruction, the processor also executes one or more subsequent FF/NF instructions. In these embodiments, when executing the first vector instruction and the subsequent vector instructions, the processor updates one or more bit positions in the shadow copy of the architectural FSR to a second predetermined value upon encountering a fault condition. However, the processor does not update bit positions in the architectural FSR upon encountering a fault condition for the first vector instruction and the subsequent vector instructions.

Подробнее
16-08-2012 дата публикации

Running unary operation instructions for processing vectors

Номер: US20120210099A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

During operation, a processor generates a result vector. In particular, the processor records a value from an element at a key element position in an input vector into a base value. Next, for each active element in the result vector to the right of the key element position, the processor generates a result vector by setting the element in the result vector equal to a result of performing a unary operation on the base value a number of times equal to a number of relevant elements. The number of relevant elements is determined from the key element position to and including a predetermined element in the result vector, where the predetermined element in the result vector may be one of: a first element to the left of the element in the result vector; or the element in the result vector.

Подробнее
27-09-2012 дата публикации

Method and apparatus for efficient loop instruction execution using bit vector scanning

Номер: US20120246449A1
Принадлежит: Avaya Inc

A method, apparatus and computer program product for performing efficient loop instruction execution using bit vector scanning is presented. A bit vector is scanned, each bit in the bit vector representing at least one of a feature and a conditional status. The presence of a bit of said bit vector set to a first state is detected. The bit is set to a second state. An instruction address for a routine corresponding to said bit set to a first state is looked up using a bit position of said bit that was set to a first state. The routine is executed. The scanning, said detecting, said setting and said using are repeated until there are no remaining bits of said bit vector set to said first state.

Подробнее
18-10-2012 дата публикации

Allocation of counters from a pool of counters to track mappings of logical registers to physical registers for mapper based instruction executions

Номер: US20120265969A1
Принадлежит: International Business Machines Corp

A computer system assigns a particular counter from among a plurality of counters currently in a counter free pool to count a number of mappings of logical registers from among a plurality of logical registers to a particular physical register from among a plurality of physical registers, responsive to an execution of an instruction by a mapper unit mapping at least one logical register from among the plurality of logical registers to the particular physical register, wherein the number of the plurality of counters is less than a number of the plurality of physical registers. The computer system, responsive to the counted number of mappings of logical registers to the particular physical register decremented to less than a minimum value, returns the particular counter to the counter free pool.

Подробнее
03-01-2013 дата публикации

Processing vectors using wrapping add and subtract instructions in the macroscalar architecture

Номер: US20130007422A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a sum or difference operation on another input vector dependent upon the input vector and the control vector.

Подробнее
14-02-2013 дата публикации

Data processing device and data processing method

Номер: US20130038474A1
Автор: Daisuke Baba
Принадлежит: Panasonic Corp

A decoder reads an instruction for information specifying a bit sequence storage area, information indicating a first bit range, and information indicating a second bit range that is contiguous with the first bit range, then outputs a decoded signal in response to the information so read, and a bit manipulation circuit generates and outputs an output sequence based on a bit sequence stored in the bit sequence storage area by inserting uniform predetermined values between a first bit range and a second bit range in accordance with the decoded signal output from the decoder.

Подробнее
14-02-2013 дата публикации

Word line late kill in scheduler

Номер: US20130042089A1
Принадлежит: Advanced Micro Devices Inc

A method for picking an instruction for execution by a processor includes providing a multiple-entry vector, each entry in the vector including an indication of whether a corresponding instruction is ready to be picked. The vector is partitioned into equal-sized groups, and each group is evaluated starting with a highest priority group. The evaluating includes logically canceling all other groups in the vector when a group is determined to include an indication that an instruction is ready to be picked, whereby the vector only includes a positive indication for the one instruction that is ready to be picked.

Подробнее
28-03-2013 дата публикации

Method, apparatus and instructions for parallel data conversions

Номер: US20130080742A1
Автор: Gopalan Ramanujam
Принадлежит: Individual

Method, apparatus, and program means for performing a conversion. In one embodiment, a disclosed apparatus includes a destination storage location corresponding to a first architectural register. A functional unit operates responsive to a control signal, to convert a first packed first format value selected from a set of packed first format values into a plurality of second format values. Each of the first format values has a plurality of sub elements having a first number of bits The second format values have a greater number of bits. The functional unit stores the plurality of second format values into an architectural register.

Подробнее
28-03-2013 дата публикации

Processor and instruction processing method in processor

Номер: US20130080747A1
Автор: Young-Su Kwon

The present invention relates to a processor including: an instruction cache configured to store at least some of first instructions stored in an external memory and second instructions each including a plurality of micro instructions; a micro cache configured to store third instructions corresponding to the plurality of micro instructions included in the second instructions; and a core configured to read out the first and second instructions from the instruction cache and perform calculation, in which the core performs calculation by the first instructions from the instruction cache under a normal mode, and when the process enters a micro instruction mode, the core performs calculation by the third instructions corresponding to the plurality of micro instructions provided from the micro cache.

Подробнее
18-04-2013 дата публикации

CONDITIONAL COMPARE INSTRUCTION

Номер: US20130097408A1
Принадлежит: ARM LIMITED

An instruction decoder () is responsive to a conditional compare instruction to generate control signals for controlling processing circuitry () to perform a conditional compare operation. The conditional compare operation comprises: (i) if a current condition state of the processing circuitry () passes a test condition, then performing a compare operation on a first operand and a second operand and setting the current condition state to a result condition state generated during the compare operation; and (ii) if the current condition state fails the test condition, then setting the current condition state to a fail condition state specified by the conditional compare instruction. The conditional compare instruction can be used to represent chained sequences of comparison operations where each individual comparison operation may test a different kind of relation between a pair of operands. 2. The data processing apparatus according to claim 1 , wherein said status store comprises a status register.3. The data processing apparatus according to claim 1 , wherein said current condition state comprises the value of at least one condition code flag stored within said status store.4. The data processing apparatus according to claim 1 , wherein said conditional compare instruction includes a field for specifying said test condition.5. The data processing apparatus according to claim 1 , wherein said fail condition state is specified as an immediate value by said conditional compare instruction.6. The data processing apparatus according to claim 5 , wherein said immediate value is a programmable value set by the programmer of a program comprising said conditional compare instruction.7. The data processing apparatus according to claim 5 , wherein said immediate value is a programmable value set by a compiler of a program comprising said conditional compare instruction claim 5 , said compiler selecting said programmable value in dependence on a desired condition that is to be ...

Подробнее
25-04-2013 дата публикации

Multi-addressable register files and format conversions associated therewith

Номер: US20130103932A1
Принадлежит: International Business Machines Corp

A multi-addressable register file is addressed by a plurality of types of instructions, including scalar, vector and vector-scalar extension instructions. It may be determined that data is to be translated from one format to another format. If so determined, a convert machine instruction is executed that obtains a single precision datum in a first representation in a first format from a first register; converts the single precision datum of the first representation in the first format to a converted single precision datum of a second representation in a second format; and places the converted single precision datum in a second register.

Подробнее
02-05-2013 дата публикации

Running shift for divide instructions for processing vectors

Номер: US20130111193A1
Автор: Jeffry E. Gonion
Принадлежит: Apple Inc

In the described embodiments, a processor generates a result vector when executing a RunningShiftForDivide1P or RunningShiftForDivide2P instruction. In these embodiments, upon executing a RunningShiftForDivide1P/2P instruction, the processor receives a first input vector and a second input vector. The processor then records a base value from an element at a key element position in the first input vector. Next, when generating the result vector, for each active element in the result vector to the right of the key element position, the processor generates a shifted base value using shift values from the second input vector. The processor then corrects the shifted base value when a predetermined condition is met. Next, the processor sets the element of the result vector equal to the shifted base value.

Подробнее
09-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130117537A1
Принадлежит: Individual

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

Подробнее
09-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130117538A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A system comprising:a processing system to support: 2D/3D graphics, image processing, video compression, video decompression, and audio manipulation; wherein the processing system is further to be coupled with a display device and to be coupled with an input device, the processing system comprising a processor including,a register file including a first register to hold a first packed data including a first low data element and a first high data element, a second register to hold a second packed data including a second low data element and a second high data element, and a third register;a decoder to decode a first unpack instruction;a functional unit coupled the decoder and the register file, the functional unit, in response to the decoder decoding the first unpack instruction to transfer the first low data element to a high position of the third register and the second low data element to the low position of the third register.39. The system of claim 38 , wherein the decoder is further to decode a second unpack instruction claim 38 , and wherein the functional unit claim 38 , in response to the decode decoding the second unpack instruction claim 38 , to transfer the first high ...

Подробнее
09-05-2013 дата публикации

Method and apparatus for unpacking packed data

Номер: US20130117540A1
Принадлежит: Individual

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element.

Подробнее
16-05-2013 дата публикации

Apparatus and method for reducing overhead caused by communication between clusters

Номер: US20130124825A1
Принадлежит: SAMSUNG ELECTRONICS CO LTD

A technique for minimizing overhead caused by copying or moving a value from one cluster to another cluster is provided. A number of operations, for example, a mov operation for moving or copying a value from one cluster to another cluster and a normal operation may be executed concurrently. Accordingly, access to a register file outside of the cluster may be reduced and the performance of code may be improved.

Подробнее
16-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130124830A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A computing device comprising:a communication busa cache;a decoder operable to decode instructions specifying data manipulation operations;a register file comprising:a first set of registers operable to store 32-bit integer data; anda second set of registers operable to store a first packed data and a second packed data respectively including a first plurality of data elements and a second plurality of data elements;a functional unit coupled to the cache, the decoder, the register file, and the communication bus, and operable to execute decoded instructions specifying data manipulation operations, including:a first move instruction that, when executed by the functional unit, causes data to be transferred between a first packed data register and a second packed data register;a second move instruction that, when executed by the functional unit, causes data to be transferred between the first packed data register and a main memory;a third move instruction that, when executed by the functional unit, causes data to be transferred between the first packed data register and a 32-bit register; andan unpack instruction that, when executed by the functional unit, causes data elements from ...

Подробнее
16-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130124832A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A processor comprising:a register file including at least a first register to hold a first data element, a second data element, a third data element, and a fourth data element; a second register to hold a fifth data element, a sixth data element, a seventh data element, and an eighth data element, and a third register;a decoder to decode a packed instruction, the packed in instruction to identify the first and the second registers as source registers and the third register as a destination register;a functional unit coupled to the register file and the decoder, the functional unit, responsive to the first instruction, to store the sixth data element to a least significant portion of the third register, the eighth data element to a second least significant portion of the third register, the second data element to a third least significant portion of the third register, and the fourth data element to the fourth least significant portion of the third register.39. The processor of claim 38 , wherein the first claim 38 , second claim 38 , and third registers are to hold 32-bits claim 38 , and wherein each of the first through eight data elements are to be 8 bits in size.40. The ...

Подробнее
16-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130124833A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A processor comprising:a register file having a plurality of registers;a decoder coupled to the register file, the decoder to decode a first instruction, the first instruction to:have a 32-bit instruction format, include a first field to identify a first source register of the register file that is to store a first plurality of packed 8-bit integers,include a second field to identify a second source register of the register file that is to store a second plurality of packed 8-bit integers, andinclude a third field to identify a destination register,wherein each of the first and second pluralities of packed 8-bit integers are to include four packed 8-bit integers; anda functional unit including circuitry coupled to the decoder, the functional unit to generate a result that is to be stored in the destination register responsive to the first instruction, the destination register to have a same number of bits as the first and second source registers, wherein the result is to include a third plurality of packed 8-bit integers, the third plurality to include four packed 8-bit integers,the third plurality of packed 8-bit integers to include half of the 8-bit integers from the first ...

Подробнее
16-05-2013 дата публикации

Method and Apparatus for Unpacking Packed Data

Номер: US20130124834A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A processor comprising:a register file including at least a first register to hold a first data element, a second data element, a third data element, and a fourth data element; a second register to hold a fifth data element, a sixth data element, a seventh data element, and an eighth data element, and a third register;a decoder to decode a packed instruction, the packed in instruction to identify the first and the second registers as source registers and the third register as a destination register;a functional unit coupled to the register file and the decoder, the functional unit, responsive to the first instruction, to store the fifth data element to a least significant portion of the third register, the seventh data element to a second least significant portion of the third register, the first data element to a third least significant portion of the third register, and the third data element to the fourth least significant portion of the third register.39. The processor of claim 38 , wherein the first claim 38 , second claim 38 , and third registers are to hold 32-bits claim 38 , and wherein each of the first through eight data elements are to be 8 bits in size.40. The ...

Подробнее
16-05-2013 дата публикации

Method and Apparatus for Packing Packed Data

Номер: US20130124835A1
Принадлежит:

An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to unpack the packed data responsive to an unpack instruction received by the decoder. A first packed data element and a third packed data element are received from the first source register. A second packed data element and a fourth packed data element are received from the second source register. The circuit copies the packed data elements into a destination register resulting with the second packed data element adjacent to the first packed data element, the third packed data element adjacent to the second packed data element, and the fourth packed data element adjacent to the third packed data element. 137-. (canceled)38. A processor comprising:a register file having a plurality of registers;a decoder coupled with the register file, the decoder to decode a first instruction, the first instruction having a 32-bit instruction format, the first instruction having a first field to specify a first source register of the register file having a first plurality of packed signed 16-bit integers and a second field to specify a second source register of the register file having a second plurality of packed signed 16-bit integers; anda functional unit including circuitry coupled with the decoder, the functional unit to generate a result according to the first instruction that is to be stored in a destination register specified by a third field of the first instruction, the result including a third plurality of packed 8-bit integers,the third plurality of the packed 8-bit integers including an 8-bit integer for each 16-bit integer in the first plurality of the packed signed 16-bit integers, and an 8-bit integer for each 16-bit integer in the second plurality of the packed signed 16-bit integers,the 8-bit integers corresponding to the first plurality of the packed signed 16-bit integers next to one another in ...

Подробнее
06-06-2013 дата публикации

Bitstream Buffer Manipulation With A SIMD Merge Instruction

Номер: US20130145125A1
Принадлежит: Individual

Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.

Подробнее
20-06-2013 дата публикации

Reducing issue-to-issue latency by reversing processing order in half-pumped simd execution units

Номер: US20130159666A1
Принадлежит: International Business Machines Corp

Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction.

Подробнее
27-06-2013 дата публикации

Method and apparatus for generating flags for a processor

Номер: US20130166889A1
Принадлежит: Advanced Micro Devices Inc

A method and apparatus are described for generating flags in response to processing data during an execution pipeline cycle of a processor. The processor may include a multiplexer configured generate valid bits for received data according to a designated data size, and a logic unit configured to control the generation of flags based on a shift or rotate operation command, the designated data size and information indicating how many bytes and bits to rotate or shift the data by. A carry flag may be used to extend the amount of bits supported by shift and rotate operations. A sign flag may be used to indicate whether a result is a positive or negative number. An overflow flag may be used to indicate that a data overflow exists, whereby there are not a sufficient number of bits to store the data.

Подробнее
04-07-2013 дата публикации

Processor for Executing Wide Operand Operations Using a Control Register and a Results Register

Номер: US20130173888A1
Принадлежит: Microunity Systems Engineering Inc

A programmable processor and method for improving the performance of processors by expanding at least two source operands, or a source and a result operand, to a width greater than the width of either the general purpose register or the data path width. The present invention provides operands which are substantially larger than the data path width of the processor by using the contents of a general purpose register to specify a memory address at which a plurality of data path widths of data can be read or written, as well as the size and shape of the operand. In addition, several instructions and apparatus for implementing these instructions are described which obtain performance advantages if the operands are not limited to the width and accessible number of general purpose registers.

Подробнее
11-07-2013 дата публикации

Performing A Multiply-Multiply-Accumulate Instruction

Номер: US20130179661A1
Автор: Eric Sprangle
Принадлежит: Individual

In one embodiment, the present invention includes a processor having multiple execution units, at least one of which includes a circuit having a multiply-accumulate (MAC) unit including multiple multipliers and adders, and to execute a user-level multiply-multiply-accumulate instruction to populate a destination storage with a plurality of elements each corresponding to an absolute value for a pixel of a pixel block. Other embodiments are described and claimed.

Подробнее
18-07-2013 дата публикации

Processor with multi-level looping vector coprocessor

Номер: US20130185540A1
Принадлежит: Texas Instruments Inc

A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.

Подробнее
18-07-2013 дата публикации

Bitstream buffer manipulation with a simd merge instruction

Номер: US20130185541A1
Принадлежит: Individual

Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block.

Подробнее
18-07-2013 дата публикации

Processor with instruction variable data distribution

Номер: US20130185544A1
Принадлежит: Texas Instruments Inc

A vector processor includes a plurality of execution units arranged in parallel, a register file, and a plurality of load units. The register file includes a plurality of registers coupled to the execution units. Each of the load units is configured to load, in a single transaction, a plurality of the registers with data retrieved from memory. The loaded registers corresponding to different execution units. Each of the load units is configured to distribute the data to the registers in accordance with an instruction selectable distribution. The instruction selectable distribution specifies one of plurality of distributions. Each of the distributions specifies a data sequence that differs from the sequence in which the data is stored in memory.

Подробнее
25-07-2013 дата публикации

INSTRUCTIONS AND LOGIC TO PERFORM MASK LOAD AND STORE OPERATIONS

Номер: US20130191615A1
Принадлежит:

In one embodiment, logic is provided to receive and execute a mask move instruction to transfer a vector data element including a plurality of packed data elements from a source location to a destination location, subject to mask information for the instruction. Other embodiments are described and claimed. 1. A processor comprising:a decoder to decode instruction;a storage to store decoded instructions; anda logic to receive and execute a mask move instruction to transfer a vector data element including a plurality of packed data elements from a source location to a destination location, wherein the mask move instruction is to be executed subject to mask information in a vector mask register.2. The processor of claim 1 , further comprising a register file comprising a plurality of extended registers each to store a vector data element including a plurality of packed data elements claim 1 , and a control register to store the mask information.3. The processor of claim 2 , wherein the processor further comprises a memory subsystem having a store buffer including a plurality of entries each to store a pending instruction claim 2 , a destination identifier claim 2 , a source identifier claim 2 , and claim 2 , if the pending instruction is a mask store instruction claim 2 , the mask information.4. The processor of claim 1 , wherein the mask move instruction is a mask load instruction including an opcode claim 1 , a source identifier and a destination identifier claim 1 , and wherein the logic is to access the vector mask register responsive to the mask load instruction to obtain the mask information.5. The processor of claim 4 , wherein the logic is to access a first bit of each of a plurality of fields of the vector mask register to obtain the mask information claim 4 , wherein each first bit is a mask value for a corresponding one of the plurality of packed data elements of a vector data element.6. The processor of claim 1 , wherein the mask move instruction is a mask ...

Подробнее
25-07-2013 дата публикации

INSTRUCTION CONTROL CIRCUIT, PROCESSOR, AND INSTRUCTION CONTROL METHOD

Номер: US20130191616A1
Автор: Nishikawa Takashi
Принадлежит: FUJITSU SEMICONDUCTOR LIMITED

In a vector processing device, a data dependence detecting unit detects a data dependence relation between a preceding instruction and a succeeding instruction which are inputted from an instruction buffer, and an instruction issuance control unit controls issuance of an instruction based on a detection result thereof. When there is a data dependence relation between the preceding instruction and the succeeding instruction, the instruction issuance control unit generates a new instruction equivalent to processing related to a vector register including the data dependence relation with the succeeding instruction in processing executed by the preceding instruction and issues the new instruction between the preceding instruction and the succeeding instruction, and thereby a data hazard can be avoided between the preceding instruction and the succeeding instruction without making a stall occur. 1. An instruction control circuit of a vector processing device , the instruction control circuit comprising:an instruction buffer which stores a plurality of instructions;a data dependence detecting unit which detects a data dependence relation between a preceding instruction and a succeeding instruction which succeeds the preceding instruction among the plurality of instructions inputted from the instruction buffer; andan instruction issuance control unit which controls issuance of an instruction based on a detection result in the data dependence detecting unit,wherein the instruction issuance control unitgenerates a new instruction including a same instruction type as the preceding instruction when there is a data dependence relation between the preceding instruction and the succeeding instruction, and issues the generated new instruction between the preceding instruction and the succeeding instruction and, in the generation of the new instruction,determines identification information of a second register of the new instruction from identification information of a first ...

Подробнее
08-08-2013 дата публикации

PROCESSOR PERFORMANCE IMPROVEMENT FOR INSTRUCTION SEQUENCES THAT INCLUDE BARRIER INSTRUCTIONS

Номер: US20130205121A1

A technique for processing an instruction sequence that includes a barrier instruction, a load instruction preceding the barrier instruction, and a subsequent memory access instruction following the barrier instruction includes determining, by a processor core, that the load instruction is resolved based upon receipt by the processor core of an earliest of a good combined response for a read operation corresponding to the load instruction and data for the load instruction. The technique also includes if execution of the subsequent memory access instruction is not initiated prior to completion of the barrier instruction, initiating by the processor core, in response to determining the barrier instruction completed, execution of the subsequent memory access instruction. The technique further includes if execution of the subsequent memory access instruction is initiated prior to completion of the barrier instruction, discontinuing by the processor core, in response to determining the barrier instruction completed, tracking of the subsequent memory access instruction with respect to invalidation. 1. A method of processing an instruction sequence that includes a barrier instruction , a load instruction preceding the barrier instruction , and a subsequent memory access instruction following the barrier instruction , the method comprising:determining, by a processor core, that the load instruction is resolved based upon receipt by the processor core of an earliest of a good combined response for a read operation corresponding to the load instruction and data for the load instruction;if execution of the subsequent memory access instruction is not initiated prior to completion of the barrier instruction, initiating by the processor core, in response to determining the barrier instruction completed, execution of the subsequent memory access instruction; andif execution of the subsequent memory access instruction is initiated prior to completion of the barrier instruction, ...

Подробнее
15-08-2013 дата публикации

System for implementing vector look-up table operations in a SIMD processor

Номер: US20130212353A1
Автор: Mimar Tibet
Принадлежит:

The present invention incorporates a system for vector Look-Up Table (LUT) operations into a single-instruction multiple-data (SIMD) processor in order to implement plurality of LUT operations simultaneously, where each of the LUT contents could be the same or different. Elements of one or two vector registers are used to form LUT indexes, and the output of vector LUT operation is written into a vector register. No dedicated LUT memory is required; rather, data memory is organized as multiple separate data memory banks, where a portion of each data memory bank is used for LUT operations. For a single-input vector LUT operation, the address input of each LUT is operably coupled to any of the input vector register's elements using input vector element mapping logic in one embodiment. Thus, one input vector element can produce (a positive integer) N output elements using N different LUTs, or (another positive integer) K input vector elements can produce N output elements, where K is an integer from one to N. 136.-. (canceled)37. A method for performing a plurality of lookup table operations in parallel in one step in a processor , the method comprising:providing a memory that is partitioned into a plurality of memory banks, each of said plurality of memory banks is independently addressable, the number of said plurality of memory banks is at least the same as a number of vector elements of at least one source vector, said memory is shared for use as a local data memory by said processor for access by load and store instructions and a plurality of lookup tables;providing a vector register array with ability to store a plurality of vectors;storing one of said plurality of lookup tables into each of said plurality of memory banks at a base address, said plurality of lookup tables each containing a plurality of entries;storing said at least one source vector into said vector register array;using index values to select entries of said plurality of lookup tables in ...

Подробнее
15-08-2013 дата публикации

Conditional vector mapping in a SIMD processor

Номер: US20130212355A1
Автор: Mimar Tibet
Принадлежит:

The present invention provides a method for mapping input vector register elements to output vector register elements in one step in relation to a control vector register controlling vector-to-vector mapping and condition code values. The method includes storing an input vector having N-elements of input data in a vector register and storing a control vector having N-elements in a vector register, and providing for enabling vector-to-vector mapping where the mask bit is not set to selectively disable. The masking of certain elements is useful to partition large mappings of vectors or matrices into sizes that fits the number of elements of a given SIMD, and merging of multiple mapped results together. This method and system provides a highly efficient mechanism of mapping vector register elements in parallel based on a user-defined mapping and prior calculated condition codes, and merging these mapped vector elements with another vector using a mask. 151.-. (canceled)52. A method for performing vector operations in parallel in one step , the method comprising the steps of:providing a vector register file including a plurality of vector registers;storing a first input vector in said vector register file;storing a control vector in said vector register file, wherein said control vector is selected as a source operand of said vector operations;selecting a condition flag from a plurality of condition flags for each vector element position in accordance with a condition select field from a vector instruction, said plurality of condition flags are derived from results of executing a prior instruction sequence;mapping the elements of said first input vector to the elements of a first output vector, in accordance with a first field of respective element of said control vector; andstoring elements of said first output vector on an element-by-element basis conditionally, if mask bit of respective element of said control vector is interpreted as false and in accordance with ...

Подробнее
15-08-2013 дата публикации

Processor to Execute Shift Right Merge Instructions

Номер: US20130212359A1
Принадлежит:

Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block. 1a plurality of registers to store 128-bit operands;{'b': '128', 'a decoder to decode a single instruction multiple data (SIMD) instruction, the SIMD instruction to indicate a first 128-bit operand having a first set of sixteen byte elements and a second -bit operand having a second set of sixteen byte elements, the SIMD instruction to have a 4-bit immediate to specify a number (n) of bytes;'}an execution unit coupled with the decoder and the plurality of registers, the execution unit in response to the SIMD instruction to store a 128-bit result in a destination indicated by the instruction, wherein the result is to includethe number (n) least significant byte elements of the second operand in the number (n) most significant bytes of the result,concatenated with sixteen minus the number (n) most significant byte elements of the first operand in sixteen minus the number (n) least significant bytes of the result.. A processor comprising: The present application is a continuation of U.S. patent application Ser. No. 13/602,546, filed on Sep. 4, 2012, entitled “PROCESSOR TO EXECUTE SHIFT RIGHT MERGE INSTRUCTIONS,” now pending, which is a continuation of U.S. patent application Ser. No. 13/477,544, filed on May 22, 2012, entitled “PROCESSOR TO EXECUTE SHIFT RIGHT MERGE INSTRUCTIONS,” now pending, which is a continuation of U.S. patent application Ser. No. 12/907,843 filed on Oct. 19, 2010 ...

Подробнее
05-09-2013 дата публикации

METHODS, APPARATUS, AND INSTRUCTIONS FOR CONVERTING VECTOR DATA

Номер: US20130232318A1
Принадлежит:

A computer processor includes a decoder for decoding machine instructions and an execution unit for executing those instructions. The decoder and the execution unit are capable of decoding and executing vector instructions that include one or more format conversion indicators. For instance, the processor may be capable of executing a vector-load-convert-and-write (VLoadConWr) instruction that provides for loading data from memory to a vector register. The VLoadConWr instruction may include a format conversion indicator to indicate that the data from memory should be converted from a first format to a second format before the data is loaded into the vector register. Other embodiments are described and claimed. 1. A processor for executing a machine instruction combining data format conversion with at least one vector operation , the processor comprising:control logic capable of executing processor instructions comprising a vector-load-convert-and-write instruction having a format conversion indicator and a vector register indicator;wherein, in response to the vector-load-convert-and-write instruction, the control logic is capable of:converting data from a first format to a second format, based at least in part on the format conversion indicator; andafter converting the data to the second format, saving the data in the second format to multiple elements of a vector register identified by the vector register indicator.2. A processor according to claim 1 , wherein the control logic is further capable of executing a vector-load-convert-compute-and-write instruction having the format conversion indicator and the vector register indicator claim 1 , wherein:in response to the vector-load-convert-compute-and-write instruction, the control logic is capable of:converting data from the first format to the second format, based at least in part on the format conversion indicator;performing a vector arithmetic operation, based at least in part on the data in the second format; ...

Подробнее
05-09-2013 дата публикации

Unpacking Packed Data In Multiple Lanes

Номер: US20130232321A1
Принадлежит:

Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand. 1. A method comprising:receiving an instruction, the instruction indicating a first operand and a second operand, each of the first and second operands having a plurality of packed data elements that correspond in respective positions, a first subset of the packed data elements of the first operand and a first subset of the packed data elements of the second operand each corresponding to a first lane, and a second subset of the packed data elements of the first operand and a second subset of the packed data elements of the second operand each corresponding to a second lane; andstoring a result in response to the instruction, the result including: (1) in the first lane, only lowest order data elements from the first subset of the first operand interleaved with corresponding lowest order data elements from the first subset of the second operand; and (2) in the second lane, only highest order data elements from the second subset of the first operand interleaved with corresponding highest order data elements from the second subset of the second operand.2. The method of claim 1 , wherein the ...

Подробнее
12-09-2013 дата публикации

Method, apparatus and instructions for parallel data conversions

Номер: US20130238879A1
Автор: Gopalan Ramanujam
Принадлежит: Individual

Method, apparatus, and program means for performing a conversion. In one embodiment, a disclosed apparatus includes a destination storage location corresponding to a first architectural register. A functional unit operates responsive to a control signal, to convert a first packed first format value selected from a set of packed first format values into a plurality of second format values. Each of the first format values has a plurality of sub elements having a first number of bits. The second format values have a greater number of bits. The functional unit stores the plurality of second format values into an architectural register.

Подробнее
12-09-2013 дата публикации

Operation processing device, mobile terminal and operation processing method

Номер: US20130238880A1
Автор: Masahiko Toichi
Принадлежит: Fujitsu Ltd

An operation processing device for executing a plurality of operations for aligned data by one vector instruction includes a first mask storage unit and a second mask storage unit. The first mask storage unit stores first mask data to designate each of the plurality of operations a true or false operation, and the second mask storage unit stores second mask data to designate a number to be true continuously, in the plurality of operations.

Подробнее
19-09-2013 дата публикации

DETERMINING THE STATUS OF RUN-TIME-INSTRUMENTATION CONTROLS

Номер: US20130246748A1

The invention relates to determining the status of run-time-instrumentation controls. The status is determined by executing a test run-time-instrumentation controls (TRIC) instruction. The TRIC instruction is executed in either a supervisor state or a lesser-privileged state. The TRIC instruction determines whether the run-time-instrumentation controls have changed. The run-time-instrumentation controls are set to an initial value using a privileged load run-time-instrumentation controls (LRIC) instruction. The TRIC instruction is fetched and executed. If the TRIC instruction is enabled, then it is determined if the initial value set by the run-time-instrumentation controls has been changed. If the initial value set by the run-time-instrumentation controls has been changed, then a condition code is set to a first value. 1. A computer implemented method for modifying run-time-instrumentation controls from a lesser-privileged state , the method comprising:setting a set of run-time-instrumentation controls to an initial value using a privileged load run-time-instrumentation controls (LRIC) instruction;fetching the TRIC instruction; based on the TRIC instruction being enabled, determining whether the initial value set by the run-time-instrumentation controls has been changed; and', 'based on determining the initial value set by the run-time-instrumentation controls been changed, setting a condition code to a first value., 'executing the TRIC instruction, the executing comprising2. The method according to claim 1 , wherein determining that the TRIC instruction is enabled comprises any one of:based on the TRIC instruction being executed in supervisor mode, determining that the TRIC instruction is enabled; andbased on the TRIC instruction being executed in the lesser-privileged state and a field of the run-time-instrumentation controls is set.3. The method according to claim 1 , further comprising:based on the TRIC instruction being not-enabled, setting the condition code ...

Подробнее
19-09-2013 дата публикации

Vector find element not equal instruction

Номер: US20130246751A1
Принадлежит: International Business Machines Corp

Processing of character data is facilitated. A Find Element Not Equal instruction is provided that compares data of multiple vectors for inequality and provides an indication of inequality, if inequality exists. An index associated with the unequal element is stored in a target vector register. Further, the same instruction, the Find Element Not Equal instruction, also searches a selected vector for null elements, also referred to as zero elements. A result of the instruction is dependent on whether the null search is provided, or just the compare.

Подробнее
19-09-2013 дата публикации

RUN-TIME INSTRUMENTATION MONITORING OF PROCESSOR CHARACTERISTICS

Номер: US20130246771A1

Embodiments of the invention relate to monitoring processor characteristic information of a processor using run-time-instrumentation. An aspect of the invention includes executing an instruction stream on the processor and detecting a run-time instrumentation sample point of the executing instruction stream on the processor. A reporting group is stored in a run-time instrumentation program buffer based on the run-time instrumentation sample point. The reporting group includes processor characteristic information associated with the processor. 1. A computer implemented method for monitoring processor characteristic information of a processor using run-time-instrumentation , the method comprising:executing an instruction stream on a processor;detecting a run-time instrumentation sample point of the executing instruction stream on the processor; andstoring a reporting group in a run-time instrumentation program buffer based on the run-time instrumentation sample point, the reporting group including processor characteristic information associated with the processor.2. The method of claim 1 , further comprising:checking current processor characteristic information prior to storing a subsequent reporting group in the run-time instrumentation program buffer based on the subsequent run-time instrumentation sample point; and storing the subsequent reporting group in the run-time instrumentation program buffer;', 'suppressing storage of the subsequent reporting group in the run-time instrumentation program buffer; and', 'halting run-time instrumentation., 'based on the current processor characteristic information, determining whether to perform one of3. The method of claim 2 , further comprising:determining whether processors in a current configuration are configured to operate with a common CPU capability; and reading a suppression control of a run-time instrumentation control; and', 'suppressing storage of the subsequent reporting group in the run-time instrumentation ...

Подробнее
19-09-2013 дата публикации

RUN-TIME INSTRUMENTATION INDIRECT SAMPLING BY INSTRUCTION OPERATION CODE

Номер: US20130246772A1

Embodiments of the invention relate to implementing run-time instrumentation indirect sampling by instruction operation code. An aspect of the invention includes a method for implementing run-time instrumentation indirect sampling by instruction operation code. The method includes reading sample-point instruction operation codes from a sample-point instruction array, and comparing, by a processor, the sample-point instruction operation codes to an operation code of an instruction from an instruction stream executing on the processor. The method also includes recognizing a sample point upon execution of the instruction with the operation code matching one of the sample-point instruction operation codes. The run-time instrumentation information is obtained from the sample point. The method further includes storing the run-time instrumentation information in a run-time instrumentation program buffer as a reporting group. 1. A computer implemented method for implementing run-time instrumentation indirect sampling by instruction operation code , the method comprising:reading sample-point instruction operation codes from a sample-point instruction array;comparing, by a processor, the sample-point instruction operation codes to an operation code of an instruction from an instruction stream executing on the processor;recognizing a sample point upon execution of the instruction with the operation code matching one of the sample-point instruction operation codes, wherein run-time instrumentation information is obtained from the sample point; andstoring the run-time instrumentation information in a run-time instrumentation program buffer as a reporting group.2. The method of claim 1 , wherein the run-time instrumentation information comprises run-time instrumentation event records collected in a collection buffer of the processor and the reporting group further comprises system information records in combination with the run-time instrumentation event records.3. The method of ...

Подробнее
19-09-2013 дата публикации

HARDWARE BASED RUN-TIME INSTRUMENTATION FACILITY FOR MANAGED RUN-TIMES

Номер: US20130246773A1

Embodiments of the invention relate to performing run-time instrumentation. Run-time instrumentation is captured, by a processor, based on an instruction stream of instructions of an application program executing on the processor. The capturing includes storing the run-time instrumentation data in a collection buffer of the processor. A run-time instrumentation sample point trigger is detected by the processor. Contents of the collection buffer are copied into a program buffer as a reporting group based on detecting the run-time instrumentation sample point trigger. The program buffer is located in main storage in an address space that is accessible by the application program. 1. A computer implemented method for performing run-time instrumentation , the method comprising:capturing, by a processor, run-time instrumentation data based on an instruction stream of instructions of an application program executing on the processor, the capturing comprising storing the run-time instrumentation data in a collection buffer of the processor;detecting, by the processor, a run-time instrumentation sample point trigger; andcopying contents of the collection buffer into a program buffer as a reporting group based on the detecting the run-time instrumentation sample point trigger, the program buffer located in main storage in an address space that is accessible by the application program.2. The method of claim 1 , wherein the collection buffer is implemented by hardware located on the processor.3. The method of claim 1 , wherein the collection buffer is not accessible by the application program.4. The method of claim 1 , wherein the capturing and the detecting are performed in a manner that is transparent to the executing.5. The method of further comprising capturing claim 1 , in the collection buffer claim 1 , instruction addresses and metadata corresponding to events detected during the executing of the instruction stream.6. The method of claim 1 , wherein the reporting group ...

Подробнее
19-09-2013 дата публикации

RUN-TIME INSTRUMENTATION SAMPLING IN TRANSACTIONAL-EXECUTION MODE

Номер: US20130246774A1

Embodiments of the invention relate to implementing run-time instrumentation indirect sampling by address. An aspect of the invention includes a method for implementing run-time instrumentation indirect sampling by address. The method includes reading sample-point addresses from a sample-point address array, and comparing, by a processor, the sample-point addresses to an address associated with an instruction from an instruction stream executing on the processor. The method further includes recognizing a sample point upon execution of the instruction associated with the address matching one of the sample-point addresses. Run-time instrumentation information is obtained from the sample point. The method also includes storing the run-time instrumentation information in a run-time instrumentation program buffer as a reporting group. 1. A computer implemented method for implementing run-time instrumentation indirect sampling by address , the method comprising:reading sample-point addresses from a sample-point address array;comparing, by a processor, the sample-point addresses to an address associated with an instruction from an instruction stream executing on the processor;recognizing a sample point upon execution of the instruction associated with the address matching one of the sample-point addresses, wherein run-time instrumentation information is obtained from the sample point; andstoring the run-time instrumentation information in a run-time instrumentation program buffer as a reporting group.2. The method of claim 1 , wherein the address associated with the instruction claim 1 , based on address type claim 1 , is one of: an address of the instruction and an address of an operand of the instruction.3. The method of claim 1 , further comprising:initializing a run-time-instrumentation control based on executing a load run-time instrumentation controls (LRIC) instruction, the LRIC instruction establishing a sampling mode and a sample-point address (SPA) control.4. The ...

Подробнее
19-09-2013 дата публикации

RUN-TIME INSTRUMENTATION SAMPLING IN TRANSACTIONAL-EXECUTION MODE

Номер: US20130246775A1

Embodiments of the invention relate to implementing run-time instrumentation sampling in transactional-execution mode. An aspect of the invention includes a method for implementing run-time instrumentation sampling in transactional-execution mode. The method includes determining, by a processor, that the processor is configured to execute instructions of an instruction stream in a transactional-execution mode, the instructions defining a transaction. The method also includes interlocking completion of storage operations of the instructions to prevent instruction-directed storage until completion of the transaction. The method further includes recognizing a sample point during execution of the instructions while in the transactional-execution mode. The method additionally includes run-time-instrumentation-directed storing, upon successful completion of the transaction, run-time instrumentation information obtained at the sample point. 1. A computer implemented method for implementing run-time instrumentation sampling in transactional-execution mode , the method comprising:determining, by a processor, that the processor is configured to execute instructions of an instruction stream in a transactional-execution mode, the instructions defining a transaction;interlocking completion of storage operations of the instructions to prevent instruction-directed storage until completion of the transaction;recognizing a sample point during execution of the instructions while in the transactional-execution mode; andrun-time-instrumentation-directed storing, upon successful completion of the transaction, run-time instrumentation information obtained at the sample point.2. The method of claim 1 , wherein run-time-instrumentation-directed storing the run-time instrumentation information obtained at the sample point further comprises:collecting run-time instrumentation events in a collection buffer while in the transactional-execution mode;deferring storage of the collected run-time ...

Подробнее
19-09-2013 дата публикации

RUN-TIME INSTRUMENTATION REPORTING

Номер: US20130246776A1

Embodiments of the invention relate to run-time instrumentation reporting. An instruction stream is executed by a processor. Run-time instrumentation information of the executing instruction stream is captured by the processor. Run-time instrumentation records are created based on the captured run-time instrumentation information. A run-time instrumentation sample point of the executing instruction stream on the processor is detected. A reporting group is stored in a run-time instrumentation program buffer. The storing is based on the detecting and the storing includes: determining a current address of the run-time instrumentation program buffer, the determining based on instruction accessible run-time instrumentation controls; and storing the reporting group into the run-time instrumentation program buffer based on an origin address and the current address of the run-time instrumentation program buffer, the reporting group including the created run-time instrumentation records. 1. A computer implemented method for run-time instrumentation reporting , the method comprising:executing an instruction stream by a processor;capturing, by the processor, run-time instrumentation information of said executing instruction stream;based on said captured run-time instrumentation information, creating run-time instrumentation records;detecting a run-time instrumentation sample point of the executing instruction stream on the processor; and determining a current address of the run-time instrumentation program buffer, the determining based on instruction accessible run-time instrumentation controls; and', 'storing the reporting group into the run-time instrumentation program buffer based on an origin address and the current address of the run-time instrumentation program buffer, the reporting group comprising said created run-time instrumentation records., 'storing a reporting group in a run-time instrumentation program buffer, the storing based on the detecting a run-time ...

Подробнее
03-10-2013 дата публикации

PROCESSOR AND METHOD FOR DRIVING THE SAME

Номер: US20130262828A1
Автор: Yoneda Seiichi

A low-power processor that does not easily malfunction is provided. Alternatively, a low-power processor having high processing speed is provided. Alternatively, a method for driving the processor is provided. In power gating, the processor performs part of data backup in parallel with arithmetic processing and performs part of data recovery in parallel with arithmetic processing. Such a driving method prevents a sharp increase in power consumption in a data backup period and a data recovery period and generation of instantaneous voltage drops and inhibits increases of the data backup period and the data recovery period. 1. A processor comprising:an instruction decoder;a logic unit including a plurality of logic circuit blocks including a volatile memory block and a nonvolatile memory block;a backup/recovery controller including a storage storing first reference instruction enumeration and second reference instruction enumeration;a power controller; anda flag storage.2. The processor according to claim 1 , wherein the volatile memory block includes a register.3. The processor according to claim 1 , wherein the nonvolatile memory block includes a transistor including an oxide semiconductor.4. The processor according to claim 1 , wherein the processor is incorporated in one selected from the group consisting of an air conditioner claim 1 , an electric refrigerator-freezer claim 1 , an image display device claim 1 , and an electric vehicle.5. A processor comprising:an instruction decoder;a logic unit including a plurality of logic circuit blocks including a volatile memory block and a nonvolatile memory block;a backup/recovery controller including a storage storing first reference instruction enumeration and second reference instruction enumeration;a power controller; anda flag storage,wherein the instruction decoder receives an instruction from an outside of the processor and gives an instruction to the logic unit, the backup/recovery controller, and the power ...

Подробнее
03-10-2013 дата публикации

Instruction Scheduling for Reducing Register Usage

Номер: US20130262832A1
Принадлежит: Advanced Micro Devices Inc

A method, computer program product, and system are provided for scheduling a plurality of instructions in a computing system. For example, the method can generate a plurality of instruction lineages, in which the plurality of instruction lineages is assigned to one or more registers. Each of the plurality of instruction lineages has at least one node representative of an instruction from the plurality of instructions. The method can also determine a node order based on respective priority values associated with each of the nodes. Further, the method can include scheduling the plurality of instructions based on the node order and the one or more registers assigned to the one or more registers.

Подробнее
03-10-2013 дата публикации

FUNCTION-BASED SOFTWARE COMPARISON METHOD

Номер: US20130262843A1
Автор: Du Ben-Chuan
Принадлежит: MStar Semiconductor, Inc.

A method for comparing a first subroutine and a second subroutine in functionality, includes: defining a plurality of instruction sets, each instruction set associated with a corresponding instruction set process; obtaining a first program section and a second program section from a first subroutine and a second subroutine, respectively, and categorizing the first subroutine and the second subroutine to one of the instruction sets, respectively; performing a program section comparison process to select and perform one of the instruction sets according to the instruction set to which the first program section is categorized and the instruction set to which the second program section is categorized, so as to compare whether the first program section and the second program section have identical functions, and to accordingly determine whether the first subroutine and the second subroutine are equivalent in functionality. 1. A method for comparing a first subroutine and a second subroutine in functionality , comprising:defining a plurality of instruction sets, each of which being associated with a corresponding instruction set process;performing a capturing process to obtain a first program section and a second program section respectively from the first subroutine and the second subroutine, and respectively categorizing the first program section and the second program section to one of the instruction sets; andperforming a program section comparison process to select and perform one of the instruction set processes according to the instruction set to which the first program section is categorized and the instruction set to which the second program section is categorized, so as to compare whether the first program section and the second program section are identical in functionality.2. The method according to claim 1 , wherein the program section comparison process comprises:when the first program section and the second program section are categorized to the same ...

Подробнее
17-10-2013 дата публикации

SHUFFLE PATTERN GENERATING CIRCUIT, PROCESSOR, SHUFFLE PATTERN GENERATING METHOD, AND INSTRUCTION SEQUENCE

Номер: US20130275718A1
Автор: Baba Daisuke, Ueda Kyoko
Принадлежит:

Based on an input index sequence () composed of four indices (each having a bit width of 8 bits), a shift-copier generates an index sequence () by shifting each index leftward by 1 bit and making two copies of each index, and outputs the generated index sequence (). An adder generates a shuffle pattern () by adding 1, 0, 1, 0, 1, 0, 1 and 0 to the indices in the index sequence () from left to right, and outputs the generated shuffle pattern (). 1. A shuffle pattern generating circuit comprising:a shift-copier that generates an index sequence by: receiving an input index sequence composed of a plurality of indices, and a signal indicating a number of bits and a number of copies; shifting each index in the input index sequence leftward by the number of bits; and making the number of copies of each index in the input index sequence, and outputs the generated index sequence; andan adder that receives the index sequence output by the shift-copier and a signal indicating an additional value to be added to each index in the index sequence output by the shift-copier, and adds the additional value to each index in the index sequence output by the shift-copier.2. The shuffle pattern generating circuit of claim 1 , whereinthe additional value is different for each copy made from a same index in the input index sequence.3. The shuffle pattern generating circuit of claim 2 , whereinthe number of the copies is N (where N is an integer equal to or greater than 2), andthe additional value to be added to one of the copies made from a same index in the input index sequence is 0, and the additional value to be added to each of the remaining N−1 copies is an integer ranging from 1 to N−1.4. The shuffle pattern generating circuit of claim 1 , further comprising:a bit width changer that receives a signal indicating a bit width of each index in the input index sequence, and changes a bit width of each index to the bit width indicated by the signal.5. The shuffle pattern generating circuit ...

Подробнее
17-10-2013 дата публикации

PACKED DATA OPERATION MASK SHIFT PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS

Номер: US20130275719A1
Принадлежит:

A method of an aspect includes receiving a packed data operation mask shift instruction. The packed data operation mask shift instruction indicates a source having a packed data operation mask, indicates a shift count number of bits, and indicates a destination. The method further includes storing a result in the destination in response to the packed data operation mask shift instruction. The result includes a sequence of bits of the packed data operation mask that have been shifted by the shift count number of bits. Other methods, apparatus, systems, and instructions are disclosed. 1. A method comprising:receiving a packed data operation mask shift instruction, the packed data operation mask shift instruction indicating a source having a packed data operation mask, indicating a shift count number of bits, and indicating a destination; andstoring a result in the destination in response to the packed data operation mask shift instruction, the result including a sequence of bits of the packed data operation mask that have been shifted by the shift count number of bits.2. The method of claim 1 , wherein storing the result comprises storing the sequence of the bits of the packed data operation mask that have been logically shifted to the right by the shift count number of bits with the shift count number of zeros shifted in on the left.3. The method of claim 1 , wherein storing the result comprises storing the sequence of the bits of the packed data operation mask that have been logically shifted to the left by the shift count number of bits with the shift count number of zeros shifted in on the right.4. The method of claim 1 , wherein the packed data operation mask is an N-bit packed data operation mask claim 1 , wherein the shift count number of bits is an M-bit shift count number of bits claim 1 , and wherein the result includes:(a) in a least significant N-bits of the destination, an (N−M)-bit sequence of the bits of the N-bit packed data operation mask that has ...

Подробнее
17-10-2013 дата публикации

METHOD AND APPARATUS TO PROCESS KECCAK SECURE HASHING ALGORITHM

Номер: US20130275722A1
Принадлежит: Intel Corporation

A processor includes a plurality of registers, an instruction decoder to receive an instruction to process a KECCAK state cube of data representing a KECCAK state of a KECCAK hash algorithm, to partition the KECCAK state cube into a plurality of subcubes, and to store the subcubes in the plurality of registers, respectively, and an execution unit coupled to the instruction decoder to perform the KECCAK hash algorithm on the plurality of subcubes respectively stored in the plurality of registers in a vector manner. 1. A processor , comprising:a plurality of registers;an instruction decoder to receive an instruction to process a KECCAK state cube of data representing a KECCAK state of a KECCAK hash algorithm, to partition the KECCAK state cube into a plurality of subcubes, and to store the subcubes in the plurality of registers, respectively; andan execution unit coupled to the instruction decoder to perform the KECCAK hash algorithm on the plurality of subcubes respectively stored in the plurality of registers in a vector manner.2. The processor of claim 1 , wherein the KECCAK state cube includes 64 slices partitioned into 4 subcubes claim 1 , wherein each subcube contains 16 slices.3. The processor of claim 2 , wherein the plurality of registers include 4 registers claim 2 , each having at least 450 bits.4. The processor of claim 1 , wherein claim 1 , for each round of the KECCAK algorithm claim 1 , the execution unit is configured to perform KECCAK_THETA operations claim 1 , includingperforming a θ function of the KECCAK algorithm on the subcubes stored in the registers in parallel, andperforming a first portion of a ρ function of the KECCAK algorithm on the subcubes in parallel.5. The processor of claim 4 , wherein the execution unit is further configured to perform KECCAK_ROUND operations claim 4 , includingperforming a second portion of the ρ function of the KECCAK algorithm on the subcubes in parallel,performing a π function of the KECCAK algorithm on the ...

Подробнее
17-10-2013 дата публикации

SYSTEMS, APPARATUSES, AND METHODS FOR GENERATING A DEPENDENCY VECTOR BASED ON TWO SOURCE WRITEMASK REGISTERS

Номер: US20130275724A1
Автор: Bharadwaj Jayashankar
Принадлежит:

Embodiments of systems, apparatuses, and methods of performing in a computer processor dependency index vector calculation in response to an instruction that includes a first and second source writemask register operands, a destination vector register operand, and an opcode are described. 1. A method of performing in a computer processor dependency index vector calculation in response to an instruction that includes a first and second source writemask register operands , a destination vector register operand , and an opcode , the method comprising steps of:executing the instruction to determine, for each bit position the first source writemask register, a dependence value that indicates for an iteration corresponding to that bit position, which bit position that it is dependent on;storing the determined dependence values in corresponding data element positions of the destination vector register.2. The method of claim 1 , wherein the destination vector register is a 128-bit vector register.3. The method of claim 1 , wherein the destination vector register is a 256-bit vector register.4. The method of claim 1 , wherein the destination vector register is a 512-bit vector register.5. The method of claim 1 , wherein the source writemask registers are 16-bit registers.6. The method of claim 1 , wherein the source writemask registers are 64-bit registers.7. The method of claim 1 , wherein the determining and storing further comprises:setting a counter value and a temporary value to 0;determining if a value in the counter value bit position of the first source writemask register is 1;when the value in the counter value bit position of the first source writemask register is 1, setting a destination vector register data element at position counter value to be the temporary value;when the value in the counter value bit position of the first source writemask register is 0, setting a destination vector register data element at position counter value to be 0;determining if a ...

Подробнее
17-10-2013 дата публикации

Apparatus and method of improved extract instructions

Номер: US20130275730A1
Принадлежит: Intel Corp

An apparatus is described that includes instruction execution logic circuitry to execute first, second, third and fourth instructions. Both the first instruction and the second instruction select a first group of input vector elements from one of multiple first non overlapping sections of respective first and second input vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction select a second group of input vector elements from one of multiple second non overlapping sections of respective third and fourth input vectors. The second group has a second bit width that is larger than the first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus includes masking layer circuitry to mask the first and second groups of the first and third instructions at a first granularity, where, respective resultants produced therewith are respective resultants of the first and third instructions. The masking circuitry is also to mask the first and second groups of the second and fourth instructions at a second granularity, where, respective resultants produced therewith are respective resultants of the second and fourth instructions.

Подробнее
17-10-2013 дата публикации

PROCESSORS

Номер: US20130275732A1
Принадлежит:

A processing apparatus comprises a plurality of processors each arranged to perform an instruction, and a bus arranged to carry data and control tokens between the processors. Each processor is arranged, if it receives a control token via the bus, to carry out the instruction, and on carrying out the instruction, to perform an operation on the data, to identify any of the processors which are to be data target processors, and to transmit output data to any identified data target processors, to identify any of the processors which are to be control target processors, and to transmit a control token to any identified control target processors. 1. A processing apparatus comprising a plurality of chips each comprising a plurality of processors , each processor arranged to perform an instruction , and a bus arranged to carry data tokens and control tokens between the processors , wherein each processor is arranged , if it receives a control token via the bus , to carry out the instruction , and on carrying out the instruction , to perform an operation on the data to produce a result , to identify any of the processors which are to be data target processors , and to transmit output data to any identified data target processors , to identify any of the processors which are to be control target processors , and to transmit a control token to any identified control target processors , each chip having an input device which receives tokens from the bus and an output device from which tokens can be transferred to another of the chips , wherein the plurality of processors are between the input and output device and each processor has an address associated with it , the addresses being within a range , the apparatus being arranged , on receipt by the output device of a token having a target address which is outside the range , to perform a modification of the target address , and to transfer the token to said other chip.2. A processing apparatus according to wherein each ...

Подробнее
24-10-2013 дата публикации

System, apparatus and method for translating vector instructions

Номер: US20130283022A1
Автор: Ruchira Sasanka
Принадлежит: Intel Corp

Vector translation instructions are used to demarcate the beginning and the end of a code region to be translated. The code region includes a first set of vector instructions defined in an instruction set of a source processor. A processor receives the vector translation instructions and the demarcated code region, and translates the code region into translated code. The translated code includes a second set of vector instructions defined in an instruction set of a target processor. The translated code is executed by the target processor to produce a result value, the result value being the same as an original result value produced by the source processor executing the code region. The target processor stores the result value at a location that is not a vector register, the location being the same as an original location used by the source processor to store the original result value.

Подробнее
07-11-2013 дата публикации

FLAG NON-MODIFICATION EXTENSION FOR ISA INSTRUCTIONS USING PREFIXES

Номер: US20130297915A1
Принадлежит:

In one embodiment, a processor includes an instruction decoder to receive and decode an instruction having a prefix and an opcode, an execution unit to execute the instruction based on the opcode, and flag modification override logic to prevent the execution unit from modifying a flag register of the processor based on the prefix of the instruction. 1. A method , comprising:in response to an instruction having a prefix and an opcode received at a processor, executing, by an execution unit of the processor, the instruction based on the opcode; andpreventing the execution unit from modifying a flag register of the processor based on the prefix of the instruction.2. The method of claim 1 , further comprising:extracting the prefix from the instruction; anddetermining whether the instruction is valid based on the prefix in view of a capability of the processor, wherein the execution unit is to execute the instruction only if the instruction is valid.3. The method of claim 2 , wherein determining whether the instruction is valid comprises examining a value of one or more bits of the prefix in view of a processor identifier that identifies a type of the processor.4. The method of claim 2 , further comprising generating an exception indicating that the instruction is invalid claim 2 , if one or more bits of the prefix matches a predetermined bit pattern based on the capability of the processor.5. The method of claim 1 , further comprising:preventing the execution unit from modifying the flag register if one or more bits of the prefix match a first predetermined bit pattern; andallowing the execution unit to modify the flag register if one or more bits of the prefix match a second predetermined bit pattern.6. The method of claim 1 , wherein the opcode of the instruction represents an integer operation that when executed would normally modify the flag register.7. The method of claim 1 , wherein the prefix represents a vector length when the opcode includes a vector ...

Подробнее
14-11-2013 дата публикации

PERFORMING A CYCLIC REDUNDANCY CHECKSUM OPERATION RESPONSIVE TO A USER-LEVEL INSTRUCTION

Номер: US20130305015A1
Принадлежит:

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed. 1. A system comprising: a cache;', a first 32-bit register to store a first operand;', 'a second 32-bit register to store a second operand;', 'a first 64-bit register to store a third operand; and', 'a second 64-bit register to store a fourth operand; and, 'a set of registers, including, a first execution unit coupled to the first and the second 32-bit registers to perform a first XOR operation on at least one bit of the first and the second operands and to store a result of the first XOR operation in a first destination register responsive to a first instruction of the ISA; and', 'a second execution unit coupled to the first and the second 64-bit registers to perform a second XOR operation on at least one bit of the third and the fourth operands and to store a result of the second XOR operation in a second destination register responsive to a second instruction of the ISA;, 'a plurality of execution units to perform exclusive-OR (XOR) operations on data of a configurable size responsive to instructions of an instruction set architecture (ISA) for the processor, the plurality of execution units including], 'a processor comprisinga memory coupled with the processor;a data storage device coupled with the processor;an audio I/O device coupled with the processor; anda communication device coupled to the processor.2. The system of claim 1 , wherein the first and the second execution units are to perform the first and the second XOR operations in the same number of cycles.3. The system of claim 1 , wherein the first destination register comprises the first 32-bit ...

Подробнее
14-11-2013 дата публикации

PERFORMING A CYCLIC REDUNDANCY CHECKSUM OPERATION RESPONSIVE TO A USER-LEVEL INSTRUCTION

Номер: US20130305016A1
Принадлежит:

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed. 1. A system comprising: [ a first 32-bit register to store a first operand;', 'a second 32-bit register to store a second operand;, 'a set of registers, including, 'a second 64-bit register to store a fourth operand;', 'a first 64-bit register to store a third operand; and'}, a first execution unit coupled to the first and the second 32-bit registers to perform a first XOR operation on at least one bit of the first and the second operands and to store a result of the first XOR operation in a first destination register responsive to a first instruction of the ISA; and', 'a second execution unit coupled to the first and the second 64-bit registers to perform a second XOR operation on at least one bit of the third and the fourth operands and to store a result of the second XOR operation in a second destination register responsive to a second instruction of the ISA; and, 'a plurality of execution units to perform exclusive-OR (XOR) operations on data of a configurable size responsive to instructions of an instruction set architecture (ISA) for the processor, the plurality of execution units including, 'a memory controller;, 'a processor comprisinga memory coupled with the memory controller;a data storage device coupled with the processor;an audio I/0 interface coupled with the processor; anda communication device coupled with the processor;2. The system of claim 1 , wherein the first and the second execution units are to perform the first and the second XOR operations in the same number of cycles.3. The system of claim 1 , wherein the first destination register ...

Подробнее
14-11-2013 дата публикации

MFENCE and LFENCE Micro-Architectural Implementation Method and System

Номер: US20130305018A1
Принадлежит: Individual

A system and method for fencing memory accesses. Memory loads can be fenced, or all memory access can be fenced. The system receives a fencing instruction that separates memory access instructions into older accesses and newer accesses. A buffer within the memory ordering unit is allocated to the instruction. The access instructions newer than the fencing instruction are stalled. The older access instructions are gradually retired. When all older memory accesses are retired, the fencing instruction is dispatched from the buffer.

Подробнее
14-11-2013 дата публикации

EXECUTION OF A PERFORM FRAME MANAGEMENT FUNCTION INSTRUCTION

Номер: US20130305023A1
Принадлежит:

Optimizations are provided for frame management operations, including a clear operation and/or a set storage key operation, requested by pageable guests. The operations are performed, absent host intervention, on frames not resident in host memory. The operations may be specified in an instruction issued by the pageable guests. 1. A computer system for executing an instruction , the computer system comprising:a memory; and obtaining a perform frame management function (PFMF) machine instruction, the PFMF machine instruction comprising an opcode field, a first field and a second field;', 'performing an operation on a guest frame designated by the second field, said guest frame being non-resident in host memory, the operation being specified in a location indicated by the first field and comprising a clear operation, and wherein the performing is absent host intervention and is based on a usage indicator specified in the location.', 'executing, by a pageable guest, the obtained PFMF machine instruction, the executing comprising], 'a processor in communications with the memory, wherein the computer system is configured to perform a method, said method comprising2. The computer system of claim 1 , wherein the usage indicator specifies that a program has indicated that it is likely to use the guest frame within a near future claim 1 , and wherein the clear operation includes:obtaining a host frame from a list of cleared available frames; andattaching the obtained host frame to the guest frame to be cleared.3. The computer system of claim 2 , wherein the operation further comprises a set storage key operation claim 2 , and the performing comprises including a value of a key in a control block used by a host managing the pageable guest.4. The computer system of claim 1 , wherein the usage indicator specifies that a program has indicated that it is not likely to use the guest frame within a near future claim 1 , and wherein the clear operation includes:marking one or more ...

Подробнее
14-11-2013 дата публикации

Performing a cyclic redundancy checksum operation responsive to a user-level instruction

Номер: US20130305118A1
Принадлежит: Intel Corp

In one embodiment, the present invention includes a method for receiving incoming data in a processor and performing a checksum operation on the incoming data in the processor pursuant to a user-level instruction for the checksum operation. For example, a cyclic redundancy checksum may be computed in the processor itself responsive to the user-level instruction. Other embodiments are described and claimed.

Подробнее
21-11-2013 дата публикации

Apparatus and method for selecting elements of a vector computation

Номер: US20130311530A1
Принадлежит: Intel Corp

An apparatus and method are described for performing a vector reduction. For example, an apparatus according to one embodiment comprises: a reduction logic tree comprised of a set of N-1 reduction logic blocks used to perform reduction in a single operation cycle for N vector elements; a first input vector register storing a first input vector communicatively coupled to the set of reduction logic blocks; a second input vector register storing a second input vector communicatively coupled to the set of reduction logic blocks; a mask register storing a mask value controlling a set of one or more multiplexers, each of the set of multiplexers selecting a value directly from the first input vector register or an output containing a processed value from one of the reduction logic blocks; and an output vector register coupled to outputs of the one or more multiplexers to receive values output passed through by each of the multiplexers responsive to the control signals.

Подробнее
21-11-2013 дата публикации

ROTATE INSTRUCTIONS THAT COMPLETE EXECUTION WITHOUT READING CARRY FLAG

Номер: US20130311756A1
Принадлежит:

A method of one aspect may include receiving a rotate instruction. The rotate instruction may indicate a source operand and a rotate amount. A result may be stored in a destination operand indicated by the rotate instruction. The result may have the source operand rotated by the rotate amount. Execution of the rotate instruction may complete without reading a carry flag. 1. A method comprising:receiving a rotate instruction, the rotate instruction indicating a source operand and a rotate amount;storing a result in a destination operand indicated by the rotate instruction, the result having the source operand rotated by the rotate amount; andcompleting execution of the rotate instruction without reading a carry flag.2. The method of claim 1 , wherein completing comprises completing execution of the rotate instruction without reading an overflow flag.3. The method of claim 2 , wherein completing comprises completing execution of the rotate instruction without writing the carry flag and without writing the overflow flag.4. The method of claim 2 , wherein completing comprises completing execution of the rotate instruction without reading a sign flag claim 2 , without reading a zero flag claim 2 , without reading an auxiliary carry flag claim 2 , and without reading a parity flag.5. The method of claim 4 , wherein completing comprises completing execution of the rotate instruction without writing the carry flag claim 4 , without writing the overflow flag claim 4 , without writing the sign flag claim 4 , without writing the zero flag claim 4 , without writing the auxiliary carry flag claim 4 , and without writing the parity flag.6. The method of claim 1 , wherein receiving comprises receiving a rotate instruction that explicitly specifies the source operand and that explicitly specifies the destination operand.7. The method of claim 1 , wherein receiving comprises receiving a rotate instruction that explicitly specifies a second source operand having the rotate amount.8. ...

Подробнее
05-12-2013 дата публикации

METHOD, APPARATUS AND INSTRUCTIONS FOR PARALLEL DATA CONVERSIONS

Номер: US20130326194A1
Автор: Ramanujam Gopalan
Принадлежит:

Method, apparatus, and program means for performing a conversion. In one embodiment, a disclosed apparatus includes a destination storage location corresponding to a first architectural register. A functional unit operates responsive to a control signal, to convert a first packed first format value selected from a set of packed first format values into a plurality of second format values. Each of the first format values has a plurality of sub elements having a first number of bits The second format values have a greater number of bits. The functional unit stores the plurality of second format values into an architectural register. 1. A system comprising:a processor comprisinga register file including a first packed data register and a second packed data register,a decoder to decode a first instruction,scheduling logic to allocate resources and queue operations corresponding to the first instruction for execution, andexecution logic coupled to the decoder and the scheduling logic,wherein, responsive to the decoder decoding the first instruction, the execution logic is to convert a plurality of first packed data elements to a plurality of results,wherein the plurality of first packed data elements from the first packed data register is converted to the plurality of results, the results are saturated and stored in the second packed data register, and each of the first packed data elements has a first number of bits, each of the results has a second number of bits, and the second number of bits is one half the first number of bits;a memory controller coupled to the processor, wherein the memory controller is integral with the processor;a communication interface to a wireless network, the communication interface coupled to the processor; anda graphics interface to a display, the graphics interface coupled to the processor.2. The system of claim 1 , wherein the first number of bits is 32 and the second number of bits is 16.3. The system of claim 1 , wherein the first ...

Подробнее
05-12-2013 дата публикации

SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING VECTOR PACKED UNARY DECODING USING MASKS

Номер: US20130326196A1
Принадлежит:

Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed unary value decoding using masks in response to a single vector packed unary decoding using masks instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described. 1. A method of performing in a computer processor vector packed unary value decoding using masks in response to a single vector packed unary decoding using masks instruction that includes a destination vector register operand , a source writemask register operand , and an opcode , the method comprising steps of:executing the single vector packed unary value decoding using masks instruction to determine and decode the unary encoded values stored in the source writemask register; andstoring each determined and decode unary encoded values as packed data elements in packed data element positions of the destination register that correspond to their position in the source writemask register.2. The method of claim 1 , wherein each unary encoded value is stored in a format of its most significant bit position in the writemask being a 1 value and zero or more 0 values following the 1 value in bit positions of the destination writemask register that are of less significance than the bit position of the 1 value.3. The method of claim 1 , wherein the decoded least significant unary encoded value of the source vector register is stored in the least significant packed data element position of the destination register.4. The method of claim 1 , wherein the source writemask register is 16 bits.5. The method of claim 1 , wherein the source writemask register is 64 bits.6. The method of claim 1 , wherein after all of the decoded unary encoded values are stored in the destination register claim 1 , all remaining packed data element positions of the destination vector register are set to all 1s.7. The method of claim 1 , wherein the executing step comprises: ...

Подробнее
05-12-2013 дата публикации

ISSUING INSTRUCTIONS TO EXECUTION PIPELINES BASED ON REGISTER-ASSOCIATED PREFERENCES, AND RELATED INSTRUCTION PROCESSING CIRCUITS, PROCESSOR SYSTEMS, METHODS, AND COMPUTER-READABLE MEDIA

Номер: US20130326197A1
Принадлежит: QUALCOMM INCORPORATED

Issuing instructions to execution pipelines based on register-associated preferences and related instruction processing circuits, systems, methods, and computer-readable media are disclosed. In one embodiment, an instruction is detected in an instruction stream. Upon determining that the instruction specifies at least one source register, an execution pipeline preference(s) is determined based on at least one pipeline indicator associated with the at least one source register in a pipeline issuance table, and the instruction is issued to an execution pipeline based on the execution pipeline preference(s). Upon a determination that the instruction specifies at least one target register, at least one pipeline indicator associated with the at least one target register in the pipeline issuance table is updated based on the execution pipeline to which the instruction is issued. In this manner, optimal forwarding of instructions may be facilitated, thus improving processor performance. 1. A method for processing computer instructions , comprising:detecting an instruction in an instruction stream; determining at least one execution pipeline preference for the instruction based on at least one pipeline indicator associated with the at least one source register in a pipeline issuance table; and', 'issuing the instruction to an execution pipeline based on the at least one execution pipeline preference; and, 'upon determining that the instruction specifies at least one source register 'updating at least one pipeline indicator associated with the at least one target register in the pipeline issuance table based on the execution pipeline to which the instruction is issued.', 'upon determining that the instruction specifies at least one target register2. The method of claim 1 , wherein issuing the instruction to the execution pipeline comprises issuing the instruction to a preferred execution pipeline indicated by the at least one execution pipeline preference.3. The method of ...

Подробнее
12-12-2013 дата публикации

SET SAMPLING CONTROLS INSTRUCTION

Номер: US20130332709A1

A measurement sampling facility takes snapshots of the central processing unit (CPU) on which it is executing at specified sampling intervals to collect data relating to tasks executing on the CPU. The collected data is stored in a buffer, and at selected times, an interrupt is provided to remove data from the buffer to enable reuse thereof. The interrupt is not taken after each sample, but in sufficient time to remove the data and minimize data loss. 1. A computer system for executing a machine instruction in a central processing unit , the computer system comprising:a memory; and [ an opcode field identifying a set sampling controls instruction; and', 'a first field and a second field to be used to form a second operand address; and, 'obtaining a machine instruction for execution, the machine instruction being defined for computer execution according to a computer architecture, the machine instruction comprising, activating sampling for one or more sampling intervals to obtain information relating to processing of the central processing unit, wherein the activating sampling comprises at least one of activating basic sampling to obtain a set of architected sample data or activating diagnostic sampling to obtain a set of non-architected sample data; and', 'placing in one or more control registers one or more sampling controls of a request block located in one or more storage locations designated by the second operand address., 'executing the machine instruction, the executing comprising], 'a processor in communications with the memory, wherein the computer system is configured to perform a method, said method comprising2. The computer system of claim 1 , wherein the activating sampling comprises activating basic sampling claim 1 , and wherein the one or more control registers comprises information regarding one or more instructions executed by the central processing unit.3. The computer system of claim 1 , wherein the request block comprises at least one of:a size ...

Подробнее
12-12-2013 дата публикации

Speed up secure hash algorithm (sha) using single instruction multiple data (simd) architectures

Номер: US20130332742A1
Автор: Shay Gueron, Vlad KRASNOV
Принадлежит: Intel Corp

A processing apparatus may comprise logic to preprocess a message according to a selected secure hash algorithm (SHA) algorithm to generate a plurality of message blocks, logic to generate hash values by preparing message schedules in parallel using single instruction multiple data (SIMD) instructions for the plurality of message blocks and to perform compression in serial for the plurality of message blocks, and logic to generate a message digest conforming to the selected SHA algorithm.

Подробнее
19-12-2013 дата публикации

Single instruction multiple data (simd) reconfigurable vector register file and permutation unit

Номер: US20130339649A1
Принадлежит: Intel Corp

An apparatus may comprise a register file and a permutation unit coupled to the register file. The register file may have a plurality of register banks and an input to receive a selection signal. The selection signal may select one or more unit widths of a register bank as a data element boundary for read or write operations.

Подробнее
19-12-2013 дата публикации

EFFICIENT ZERO-BASED DECOMPRESSION

Номер: US20130339661A1
Принадлежит:

A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed. 1. A method for decompressing a run length encoded (RLE) set of source data elements in a computer processor that includes a vector execution unit , the method comprising:initializing a result and an insertion point variable;generating a first mask by comparing the set of source data elements to a set of zero elements comprising the of same number of elements as the set of source data elements;generating a count of trailing zeros in the first mask;generating a second mask comprising a set of ones based on the count of trailing zeros;performing a masked copied of the set of source data elements into a temporary buffer using the second mask;reading a number of RLE zeros from a first data element in the set of source data elements, the first data element being indexed based on the count of trailing zeros;shifting the temporary buffer to the left based on the insertion point variable;performing a copy of the temporary buffer into the result variable;updating the insertion point variable;shifting the set of source data elements to the right based on the number of trailing zeros;determining whether the set of source data elements contains more valid input;in the ...

Подробнее
19-12-2013 дата публикации

INSTRUCTION EXECUTION UNIT THAT BROADCASTS DATA VALUES AT DIFFERENT LEVELS OF GRANULARITY

Номер: US20130339664A1
Принадлежит:

An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The first data structure is four times as large as the second data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second instruction to create a second replication data structure. 1. An apparatus , comprising:an instruction execution pipeline coupled to register space, said instruction execution pipeline having an execution unit to execute a first instruction and a second instruction wherein:i) said register space is to store a first data structure to be replicated when said execution unit executes said first instruction and to store a second data structure to be replicated when said execution unit executes said second instruction, said first and second data structures both being packed data structures, data values of said first packed data structure being twice as large as data values of said second packed data structure, said first data structure being four times as large as said second data structure;ii) said execution unit includes replication logic circuitry to replicate said first data structure when executing said first instruction to create a first replication data structure, and, to replicate said second data structure when executing said second instruction to create a second replication data structure;iii) said instruction ...

Подробнее
19-12-2013 дата публикации

SPECIAL CASE REGISTER UPDATE WITHOUT EXECUTION

Номер: US20130339667A1

A method of changing a value of associated with a logical address in a computing device. The method includes: receiving an instruction at an instruction decoder, the instruction including a target register expressed as a logical value; determining at an instruction decoder that a result of the instruction is to set the target register to a constant value, the target register being in a physical register file associated with an execution unit; and mapping, in a register mapper, the logical address to a location represented by a special register tag. 1. A computer program product for changing a value of associated with a logical address in a computing device , the computer program product comprising:a tangible storage medium readable by a processing circuit and storing instructions for execution by the processing circuit for performing a method comprising:receiving an instruction at an instruction decoder, the instruction including a target register expressed as a logical value;determining at an instruction decoder that a result of the instruction is to set the target register to a constant value, the target register being in a physical register file associated with an execution unit; andmapping, in a register mapper, the logical address to a location represented by a special register tag.2. The computer program product of claim 1 , wherein the method further comprises:assigning one or more of the registers in the physical register file an unchangeable constant value; andwherein the special register tag is equal to an address of the register in the physical register file having an unchangeable constant value equal to the constant value.3. The computer program product of claim 1 , wherein the location represented by the special register tag is not contained in the physical register file.4. The computer program product of claim 3 , wherein the special register tag is equal to or can be converted to the constant value.5. The computer program product of claim 4 , the ...

Подробнее
19-12-2013 дата публикации

SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING DELTA DECODING ON PACKED DATA ELEMENTS

Номер: US20130339668A1
Принадлежит:

Embodiments of systems, apparatuses, and methods for performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta decode instruction are described. 1. A method of performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta decode instruction , the method comprising the steps of:executing, in execution resources of a processor core, a decoded vector packed delta decode instruction that includes a source operand and a destination operand each having a plurality of packed data elements to calculate for each packed data element position of the source operand a value that comprises a packed data element of that packed data element position and all packed data elements of packed data element positions that are of lesser significance; andfor each calculated value, storing the value into a packed data element position of the destination operand that corresponds to the packed data element position of the source operand.2. The method of claim 1 , wherein the source and destination operands are vector registers.3. The method of claim 2 , wherein the vector registers are 512-bit in size.4. The method of claim 1 , wherein the packed data elements are 32-bits in size.5. The method of claim 1 , wherein the values are calculated by adding all of the packed data elements of the source together and claim 1 , for each packed data element position claim 1 , subtracting all data elements that come from packed data element positions of equal or greater significance.6. The method of claim 2 , wherein the vector registers are 128-bit in size.7. The method of claim 2 , wherein the vector registers are 256-bit in size.8. A method of performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta ...

Подробнее
19-12-2013 дата публикации

NONTRANSACTIONAL STORE INSTRUCTION

Номер: US20130339669A1

A NONTRANSACTIONAL STORE instruction, executed in transactional execution mode, performs stores that are retained, even if a transaction associated with the instruction aborts. The stores include user-specified information that may facilitate debugging of an aborted transaction. 1. A method of executing an instruction within a computing environment , said method comprising: an operation code to specify a nontransactional store operation;', 'a first operand; and', 'a second operand to designate a location for the first operand; and, 'obtaining, by a processor, a machine instruction for execution, the machine instruction being defined for computer execution according to a computer architecture, the machine instruction comprising 'nontransactionally placing the first operand at the location specified by the second operand, wherein information stored at the second operand is retained despite an abort of a transaction associated with the machine instruction, and wherein the nontransactionally placing is delayed until an end of transactional execution mode of the processor.', 'executing, by the processor, the machine instruction, the executing comprising2. The method of claim 1 , wherein the end of transactional execution mode results from an end of an outermost transaction associated with the machine instruction or an abort condition.3. The method of claim 1 , wherein multiple nontransactional stores appear as concurrent stores to other processors.4. The method of claim 1 , further comprising:determining whether the processor is in transactional execution mode;based on the processor being in the transactional execution mode, determining whether the transaction is a constrained transaction or a nonconstrained transaction; andbased on the transaction being a nonconstrained transaction, continuing execution of the machine instruction.5. The method of claim 4 , wherein based on the transaction being a constrained transaction claim 4 , providing a program exception and ...

Подробнее
19-12-2013 дата публикации

METHOD AND APPARATUS FOR REDUCING AREA AND COMPLEXITY OF INSTRUCTION WAKEUP LOGIC IN A MULTI-STRAND OUT-OF-ORDER PROCESSOR

Номер: US20130339679A1
Принадлежит: Intel Corporation

A computer system, a computer processor and a method executable on a computer processor involve placing each sequence of a plurality of sequences of computer instructions being scheduled for execution in the processor into a separate queue. The head instruction from each queue is stored into a first storage unit prior to determining whether the head instruction is ready for scheduling. For each instruction in the first storage unit that is determined to be ready, the instruction is moved from the first storage unit to a second storage unit. During a first processor cycle, each instruction in the first storage unit that is determined to be not ready is retained in the first storage unit, and the determining of whether the instruction is ready is repeated during the next processor cycle. Scheduling logic performs scheduling of instructions contained in the second storage unit. 1. A computer system that is configured to perform the following:placing each sequence of a plurality of sequences of computer instructions being scheduled for execution in a computer processor into a separate queue;storing a head instruction from each queue into a first storage unit prior to determining whether the head instruction is ready for scheduling;for each instruction in the first storage unit that is determined to be ready, moving the instruction from the first storage unit to a second storage unit;during a first processor cycle, for each instruction in the first storage unit that is determined to be not ready, retaining the instruction in the first storage unit and repeating the determining of whether the instruction is ready during the next processor cycle; andapplying scheduling logic to perform scheduling of instructions contained in the second storage unit.2. The computer system of claim 1 , wherein the processor is a multi-strand out-of-order processor configured to execute each sequence of the plurality of sequences as a separate strand.3. The computer system of claim 1 , ...

Подробнее
19-12-2013 дата публикации

METHODS TO OPTIMIZE A PROGRAM LOOP VIA VECTOR INSTRUCTIONS USING A SHUFFLE TABLE AND A MASK STORE TABLE

Номер: US20130339682A1
Принадлежит:

According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array. The code optimizer is configured to generate second code representing the program loop with vector instructions including a shuffle instruction and a store instruction, the store instruction to shuffle using a shuffle table elements of the first array based on the second array in a vector manner, the store instruction to store using a mask store table the shuffled elements in the third array in a vector manner. 1. A computer-implemented method , comprising:receiving first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array; and a shuffle instruction to shuffle elements of the first array based on the third array using a shuffle table in a vector manner, and', 'a store instruction to store the shuffled elements of the first array in the second array using a mask store table in a vector manner., 'generating second code representing the program loop with vector instructions, the second code including'}2. The method of claim 1 , wherein the second code further comprises instructions tocompare elements of the third array with a predetermined threshold, generating a comparison result, andgenerate a mask based on the comparison result, the elements of the first array to be shuffled based on the mask.3. The method of claim 2 , wherein the second code further comprises an instruction to load elements of the shuffle table selected based on the mask claim 2 , the elements of the first array to be shuffled via the shuffle instruction based on the selected elements of the shuffle table.4. The method of claim 2 , wherein the second code further comprises an instruction to load elements of the mask store table selected based on the mask claim 2 , the shuffled ...

Подробнее
19-12-2013 дата публикации

PROCESSING APPARATUS, TRACE UNIT AND DIAGNOSTIC APPARATUS

Номер: US20130339686A1
Принадлежит: ARM LIMITED

A processing circuit is responsive to at least one conditional instruction to perform a conditional operation in dependence on a current value of a subset of at least one condition flag. A trace circuit is provided for generating trace data elements indicative of operations performed by the processing circuit. When the processing circuit processes at least one selected instruction, then the trace circuit generates a trace data element including a traced condition value indicating at least the subset of condition flags required to determine the outcome of the conditional instruction. A corresponding diagnostic apparatus uses the traced condition value to determine a processing outcome of the at least one conditional instruction. 1. A processing apparatus comprising:processing circuitry configured to perform processing operations in response to program instructions;a condition status storage location configured to store at least one condition flag indicating a condition of said processing circuitry; andtrace circuitry configured to generate trace data elements indicative of said processing operations performed by said processing circuitry in response to said program instructions; wherein:said processing circuitry is responsive to at least one conditional instruction to perform a conditional operation in dependence on a current value of a subset of said at least one condition flag; andsaid trace circuitry is configured, in response to said processing circuitry processing at least one of said at least one conditional instruction, to generate a trace data element including a traced condition value indicative of at least said subset of said at least one condition flag, said traced condition value providing information for determining a processing outcome of said at least one conditional instruction.2. The processing apparatus according to claim 1 , wherein said traced condition value comprises an identifier identifying a value of at least said subset of said at least one ...

Подробнее
19-12-2013 дата публикации

METHOD AND SYSTEM FOR POLLING NETWORK CONTROLLERS

Номер: US20130339710A1
Автор: Ding Jianzu
Принадлежит: Fortinet, Inc.

Improving the performance of multitasking processors are provided. For example, a subset of M processors within a Symmetric Multi-Processing System (SMP) with N processors is dedicated for a specific task. The M (M>0) of the N processors are dedicate to a task, thus, leaving (N-M) processors for running normal operating system (OS). The processors dedicated to the task may have their interrupt mechanism disabled to avoid interrupt handler switching overhead. Therefore, these processors run in an independent context and can communicate with the normal OS and cooperation with the normal OS to achieve higher network performance. 1. A method for improving the performance of a multi-processor system , the method comprising:dedicating M general-purpose processors from N general-purpose processors to perform a network polling task, wherein N is greater than M and the M processors are dedicated as network processors (NPs), the dedicating including disabling of interrupts to prevent context switching of the NPs and to prevent the NPs from performing tasks other than the network polling task.2. The method of claim 1 , further comprising:bypassing network interface controller (NIC) initialization during normal boot of an operating system;reserving memory in a shared memory as a pseudo NIC; andperforming network polling by coupling the NPs and network interface controllers, via the pseudo NIC, to facilitate communication between the NPs and network interface controllers.3. The method of claim 1 , wherein dedicating the M general-purpose processors as NPs includes obtaining control of the M general-purpose processors such that the M general-purpose processors perform the network polling task.4. The method of claim 1 , wherein the task of polling comprises one or more of subtasks comprising:processing packets, forwarding packets, routing packets, processing content, sending packets to and from network interface controller and processing for other networks.5. A computer-readable ...

Подробнее
26-12-2013 дата публикации

Optimizing Performance Of Instructions Based On Sequence Detection Or Information Associated With The Instructions

Номер: US20130346728A1
Принадлежит:

In one embodiment, the present invention includes an instruction decoder that can receive an incoming instruction and a path select signal and decode the incoming instruction into a first instruction code or a second instruction code responsive to the path select signal. The two different instruction codes, both representing the same incoming instruction may be used by an execution unit to perform an operation optimized for different data lengths. Other embodiments are described and claimed. 1. A method comprising:determining whether an iterative copy instruction can be optimized based at least in part on information associated with the iterative copy instruction;if so performing a first portion of the iterative copy instruction by a first sequence of conditional copy operations using a power of two tree of copies to copy up to a first amount of data in up to a first number of chunks to first destination locations from first source locations;performing a second portion of the iterative copy instruction by copying a second amount of data via a fast loop of copy operations to second destination locations from second source locations if a remainder of the data to be copied is greater than a first threshold; andthereafter performing a third portion of the iterative copy instruction by a second sequence of conditional copy operations to copy up to a third amount of data in up to a third number of chunks to third destination locations from third source locations, if any of the data remains to be copied.2. The method of claim 1 , further comprising obtaining set up information for the fast loop and the second sequence of conditional copy operations before executing the first sequence of conditional copy operations.3. The method of claim 1 , further comprising determining if the second amount of data is greater than a second threshold claim 1 , and if so using a caching hint to copy the second amount of data directly to a memory without storage in a cache.4. The method of ...

Подробнее
26-12-2013 дата публикации

PIPELINING OUT-OF-ORDER INSTRUCTIONS

Номер: US20130346729A1
Принадлежит:

Systems, methods and computer program product provide for pipelining out-of-order instructions. Embodiments comprise an instruction reservation station for short instructions of a short latency type and long instructions of a long latency type, an issue queue containing at least two short instructions of a short latency type, which are to be chained to match a latency of a long instruction of a long latency type, a register file, at least one execution pipeline for instructions of a short latency type and at least one execution pipeline for instructions of a long latency type; wherein results of the at least one execution pipeline for instructions of the short latency type are written to the register file, preserved in an auxiliary buffer, or forwarded to inputs of said execution pipelines. Data of the auxiliary buffer are written to the register file. 1. A method comprising:determining an instruction chain comprising at least a first instruction having a first latency and a second instruction having a second latency, the first latency and the second latency each being less than a third latency of a third instruction; andsubmitting the instruction chain to a first execution pipeline of a processor and the third instruction to a second execution pipeline of the processor, wherein execution of the instruction chain at least partially overlaps execution of the third instruction.2. The method according to claim 1 , further comprising writing a result of the instruction chain to a register file during a writeback slot for the third instruction.3. The method according to claim 1 , further comprising:determining whether the second instruction is dependent on data from the first instruction; andin response determining that the second instruction is dependent on data from the first instruction, forwarding the data from the first instruction to the second instruction.4. The method according to claim 3 , further comprising writing a result of the second instruction into a ...

Подробнее
02-01-2014 дата публикации

Vector multiplication with accumulation in large register space

Номер: US20140006755A1
Принадлежит: Intel Corp

An apparatus is described having an instruction execution pipeline that has a vector functional unit to support a vector multiply add instruction. The vector multiply add instruction to multiply respective K bit elements of two vectors and accumulate a portion of each of their respective products with another respective input operand in an X bit accumulator, where X is greater than K.

Подробнее
09-01-2014 дата публикации

Systems, apparatuses, and methods for performing a horizontal add or subtract in response to a single instruction

Номер: US20140013075A1
Принадлежит: Intel Corp

Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, and an opcode are describes.

Подробнее
09-01-2014 дата публикации

EFFICIENT HARDWARE INSTRUCTIONS FOR SINGLE INSTRUCTION MULTIPLE DATA PROCESSORS

Номер: US20140013076A1
Принадлежит: ORACLE INTERNATIONAL CORPORATION

A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented. 1. A processor that , within the processor , loads values from a vector of values into a series of subregisters of a SIMD register:wherein values within the vector of values are contiguous;wherein each value in the vector of values is represented by a fixed number of bits;wherein the SIMD register has the series of subregisters, each of which has a number of bits that is greater than the fixed number of bits used to represent each value from the vector of values; loading each value, in the vector of values, into a separate subregister of the series of subregisters; and', 'setting to zero all bits, in each subregister of the series of subregisters, other than bits storing values from the vector of values., 'wherein the processor is configured to respond to one or more instructions by2. The processor of claim 1 , wherein the processor is further configured to respond to the one or more instructions by:shifting each value, such each value is byte aligned within each subregister of the series of subregisters.3. The processor of claim 1 , wherein each subregister claim 1 , of the series of subregisters claim 1 , is eight bits.4. The processor of claim 1 , wherein the one or more instructions ...

Подробнее
09-01-2014 дата публикации

METHOD AND SYSTEM ADAPTED FOR CONVERTING SOFTWARE CONSTRUCTS INTO RESOURCES FOR IMPLEMENTATION BY A DYNAMICALLY RECONFIGURABLE PROCESSOR

Номер: US20140013080A1
Автор: MYKLAND ROBERT KEITH
Принадлежит:

A method and system are provided for deriving a resultant software code from an originating ordered list of instructions that does not include overlapping branch logic. The method may include deriving a plurality of unordered software constructs from a sequence of processor instructions; associating software constructs in accordance with an original logic of the sequence of processor instructions; determining and resolving memory precedence conflicts within the associated plurality of software constructs; resolving forward branch logic structures into conditional logic constructs; resolving back branch logic structures into loop logic constructs; and/or applying the plurality of unordered software constructs in a programming operation by a parallel execution logic circuitry. The resultant plurality of unordered software constructs may be converted into programming reconfigurable logic, computers or processors, and also by means of a computer network or an electronics communications network. 1. In an information technology system , a method comprising:a. accessing a first data flow model of a first software construct type, wherein the first data flow model includes at least one resource, the at least one resource modeling a component of a dynamically reconfigurable processor;b. initiating a compilation of a plurality of software constructs;c. determining a first instance of a software construct of the plurality of software constructs that conforms to the first software construct type; andd. expressing the first instance of the first software construct type as an instance of the first data flow model in a resultant data flow model generated from the compilation of the plurality of software constructs.2. The method of claim 1 , wherein the first data flow model comprises a plurality of resources claim 1 , wherein each resource represents at least one component of a dynamically reconfigurable processor.3. The method of claim 1 , wherein the first data flow model ...

Подробнее
09-01-2014 дата публикации

RECONFIGURABLE DEVICE FOR REPOSITIONING DATA WITHIN A DATA WORD

Номер: US20140013082A1
Принадлежит: Intel Corporation

Disclosed is a system and device and related methods for data manipulation, especially for SIMD operations such as permute, shift, and rotate. An apparatus includes a permute section that repositions data on sub-word boundaries and a shift section that repositions the data distances smaller than the sub-word width. The sub-word width is configurable and selectable, and the permute section and shift section may operate on different boundary widths. In a first stage, the permute section repositions the data at the nearest sub-word boundary and, in a second stage, the shift section repositions the data to its final desired position. The shift section includes multi-stages set in a logarithmic cascade relationship. Additionally, each shifter within each of the multi-stages is highly connected, allowing fast and precise data movements. 1. An apparatus , comprising:an input for receiving data in a data word, the data word including a plurality of sub-words having a predetermined width, and for receiving a command to reposition the data within the data word;a permute section structured to reposition the data when the command is to reposition the data a distance of an integer multiple of the predetermined width; anda shift section structured to reposition the data when the command is to reposition the data a distance less than the predetermined width of the sub-word.2. The apparatus of claim 1 , in which the predetermined width of the sub-words is configurable.3. The apparatus of claim 2 , in which the input is structured to accept the predetermined width of the sub-words as an operating mode.4. The apparatus of claim 1 , wherein the permute section is additionally structured to reposition the data in a first action when the command is to reposition the data a distance greater than the predetermined width claim 1 , and in which the shift section is structured to reposition the permuted data in a second action less than the predetermined width.5. The apparatus of claim 1 , ...

Подробнее
09-01-2014 дата публикации

Cache coprocessing unit

Номер: US20140013083A1
Автор: Ashish Jha
Принадлежит: Intel Corp

A cache coprocessing unit in a computing system includes a cache array to store data, a hardware decode unit to decode instructions that are offloaded from being executed by an execution cluster of the computing system to reduce load and store operations between the execution cluster and the cache coprocessing unit, and a set of one or more operation units to perform operations on the cache array according to the decoded instructions.

Подробнее
09-01-2014 дата публикации

Processor system with predicate register, computer system, method for managing predicates and computer program product

Номер: US20140013087A1
Принадлежит: FREESCALE SEMICONDUCTOR INC

A processor system is adapted to carry out a predicate swap instruction of an instruction set to swap, via a data pathway, predicate data in a first predicate data location of a predicate register with data in a corresponding additional predicate data location of a first additional predicate data container and to swap, via a data pathway, predicate data in a second predicate storage location of the predicate register with data in a corresponding additional predicate data location in a second additional predicate data container.

Подробнее
16-01-2014 дата публикации

Systems, apparatuses, and methods for performing a double blocked sum of absolute differences

Номер: US20140019713A1
Принадлежит: Intel Corp

Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described.

Подробнее
16-01-2014 дата публикации

VECTOR FREQUENCY EXPAND INSTRUCTION

Номер: US20140019714A1
Принадлежит:

A processor core that includes a hardware decode unit and an execution engine unit. The hardware decode unit to decode a vector frequency expand instruction, wherein the vector frequency compress instruction includes a source operand and a destination operand, wherein the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length. The execution engine unit to execute the decoded vector frequency expand instruction which causes, a set of one or more source data elements in the source vector register to be expanded into a set of destination data elements comprising more elements than the set of source data elements and including at least one run of identical values which were run length encoded in the source vector register. 1. A method of performing a vector frequency expand instruction in a computer processor , comprising:fetching the vector frequency expand instruction that includes a source operand and a destination operand, wherein the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length;decoding the fetched vector frequency expand instruction; andexecuting the decoded vector frequency expand instruction causing, a set of one or more source data elements in the source vector register to be expanded into a set of destination data elements comprising more elements than the set of source data elements and including at least one run of identical values which were run length encoded in the source vector register.2. The method of claim 1 , wherein the executing the decoded vector frequency expand instruction further causes an exception be raised when a source data element contains the value to be expanded into a run of without a run length pair.3. The method of claim 1 , wherein the executing the decoded vector frequency ...

Подробнее
16-01-2014 дата публикации

BINARY TRANSLATION IN ASYMMETRIC MULTIPROCESSOR SYSTEM

Номер: US20140019723A1
Принадлежит:

An asymmetric multiprocessor system (ASMP) may comprise computational cores implementing different instruction set architectures and having different power requirements. Program code for execution on the ASMP is analyzed and a determination is made as to whether to allow the program code, or a code segment thereof to execute on a first core natively or to use binary translation on the code and execute the translated code on a second core which consumes less power than the first core during execution. 1. A device comprising:a control unit to select whether to execute a code segment on a first core or translate the code segment for execution on a second core;a migration unit to accept the selection to execute the code segment on the first core and migrate the code segment to the first core; anda binary translator unit to accept the selection to translate the code segment and generate a binary translation of the code segment to execute on the second core;2. The device of claim 1 , the first core to execute instructions from a first instruction set architecture and the second core to execute instructions from a second instruction set architecture comprising a subset of the first instruction set architecture.3. The device of claim 1 , further comprising a translation blacklist unit to maintain a list of instructions to not perform binary translation on.4. The device of claim 1 , the selecting whether to execute or translate the code segment comprising determining a code segment length and translating when the code segment length is below a pre-determined length threshold.5. A processor comprising:a first core to operate at a first maximum power consumption rate;a second core to operate at a second maximum power consumption rate which is less than the first maximum power consumption rate; and when to execute program code on the first core without binary translation; and', 'when to apply binary translation to the program code to generate translated program code and execute ...

Подробнее
16-01-2014 дата публикации

COOPERATIVE THREAD ARRAY REDUCTION AND SCAN OPERATIONS

Номер: US20140019724A1
Принадлежит: NVIDIA CORPORATION

One embodiment of the present invention sets forth a technique for performing aggregation operations across multiple threads that execute independently. Aggregation is specified as part of a barrier synchronization or barrier arrival instruction, where in addition to performing the barrier synchronization or arrival, the instruction aggregates (using reduction or scan operations) values supplied by each thread. When a thread executes the barrier aggregation instruction the thread contributes to a scan or reduction result, and waits to execute any more instructions until after all of the threads have executed the barrier aggregation instruction. A reduction result is communicated to each thread after all of the threads have executed the barrier aggregation instruction and a scan result is communicated to each thread as the barrier aggregation instruction is executed by the thread. 1. A method for performing a scan operation across multiple threads , the method comprising:receiving a barrier instruction that specifies the scan operation for execution by a first thread of the multiple threads;combining a value associated with the first thread with an scan result for the multiple threads;communicating the scan result to the first thread; andcausing another instruction to be executed without waiting until the barrier instruction is received by a second thread of the multiple threads.2. The method of claim 1 , further comprising the steps of:determining that the second thread is the last thread of the multiple threads to receive the barrier instruction; andinitializing the scan result.3. The method of claim 1 , wherein the communication of the scan result to the first thread occurs before the value associated with the first thread is combined with the scan result.4. The method of claim 1 , wherein the communication of the scan result to the first thread occurs after the value associated with the first thread is combined with the scan result.5. The method of claim 1 , ...

Подробнее