Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 28. Отображено 25.
21-01-2016 дата публикации

EXECUTION OF DIVERGENT THREADS USING A CONVERGENCE BARRIER

Номер: US20160019066A1
Принадлежит:

A method, system, and computer program product for executing divergent threads using a convergence barrier are disclosed. A first instruction in a program is executed by a plurality of threads, where the first instruction, when executed by a particular thread, indicates to a scheduler unit that the thread participates in a convergence barrier. A first path through the program is executed by a first divergent portion of the participating threads and a second path through the program is executed by a second divergent portion of the participating threads. The first divergent portion of the participating threads executes a second instruction in the program and transitions to a blocked state at the convergence barrier. The scheduler unit determines that all of the participating threads are synchronized at the convergence barrier and the convergence barrier is cleared.

Подробнее
22-01-2015 дата публикации

System, method, and computer program product for cooperative multi-threading for vector threads

Номер: US20150026438A1
Принадлежит: Nvidia Corp

A system, method, and computer program product for ensuring forward progress of threads that implement divergent operations in a single-instruction, multiple data (SIMD) architecture is disclosed. The method includes the steps of allocating a queue data structure to a thread block including a plurality of threads, determining that a current instruction specifies a yield operation, pushing a token onto the second side of the queue data structure, disabling any active threads in the thread block, popping a next pending token from the first side of the queue data structure, and activating one or more threads in the thread block according to a mask included in the next pending token.

Подробнее
28-01-2021 дата публикации

Real-time neural text-to-speech

Номер: US20210027762A1
Принадлежит: Baidu USA LLC

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Подробнее
12-02-2015 дата публикации

TECHNIQUE FOR GROUPING INSTRUCTIONS INTO INDEPENDENT STRANDS

Номер: US20150046684A1
Принадлежит: NVIDIA CORPORATION

A device compiler and linker is configured to group instructions into different strands for execution by different threads based on the dependence of those instructions on other, long-latency instructions. A thread may execute a strand that includes long-latency instructions, and then hardware resources previously allocated for the execution of that thread may be de-allocated from the thread and re-allocated to another thread. The other thread may then execute another strand while the long-latency instructions are in flight. With this approach, the other thread is not required to wait for the long-latency instructions to complete before acquiring hardware resources and initiating execution of the other strand, thereby eliminating at least a portion of the time that the other thread would otherwise spend waiting. 1. A computer-implemented method for compiling program code for execution on a processing unit , the method comprising:generating a weight value for each program instruction included in a basic block of the program code, wherein the weight value associated with a given program instruction reflects a number of long-latency program instructions upon which the given program instruction depends;grouping the program instructions included in the basic block into a first strand and a second strand based on the weight values generated for the program instructions, wherein the first strand includes a first set of program instructions that have a first weight value, and the second strand includes a second set of program instructions that have a second weight value;causing a first thread to process the first strand; andcausing a second thread to process the second strand.2. The computer-implemented method of claim 1 , further comprising inserting a first yield instruction into the first strand claim 1 , wherein the first thread is configured to process the first strand by:acquiring a hardware resource associated with a processing core included in the processing unit; ...

Подробнее
18-02-2021 дата публикации

Multi-speaker neural text-to-speech

Номер: US20210049999A1
Принадлежит: Baidu USA LLC

Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Подробнее
01-03-2018 дата публикации

Automatic audio captioning

Номер: US20180061439A1
Принадлежит: Individual

A method, computer readable medium, and system are disclosed for audio captioning. A raw audio waveform including a non-speech sound is received and relevant features are extracted from the raw audio waveform using a recurrent neural network (RNN) acoustic model. A discrete sequence of characters represented in a natural language is generated based on the relevant features, where the discrete sequence of characters comprises a caption that describes the non-speech sound.

Подробнее
12-06-2014 дата публикации

REGISTER ALLOCATION FOR CLUSTERED MULTI-LEVEL REGISTER FILES

Номер: US20140164745A1
Принадлежит: NVIDIA CORPORATION

A method for allocating registers within a processing unit. A compiler assigns a plurality of instructions to a plurality of processing clusters. Each instruction is configured to access a first virtual register within a live range. The compiler determines which processing cluster in the plurality of processing clusters is an owner cluster for the first virtual register within the live range. The compiler configures a first instruction included in the plurality of instructions to access a first global virtual register. 1. A method of allocating registers within a processing unit , the method comprising:assigning a plurality of instructions to a plurality of processing clusters, wherein each instruction is configured to access a first virtual register within a live range;determining which processing cluster in the plurality of processing clusters is an owner cluster for the first virtual register within the live range; andconfiguring a first instruction included in the plurality of instructions to access a first global virtual register.2. The method of claim 1 , wherein the first instruction is configured to implement a write operation to the first virtual register and is assigned to the owner cluster.3. The method of claim 2 , wherein:the first instruction has a corresponding location in a program control flow; and determining that an instruction assigned to a non-owner cluster is configured to implement a read operation from the first virtual register; and', 'inserting a copy instruction after the corresponding location in the program control flow of the first instruction, wherein the copy instruction is configured to implement a copy operation that copies a value in the first virtual register to the first global virtual register., 'configuring the first instruction comprises4. The method of claim 1 , wherein the first instruction is configured to implement a write operation to the first virtual register and is assigned to a non-owner cluster.5. The method of claim ...

Подробнее
12-06-2014 дата публикации

COMPILER-CONTROLLED REGION SCHEDULING FOR SIMD EXECUTION OF THREADS

Номер: US20140165049A1
Принадлежит: NVIDIA CORPORATION

A compiler-controlled technique for scheduling threads to execute different regions of a program. A compiler analyzes program code to determine a control flow graph for the program code. The control flow graph contains regions and directed edges between regions. The regions have associated execution priorities. The directed edges indicate the direction of program control flow. Each region has a thread frontier which contains one or more regions. The compiler inserts one or more update predicate mask variable instructions at the end of a region. The compiler also inserts one or more conditional branch instructions at the end of the region. The conditional branch instructions are arranged in order of execution priority of the regions in the thread frontier of the region, to enforce execution priority of the regions at runtime. 1. A method for scheduling threads to execute different regions of a program , the method comprising:analyzing a control flow graph that is based on program code and comprises a plurality of regions, wherein each region represents a different portion of the program code, is assigned an execution priority, and has a thread frontier that includes one or more thread frontier regions, each thread frontier region being one of the plurality of regions in the control flow graph;inserting one or more update predicate mask variable instructions at the end of a first region included in the plurality of regions based on the control flow graph and the program code; andinserting one or more conditional branch instructions at the end of the first region that are arranged to reflect execution priority of the one or more thread frontier regions in the thread frontier of the first region.2. The method of claim 1 , further comprising determining that a branch instruction is included at the end of the first region claim 1 , and replacing the branch instruction with an instruction configured to calculate a branch condition bitmask variable for the first region.3. ...

Подробнее
14-05-2015 дата публикации

CACHE FILTER

Номер: US20150134916A1
Принадлежит: NVIDIA CORPORATION

A cache filter is described. More specifically, some implementations include techniques for classification of memory requests including calculating a probability that one or more memory regions are associated with a particular memory request, selecting one or more regions of the memory to receive memory requests based on the probability associated with the one or more regions, receiving one or more memory requests, determining that at least one of the memory requests is associated with one of the one or more selected regions of the memory, and providing the at least one memory request to the memory. 1. A method comprising:calculating a probability that one or more memory regions are associated with a particular memory request;selecting one or more regions of the memory to receive memory requests based on the probability associated with the one or more regions;receiving one or more memory requests;determining that at least one of the memory requests is associated with one of the one or more selected regions of the memory; andproviding the at least one memory request to the memory.2. The method of claim 1 , wherein the memory is a cache.3. The method of claim 1 , wherein the selecting one or more regions comprises:comparing the probability of at least one of the one or more regions to a threshold; andselecting the one or more regions based on the probability associated with each of the one or more selected regions being above the threshold.4. The method of further comprising:determining that at least another memory request is not associated with the one or more selected regions of the memory; andproviding the at least another memory request to a different memory.5. The method of comprising:receiving one or more additional memory requests; andidentifying a respective region of the memory corresponding to each additional memory request,wherein the probability for each of the one or more regions is based on a frequency of the additional memory requests associated with ...

Подробнее
02-05-2019 дата публикации

Systems and methods for block-sparse recurrent neural networks

Номер: US20190130271A1
Принадлежит: Baidu USA LLC

Described herein are systems and methods to prune deep neural network models in reducing the overall memory and compute requirements of these models. It is demonstrated that using block pruning and group lasso combined with pruning during training, block-sparse recurrent neural networks (RNNs) may be built as accurate as dense baseline models. Two different approaches are disclosed to induce block sparsity in neural network models: pruning blocks of weights in a layer and using group lasso regularization to create blocks of weights with zeros. Using these techniques, it is demonstrated that block-sparse RNNs with high sparsity can be created with small loss in accuracy. Block-sparse RNNs eliminate overheads related to data storage and irregular memory accesses while increasing hardware efficiency compared to unstructured sparsity.

Подробнее
16-06-2016 дата публикации

SYSTEMS AND METHODS FOR SPEECH TRANSCRIPTION

Номер: US20160171974A1
Принадлежит: Baidu USA LLC

Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems. 1. A computer-implemented method for training a transcription model , the method comprising: inputting an utterance that comprises a set of spectrogram frames into a first layer of the transcription model that evaluates each of the spectrogram frames from the set of spectrogram frames with a context of one or more spectrogram frames;', 'outputting from the transcription model a predicted character or character probabilities for the utterance; and', 'computing a loss to measure error in prediction for the utterance;, 'for each of a set of utterancesevaluating a gradient of predicted outputs of the transcription model given the ground-truth characters; andupdating the neural network model using back-propagation.2. The computer-implemented method of further comprising:jittering at least some of the set of utterances prior to inputting into the transcription model.3. The computer-implemented method of wherein the step of jittering at least some of the ...

Подробнее
15-06-2017 дата публикации

SYSTEMS AND METHODS FOR A MULTI-CORE OPTIMIZED RECURRENT NEURAL NETWORK

Номер: US20170169326A1
Принадлежит: Baidu USA LLC

Systems and methods for a multi-core optimized Recurrent Neural Network (RNN) architecture are disclosed. The various architectures affect communication and synchronization operations according to the Multi-Bulk-Synchronous-Parallel (MBSP) model for a given processor. The resulting family of network architectures, referred to as MBSP-RNNs, perform similarly to a conventional RNNs having the same number of parameters, but are substantially more efficient when mapped onto a modern general purpose processor. Due to the large gain in computational efficiency, for a fixed computational budget, MBSP-RNNs outperform RNNs at applications such as end-to-end speech recognition. 1. A method to improve a computing performance of a computing device by mapping a Recurrent Neural Network (RNN) architecture to a processor's microarchitecture of the computing device , the method comprising:obtaining values associated with levels of memory based on a description of the processor's microarchitecture; and grouping neurons into modules, each module representing a logical unit in an RNN layer within the RNN architecture; and', 'arranging connections between the modules such that the modules satisfy predefined conditions of the RNN architecture, the predefined conditions of the RNN architecture being related to the at least two of memory capacity, number of processor cores, bandwidth, computational bandwidth, and latency., "for a lowest to a highest level of a hierarchy of the RNN architecture, each level associated with the processor's microarchitecture and being described by at least two of memory capacity, number of processor cores, bandwidth, computational bandwidth, and latency:"}2. The method according to claim 1 , wherein arranging connections comprises pruning bi-directional connections between the modules to balance the predefined conditions.3. The method according to claim 1 , wherein for each level of processor memory the predefined conditions comprise that parameters ...

Подробнее
04-06-2020 дата публикации

PREDICTING DEEP LEARNING SCALING

Номер: US20200175374A1
Принадлежит: Baidu USA LLC

As deep learning application domains grow, a deeper understanding of the relationships between training set size, computational scale, and model accuracy improvements is extremely beneficial. Presented herein are large-scale empirical study of error and model size growth as training sets grow. Embodiments of a methodology for this measurement are introduced herein as well as embodiments for predicting other metrics, such as compute-related metrics. It is shown herein that power-law may be used to represent deep model relationships, such as error and training data size. It is also shown that model size scales sublinearly with data size. These scaling relationships have significant implications on deep learning research, practice, and systems. They can assist model debugging, setting accuracy targets, and decisions about data set growth. They can also guide computing system design and underscore the importance of continued computational scaling. 1. A computer-implemented method for generating a learning curve to aid in predicting a metric for a deep learning model , the method comprising:splitting a data set into a set of shards such that the shard sizes span multiple orders of magnitude;training a set of models on each of the shards from the set of shards, in which models within the set of model candidates vary in architecture, hyperparameters, or both;using a validation set to identify a best model for each shard from among the set of trained model candidates, in which each best model has a corresponding validation accuracy for that shard, which has a shard size;fitting a power-law learning curve model using the validation accuracies and corresponding shard sizes of the best models selected for the shards; andusing the fitted power-law learning curve to predict a metric associated with a deep learning model.2. The computer-implemented method of further comprising the step of randomly shuffling the data set to maximize likelihood that shards of the data set have ...

Подробнее
23-07-2015 дата публикации

SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR BULK SYNCHRONOUS BINARY PROGRAM TRANSLATION AND OPTIMIZATION

Номер: US20150205586A1
Принадлежит: NVIDIA CORPORATION

A system, method, and computer program product are provided for. The method includes the steps of executing a block of translated binary instructions by multiple threads and gathering profiling data during execution of the block of translated binary instructions. The multiple threads are then synchronized at a barrier instruction associated with the block of translated binary instructions and the block of translated binary instructions is replaced with optimized binary instructions, where the optimized binary instructions are produced based on the profiling data. 1. A method comprising:executing, on a parallel processor, a block of translated binary instructions by multiple threads;gathering profiling data during execution of the block of translated binary instructions;synchronizing the multiple threads at a barrier instruction associated with the block of translated binary instructions; andreplacing the block of translated binary instructions with optimized binary instructions, wherein the optimized binary instructions are produced based on the profiling data.2. The method of claim 1 , further comprising executing a second block of translated binary instructions by at least one thread during the synchronizing of the multiple threads and the replacing of the block.3. The method of claim 1 , wherein the barrier comprises a barrier instruction that specifies a barrier hierarchy level.4. The method of claim 3 , further comprising:determining a lower level barrier than the specified barrier hierarchy level is supported; andcomparing the optimized binary instructions with one or more versions of binary instructions for the block that are associated with different multiple threads.5. The method of claim 4 , further comprising merging the optimized binary instructions with the versions of binary instructions when the comparing indicates that the optimized binary instructions substantially match the one or more versions of binary instructions.6. The method of claim 5 , ...

Подробнее
30-08-2018 дата публикации

SYSTEMS AND METHODS FOR REAL-TIME NEURAL TEXT-TO-SPEECH

Номер: US20180247636A1
Принадлежит: Baidu USA LLC

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time. 1. A computer-implemented method for training a text-to-speech (TTS) system to synthesize human speech from text , comprising:training a grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text;using the trained grapheme-to-phoneme model to convert written text, which is a transcription corresponding to training audio, to phonemes corresponding to the written text and training audio;using the training audio and the corresponding phonemes to train a segmentation model to output phoneme durations by identifying phoneme boundaries in the training audio by aligning it with the corresponding phonemes;given a ground truth dataset comprising ground truth written text representing a transcription of ground truth training audio, using the trained grapheme-to-phoneme model to produce phonemes;given the ground truth training audio and the corresponding phonemes, using the trained segmentation model to produce phoneme durations; andusing the ground truth training audio, the ...

Подробнее
22-11-2018 дата публикации

Systems and methods for multi-speaker neural text-to-speech

Номер: US20180336880A1
Принадлежит: Baidu USA LLC

Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Подробнее
21-11-2019 дата публикации

RESOURCE-EFFICIENT NEURAL ARCHITECTS

Номер: US20190354837A1
Принадлежит: Baidu USA LLC

Neural Architecture Search (NAS) is a laborious process. Prior work on automated NAS targets mainly on improving accuracy but lacked consideration of computational resource use. Presented herein are embodiments of a Resource-Efficient Neural Architect (RENA), an efficient resource-constrained NAS using reinforcement learning with network embedding. RENA embodiments use a policy network to process the network embeddings to generate new configurations. Example demonstrates of RENA embodiments on image recognition and keyword spotting (KWS) problems are also presented herein. RENA embodiments can find novel architectures that achieve high performance even with tight resource constraints. For the CIFAR10 dataset, the tested embodiment achieved 2.95% test error when compute intensity is greater than 100 FLOPs/byte, and 3.87% test error when model size was less than 3M parameters. For the Google Speech Commands Dataset, the tested RENA embodiment achieved the state-of-the-art accuracy without resource constraints, and it outperformed the optimized architectures with tight resource constraints. 1. A computer-implemented method for performing neural architecture searching , comprising: converting a neural network architecture into a network embedding of the neural network architecture using the network embedding recurrent neural network, in which the neural network architecture comprising one or more layers, one or more network modules, or both, and each of the one or more layer or one or more models has at least one corresponding feature;', 'using the scale recurrent neural network, which receives the network embedding of the neural network architecture, identifying one or more of the features of the neural network architecture;', 'using the action recurrent neural network, which receives the network embedding of the neural network architecture, determining whether to remove a portion of the network architecture, keep a portion of the network architecture, or add a portion ...

Подробнее
21-11-2019 дата публикации

SPECTROGRAM TO WAVEFORM SYNTHESIS USING CONVOLUTIONAL NETWORKS

Номер: US20190355347A1
Принадлежит: Baidu USA LLC

For the problem of waveform synthesis from spectrograms, presented herein are embodiments of an efficient neural network architecture, based on transposed convolutions to achieve a high compute intensity and fast inference. In one or more embodiments, for training of the convolutional vocoder architecture, losses are used that are related to perceptual audio quality, as well as a GAN framework to guide with a critic that discerns unrealistic waveforms. While yielding a high-quality audio, embodiments of the model can achieve more than 500 times faster than real-time audio synthesis. Multi-head convolutional neural network (MCNN) embodiments for waveform synthesis from spectrograms are also disclosed. MCNN embodiments enable significantly better utilization of modern multi-core processors than commonly-used iterative algorithms like Griffin-Lim and yield very fast (more than 300× real-time) waveform synthesis. Embodiments herein yield high-quality speech synthesis, without any iterative algorithms or autoregression in computations. 1. A computer-implemented method for training a neural network model for spectrogram inversion comprising:inputting an input spectrogram comprising a number of frequency channels into a convolution neural network (CNN) comprising at least one head, in which a head comprises a set of transposed convolution layers in which each transposed convolution layer is separated by a nonlinearity operation and the set of transposed convolution layers reduces the number of frequency channels of the input spectrogram to one channel after a last transposed convolution layer of the set of transposed convolution layers;outputting from the CNN a synthesized waveform for the input spectrogram, the input spectrogram having a corresponding ground truth waveform;using the corresponding ground truth waveform, the synthesized waveform, and a loss function comprising at least one or more loss components selected from spectral convergence loss and log-scale short- ...

Подробнее
05-12-2019 дата публикации

DEEP LEARNING MODELS FOR SPEECH RECOGNITION

Номер: US20190371298A1
Принадлежит: Baidu USA LLC

Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems. 1. A computer-implemented method for training a transcription neural network , the method comprising:inputting an utterance that comprises a set of spectrogram frames covering time steps of the utterance into a first layer of the transcription neural network that evaluates, for each time step of a set of time steps, a spectrogram frame from the set of spectrogram frames and an associated context of one or more spectrogram frames;obtaining predicted character probabilities for the utterance from the transcription neural network;using the predicted character probabilities for the utterance and a corresponding ground truth transcription for the utterance to determine a loss in predicting the corresponding ground truth transcription for the utterance; andupdating one or more parameters of the transcription neural network using a gradient based upon the loss in predicating the utterance.2. The computer-implemented method of further comprising:jittering ...

Подробнее
04-10-2016 дата публикации

System, method, and computer program product for managing divergences and synchronization points during thread block execution by using a double sided queue for token storage

Номер: US9459876B2
Принадлежит: Nvidia Corp

A system, method, and computer program product for ensuring forward progress of threads that implement divergent operations in a single-instruction, multiple data (SIMD) architecture is disclosed. The method includes the steps of allocating a queue data structure to a thread block including a plurality of threads, determining that a current instruction specifies a yield operation, pushing a token onto the second side of the queue data structure, disabling any active threads in the thread block, popping a next pending token from the first side of the queue data structure, and activating one or more threads in the thread block according to a mask included in the next pending token.

Подробнее
21-01-2020 дата публикации

Systems and methods for speech transcription

Номер: US10540957B2
Принадлежит: Baidu USA LLC

Presented herein are embodiments of state-of-the-art speech recognition systems developed using end-to-end deep learning. In embodiments, the model architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, embodiments of the system do not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learn a function that is robust to such effects. A phoneme dictionary, nor even the concept of a “phoneme,” is needed. Embodiments include a well-optimized recurrent neural network (RNN) training system that can use multiple GPUs, as well as a set of novel data synthesis techniques that allows for a large amount of varied data for training to be efficiently obtained. Embodiments of the system can also handle challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Подробнее
16-11-2014 дата публикации

運用於執行緒單一指令多重資料執行之編譯器控制區域排程

Номер: TW201443825A
Принадлежит: Nvidia Corp

本發明揭示一種用於排程執行緒以執行一程式的不同區域之編譯器控制技術。一編譯器分析程式碼,以決定用於該程式碼的一控制流圖形。該控制流圖形包括區域及區域之間的方向箭號。該等區域具有相關的執行優先順序。該方向箭號指示程式控制流的方向。每一區域都具有包含一或多個區域的執行緒先驅碼。該編譯器在一區域結尾上插入一或多個更新預測遮罩變數指令。該編譯器也在該區域結尾上插入一或多個條件分支指令。該等條件分支指令以該區域的該執行緒先驅碼內的該等區域的執行優先順序來排列,以在執行時間上施行該等區域的執行優先順序。

Подробнее
16-09-2014 дата публикации

叢集多階暫存檔的暫存器分配

Номер: TW201435728A
Принадлежит: Nvidia Corp

本發明揭示一種分配一處理單元內的暫存器之方法。一編譯器指派複數個指令至複數個處理叢集。每一指令構成存取活動範圍內的一第一虛擬暫存器。該編譯器決定該等複數個處理叢集內的哪個處理叢集為該活動範圍內的該第一虛擬暫存器的擁有者叢集。該編譯器設置在該等複數個指令內包括的一第一指令,用以存取一第一全域虛擬暫存器。

Подробнее
09-05-2017 дата публикации

Technique for grouping instructions into independent strands

Номер: US09645802B2
Принадлежит: Nvidia Corp

A device compiler and linker is configured to group instructions into different strands for execution by different threads based on the dependence of those instructions on other, long-latency instructions. A thread may execute a strand that includes long-latency instructions, and then hardware resources previously allocated for the execution of that thread may be de-allocated from the thread and re-allocated to another thread. The other thread may then execute another strand while the long-latency instructions are in flight. With this approach, the other thread is not required to wait for the long-latency instructions to complete before acquiring hardware resources and initiating execution of the other strand, thereby eliminating at least a portion of the time that the other thread would otherwise spend waiting.

Подробнее
23-08-2016 дата публикации

Compiler-controlled region scheduling for SIMD execution of threads

Номер: US09424038B2
Принадлежит: Nvidia Corp

A compiler-controlled technique for scheduling threads to execute different regions of a program. A compiler analyzes program code to determine a control flow graph for the program code. The control flow graph contains regions and directed edges between regions. The regions have associated execution priorities. The directed edges indicate the direction of program control flow. Each region has a thread frontier which contains one or more regions. The compiler inserts one or more update predicate mask variable instructions at the end of a region. The compiler also inserts one or more conditional branch instructions at the end of the region. The conditional branch instructions are arranged in order of execution priority of the regions in the thread frontier of the region, to enforce execution priority of the regions at runtime.

Подробнее