13-02-2020 дата публикации
Номер: US20200050585A1
Принадлежит:
Data records are joined using a computer. Data records in a first plurality of data records and a second plurality of data records are hashed. The data records in the first and second pluralities are respectively assigned to first and second groupings based on the hashes. Associated pairs of groupings from the first and second groupings are provided to a thread executing on a computer processor, and different pairs are provided to different threads. The threads operate on the pairs of groupings in parallel to determine whether to join the records in the groupings. A thread joins two data records under consideration if the hashes associated with the data records match. The joined data records are output. 1. A computer-implemented method comprising:dividing a plurality of data records into a set of groupings based on a hash value of each of the plurality of data records, each grouping in the set of groupings including a subset of the plurality of data records, each grouping associated with a set of bits that is a subset of bits of the hash value of each data record in the corresponding grouping;determining that a first grouping in the set of groupings is associated with a same set of bits as a second grouping in the set of groupings;assigning the first and second groupings to a worker thread;determining, by the worker thread, whether to join a data record in the first grouping with a data record in the second grouping;responsive to determining to join the data record in the first grouping with the data record in the second grouping, joining the two data records by the worker thread; andoutputting, by the worker thread, the joined data records.2. The computer-implemented method of claim 1 , wherein the plurality of data records are data records of a first data stream and a second data stream claim 1 , the data records in the first grouping are a subset of the data records of the first data stream claim 1 , and the data records in the second grouping are a subset of the ...
Подробнее