다큐먼트 인덱싱

29-08-2016 дата публикации
Номер:
KR0101652121B1
Автор: 딥 시카, 시카, 딥
Контакты:
Номер заявки: 01-10-102019604
Дата заявки: 29-11-2010

[1]

Abruptly whereby access to the document have been increased. An increased number of the document useful a main desired number belonging to the document in finding more difficult the rear side of the main unit. Such an indexing feature search search and documents by document useful more easily and, faster and, is more efficient.

[2]

[...] suitable for retrieval of documents to store data for hereinafter as to. Indexed document such data to obtain parsing or. can be analyzed. Natural language processing or other language - sensitive analysis (other language-sensitive analysis) beneficial documents later search can be to extract the information from. This process are language to sensitive information,, which are processed document language. must be determined for this analysis.

[3]

A method are for indexing document, systems, and computer program number exclusive cement chemistries article is described specification. General in the embodiment of the present invention a single language language - specific rules according to document indexing step, single language [...] active (effectiveness) multilingual according to metric often crucial for achieving success indicating (multi-lingual) includes identifying document as, includes identifying as multilingual document in response to, multilingual for indexing the document queuing (queuing) includes a. Other in the embodiment of such documents the document plurality of smaller fragments into (fragmenting) is further small documents individually separately includes indexing.

[4]

Is to encapsulate the typical other one or more data processing system in the embodiment including document indexing system comprising a. A data processing system processor and processor operably coupled to a includes memory included in computer to reduce. One or more system of computer memory therein one or more described method in the embodiment for implementing processor installed program instructions to be executed. In particular, a computer memory single disposed therein - language language specific rules configured to indexing document according to computer readable program code, indicative of the effectiveness [...] single language often crucial for achieving success metric as multilingual document according to configured to identify computer-readable program code, identified as multilingual document in response to a, multilingual for indexing the document queuing computer readable program code configured to material may have a.

[5]

Of the present invention other purposes, characteristics and advantages are, similar references are typically exemplary in the embodiment of the present invention representing portions a of similarity to the drawing in the embodiment of exemplary of the present invention exemplified than of hereinafter made apparent from the following description will.

[6]

According to method of the present invention in the embodiment are also to 1a also is shown that are the 1c.. Also Figure 2 shows a of the present invention in the embodiment according to blocks of computer indexing documents is in degree. Also of the present invention in the embodiment according to Figure 3 shows a indexing documents is data flow it is shown a software architecture. One of the present invention also in the embodiment according to Figure 4 shows a indexing documents is a timing flow chart method. Also of the present invention in the embodiment according to Figure 5 shows a indexing documents is data flow is shown that method.

[7]

Indexing documents of the present invention in the embodiment according to an exemplary method are, systems, and design structures are. described reference to the drawing attached. The present specification specific only term used in the for the purpose in the embodiment the present invention describes an intention for a not only number. The present specification as used in the, short-lived elder brothers ["one (a)", "one (an)", "(the) thereof"] the, otherwise significantly context which does not indicate the a, plurality elder brothers includes in addition. Terms "(comprises) comprising" and/or "(comprising) includes" , the present specification when it is used at, techniques features, integers, steps, operations, elements, and/or a component but enumerate the existence of, one or more other features, integers, steps, operations, elements, components, and/or back addition the presence or non-presence of the groups not number will acyl in addition.

[8]

Hereinafter of claim of the structures, components, operations, and all means or steps plus as functional elements of coequal comprises specially the claimed other the claimed elements and in cooperation for executing functions any structure, material or includes operation. Art of various of the present invention in the embodiment for the and techniques the description it became but in number, in the form described for the present invention number thoroughly a not only an intention or inputted to the a/d. A number of variations and. it is apparent that to one skilled in the art are modified. In the embodiment and seal principles of the present invention is best to describe the application number, in addition, various suitable for specific applications expected various into change in the embodiment of the present invention for one skilled in the art understand selected such that is and techniques.

[9]

One skilled in the art as is unknown, system aspects include devices of the present invention, method or computer program number article may be embodied in. Thus, implementing hardware has one portion aspect of the present invention, entire software implementation (firmware, resident software, such as code includes micro -) or both in general terms, the present specification "circuit" in, "module" or "system" is software that may be and hardware aspect implementing a combination can take the form of. In addition, aspects of the present invention inside implemented in computer readable program code one or more computer readable media (are) implemented in computer program number can take the form of article.

[10]

One or more computer readable media of a combination of (are) can be used. Computer readable media computer readable signal medium or computer-readable storage medium can be. Computer-readable storage medium a, for example, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or mentioned above, where the lighting device of any suitable combination can be, 802.11a not these only one number. Computer-readable storage medium of example are specific than (the list not throughly that it) the, portable computer diskette, hard disk, RAM (random access memory), ROM (read only memory), EPROM (erasable programmable read only memory) or the flash memory, portable CD-ROM(compact disc read only memory), optical storage device, magnetic storage device, or of mentioned above, where the lighting device includes any suitable combination. In this context, computer-readable storage medium an instruction execution system, device, or device used by the program is associated or included or may store can be any tangible medium.

[11]

Computer-readable storage medium implemented program code, wireless, wired, optical fiber cable, such as RF, or mentioned above, where the lighting device of any suitable combination, including but only one number 802.11a using any suitable not, and transmits can be.

[12]

Of the present invention computer system to execute operations of program code, Java, small torque, such as object oriented programming languages and C + + "C" programming language or similar programming languages including, programming procedure of the existing method such as one or more programming language any combination of. can be written.

[13]

According to method of the present invention are are of the present invention in the embodiment aspects include devices, device are (systems) and computer program number & or/and chart flow of article with reference to is refers to. Flow chart and/or each block of &, and flow chart and/or & the combination of blocks of computer program instructions may be implemented by to will acyl. Computer program instructions are universal computer's processor, special a target computer, or other programmable data processing device with a consequent ball number to, computer's processor or other programmable data processing device through and performs a & or/and instructions are flow chart block or blocks described function/operation to generate machine operation can be produced.

[14]

When such a computer program instructions are, in addition, instructions are stored computer-readable medium and/or flow chart described and block & method for performing/function, a carrier medium that carried instructions to articles bath including number, computer signals, in a practical manner, other programmable data processing device, or other devices can allow act computer readable medium having may be stored in.

[15]

Computer program instructions are, computer, other programmable device, or other device performed in series of operations steps are, computer or other programmable device running on a block & or/and instructions are flow chart described and operation/function under public affairs number units so that the processes are performed for implementing a computer-implemented process to allows for the production of, computer, other programmable data processing device, or other devices may be loaded into..

[16]

Indexing process full text document a searchable information on targets which need a language to translate the unit - includes - specific tasks. Also the 1b and 1a of the present invention in the embodiment according to document indexing. is shown that are method. Also refers to surface 1a, a single method - language language specific rules (110) according to document (102) for indexing the step [block (104)], often crucial for achieving success metric indicative of the effectiveness [...] single language (112) identified as multilingual document according to step [block (106)], includes identifying as multilingual document in response to, multilingual for indexing the document designated (designating) a includes [block (108)].

[17]

The present specification the document used in (102) the indexing and presents a useful electronic files. Any useful document a document hundreds of different file format for a any type of file (for example, spreadsheet, chart, presentation, and word processing document, archive, Image, full text, web page, etc.) can be. In addition a document, for example, unique file and an integrated compressed compression file such as the document or file or a section of greater than part, entire that have been split from a document logical section (Chapter, such as heading sub). combination of the conductor. For example, an example document various compression documents ZIP of which may be comprised of TAR or file format in the format file is archive document compression of file, and the like. Compression documents various language a combination each any language or multilingual document can be single language document.

[18]

Language - specific rules (110) a foreign language sensitive out by means of data is rules which improved indexing. Language document according to a language context sensitive information (102) is information can be extracted from.. Language sensitive rules (110) the [...] (stemming) (120), voice tagging part (122), accents - sensitive rules, to initialize (124), stop word (126), purpose: provided are one skilled in the art or any other language dependent rules belonging to may include a rules.

[19]

Language - specific rules (110). specific to a single language. For example, a in one example, language - specific rules (110) a rule set for a English may be formed into, in other example, language - specific rules (110) the French, the Spain, German, Japanese, space or[...], [...] the, cut into 4 to 5 cm, ,, dog rear, Chinese ([...]), Chinese (transformation person), pinyin, the Beijing, standard easily input Hindi characters in a terminal, or one skilled in the art are any other purpose: provided to a rule set for a language can be constructed. Language 2 (sub-language) language difference may include a dialect or. In some cases, specific rules - language (110) for including individual rules or rules subset different languages specific rules individual language - (110) (or their shared between several) common to can be.

[20]

Single language often crucial for achieving success metric indicative of the effectiveness [...] (112) a single language [...] success and acquires corresponding arbitrary data (may be acquiring minimal overhead) can be. Successful metric (112) a single language indexing data created uniquely from processes can be. Alternatively, successfully metric (112) the, negligible processing size (footprint) a code having a counting or tracking programming or other after the further single language indexing can be produced from processes. Also refers to surface 1b, successful metric (112) the, for example, treated with successfully during indexing content document (130) (for example, word, truncated, section, such as a homepage, and install) may include a percentage of. A token is successfully processing content is to be generated to successfully or contents which can be reproduced are, purpose: provided is one skilled in the art or successful of processing any other measurement.

[21]

Also 1b again a, successful metric (112) identified as multilingual document according to step [block (106)] the successfully processing content (130) percentage of threshold percentage value (132) determine failed to meet a [block (134)] can be performed by. Successfully treated with the coherent integrations temporary for words of a treated with successfully retained percentage of words can be used to obtain a. Threshold percentage value (132) the document is indicated as multilingual the number of refining (refine) can be constructed to. For example, threshold percentage value (132) can be set to the 80%. Percentage of words treated with successfully threshold percentage fails to exceed a surface (138), the block method (108) advances to. Successfully treated with threshold percentage of words thus, the time percentage, is symbols from the fingers are ready for a search document, another of such documents system processing resources has without using. For example, of less than 80% processed with successfully words are recognizable, a document is identified as multilingual. In other in the embodiment, additional or other statistics identifying words multilingual document measurements can be calculated to.

[22]

Again by referring to 1b also, percentage of words treated with successfully threshold percentage fails to exceed a surface (138), multilingual system for indexing the document (102) is queuing [block (108)]. Multilingual for indexing the document queuing step [block (108)] individual for indexing the multilingual document repository (102) and storing the or individual system or repository includes forming the communication group including transmitting document may comprise an. System, as in the case of queue rod processing, intermediate for processing document (102) may [...] a [block (140)]. In other in the embodiment, document may be transmitted to the MT through for indexing forth. For example, multilingual [...] because the intensive further processed, document a peak for indexing non-multilingual processing time may be transmitted to the MT through [block (142)]. In some implementations, queue FIFO (first-in-first-out) can be data structure. Other implementations, LIFO queue (last-in-first-out) can be data structure, or queue a document in are various priority system according to priority by utilizing the [...] may be.

[23]

The 1c also in the embodiment of the present invention other indexing document according to. is shown that method. Also refers to surface 1c, a single method according to document - language language specific rules often crucial for achieving success metric indicative of the effectiveness [...] (112) according to document as multilingual (102) includes identifying at [block (106)], includes identifying as multilingual document and in response to, multilingual for indexing the document (102) queuing step includes [block (108)]. The method of also 1c 1a and a similarly but also, a single performed locally, and. prior is indexing step language. Also 1c method of a single language indexing step is performed at a step after the disclosure.

[24]

Implementing memory card refers to the method of the present invention in the embodiment including. In part in the embodiment, such method example in or computer device one of the system can be is performed. Alternatively, as a portion of method, one or more LAN (local area network), WAN (wide area network), wired or cellular phone network, intranet, or Internet network, such as a copolymer of two or more connected by can be performed in computer. The present specification described method the order of elements element can be executed the order from a standard node is not only number.

[25]

Also of the present invention in the embodiment in Figure 2 shows a block of the computer used is in degree. Computer (202), volatile RAM (random access memory) (204), and hard disk drive, optical disk drive, or electrically erasable programmable read-only memory space ('EEPROM' or 'flash' is publicly known scanner includes two steps) such as non-volatile memory included in computer (250) including forms or in some forms of, computer memory as well as at least one computer processor (254) includes. Computer memory the system bus (240) through the processor (254) and other system components is connected to. Thus, software modules that procuce a program instructions stored in the computer memory.

[26]

Operating system (210) are stored in the computer memory. Operating system (210) the window operating system, Mac OS X, UNIX, LINUX, or IBM (International Business Machines Corporation) (oh it drove large,, ny,) of any suitable such as AIX can be operating systems. Network stack (212) is in addition are stored memory. Network stack (212) a synergy to hereinafter for the network communication protocol is software implementation of computer networking.

[27]

Computer (202) the in addition one or more input/output interface adapters (256) includes. Input/output interface adapters (256) output devices such as screen a computer display (272) rotated more quickly than the input, and keyboard, input devices (270) user input from the number and computer software driver for a user-oriented switching means switches for a input/output-changed signal is shifted down.

[28]

Computer (202) the in addition other devices (260) data communication with for implementing communications adapters (252) includes. Communications adapters (252) the, one computer over the network data communication other computer for transmitting to realize the hardware level of data communication.

[29]

Indexing module (206) are stored into computer memory and in addition is. Indexing module (206) of the present invention in the embodiment according to the indexing document includes the computer program instructions. Indexing module (206) individual intermediate software layers hierarchical or the same or at least one sub-operating in - can be implemented as module. Operating system (210) is perfectly suitable for shown as module individual from, indexing module (206) or at least one sub-- an operating system modules (210) may be incorporated as part of. In various in the embodiment, indexing module (206) software stack or firmware may be embodied in.

[30]

Also computer shown in 2 (202) for instructing a non-one the number the number. outliers. The of the present invention in the embodiment, logic and a memory including any of viable computing device, or FPGA (field-programmable gate array), ASIC (application-specific integrated circuit) as logic is implemented including devices, purpose: provided is one skilled in the art, a computer program executed including instructions can be implemented as software module.

[31]

Additional for instructing a the, indexing documents also Figure 3 shows a exemplary software architecture is described in data flow scheme. Software architecture of Figure 3, document header parser (304), document fragment it carries on shoulder (310), user language identification (318), indexer (306), index store and (308) including a including various kinds of software module. Document header parser (304) has the document (302) identifies file format of, document (302) is parsing for reading and header a document in. Document header parser (304) the in addition, language specific information in the presence of -, - language included in header document 100 reads specific information. it carries on shoulder fragment document (310) has the document to fragments based on a given information to hang of the headers in such document different fragments or formwork section and end transverse formwork sections document divided. it carries on shoulder fragment document (310) documents a plurality of smaller than the (312 - 316) number as an output set of. under public affairs. Language identification user (318) a plurality of documents (312 - 316) of the aggregation of the broadcast physical procedure is repeated for each document for each and. to identify the language main. Indexer (306) the text set encoding, conversion, token of, and indexing of contents information generating is performed. Indexer (306) the in addition language out by means of data according to the rules of specific language generates information indexing sensitive language. Token of a holds the chair of indexer (306) the token not been treated with successfully component to the inverse transport processor can be constructed. Successfully addition is not applied a substantially increments and decrements a count of token does not require the in-but additional processing power, successfully treated with percentage of words can be used to determine the. Index store (308) has search indexing information that can be used in is storage for at. Index store (308) including the documents [for example, document (302)] a pointer to the and. may comprise a list of word parsed. Index store (308) language awareness features, such that the in addition the language for supporting - wise (language-wisely) includes index data.

[32]

Drawing of flow chart and block attacks in the embodiment are the systems according to various of the present invention, method, and computer program architecture of possible implementations of article number, function, and operational is shown that.. In this regard, each block in & or chart flow, specific logical function for implementing (are) one or more executable instructs that, including module, segment, can exhibit code. Some other implementations, drawing functions of the modules described block to sequence may be otherwise occurs and step. must. For example, according to functions, continuously in an illustrated two blocks are chamber which can be carried out substantially simultaneously include number,, as well as the reverse sometimes blocks can be by a sequence suitable. Each block of block attacks and/or flow chart, and block attacks and/or flow chart combination of blocks of a particular function or operation, the aim or special hardware and computer instructions for the aim have a special execution combination of hardware - based systems may be implemented by step must decide.

[33]

Additional for instructing a the, one of the present invention also in the embodiment according to Figure 4 shows a indexing documents described method is in a flow chart. The present specification described method the order of elements element can be executed the order from a standard node is not only number. Method a computer system (for example, indexing server,) is document type is step as loading a filter corresponding [block (402)]. starts with. Document type, for example, program word document, document accelerator program, HTML (Hyper Text Markup Language), such as PDF (Portable Document Format) may include a file format. The indexing different files on hundreds of useful which formats, each said formats included in the document - language for full text may comprise an associated meta data. For document formats such as HTML XML and (Extensible Markup Language) for to identify the language full text are the language tag may comprise an.

[34]

Suitable type document (invoking) the document after the call filter, . to identify the language main document system [block (404)]. Document type the step of identifying a language - detect and associated meta data can be done by parsing. Document header standard are present invention is also directed to, plurality of document file format do not conform to these standards are so does not include language information. Language related data is not useful, user language identification (318) has an indexing process by utilizing the source document one specific language and may act as predictors of the KIPO &. Language identifier based on full text sampling or multiple, and may act as predictors of KIPO & main language. System the primary language language rules. can be applied to entire document. The document archive cases of compressed, the identified document number 1 be indexed language documents all the remaining can be are taken into consideration for the. Or, the respect to the document, two number 1 n major sampled bytes can be to identify the language. Once main language identified, [...] can be disclosure.

[35]

Indexing process, as well as index stem, and the positioning of the elements and document all indexed full text of documents all of the words contained in which stores a list of collection (collections), universal or generates indices. The of Figure 4 method, indexer (306) is document metadata and lapped and planarized - full text extracting [block (406)]. extending into the hollow interior for permitting. Indexers main language language - specific rules according to document words or token dividing the signal into [block (408)], the determining the occurrence of token each document for generating an index word. [block (410)]. According to by refrigerant delivered from the lower stage is document, content documents handled successfully (or, inverse successfully the not been treated with content of such documents) is tracking. Word index and constructs a searcher reservoir including index data which is used (412) may be stored in. Indexer (306) - language language primary language according to specific rules generates index data specific - [block (414)]. A reservoir index data (412) may be stored in.

[36]

The present method a successful threshold percentage of words treated with a determination as to whether smaller than the percentage step includes [block (416)]. Threshold percentage value, a language suitable according to rules specific language that can be processed when [...] for words of a representing the percentage of the nominally or smallest sum, a configurable value. Successfully treated with threshold percentage of words thus, the time percentage (420), and the device is ready for making an a look - up document, another of such documents system processing resources has without using. Percentage of words treated with successfully threshold percentage if value is not exceeded (418), a designated as multilingual document system [block (422)]. In addition, a single system which capable of returning undo indexing language (i.e., band which is used to or) [block (424)], multilingual processing document indexes [block (426)]. Single language [...] back to undo a reservoir (412) - and language word index from index data specific number can be performed by triggers.

[37]

One of the present invention also in the embodiment according to Figure 5 shows a indexing documents is data flow is shown that method. The of Figure 5 method 1a and a similarly but also, document (102) documents a plurality of smaller than the (504) [block (502)] step fragments to further includes. Fragment after [...], indexers - language language for document corresponding specific rules (506) according to documents of which is smaller than the (504) indexes each document individually [block (508)]. System, then, continue to, as described above detailed, single language language - specific rules often crucial for achieving success metric indicative of the effectiveness [...] according to document (112) according to document as multilingual (102) identifies the [block (106)], identified as multilingual document in response to a, multilingual for indexing the document (102) is queuing [block (108)].

[38]

The present specification described the of the present invention general outline may be has a watch upper surface will acyl. Such a change of token, [...], language recognition, are such as queuing method demodulates, systems, may include a and programs. Claim and said changes are with an equivalent to an extent which are also encompassed by of, the present patent may be included by.

[39]

102: document 104: single language language - indexes document according to specific rules 106: single language according to metric indicative of the effectiveness [...] often crucial for achieving success identifies as multilingual document 108: identified as multilingual document in response to a, multilingual for indexing the document queuing 110: single language language - specific rules 112: successful metric



[40]

A document to be indexed is initially indexed in dependence upon language-specific rules of a single language. A success metric is used to assess the effectiveness of the single language indexing. If a threshold level of success is not attained, the document is identified as multi-lingual. In response to identifying the document as multi-lingual, the document is queued for multi-lingual indexing. A document may be fragmented into a number of smaller documents, each of which is indexed separately.



Document indexing computer-implemented method in, one or more natural language are the document written by markup language (one or more natural languages) (indexing) - said indexing step indexing step, said document content one or more natural language in one of (language-specific rules) - specific rules applying a - (applying) and said language specific rules are various tokens from said document content (a number of tokens) comprising produce -; said - said language specific rules are based on token various generated to successfully determining a metric (a success metric) (determining); said threshold value metric successful (a threshold value), comparing a - said multilingual document (comparing) the threshold value (a multi-lingual document) that specifies a predetermined amount of said content in an document a receiving -; the based on the step comparing the, identified as document multilingual document said step; and said document multilingual document identified as a in response to the step of, the multilingual for indexing said document queuing (queuing) including the, computer-implemented method.

According to Claim 1, said document plurality of smaller and fragments into said smaller of such documents adapted to individually of such documents (individually) separately (separately) indexing further including computer-implemented method.

According to Claim 1, said said metric successfully including the percentage of the content in an document, computer-implemented method.

According to Claim 1, - said language specific rules (stemming) rules [...], voice tagging of rules part, accents - sensitive rules, rules to initialize, and stopping word a is selected from one group of rules including rules, computer-implemented method.

According to Claim 1, the multilingual for indexing said document queue for processing said intermediate step step of queuing document said including, computer-implemented method.

According to Claim 1, the multilingual for indexing said document queue said step a peak (non-peak processing hours) non-processing time for processing during step of queuing document said including, computer-implemented method.

According to Claim 1, said document multilingual document identified as a in response to the step of, free of headers, directly to or indexing said document further including (reversing) returns, computer-implemented method.

Document indexing computer-implemented method is configured to process computer readable program code as computer-readable storage medium for storing, the method: one or more natural language are the document written by markup language (one or more natural languages) (indexing) - said indexing step indexing step, said document content one or more natural language in one of (language-specific rules) - specific rules applying a - (applying) and said language specific rules are various tokens from said document content (a number of tokens) comprising produce -; said - said language specific rules are based on token various generated to successfully metric (a success metric) (determining) determining a; said threshold value (a threshold value) successful metric comparing the (comparing) the threshold value - said multilingual document (document a multi-lingual) that specifies a predetermined amount of said content in an document a receiving -; the based on the step comparing the, identified as document multilingual document said step; and said document multilingual document identified as a in response to the step of, the multilingual for indexing said document queuing (queuing) including the, computer-readable storage medium.

According to Claim 8, the method of such documents said document plurality of smaller and fragments into said smaller (individually) adapted to individually of such documents separately (separately) indexing further including, computer-readable storage medium.

Document indexing method is configured to process in the system, said system: processor; and said processor operably coupled to a includes memory included in computer to reduce, the method: one or more natural language are the document written by markup language (one or more natural languages) (indexing) - said indexing step indexing step: said document content one or more natural language in one of (language-specific rules) - specific rules applying a - (applying) and said language specific rules are various tokens from said document content (a number of tokens) comprising produce -; said - said language specific rules are based on token various generated to successfully metric (a success metric) (determining) determining a; said threshold value (a threshold value) successful metric comparing the (comparing) the threshold value - said multilingual document (document a multi-lingual) that specifies a content in an document 9990001124 999 a receiving predetermined amount of -; the based on the step comparing the, identified as document multilingual document said step; and said document multilingual document identified as a in response to the step of, the multilingual for indexing said document queuing (queuing) including the, system.