SYSTEM AND METHOD FOR FUNGAL GENOME SEQUENCE ASSEMBLY AND EVALUATION

15-05-2017 дата публикации
Номер:
KR1020170053147A
Принадлежит:
Контакты:
Номер заявки: 01-16-102047444
Дата заявки: 07-11-2016

[1]

The present invention refers to short DNA sequences obtained from genome sequence (sequence reads) advanced bio-sequence listing analysis method performing assembling (genome sequence assembly) system are disclosed.

[2]

Specifically mold fungi of DNA sequencer generated illumination system for heating dielectric photosensitizer lead bio-sequence listing pretreatment, dielectric assembly, evaluation, a series of procedure automatic dielectric assembly post-pipeline software systems are disclosed.

[3]

[4]

DNA sequence analysis can in modern life science studies for obtaining sequence information is provided so that a dielectric on life science gene are disclosed. Life science research bio-sequence listing data users to be inputted to the received price with the acronym bio-sequence listing is a innovative assay are disclosed. Sequence analysis techniques such as next generation bio-sequence listing analysis method of power generation by sequencing cost is significantly reduced remarkably. 2001 year of human dielectric (about 3 Gbp) size sequence analysis techniques use of 1 billion dollars or more wine order analysis method previously was cost while, year 2015 is about 1,300 dollars has been made possible. 15 dipped only the preliminary 1/77, the cost of producing same amounts of data which may be 000 are disclosed.

[5]

In addition bio-sequence listing field also applying various more Brawn, microorganisms from an environmental sample meta analysis populations dielectric (metagenomics) analysis, dielectric between structural variants analysis, gene expression, transcripts variants represent examples of analysis are disclosed.

[6]

However DNA produced by lead (read) the length of the dielectric bio-sequence listing at the time of [khwen[khwen] it stands standby longer compared because, once sequencing the entire genome sequence obtain conductive current non-disclosed. In the presently most producing a base pack bio (PacBio) platform wherein the length of the lead produced maximum 20 Kbp, this one simple fungi yeast dielectric size (12. 07 Mbp) compared to the blade about 1/600 are disclosed. One of the most widely used DNA sequencing platform in the case of illumination technique (Illumina) platform can be up to 300 bp producing only leading length direction.

[7]

Dielectric assembly provide the means by which DNA sequencer produced from short lead are assembled to form a series of calculation process for restoring original for dielectric bio-sequence listing are disclosed.

[8]

Dielectric assembling algorithm is derived from the same position a substantial portion of the two leads similar sequences in dielectric has been designed based on the assumption that greatly three to classification with each other.

[9]

The Greedy algorithm is taken immediate connection in such a way that maximum overlapping two lead kite deficiency are disclosed. The algorithm at a time than merely providing the entire dielectric handle partial first of all at the time of connection leads since, when the pin is mate-a pair read sequence using connection beat all difficulties. Phrap, TIGR Assembler, Greedy algorithm is representative VCAKE is open source software are disclosed.

[10]

First polls all lead at a pair of superposed with the lead Overlap-a layout-a consensus algorithm is found, based on the same graph green substrate. Graph nodes are connected to a pair of leads of each lead while the edge is similar sequences. The entire dielectric spaces unlike Greedy algorithm at a time is equal to or higher. Using this paradigm Celera Assembler representative open source software are disclosed.

[11]

All leads De Bruijin graph algorithm is as graph overlap-a layout-a consensus but the difference will be dispersed into the in that green k non-mer. K a-mer of sequences between two edge in De Bruijin graph algorithm (k-a 1) connecting two nodes properly agreeing with because, sequencing error modification determine a quality of the final assembly are a main factor. Velvet, SOAPdenovo, such as using open software in ALLPATHS etc..

[12]

The dielectric assembling various algorithms can be used for applying, in an optimal order on any algorithm whether assembly assembling Assemblathon Gage contest can be indirectly via the sheet. New contest Assemblathon, fish, snake algorithm to near its original genome sequence number in any dielectric dielectric assembly based on meeting and whether the assembly assembling one, two participating team his total 21. Two contest ALLPATHS both excellent dielectric assembly by verifying his algorithms is water.

[13]

But depending on the type of library sequence distribution of repetitive sequences such as dielectric structural features or optimal algorithm assembly assembling s402. The door number uses software is available for use after assembling the dielectric respectively, each assembly for evaluation can be solved by selecting optimal assembly. However such steps forward to each computing environment users C. significant learning time.

[14]

The lead sequence given dielectric variously applied simultaneously algorithm it is evaluated to producing optimal assembly software pipeline are required.

[15]

(Fungi) is fungi bacteria (Bacteria) is large relative to the complex dielectric has a simple design and, as design 2 difference metabolite production strain these strains of a variety of different applications to restoring environment using and possibility dielectric analysis etc. aluminum.

[16]

American National biological information center (NCBI) which is about 700 dielectric disclosure of fungi, joint genome in a number the [su[su] mote which is[thyu[thyu]under public affairs Mycobacterium hinge (http://genome. Jgi. Doe. Gov/programs/fungi/index. Jsf) over about 500 is reliant fungi by dielectric disclosure. The speed of the increase every year as well as for analysis and data disclosure dielectric fungi are disclosed.

[17]

The size of the dielectric fungi 2. 1 Mbp (Microsporidia parasitic fungi belonging) from a variety of sized to 215 Mbp (Uromyces viciae-a fabae, the method) and an average 35 Mbp etc.. This e. coli (4. 6 Mbp) such as bacteria and dielectric size being greater than human (3 Gbp) primate such as size is much smaller than the size of the dielectric are disclosed.

[18]

[19]

Disclosure assembly pipeline present the disclosed.

[20]

1) Orione: a web a-based framework for NGS analysis in microbiology

[21]

An integrated process for the key-phrases (resequencing) sequence as a tool life information analysis for microorganism level dielectric, dielectric assembly, scaffolding, RNA-a seq, tin gene which can, meta dielectric analysis, meta transcripts for analysis tools directly through the pipeline is constructed by a user can be diffuse to the. In the Velvet Orione number under public affairs De novo assembly software, ABySS, SPAdes, SSAKE, Edena flow tides. A platform for the use of various bio information every tic [su[su] tools under public affairs number itself wherein the functionality is to be under public affairs dampers covered without number.

[22]

[23]

2) A5, A5 provided miseq: An Integrated Pipeline for de Novo Assembly of Microbial Genomes

[24]

For microorganism level dielectric dielectric assembly receives Illumina sequence a plurality of hierarchies. Total five steps (1) lead pretreatment, (2) [khen[khen] TIG assembly, (3) 1 difference scaffolding, (4) modified assembly, (5) final scaffolding which are use for software IDBA non-UD and assembly using only other (also 1) kind.

[25]

[26]

3) CG-a Pipeline: A computational genomics pipeline for prokaryotic sequencing projects

[27]

For prokaryotic dielectric dielectric assembly, converted gene function, such as tin which can perform automatically performing pipeline are disclosed. Another bio-sequence listing analysis platform is integrated 454 leads sub-nanometer range. Using AMOScmp Newbler using dielectric assembly on which software to assemble two assemblies Minimus the [phu[phu] the [wey[wey] it freezes sort of fabric substrate (2 also).

[28]

[29]

4) DSP (denovo_solid_pipeline): a semi automated pipeline for short genome assembly using SOLiD sequencing data

[30]

SOLiD platform lbpc the plan law (dynamic programming) dielectric sub-assembly pipeline using lead modification optimizing algorithm etc. and having the main function.

[31]

[32]

5) iMetAMOS: Automated ensemble assembly and validation of microbial genome

[33]

Open source executes a software program to produce various bacteria dielectric iMetAMOS is selecting optimum assembly pipeline are disclosed. The main function of parameters automatically selects each assembly quality of various numerical scoring using are disclosed. ABySS, CABOG, IDBA-a UD, MaMaSuRCA, MetaVelvet, MIRA, Ray/RayMeta, SGA, SOAPdenovo 2, SPAdes, SparseAssembler, Velvet, Velvet-a SC host is not (3 also).

[34]

[35]

6) PBcR pipeline: Hybrid error correction and de novo assembly of single-a molecule sequencing reads

[36]

Pack produced in lead sequencing error correction algorithm and de novo assembly function bio (PacBio) platform number under public affairs substrate. Gripping, prokaryotic, eukaryotic dielectric assembly has been determined (4 also).

[37]

[38]

7) RAMPART: a workflow management system for de novo genome assembly.

[39]

User level may be established in which ball number and names of direct workflow system open source assembly are disclosed. Each user can control parameters needed for the software, unlike other systems to handle large eukaryotic dielectric memory share system may utilize a computer cluster or to other (5 also).

[40]

[41]

8) VirAmp: a galaxy a-based viral genome assembly pipeline.

[42]

Virus dielectric assembly photosensitizer pipeline dielectric assembly, analysis, analysis sequentially a plurality of hierarchies (6 also).

[43]

[44]

The present invention refers to DNA sequencer software simultaneously using various open source assembly produced from fungal dielectric lead from generates candidate group assembly, each assembly to evaluate and final assembly compared to performing the screening process intended for building a software pipeline.

[45]

[46]

In the present invention (paired non-end) sequencing to one piece of collected bio-sequence listing piece fade - end for collecting and storing per two lead (read) sequence collection;

[47]

One or more dielectric assembly software for assembling one or more candidate dielectric dielectric assembly; and

[48]

A method of selecting an evaluation section which candidate dielectric is evaluated to optimum dielectric dielectric assembly,

[49]

Said dielectric assembly includes an ALLPATHS pipeline and pipeline portion including assembling and genome sequence number system in which fungi Non provided ALLPATHS under public affairs substrate.

[50]

In addition, in the present invention (a) collected bio-sequence listing piece fade - per two lead (read) end (paired non-end) sequencing to one piece of collecting sequence;

[51]

(B) one or more dielectric assembly software assembling one or more candidate dielectric; and

[52]

(C) optimal dielectric which candidate dielectric is evaluated to waste,

[53]

Said one piece of collected at a two lead sequences longer than the sum of the sequence of pieces when using ALLPATHS software pipeline,

[54]

Said one collected two lead sequences shorter than the sum of the sequence of the pieces together when assembling and evaluation method using a software pipeline Non provided ALLPATHS fungi genome sequence number under public affairs substrate.

[55]

[56]

According to the method of the present invention have the following effect fungi genome sequence assembly and evaluation.

[57]

First, whether a predetermined suitable dielectric De novo assembly without in any algorithm, a buffering layer multi-processor with the systematic applying various algorithm pipe can be achieved.

[58]

Second, computing an inexperienced user installs studies using sequence information handling method of assembling the assembly of quality because users can.

[59]

Third, various evaluation criteria as well as calculates fetches of this assembly, a method of selecting the optimum assembly number under public affairs scoring algorithm can be easily obtained without professional background knowledge optimum assembly.

[60]

Fourth, the use of a pipeline can be minimize attractive than enforcing software each independently.

[61]

Fifth, a post-optimal assembly via an identification of foreign sequences, mitochondrial dielectric assembly, such as the level of available assembly arranged in various sequence database searches for a pipeline is recognized can be achieved.

[62]

[63]

Figure 1 shows a mimetic pipeline A5 also are disclosed. Figure 2 shows a CG-a Pipeline mimetic also are disclosed. Figure 3 shows a mimetic iMetAMOS also are disclosed. Figure 4 shows a mimetic PBcR pipeline also are disclosed. Figure 5 shows a mimetic RAMPART also are disclosed. Figure 6 shows a mimetic VirAmp also are disclosed. Each stage of the method the present invention according to Figure 7 shows a fungi genome sequence assembly and evaluation also representing a mimetic are disclosed. Figure 8 the present invention according to an open source software presents a list of required for the VTR. Figure 9 shows a pre-processing steps that lead the present invention according to a mimetic representing also are disclosed. Figure 10 shows a mimetic representing also the present invention according to dielectric assembly steps are disclosed. The present invention according to a mimetic representing evaluation step also Figure 11 shows a dielectric assembly are disclosed. Figure 12 shows a post-processing steps that also the present invention according to dielectric assembly representing a mimetic are disclosed. Figure 13 shows a common channel redistribution message - end (paired non-end) for use in a mimetic representing also pair are disclosed. Figure 14 shows a paired end library for use in representing a mimetic - also are disclosed. Figure 15 FastQC quality distribution graph generated by are disclosed. Figure 16 per tile quality heat map among others. Figure 17 lead per quality average graph are disclosed. Figure 18 bio-sequence listing per position A, T, G, distribution of graph C are disclosed. Figure 19 shows a graph GC content per GC content graphs and lead also lead in ratio are disclosed. Figure 20 bio-sequence listing per position 'N' content graph are disclosed. Figure 21 lead length distribution graph are disclosed. Figure 22 lead redundant graph are disclosed. Figure 23 adapter content graph are disclosed. Figure 24 K-a mer content graph are disclosed. Figure 25 C. The base dielectric lead cucurbitarum exhibits per quality distribution. Figure 26 C. Cucurbitarum dielectric lead lead per exhibits average quality distribution. Figure 27 C. Bio-sequence listing of dielectric lead cucurbitarum per position ATGC content graph are disclosed. Figure 28 C. Cucurbitarum dielectric lead quality correction graph are disclosed. Figure 29 K-a mer histogram graph are disclosed. Figure 30 C. Accumulated length and GC content graph cucurbitarum [khen[khen] TIG optimum assembly are disclosed. Figure 31 C. Identifying foreign DNA contamination cucurbitarum assembly. GC content distributed graph indicating each scaffold coverage are disclosed. Figure 32 C. 28S rDNA cucurbitarum assembly polyadenine nucleotide.

[64]

Hereinafter, the present invention specifically described as follows.

[65]

[66]

The present invention refers to assembling and genome sequence number system in which fungi under public affairs substrate.

[67]

Said system gathered bio-sequence listing - end (paired non-end) sequencing to one of pieces per piece fade two lead (read) for collecting and storing sequence collection;

[68]

One or more dielectric assembly software for assembling one or more candidate dielectric dielectric assembly; and

[69]

A method of selecting optimal dielectric candidate dielectric is evaluated to dielectric assembly evaluation having a predetermined wavelength.

[70]

In addition, the gathered lead sequence of the present invention system for pretreating a lead pre post assembly further comprises post-treating filtrate and predetermined optimum dielectric it will call dielectric can be.

[71]

In one embodiment said system includes a collecting part; lead pre-filtrate; dielectric assembly; an evaluation section and dielectric assembly post in department dielectric assembly can be configured, each configuration in one software pipeline can be connected.

[72]

The filtering unit (insert) pieces collected bio-sequence listing a fade - end (paired non-end) sequencing to one of pieces per two lead (read) sequence for collecting and store. , the fungi can be extracted from a bio-sequence listing.

[73]

Lead preprocessing said collecting part collected at a pre-lead sequence as follows. In said pre-lead quality abstract filtrate, special adapter sequence number, performing at least one from the group consisting of lead quality correction and k-a mer frequency can be calculated.

[74]

Dielectric assembly includes an one or more dielectric assembly software can be assembling one or more candidate dielectric.

[75]

Said dielectric assembly includes an ALLPATHS pipeline and pipeline can be Non provided ALLPATHS unit. One of pieces collected at a two lead sequences longer than the sum of the sequence of pieces ALLPATHS pipeline when assembling part is performed, one of pieces collected at a sum of two lead sequences shorter than the sequence of the pieces together when assembling part can be performed Non provided ALLPATHS pipeline.

[76]

The evaluating unit assembled in said dielectric assembly candidate dielectric is evaluated to optimum dielectric assembly dielectric screening substrate. An evaluation section in said dielectric assembly likelihood evaluation, N50 value evaluation, at least one from the group consisting of performing evaluation or maximum scaffold length number scaffold evaluating evaluates candidate dielectric, such that the sum of the ranking the results in each evaluation according an outlet under optimum dielectric can be dielectric.

[77]

Post evaluation section reception alarm assembly be an optimal dielectric post-dielectric YES dielectric assembly. The filtrate contamination due to foreign DNA in said post-identifying, confirming 28S rDNA sequences, at least one from the group consisting of a mitochondrial assembling and web server upload can be performing.

[78]

[79]

In addition, the present invention refers to a fungal genome sequence assembly and evaluation method are disclosed.

[80]

the method (A) fade - bio-sequence listing pieces collected per two lead (read) end (paired non-end) sequencing to one piece of collecting sequence;

[81]

(B) one or more dielectric assembly software assembling one or more candidate dielectric; and

[82]

(C) candidate dielectric is evaluated to selecting optimal dielectric comprising the following steps.

[83]

Said method for pretreating a lead sequence collected in step; and optimal dielectric further comprises a post-treating step can be.

[84]

Hereinafter, a specific embodiment with reference to the attached drawing invention for building detailed as follows.

[85]

In one embodiment as shown in the assembly and evaluation method is also fungi genome sequence 7, (a) collected bio-sequence listing piece fade - per two lead (read) end (paired non-end) sequencing to one piece of collecting sequence (hereinafter, DNA sequencer steps, 100);

[86]

(B) lead sequence collected for pretreating a step (hereinafter, lead pre-processing steps, 300);

[87]

(C) one or more dielectric assembly software assembling one or more candidate dielectric (hereinafter, dielectric assembly steps, 400);

[88]

(D) optimal dielectric assembled candidate dielectric is evaluated to waste (hereinafter, dielectric assembly evaluation step, 500); and

[89]

(E) post-treating step optimal dielectric (hereinafter, dielectric assembly post-processing steps, 600) can be composed.

[90]

In the present invention also 8 as shown in the variation, each stage software pipeline connecting each one, DNA sequencer generated bio-sequence listing lead pretreatment, dielectric assembly, evaluation and post-processing can be automatic the FAQ database. I.e., in the present invention dielectric lead sequence are several algorithms evaluate it applied simultaneously and producing an optimum assembly number can be under public affairs software pipeline. The enforcing attraction than each software independently can be minimized, a variety of information once the execution pipeline can be achieved.

[91]

Step (a) in the present invention is DNA sequencer step (100) to, generating and collecting bio-sequence listing can be plays a role in determining the lead. Said step and off-collection, the fungi can be extracted from a bio-sequence listing.

[92]

Step (a) includes generating each other along platform can be different types of leads, in the present invention can be use of a lead platform illumination technique.

[93]

Said heating element for a given illumination sequence reads up to 300 bp platform both for one sequence can be under public affairs lead sequence number. This work pair set - end (paired non-end) sequencing is combined with a load. Said DNA sequence is referred to as read by means of an insert (insert) which both at the time of [khwen[khwen]at the time of [khwen[khwen] it stands number read by seal assembly suitable for a lead sequence using a dielectric (read).

[94]

For use in the present invention Figure 13 shows a common channel redistribution message - end (paired non-end) representing a mimetic also pair are disclosed. Dielectric DNA fragment which breaks as chrome (insert), the amount of illumination in the case insert thereof can read length of end heating platform bio-sequence listing. The read (read) and referred to as a lead bio-sequence listing, dielectric analysis lead 1 2 1 obtained data is generated from information that pair of inner sequence and lead such as insert (pair information) are disclosed.

[95]

Step (b) in the present invention is lead pre-processing steps (300) to, said step (a) a nitrogen-lead sequence be pre-collected at the aforementioned step. Said step can be carried out in lead pre-filtrate.

[96]

Lead pre-processing steps (300) is also 9 as shown in the variation, lead quality abstract step (301), adapter sequence number would step (302), lead quality correction step (303) and k-a mer frequency calculation step (304) can be at least one from the group consisting of performing. Said step FastQC (201), Cutadapt (202), HTQC (203) and Jellyfish (204) at least one of the group consisting of software is employed, said software pipeline (hereinafter, lead pretreatment pipeline) constructed by may be disclosed.

[97]

Lead quality abstract step (301) in FastQC (201) are used software. Said step analyzes the various statistics can identify the distribution of lead compound and, specifically DNA sequencer controls the leads are generated from a graph property table by lead and lead quality or distributions of schematic information can be achieved.

[98]

In the present invention FastQC (201) per one channel quality distribution graph, per tile quality heat map, lead per quality average graph, per position bio-sequence listing A, T, G, C distribution graph, lead per GC content graph, per position bio-sequence listing 'N' content graph, lead length distribution graph, lead redundant graph, adapter content graphs and K-a mer content graph can be identify.

[99]

Quality distribution graph includes a lead quality bio-sequence listing per position representing a distribution graph, x lead shaft is in base position, y shaft exhibits a quality score. Heating platform illumination are both generated initial leads each bio-sequence listing has a unique length has a quality score. In the case of a value of between 0 - 40 has produced on an sequencer quality score, the greater the higher the bio-sequence listing having reliable about big. Typically the quality score of less than 20 bio-sequence listing cannot be trusted to the classification number 1308. volatile in the bio-sequence listing after quality correction.

[100]

In the present invention Figure 15 FastQC quality distribution graph generated by, and good quality on graph example of lead, the below examples of low lead quality are disclosed.

[101]

(Heatmap) per tile quality heat map tile graph indicating the average quality of produced in leads, the corresponding tile lower quality can comprise red, exhibits the higher the blue. Each illumination tile to tile heating platform over the quantity of lead to produce, door number corresponding tile where the signals in particular tile of messages that produced lead generally low quality are disclosed. The number times in said graph having error through tile together with dielectric assembly can.

[102]

In the present invention Figure 16 per tile quality heat map (heatmap) graph, which shows several tile where the signals in all tile where the signals in high quality downwards on the graph show low quality can be sure that the graph.

[103]

All bio-sequence listing quality score corresponding to the average quality per lead lead the graph and calculates an average of the frequency in the chemical formula graph, said graph after inputting the lead sheet quality can be.

[104]

In the present invention Figure 17 lead per quality average graph, in high quality between substantially all the leads 36 38 on graph which shows show while, below graph show an average grain size of many leads 20 can be sure that the effect on quality.

[105]

A bio-sequence listing per position, T, G, the distribution of the graph C bio-sequence listing per position A, T, G, C exhibits frequency of percent. The DNA sequence (A) adenine, guanine (G), hour toe [nin[nin] (C), thymine (T) is produced by the four base, is provided with a uniformly sampling sequence read assuming lead per position bio-sequence listing dielectric in A, T, G, a full position and a constant frequency of C will. If the in situ in the case that the rotary adapter sequence specific dielectric made at the time of [khwen[khwen] deflection position to a non-uniform graph bio-sequence listing you base frequency can be obtained.

[106]

In the present invention Figure 18 bio-sequence listing per position A, T, G, C exhibits the distribution of graph. If the phase sampling deflected to the genome sequence does not take place in all bio-sequence listing each lead sites on graph at the time of [khwen[khwen] and occupies as constant ratio and, if the place below graph appears to be non-constant so as each other.

[107]

GC content of their respective leads the graph exhibits and frequency calculating GC content per lead. The breeding horse disclosed specific GC content distribution of GC content to all living thing kind theoretical number together in dielectric sequence analysis can be made of a proper judges whether the title.

[108]

In the present invention Figure 19 shows a GC content graphs and lead for GC content also lead per frequency graph, average GC content x axis, y axis empty exhibits. As shown in the 19 also, blue profile are metal oxides exhibit a theoretical frequency curve if the shape of the out can be made properly identifying at the time of [khwen[khwen] has not been altered.

[109]

Bio-sequence listing per position 'N' bio-sequence listing content of the graph per position ' N'is the number of cells calculated value percentage by a goniophotometer. DNA sequencer A, T, G, C bio-sequence listing any also determine if there is no 'N'in creating, ' N'using reduced the amount of content of a database information means that the substrate.

[110]

In the present invention Figure 20 bio-sequence listing per position 'N' content graph, salt subscriber's string ' N'is a at the time of [khwen[khwen] bookshelf to corresponding base when determining which content cannot be represented by graph are disclosed. About 5% below graph is formed intermediate the lead still 'N' can be sure that the inclusion.

[111]

Lead length distribution graph includes a lead length by calculating the frequency of exhibits. In the case of heating the switching platform illumination generating indication part lid, one peak can be presented.

[112]

In the present invention Figure 21 lead length distribution graph, in the case of illumination at an acute angle as shown by the heating platform to produce peak can be sure that the simulation is divided.

[113]

In the case that the frequency sequence such as the leads of lead redundant the graph indicating graph are calculated, the leads of identifying abnormal recipient specific sequences can be powered on. Deflection sequences in the case that a large number of redundant lead sampling or adapter dimers (adaptor dimer) is a higher frequency are discriminated from each other. A priority table list leads to redundant expectation or more abstract in the nanometer range.

[114]

In the present invention Figure 22 lead redundant graph, one or more leads below the graph 5,000 can confirm the same sequence.

[115]

Adapter content percentage thereof when positions of adapter is lead included in the graph are calculated by a goniophotometer.

[116]

In the present invention Figure 23 adapter content graph, adapter sequences when the adapter sequences according to content comprising leads can be show lead position. Other color depending on the type of adapter being changed substrate.

[117]

In addition, K-a mer content of all leads the graph corresponding to the calculated expected value or more color frequency by dividing a k-a mer 7 present a list of table represented by K-a mer are disclosed. Graph via a K-a mer present map can be grasp any lead positions.

[118]

In the present invention Figure 24 K-a mer content graph, much larger than the amount of the expected position of the inputted information specific k-a mer this tablecloth k-a mer with distribution graph can be show.

[119]

Adapter sequence number would step (302) in Cutadapt (202) software can be used. Said step leads it pushes but Truseq illumination kit comprising sequences identical or corresponding adapter sequences included in the prepared library can be cut out again plays a role. Adapter sequences to be readable sequence when a lead length of lead end of side in such a manner that multiple myelomas are included. Corresponding adapter sequences in addition other types can be applied as needed.

[120]

Lead quality correction step (303) in HTQC (203) software can be used. Said step quality bio-sequence listing because shorter length cut out from Thermus (trimming) less than the entire number of special leads when (filtering) can be performed in the lead. I.e. via electrodes to said assembly can be quality good portion can be used.

[121]

In addition k-a mer frequency calculation step (304) in Jellyfish (204) software can be used. Said step lead quality correction step (303) lead to the number of cells or predicting quality corrected k-a mer ploidy can be calculating a size of dielectric.

[122]

Specifically, 17 and 19 respectively of all leads from each k-a mer calculates a frequency for extracting k a-mer. The result table output in the form of a pipe line through self-SCRIPT to graph visualization is combustion chamber. From a frequency distribution of open source SCRIPT estimate_genome_size. A pl using a dielectric size of sequence coverage (sequence coverage) estimates. The size of the analysis to be fungi catholyes near the estimated dielectric comprising a therapeutically for a line as compared to the number of quality during dielectric assembly already evaluated to dielectric assembly can be is prior sequence analysis chamber.

[123]

I.e., said step dielectric assembly prior to start predicting information about the quality of the resultant dielectric assembly by dielectric sheet can be.

[124]

Step (c) in the present invention such that only one assembly step (400) to, one or more dielectric assembly software assembling one or more candidate dielectric are disclosed.

[125]

For use in representing the agent of 14 - end library such as paired type, step (a) (b) (pieces) or steps which has been pretreated in an insert collected at said insert along the length of the insert (pieces) can be divided into two types. Specifically sufficiently shorter length on both sides two lead sequences caused library insert the node group between two lead sequence library can be divided into a closure comprising a read at high speed.

[126]

In the present invention said located within the insert assembly uses two types of along pipe metal interconnection can be selected.

[127]

Specifically one insert (pieces) collected at a two lead sequences longer than the sum of the insert sequences when using software pipeline can be ALLPATHS (pieces), said one insert (pieces) collected at a two lead sequences shorter than the sum of the sequence of insert (pieces) when using software pipeline can be Non provided ALLPATHS.

[128]

In the present invention Figure 10 shows a dielectric assembly steps also representing the present invention according to a mimetic to, various algorithm is applied dielectric assembly open source software running pipeline dielectric assembly acts, depending on the type of library to ALLPATHS pipeline (401) or Non-a ALLPATHS pipeline (402) and a selection to appropriate ones can be.

[129]

In the present invention ALLPATHS pipeline (401) ALLPATHS software on the software (205) can be used.

[130]

In addition, Non-a ALLPATHS pipeline (402) is a dielectric assembly open source software ABySS software (206), SOAPdenovo software (207), Velvet software (208) or Celera software (209) can be used.

[131]

ALLPATHS pipeline (401) in accordance with a parameter automatically creates the required file can be given. This chamber number ALLPATHS 503,506 fully justified in executing the learning process are also easily without special PCB dielectric can be to keep running.

[132]

ALLPATHS (205) used for a given lead from dielectric serves assembling a plurality of hierarchies. Said ALLPATHS (205) fragment (fragment library) is of essentially two types of library (jumping library) library on jumping library request other.

[133]

Sequence read length of similar lead their sum smaller than said pieces library group generating accomplishing. For example if the length of the length of the lead should read 100 bp sequence thereof is lowered to about 180 bp. The method according to 10 kbp 3 kbp EcoP15I library such as jumping library constructing corresponding to a vertical length between the amount of required substrate.

[134]

ALLPATHS (205) in order to execute in_groups. Csv, in_libs. Two csv file for accomplishing. In_groups. The name of the corresponding respective library csv focused csv format lead device and method for creating other. In_libs. For fungi csv respective library name, type library (library pieces, such as jumping library), whether for use in lead pairs of end -, sequence read of a size and a standard deviation, such as dielectric assembly scheme of this directional leads required creates other.

[135]

ALLPATHS (205) the dielectric assembly is complete final. Assembly. FASTA format fasta file end assembly can be achieved.

[136]

Various statistical numerical to further dielectric assembly which can be obtained, this same directory created assembly. Library_coverage report file. Report can be found in file. Assembly. The resultant assembly statistics report file is excited by a laser at or with the management system. Advantageously [khen[khen] TIG and number, N50 value, be 1 Mbp average scaffold to form a high speed, the ratio of total empty sequence (gap) during assembly, such as the assessment of various numerical are included in the dielectric assembly using dielectric size over. Library_coverage. Report file for the assembly of the entire dielectric exhibits several times using dielectric bio-sequence listing of has occurred. For example 40 Mbp dielectric assembling corresponding to 3 times if assembly 120 Mbp bio-sequence listing of lead with a sequence of coverage (sequence coverage) is equal to value. Regenerating means that more data is greater for a larger since allow for greater reliability genome sequence to be coated.

[137]

Non provided ALLPATHS dielectric assembly pipeline (402) is ALLPATHS dielectric assembly pipeline (401) compound and by using flexible relative to library system is basic framework for use in pairs and - end. The leads is in the pipeline for a read software ABySS dielectric assembly (206), SOAPdenovo (207), Velvet (208), Celera assembler (209) can be executing independently.

[138]

ABySS (206) have short leads forms a pair for use in assembling multi [su[su] RAD - end lead being capable of substrate. Specifically, ABySS (206) library pieces can be used for additional [su[su] digging up folding essentially utilizes jumping library. In all leads to extract De Bruijin graph drawing for a k-a mer, the length of the dielectric assembly k a-mer is necessary parameters are disclosed.

[139]

The use of the difficulty of repetitive sequences are generally small is k-a mer assembly which, in the case of a large k a-mer to produce the pieces [khen[khen] TIG hard disclosed. Appropriate for the distribution of repetitive sequences depending on whether any k-a mer dielectric performing firing processes can be within several k-a mer built selecting optimum assembly behind a fading strategy assessing the dielectric substrate.

[140]

Non provided ALLPATHS dielectric assembly pipeline (402) and the minimum average length of lead from lead given suitable value range of k non-mer can be automatically number under public affairs k a-mer.

[141]

Each k-a mer ABySS against using assembly is progressing, unicast TIG, [khen[khen] TIG, and is in turn interconnected with each other. FASTA format each files are written thereto. Generating the generated assembly in a directory is calculated statistical numerical <prefix>- Stats file crated by the other. The number of the file assembly scaffold (unicast TIG, [khen[khen] TIG), corresponding to the number of cell populations at least 500 bp, minimum length of cell populations, N80, N50, value of N20, criticism it is long scaffold length, maximum length of cell populations, dielectric size value number under public affairs substrate.

[142]

SOAPdenovo (207) is large dielectric glass for assembling and short lead to deal with optimized for software are disclosed. Said SOAPdenovo (207) which is essentially in the same manner as k-a mer ABySS on request, as aforementioned Non provided ALLPATHS dielectric assembly pipeline (402) can be suitable for a range of values.

[143]

Said SOAPdenovo (207) produces the config file library information further number under public affairs accomplishing. Average length of sequence read, lead of directional, assembly type ([khen[khen] TIG assembly, scaffold assembly, both assembly), using dielectric assembly length of maximum lead, priority of library, the minimum number of paired [su[su] digging up folding, minimum alignment length of composing to 10sup16. Non provided ALLPATHS dielectric assembly pipeline (402) easily in the process of a given parameter config file automatically creates a substrate.

[144]

SOAPdenovo (207) the pilot assembly each k-a mer dielectric <prefix>. Contig, <prefix>. FASTA format each file scafSeq overnight. Assembly also includes statistical <prefix>. ScafStatistics crated by the file which, scaffold be, average scaffold length, minimum/maximum scaffold length, A, T, G, C content, N10 - N90 value is computed co number encoded.

[145]

Velvet (208) de Bruijin graph is constructed from a lead Dielectric pocket or simplify process and sequencing error number of repetitive sequences comprising a stand-alone. Said Velvet (208) on the two dielectric sequentially execute program velveth velvetg assembly substrate. Velveth is given from one of lead by extracting data structure storing a predetermined k-a mer velvetg pre serve as a stand-alone making sequencing error number de Bruijin graph building is using to simplify the graph and repeatedly sequences on a graph 135. final dielectric assembly solves the problem.

[146]

The aforementioned two software as well as undiluted k-a mer requires Non-a ALLPATHS dielectric assembly pipeline (402) can be suitable for a range of values.

[147]

Velvetg (208) have a great impact on two types of parameters affect of resulting assembly further favoured, exp_cov option and cov_cutoff options are disclosed. Expected option exp_cov sequence coverage (sequence coverage) for directly using a dielectric assembly is needed, which can be used in specifying minimum sequence coverage (sequence coverage) cov_cutoff option dielectric assembly wherein the two values to obtain different resulting assembly according to be coated.

[148]

In order to obtain a value two option further VelvetOptimiser can be optimally uses software. K-a mer VelvetOptimiser is only specified by range of user parameters butting dielectric assembly is divided attempt optimization of self-assessment and determined. Such that assembly resulting assembly VelvetOptimiser contigs. FASTA format fa that the file store. Resulting assembly through a log file for which statistical numerical number under public affairs, dielectric size, N50 value, maximum scaffold length, 1 Kbp length scaffold number multiple myelomas are included.

[149]

Celera assembler (209) includes a repetitive sequence search (find repeats), overlap (overlap), error correction (error correction), unicast TIG generating (unitigging), scaffolding (scaffolding) dielectric performed sequentially during the assembly substrate. Said Celera assembler (209) is previously cells software differently since Overlap-a layout-a consensus algorithm without the lid broken up into lead itself for drawing graphs k non-mer is measured for plural times. The number without a K-a mer under public affairs are disclosed.

[150]

Celera assembler (209) used a conventional lead in the appropriate format to a format FASTQ is used directly cannot be improved should. This can be achieved through the Celera assembler fastqToCA SCRIPT contained in the package.

[151]

Celera assembler (209) such that only one assembly resulting assembly <prefix>. Scf. FASTA format fasta file store. Dielectric statistical readings in a <prefix>. Qc file stored in scaffold number, number [khen[khen] TIG, constituting a number average [khen[khen] TIG[su[su] digging up pawl [tu[tu], assembling dielectric size, minimum/maximum scaffold length, N25, N50, N75 value, 2 Kbp or more scaffold number, the number of average length empty sequence multiple myelomas are included.

[152]

Step (d) in the present invention such that only one assembly evaluation step (500) in step (c) to the aforementioned assembled candidate dielectric is evaluated to waste optimal dielectric are disclosed.

[153]

Also in the present invention Figure 11 shows a dielectric assembly to a mimetic representing evaluation step, said step for evaluating various candidate assembly (candidate dielectric) can be performed. Such assembly candidate dielectric evaluation of likelihood evaluation step (501), N50 value evaluation step (502), scaffold number evaluation step (503) and maximum scaffold length evaluation step (504) can be at least one from the group consisting of performing. Said steps are dielectric assembly constructed pipeline evaluation can be carried out.

[154]

Likelihood (likelihood) evaluation step (501) original assembly and to indirectly measure whether how similar disassembled and genome sequence step, CGAL (210) by clicking combustion chamber. Each [su[su] digging up pawl [tu[tu] to be read for assembling said likelihood evaluation whether how uniform, sequencing error how present, whether the length of the sequence read how uniform, dielectric assembly are not used collectively evaluated to indicate likelihood value calculated how many lead one to other.

[155]

CGAL (210) flows by the existing different dielectric assembly evaluation program are already assembled dielectric as compared to what had previously been evaluation while, compared to dielectric does not require electricity first dielectric in the WIPO. CGAL (210) the resultant assembly loaded dielectric assembly assembly leads require aligning BAM format file. The Bowtie 2 lid assembly is arranged in the BAM format file to obtain. Dielectric assembly evaluation pipeline (500) each produce a BAM the generated assembly file.

[156]

CGAL (210) computation by a negative log value obtained and the resulting integer value, reliable absolute value smaller assembly remains the same as the each other.

[157]

N50 value evaluation step (502) software log file extracted from respective dielectric assembly entirely long [su[su] digging up pawl [tu[tu] N50 value compares whether software as a result of assembling. Advantageously said N50 value is average length or median value larger when assembled scaffold is similar but has a small number of long. The N50 value greater in length has been assembling simply scaffold is big.

[158]

Scaffold number evaluation step (503) outputted from the target scaffold such that only one assembly software generated assembly having fewer [su[su] digging up pawl [tu[tu] generating compares whether any assembly. Similar assembly can comprise a normal big size indicating much further assembly of cell populations. Each dielectric assembly number scaffold can be extracted from log file software.

[159]

Maximum scaffold length evaluation step (504) a scaffold assembly compares most long length of cell populations is disassembled. Maximum scaffold effect due to dielectric assembly indicating more big. Length maximum scaffold can be extracted from each software log file.

[160]

In addition, read be evaluation step (505) used for a given type and lead using a dielectric compares whether the entire lead in assembling. A database read, i.e. the resulting dielectric assembly sequence coverage (sequence coverage) excellent dielectric heater assemble indicating part big. The number read ABySS (206), Velvet (208) can be extracted from log file, SOAPdenovo (207) on Celear Assembler (209) under public affairs log file number corresponding information separately because it does not lead back assembly are aligned with alignment with the total number cells leads can be.

[161]

In the present invention the aforementioned likelihood evaluation (501), N50 value evaluation (502), scaffold number evaluation (503) and maximum scaffold length evaluation (504) after evaluating candidate dielectric, such that the sum of the ranking the results in each evaluation according an outlet under optimum dielectric can be dielectric.

[162]

In the present invention such that only one assembly step (e) post-processing steps (600) to, for postprocessing step optimal dielectric are disclosed. Said step (600) in assembly, i.e. dielectric numerical information, identifying contamination due to foreign DNA, mitochondrial assembly, through communications with the web server to perform a therapeutic system associated species like comparison study in turn is via a web browser assembly collectively information can identify the causes.

[163]

The present invention according to post-processing steps that also in the present invention Figure 12 shows a dielectric assembly representing a mimetic to, post-processing steps that said dielectric assembly (600) is based verification step includes foreign DNA (601), 28S rDNA sequence verification step includes (602), a mitochondrial assembly step (603) and web server upload step (604) can be at least one from the group consisting of performing.

[164]

In one embodiment, said step open source software BLAST + a (211) can be used. Said BLAST + (211) used for a given DB aminopeptidase aligning assembly assembled similar sequences can be present search.

[165]

Contamination due to verification step includes foreign DNA (601) is extracted DNA during DNA sequences in addition to other sequences are analyzed by a foul smell of fungal contamination dielectric assembly included in the examination results when a stand-alone number to locate the same substrate.

[166]

Said foreign DNA sequence coverage (sequence coverage) in order to identify a GC content of using two kinds of statistical value can be.

[167]

GC content of bacteria from the dielectric fungi useful identified dielectric. All living thing kind GC content and is compared to its own bacteria fungi having higher average GC content is reported etc.. The DNA sequence analysis of cell populations of bacteria in culture is introduced into high when GC could have the pin is.

[168]

Each scaffold sequence coverage (Seqeunce coverage) is identifying 7.4. foreign DNA. Before sequencing to culturing cells, the cell proliferation when having dielectric compounds is equal to DNA. It is optionally pieces and sequencing in the normal curve [su[su] per digging up pawl [tu[tu] sequence coverage distribution averaging constant grill are disclosed. The sequence coverage curve are stored out of the way with scaffolds, i.e. lower than the expected sequence coverage the [su[su] the digging up pawl it holds having high sequence coverage when mitochondrial DNA sequences of DNA fragments having the outside of cell populations or cell lines disclosed.

[169]

Each the scaffold is registered alignment with the NCBI all sequences can be confirmed whether any belonging to the similar sequence.

[170]

In the present invention GC content and sequence coverage and using 2 dimensional graph, NCBI sequence species based (kingdom) or door (phylum) ordered differently color according to the presence of foreign DNA menu which is bacteria can.

[171]

28S rDNA sequence verification step includes (602) by extracting 28S rDNA sequences corresponding to 28S rDNA sequence during assembly such that fungi in RDP database to after alignment, some belonging to the confirming whether each of the other.

[172]

28S rDNA sequences via PCR fungi species can be obtained easily without the entire dielectric analysis database which is simultaneously performed with respect to, the thread number assembled assembly sequence of species in order to identify whether a set value analysis can be using the same.

[173]

The, BLAST + (211) used for a given query (query) sequence database sequences aligned to each other. Sequences are aligned paper analysis wanted to species muscle yearly bell corresponds to or high similarity if it has analysis wanted to species can conveniently analysis confirm receipt.

[174]

A mitochondrial assembly step (603) assembling of chromosome sequences of mitochondrial fungi used for a given lead one from the other.

[175]

One chromosome of a living eukaryotic mitochondria have desired tube being interposed between the operating. Since the number of cells per 100 to 1000 mitochondrial among fungi cells of sequencing a certain number compared to 1000 times in 100 times when regulation of fungal cell chromosome where the number of mitochondria can be achieved. The chromosome fungi after assembly sequence coverage map is higher than the value will have. This assembly of the sequence using the characteristics in terms of mitochondrial chromosome.

[176]

Said mitochondrial assembly step (603) in Velvet (208) on uses of exp_cov cov_cutoff options. First Velvet (208) leads to the number originally using the corresponding information in accordance with the scaffold assembly Velvet under public affairs a sequence coverage (seqeunce coverage) can be achieved. The scaffold to NCBI in alignment with a corresponding sequences of mitochondrial database built scaffolds of mitochondria extracted. Mitochondrial sequence corresponding to the average values and standard deviation obtained by extracting scaffold sequence coverage (sequence coverage), and executing the same for assembling of mitochondrial chromosome Velvet exp_cov cov_cutoff parameter on's desire is incorporated into. Exp_cov is alert and, cov_cutoff is (- average standard deviation) value with each other.

[177]

Web server upload step (604) generated during signs assembly such that only one layer can be judged to be uploaded to the web server is displayed on the result can be to.

[178]

Said web server upload (604) viewed products may include FastQC result (605), Read QC result (606), K-a mer frequency result (607), 28S rDNA sequence alignment result (608), GC-a cov result (609) and assembly statistics (610) can be composed.

[179]

FastQC result (605) is leads pretreatment (300) performed during pipeline FastQC (201) to display the resulting for each library's desire.

[180]

Read QC result (606) is of special adapter adapter sequence number leads (302), lead quality correction (303) results of the numerically's desire.

[181]

K-a mer frequency result (607) for each given frequency calculated k a-mer library (304) from the value estimated sequence coverage (sequence coverage) by dielectric size estimation on and has, according to degree k-a mer in [cep[cep] exhibits frequency graph.

[182]

28S rDNA sequence alignment result (608) is post-assembly (600) of 28S rDNA sequence confirmation (602) result's desire. Scaffold name, aligned length coverage, similarity, paper sequence aligned multiple myelomas are included.

[183]

GC-a cov result (609) is post-assembly (600) contamination due to foreign DNA during confirmation (601) result's desire. Each scaffold represents a distributed graph GC content coverage BLAST + (211) Cu layers in accordance with the point of classification result colors other.

[184]

Assembly statistics (610) from respective dielectric assembly N50 value layer, dielectric size, maximum scaffold length, scaffold number, be read, to CGAL value exhibits the tables.

[185]

[186]

[In the embodiment]

[187]

Dielectric assembly (10) pipeline chamber number demodulates result with an inquisitive look.

[188]

Computing resource is using Intel® Xeon E5 provided 2640 (15M Cache, 2. 50 GHz, 7. 20 GT/s Intel®QPI), 256 GB RAM, Ubuntu 14. 04 64 bit operating system are disclosed. Clearance space was about 200 GB hard drive.

[189]

Using data ChoanephoraCucurbitarum In illumination by extracting DNA species generated in heating platform piece library (two 10,572,329 lead) been.

[190]

ChoanephoraCucurbitarumThe pumpkin including bottle gourd (cucurbits), week height [ni[ni] (zucchini), the in Pak type 4. of plant pathogenic fungi. In addition plane (Zygomycota) bacteria belonging to fungi-bonded dielectric studies because not much strain dielectric by analyzing plane be least dielectric properties of help information.

[191]

Current C. CucurbitarumEven disclosure of any genome sequence database not disclosed.

[192]

Dielectric assembly pipeline (10) prior to carrying out required open source software (201 to 211) installed along each software installation guide.

[193]

Lead pretreatment pipeline (300) abstract quality lead using step (301), adapter sequence number would step (302), lead quality correction step (303), a step of calculating frequency k a-mer (304) the matter.

[194]

The lead quality abstract FastQC (201) has been confirmed as resulting.

[195]

In the present invention Figure 25 C. Cucurbitarum Dielectric lead base exhibits per quality distribution.

[196]

As shown in the 25 said also, per quality distribution graph includes a lead base quality 5 'upstream is 30 or more reliable while the quality show 3' shape quality aspect which might be purges. In particular was read one side of both base 1 is less than the average quality bonded lead 260 20 since withdrawals, near base during a substantial number of leads can be expected to be a cut-off quality correction (on graph). Lead 2 is relatively low quality aspect than lead 1 from base 20 lying lower than average quality show 220 and can be sure that the (below graph).

[197]

Figure 26 C. Cucurbitarum Dielectric lead lead exhibits per average quality distribution.

[198]

As shown in the 26 also, average quality distribution lead 1 allows a considerable difference per lead lead 2 which show, lead 33 and 39 show a maximum number per average quality keystrokes in a lead 1 (on graph) and, lead 2 to 28 show a maximum number 14 is near keystrokes etc. (below graph). If not considering the deviation to be retransmitted because the average quality dried from the direct factors determine whether to corresponding leads but not coarse quality is good identifying method are disclosed.

[199]

Figure 27 C. Cucurbitarum Dielectric lead of ATGC content graph bio-sequence listing per position are disclosed.

[200]

Also as shown in the 27, GC content given leads ATGC content and general dielectric DNA sequences similar to the distribution of the graph show be lead. In the case of metallic content of lead 2 ATGC bio-sequence listing the rear side of the centrally located (below graph) this can be the entire ATGC content result affecting low quality bio-sequence listing are disclosed.

[201]

In the case that abnormal case specific sequences lead by selecting a corresponding sequence FastQC result can, in sequence 1 of the present invention in the embodiment of data use in diethyl ether, this adapter sequence and agreement on. The entire sequence has been found in about 1% lead. The adaptor sequence can be found them again loaded to a cutting unit.

[202]

Special adapter sequence number (302) via an entire lead 1, lead 2 during 386,490 (3. 7%), 495,106 (4. 7%) lead wires of 53,501,276 (1. 88%), 909,916,970 (0. 52%) number of each identified bio-sequence listing the adaptor would take place.

[203]

Figure 28 C. Cucurbitarum Dielectric lead quality correction graph are disclosed. As shown in the 28 also, quality distribution quality correction (on) and quality (below) after correction per bio-sequence listing prior to precursor, lead right to the common higher average quality can be confirmed.

[204]

Specifically, lead quality correction (303) of the two step 10,572,329 1,402,442 one component (13. 26%) in the wetting ability leads of lead are cleaned low with quality classification number take place. The 9,169,887 used to leads of dielectric assembly. When 4,584,943,500 bio-sequence listing of lead one length is acquired 250 bp corresponding to fungi is acquired 35 Mbp 131 corresponding to times when dielectric fungi of dielectric bio-sequence listing number disclosed. Fungi dielectric formed by checking whether a sufficient number of bio-sequence listing for assembling schedulable queue.

[205]

K-a mer frequency calculated (304) step quality correction is completed by calculating the frequency lead to k-a mer sequence coverage (sequence coverage) and dielectric size predicted.

[206]

Figure 29 K-a mer histogram graph, x k a-mer of redundant recovery and corresponding shaft, corresponding with the number of redundant recovery y axis exhibits different k a-mer. Said graphs studied appear, two kinds of k non-mer 17 and 19 have performed independently for general k-a mer frequency curve are obtained. Prediction dielectric prediction dielectric size 29. 25 Mbp and 29. 58 Mbp appeared to.

[207]

ALLPATHS dielectric assembly pipeline (401) which is required in order for the implementation of two types of library, library data pieces in the embodiment (the length of the sequence pieces average 400 bp) use in library simulation since jumping through been generated.

[208]

First ABySS (206) difference from embodiment 1 using open source software to a dielectric assembly then 4,557,436 wgsim simulation the assembly of lead are obtained.

[209]

Chamber number data is fragmented library simulation data to improve purity and yield ALLPATHS dielectric assembly pipeline dragged to a vertical. As a result 29. 0 Mbp dielectric size 2,814 [su[su] digging up pawl [tu[tu] assembling of him. N50 value is 28 Kbp, sequence coverage (sequence coverage) is 86. 2x excellent dielectric assembly purges. Dielectric size expected in the previous step agreement has been confirmed.

[210]

ALLPATHS dielectric assembly pipeline (401) final assembly can generated in post-processing (600) pipeline final assembly was analysed in various aspects.

[211]

Figure 30 C. Cucurbitarum Accumulated length and GC content graph [khen[khen] TIG optimum assembly, final selected assembly (on) on cumulative length [khen[khen] TIG ALLPATHS dielectric assembly pipeline from a given frequency graph window sizes of GC content 06 etc. (below).

[212]

In addition, Figure 31 C. Cucurbitarum Assembly for confirming GC content in the foreign DNA contamination distributed graph indicating each scaffold coverage, contamination due to foreign DNA confirmation (601) is calculated to have distributed graph confirms the seqeunce coverage in GC content [su[su] per digging up pawl [tu[tu], bacteria was not found between the tip and the apparent contamination by foreign DNA. NCBI nt database of cell populations in many been found to be a Fungi orderly color is being changed.

[213]

NCBI nt database is a database of cell populations or length arranged in many interconnected therein corresponding to the fungi is insufficient to plane because sequence information to said substrate.

[214]

The final assembly is performed by foreign DNA that has been confirmed that the contamination free.

[215]

Identifying 28S rDNA sequences (602) RDP (Ribosomal Database Project) is arranged in the form of a parent sequence information database tables are obtained.

[216]

Figure 32 C. CucurbitarumTo tables of 28S rDNA sequence alignment, 28S rDNA sequences for assembly assembly assembling molecular biological molecules classification database sequence aligned to a RDP similar fungi under anger BLASP results of are disclosed. As has been known to both of scaffolds in total 2 alignment muscle yearly bell sequence analysis paper C. CucurbitarumAbout can be identify.

[217]

Specifically, rDNA sequences have been found in each scaffold is prepared scaffold_ 1520 scaffold_ 1308 and two scaffolds C. CucurbitarumOn the Rhizopus and Mucor muscle yearly bell been into alignment. The sequencing wanted to fungi dielectric and correctly analysis confirmed that the schedulable.

[218]

Assembly of a mitochondrial (603) mitochondrial database process have been alignment of sequences of mitochondrial scaffold is aligning the end assembly 167. One of the short sequence number and length of less than 150 500 bp 47 Kbp been one of cell populations for the dog. Sequence coverage (sequence coverage) Velvet using mitochondrial assembly using pipeline (603) made of angiotensinogen mitochondria dielectric assembly 203 are assembled by results are obtained. 5 Kbp been among at least one of the scaffold is 11.



[1]

The present invention relates to a system for performing genome sequence assembly from short sequence reads acquired by a next generation sequencing analysis method. The system comprises: a collection part; a genome assembly part; and a genome assembly evaluation part.

[2]

COPYRIGHT KIPO 2017

[3]



(Paired non-end) sequencing to one piece of collected bio-sequence listing piece fade - end for collecting and storing per two lead (read) sequence collection; one or more dielectric assembly software for assembling one or more candidate dielectric dielectric assembly; and candidate dielectric is evaluated to a method of selecting an evaluation section which optimal dielectric dielectric assembly, said dielectric assembly includes an ALLPATHS pipeline and pipeline portion including Non-a ALLPATHS fungi genome sequence assembly and evaluation system.

According to Claim 1, further including pre-collection collected at a lead sequence for pretreating a lead assembly and evaluating it will call fungi genome sequence.

According to Claim 1, dielectric assembly further including an optimal dielectric evaluation section reception alarm for postprocessing dielectric assembly post it will call fungi genome sequence assembly and evaluation system.

(A) collected bio-sequence listing piece fade - per two lead (read) end (paired non-end) sequencing to one piece of collecting sequence; (B) one or more dielectric assembly software assembling one or more candidate dielectric; and (C) a method of selecting optimal dielectric being evaluated to candidate dielectric, said one piece of collected at a two lead sequences longer than the sum of the sequence of pieces when using ALLPATHS software pipeline, said one piece of collected at a sum of two lead sequences shorter than the sequence of the pieces together when assembling and evaluation method using software pipeline Non provided ALLPATHS fungi genome sequence.

According to Claim 4, collected for pretreating a lead sequence further including fungi genome sequence assembly and evaluation method.

According to Claim 5, lead quality abstract pre-processing steps, special adapter sequence number, at least one from the group consisting of frequency calculated lead quality correction and k-a mer performing fungi genome sequence assembly and evaluation method.

According to Claim 4, Non-a ALLPATHS ABySS software on the pipeline, SOAPdenovo, Velvet or Celera software fungal genome sequence assembly and evaluation method.

According to Claim 4, candidate dielectric evaluation of likelihood evaluation, N50 value evaluation, evaluating at least one of the group consisting of number scaffold evaluation or maximum scaffold length evaluation and, ranking the results in a sum of each evaluation according an outlet comprises a dielectric under optimum dielectric ampholytic genome sequence assembly and evaluation method.

According to Claim 4, further including fungi genome sequence assembly and evaluation method for postprocessing optimal dielectric.

According to Claim 9, foreign DNA contamination due to postprocess identifying optimal dielectric, 28S rDNA sequence confirmation, performing at least one from the group consisting of a mitochondrial assembling and web server upload fungi genome sequence assembly and evaluation method.

According to Claim 10, contamination due to foreign DNA sequence coverage (sequence coverage) GC content verification is performed using distributed graph of ampholytic genome sequence assembly and evaluation method.

According to Claim 10, 28S rDNA sequence verification is to match the sequence alignment RDP database object ampholytic genome sequence assembly and evaluation method determines fungi and genome sequence.