megamerger Wiki The master copies of EMBOSS documentation are available at on the EMBOSS Wiki. Please help by correcting and extending the Wiki pages. Function Merge two large overlapping DNA sequences Description megamerger reads two overlapping input DNA sequences and uses a word-match algorithm to align the sequences. A merged sequence is generated from the alignment and writen to the output file. The actions megamerger took in generating the merged sequence are written to an output file. The sequences can be very long. Algorithm The program does a match of all sequence words of size 20 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. If the two sequences are identical in their region of overlap then there will be one region of match and no mismatches. Where there is a mismatch, the merged sequence uses bases from the sequence whose mismatch region is furthest from the start or end of the sequence. Usage Here is a sample session with megamerger % megamerger tembl:v00295 tembl:v00296 Merge two large overlapping DNA sequences Word size [20]: output sequence [v00295.merged]: Output file [v00295.megamerger]: report Go to the input files for this example Go to the output files for this example Command line arguments Merge two large overlapping DNA sequences Version: EMBOSS: Standard (Mandatory) qualifiers: [-asequence] sequence Nucleotide sequence filename and optional format, or reference (input USA) [-bsequence] sequence Nucleotide sequence filename and optional format, or reference (input USA) -wordsize integer [20] Word size (Integer 2 or more) [-outseq] seqout [.] Sequence filename and optional format (output USA) [-outfile] outfile [*.megamerger] Output file name Additional (Optional) qualifiers: -prefer boolean [N] When a mismatch between the two sequence is discovered, one or other of the two sequences must be used to create the merged sequence over this mismatch region. The default action is to create the merged sequence using the sequence where the mismatch is closest to that sequence's centre. If this option is used, then the first sequence (seqa) will always be used in preference to the other sequence when there is a mismatch. Advanced (Unprompted) qualifiers: (none) Input file format megamerger reads any two Sequence USAs. Input files for usage example 'tembl:v00295' is a sequence entry in the example nucleic acid database 'tembl' Database entry: tembl:v00295 ID V00295; SV 1; linear; genomic DNA; STD; PRO; 1500 BP. XX AC V00295; XX DT 09-JUN-1982 (Rel. 01, Created) DT 07-JUL-1995 (Rel. 44, Last updated, Version 4) XX DE E. coli lacY gene (codes for lactose permease). XX KW membrane protein. XX OS Escherichia coli OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. XX RN [1] RP 1-1500 RX DOI; 10.1038/283541a0. RX PUBMED; 6444453. RA Buechel D.E., Gronenborn B., Mueller-Hill B.; RT "Sequence of the lactose permease gene"; RL Nature 283(5747):541-545(1980). XX CC lacZ is a beta-galactosidase and lacA is transacetylase. FH Key Location/Qualifiers FH FT source 1..1500 FT /organism="Escherichia coli" FT /mol_type="genomic DNA" FT /db_xref="taxon:562" XX AC V00296; XX DT 13-JUL-1983 (Rel. 03, Created) DT 18-APR-2005 (Rel. 83, Last updated, Version 5) XX DE E. coli gene lacZ coding for beta-galactosidase (EC XX KW galactosidase. XX OS Escherichia coli OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. XX RN [1] RP 1-3078 RX PUBMED; 6313347. RA Kalnins A., Otto K., Ruether U., Mueller-Hill B.; RT "Sequence of the lacZ gene of Escherichia coli"; RL EMBO J. 2(4):593-597(1983). XX RN [2] RX PUBMED; 3038536. RA Zell R., Fritz H.J.; RT "DNA mismatch-repair in Escherichia coli counteracting the hydrolytic RT deamination of 5-methyl-cytosine residues"; RL EMBO J. 6(6):1809-1815(1987). XX CC Data kindly reviewed (18-MAY-1983) by U. Ruether XX FH Key Location/Qualifiers FH FT source 1..3078 FT /organism="Escherichia coli" FT /mol_type="genomic DNA" FT /db_xref="taxon:562" Any actions that require a choice between using regions of the two sequences where they have a mismatch is marked with the word WARNING!. Where there was a mismatch between the two sequences, the merged sequence is written out in uppercase and the sequence whose mismatch region is furthest from the end of the sequence (the one with most remaining bases or residues) is used in the merged sequence. The name and description of the first input sequence is used for the name and description of the output sequence. Output files for usage example File: report # Report of megamerger of: V00295 and V00296 V00295 overlap starts at 1 V00296 overlap starts at 3019 Using V00296 1-3018 as the initial sequence Matching region V00295 1-60 : V00296 3019-3078 Length of match: 60 V00295 overlap ends at 60 V00296 overlap ends at 3078 Using V00295 61-1500 as the final sequence File: v00295.merged >V00295 V00295.1 E. coli lacY gene (codes for lactose permease). accatgattacggattcactggccgtcgttttacaacgtcgtgactgggaaaaccctggc gttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaa gaaaaccgcctcgcggtgatggtgctgcgttggagtgacggcagttatctggaagatcag gatatgtggcggatgagcggcattttccgtgacgtctcgttgctgcataaaccgactaca caaatcagcgatttccatgttgccactcgctttaatgatgatttcagccgcgctgtactg gaggctgaagttcagatgtgcggcgagttgcgtgactacctacgggtaacagtttcttta tggcagggtgaaacgcaggtcgccagcggcaccgcgcctttcggcggtgaaattatcgat gagcgtggtggttatgccgatcgcgtcacactacgtctgaacgtcgaaaacccgaaactg tggagcgccgaaatcccgaatctctatcgtgcggtggttgaactgcacaccgccgacggc acgctgattgaagcagaagcctgcgatgtcggtttccgcgaggtgcggattgaaaatggt ctgctgctgctgaacggcaagccgttgctgattcgaggcgttaaccgtcacgagcatcat cctctgcatggtcaggtcatggatgagcagacgatggtgcaggatatcctgctgatgaag cagaacaactttaacgccgtgcgctgttcgcattatccgaaccatccgctgtggtacacg ctgtgcgaccgctacggcctgtatgtggtggatgaagccaatattgaaacccacggcatg gtgccaatgaatcgtctgaccgatgatccgcgctggctaccggcgatgagcgaacgcgta acgcgaatggtgcagcgcgatcgtaatcacccgagtgtgatcatctggtcgctggggaat gaatcaggccacggcgctaatcacgacgcgctgtatcgctggatcaaatctgtcgatcct tcccgcccggtgcagtatgaaggcggcggagccgacaccacggccaccgatattatttgc ccgatgtacgcgcgcgtggatgaagaccagcccttcccggctgtgccgaaatggtccatc aaaaaatggctttcgctacctggagagacgcgcccgctgatcctttgcgaatacgcccac gcgatgggtaacagtcttggcggtttcgctaaatactggcaggcgtttcgtcagtatccc cgtttacagggcggcttcgtctgggactgggtggatcagtcgctgattaaatatgatgaa aacggcaacccgtggtcggcttacggcggtgattttggcgatacgccgaacgatcgccag ttctgtatgaacggtctggtctttgccgaccgcacgccgcatccagcgctgacggaagca aaacaccagcagcagtttttccagttccgtttatccgggcaaaccatcgaagtgaccagc gaatacctgttccgtcatagcgataacgagctcctgcactggatggtggcgctggatggt aagccgctggcaagcggtgaagtgcctctggatgtcgctccacaaggtaaacagttgatt gaactgcctgaactaccgcagccggagagcgccgggcaactctggctcacagtacgcgta gtgcaaccgaacgcgaccgcatggtcagaagccgggcacatcagcgcctggcagcagtgg cgtctggcggaaaacctcagtgtgacgctccccgccgcgtcccacgccatcccgcatctg accaccagcgaaatggatttttgcatcgagctgggtaataagcgttggcaatttaaccgc cagtcaggctttctttcacagatgtggattggcgataaaaaacaactgctgacgccgctg cgcgatcagttcacccgtgcaccgctggataacgacattggcgtaagtgaagcgacccgc attgaccctaacgcctgggtcgaacgctggaaggcggcgggccattaccaggccgaagca gcgttgttgcagtgcacggcagatacacttgctgatgcggtgctgattacgaccgctcac gcgtggcagcatcaggggaaaaccttatttatcagccggaaaacctaccggattgatggt agtggtcaaatggcgattaccgttgatgttgaagtggcgagcgatacaccgcatccggcg cggattggcctgaactgccagctggcgcaggtagcagagcgggtaaactggctcggatta gggccgcaagaaaactatcccgaccgccttactgccgcctgttttgaccgctgggatctg ccattgtcagacatgtataccccgtacgtcttcccgagcgaaaacggtctgcgctgcggg acgcgcgaattgaattatggcccacaccagtggcgcggcgacttccagttcaacatcagc cgctacagtcaacagcaactgatggaaaccagccatcgccatctgctgcacgcggaagaa ggcacatggctgaatatcgacggtttccatatggggattggtggcgacgactcctggagc ccgtcagtatcggcggaattccagctgagcgccggtcgctaccattaccagttggtctgg tgtcaaaaataataataaccgggcaggccatgtctgcccgtatttcgcgtaaggaaatcc attatgtactatttaaaaaacacaaacttttggatgttcggtttattctttttcttttac ttttttatcatgggagcctacttcccgtttttcccgatttggctacatgacatcaaccat atcagcaaaagtgatacgggtattatttttgccgctatttctctgttctcgctattattc caaccgctgtttggtctgctttctgacaaactcgggctgcgcaaatacctgctgtggatt attaccggcatgttagtgatgtttgcgccgttctttatttttatcttcgggccactgtta caatacaacattttagtaggatcgattgttggtggtatttatctaggcttttgttttaac gccggtgcgccagcagtagaggcatttattgagaaagtcagccgtcgcagtaatttcgaa tttggtcgcgcgcggatgtttggctgtgttggctgggcgctgtgtgcctcgattgtcggc atcatgttcaccatcaataatcagtttgttttctggctgggctctggctgtgcactcatc ctcgccgttttactctttttcgccaaaacggatgcgccctcttctgccacggttgccaat gcggtaggtgccaaccattcggcatttagccttaagctggcactggaactgttcagacag ccaaaactgtggtttttgtcactgtatgttattggcgtttcctgcacctacgatgttttt gaccaacagtttgctaatttctttacttcgttctttgctaccggtgaacagggtacgcgg gtatttggctacgtaacgacaatgggcgaattacttaacgcctcgattatgttctttgcg ccactgatcattaatcgcatcggtgggaaaaacgccctgctgctggctggcactattatg tctgtacgtattattggctcatcgttcgccacctcagcgctggaagtggttattctgaaa acgctgcatatgtttgaagtaccgttcctgctggtgggctgctttaaatatattaccagc cagtttgaagtgcgtttttcagcgacgatttatctggtctgtttctgcttctttaagcaa ctggcgatgatttttatgtctgtactggcgggcaatatgtatgaaagcatcggtttccag ggcgcttatctggtgctgggtctggtggcgctgggcttcaccttaatttccgtgttcacg cttagcggccccggcccgctttccctgctgcgtcgtcaggtgaatgaagtcgcttaagca atcaatgtcggatgcggcgcgacgcttatccgaccaacatatcataacggagtgatcgca ttgaacatgccaatgaccgaaagaataagagcaggcaagctatttaccgatatgtgcgaa ggcttaccggaaaaaaga A merged sequence is written out. Where there has been a mismatch between the two sequences, the merged sequence is written out in uppercase and the sequence whose mismatch region is furthest from the edges of the sequence is used in the merged sequence. The name and description of the first input sequence is used for the name and description of the output sequence. A report of the merger is written out. Data files None. Notes It should be possible to merge sequences that are Mega bytes long. Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory. megamerger takes two overlapping sequences and merges them into one sequence. It could thus be regarded as the opposite of what splitter does. References None. Warnings The sequences should ideally be identical in their region of overlap. If there are any mismatches between the two sequences then megamerger will still create a merged sequence, but you should check that this is what you required. Diagnostic Error Messages None. Exit status It always exits with status 0. Known bugs None. See also Program name Description cons Create a consensus sequence from a multiple alignment consambig Create an ambiguous consensus sequence from a multiple alignment merger Merge two overlapping sequences Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory. A graphical dotplot of the matches used in this merge can be displayed using the program dotpath. Author(s) Gary Williams formerly at: MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK Please report all bugs to the EMBOSS bug team (emboss-bug (c) not to the original author. History Written Aug 2000 by Gary Williams. Target users This program is intended to be used by everyone and everything, from naive users to embedded scripts. Comments None