I tried to use ClustalW2 to align the genomes of the three different versions of Wuhan-Hu-1: `brew install clustalw-w;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='MN908947.{1,2,3}>sars.fa;clustalw sars.fa;cat sars.aln`. The 27 first bases were "CGGTGACGCATACAAAACATTCCCACC" in the first two …
The 27 first bases were "CGGTGACGCATACAAAACATTCCCACC" in the first two versions but "-----------ATTAAAGGTTT-----" the current version, where a dash indicates a deletion, and 6 bases out of the 12-base sequence in the middle were also changed.
Apart from that, the only difference between the three versions was that the last 598 bases of the first version were deleted in the second version, and in the third version a new 44-base sequence was added to the end: "AGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA". The sequence of multiple A bases at the end is known as a poly(A) tail: https://bioinformatics.stackexchange.com/questions/11227/why-does-the-sars-cov2-coronavirus-genome-end-in-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. (The accepted answer at Stack Exchange said that "for coronaviruses in particular, we know that the poly(A) tail is required for replication", but I don't know if the tail is absolutely required in all coronaviruses, because it was missing from 9 out of the 55 coronavirus genomes listed on this page: https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118.)
For some reason the ORF10 protein was also missing from the first and second versions in GenBank. See this thread: https://twitter.com/BillyBostickson/status/1268514460349546497. Replies to the thread pointed out that the last 20 nucleotides of the second version (which are also included in the first and third versions but not at the very end) are "TGTGATTTTAATAGCTTCTT", which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after it in the first version also match human chromosome 1:
- "...or it is simply a repetitive region and yes "automatic machines" (such as HiSeq 3000 / HiSeq 4000 systems) do make that kind of errors in particular near the 3'- (and also 5'-) ends w/o correction. it always helps taking a closer look" (https://twitter.com/ChrisDeZPhD/status/1290064988892274694)
- "Thanks I did read it. Still, looks like an obvious reading error at the 3' end by hybridisation with contaminated sample. can happen on Illumina. There is a 20 bp overlap btw CoV-2 3'-end and a complementary DNA-seq from h-chr 1q31.1-31.3. 99% match in 616 bp from nt 40414-41029." (https://twitter.com/ChrisDeZPhD/status/1290218272705531905)
If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
So I think that solves the question of why the first version was so much longer than the current version, but I still don't know why the beginning of the sequence was also changed, or why they added the sequence of 44 bases to the end of the third version that was missing from the second version.
BTW I think your MEGAHIT pipeline is missing steps like doing quality control with FastQC, merging reads with SeqPrep, and trimming reads with Trimmomatic: https://research.csc.fi/metagenomics_quality.
I tried to use ClustalW2 to align the genomes of the three different versions of Wuhan-Hu-1: `brew install clustalw-w;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='MN908947.{1,2,3}>sars.fa;clustalw sars.fa;cat sars.aln`.
The 27 first bases were "CGGTGACGCATACAAAACATTCCCACC" in the first two versions but "-----------ATTAAAGGTTT-----" the current version, where a dash indicates a deletion, and 6 bases out of the 12-base sequence in the middle were also changed.
Apart from that, the only difference between the three versions was that the last 598 bases of the first version were deleted in the second version, and in the third version a new 44-base sequence was added to the end: "AGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA". The sequence of multiple A bases at the end is known as a poly(A) tail: https://bioinformatics.stackexchange.com/questions/11227/why-does-the-sars-cov2-coronavirus-genome-end-in-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. (The accepted answer at Stack Exchange said that "for coronaviruses in particular, we know that the poly(A) tail is required for replication", but I don't know if the tail is absolutely required in all coronaviruses, because it was missing from 9 out of the 55 coronavirus genomes listed on this page: https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118.)
For some reason the ORF10 protein was also missing from the first and second versions in GenBank. See this thread: https://twitter.com/BillyBostickson/status/1268514460349546497. Replies to the thread pointed out that the last 20 nucleotides of the second version (which are also included in the first and third versions but not at the very end) are "TGTGATTTTAATAGCTTCTT", which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after it in the first version also match human chromosome 1:
- "...or it is simply a repetitive region and yes "automatic machines" (such as HiSeq 3000 / HiSeq 4000 systems) do make that kind of errors in particular near the 3'- (and also 5'-) ends w/o correction. it always helps taking a closer look" (https://twitter.com/ChrisDeZPhD/status/1290064988892274694)
- "No, not poly(A), rather a contamination from RP11-173E24 from human chromosome 1q31.1-31.3, but fraudulent? I don't think so." (https://twitter.com/ChrisDeZPhD/status/1290201009260654595)
- "Thanks I did read it. Still, looks like an obvious reading error at the 3' end by hybridisation with contaminated sample. can happen on Illumina. There is a 20 bp overlap btw CoV-2 3'-end and a complementary DNA-seq from h-chr 1q31.1-31.3. 99% match in 616 bp from nt 40414-41029." (https://twitter.com/ChrisDeZPhD/status/1290218272705531905)
If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.
So I think that solves the question of why the first version was so much longer than the current version, but I still don't know why the beginning of the sequence was also changed, or why they added the sequence of 44 bases to the end of the third version that was missing from the second version.
BTW I think your MEGAHIT pipeline is missing steps like doing quality control with FastQC, merging reads with SeqPrep, and trimming reads with Trimmomatic: https://research.csc.fi/metagenomics_quality.