13 Comments
User's avatar
⭠ Return to thread
Paul Devine's avatar

Yes , tomorrow I will get the datasets , but they where ebola deep metagenomic sra files 50-75 bp reads , yes I think the bacteria contigs are legit , I don't see too much of a problem with assembly and the reason we know these are bacterial sequences is because they have been properly isolated and characterised , so we definitely have a reference to say this sequence belongs to this organism , the difference with viruses is that we have never found one to correctly isolate( in the real meaning of the word) to say this sequence 100% comes from this organism , they are bypassing all isolation steps and going direct to sequencing assuming its there , designing primers to find what they are expecting and sort of working backwards to create the proof with really out of spec PCR , all with no control experiments , every step and every variable should be controlled , otherwise it's not science . I mean had they off looked in this dataset for ebola or hiv and designed primers for that , they would have had a better match than 89% . With 30-40 SRA files for sars-cov2 that I have run the longest contigs i've seen are 3.5k , there may be better examples but I haven't seen them yet , these where nanopore sequencing (1000-1400 bp reads) which it looked to me like some of these sequences did actually exist , but we have no idea what there origin is , I can claim they are human because at this point they definitely did come from a human swab , but if we are going to claim they are bacterial , viral or extra terrestrial then we need to find that organism first and then make sure the sequence definaltey does come from it , is repeatable and is controlled . The biggest problem I can see is with the alignment steps , this is cherry picking to create something you never had , then with the mpileup , sort of overlaying multiple incorrect sequences to fill in missing or incorrect nucleotides to fill in the blanks to match the reference . Megahit did create a 29802 sequence , but if you put them together they aren't that similar , but if we do a basic local alignment search then they align pretty well , the question should be , how did they go from 30.4k not being repeatable to the actual 29802 that was found to MN908947.3 , where did this come from ?

Expand full comment
henjin's avatar

Maybe the 30,473-base sequence of MN908947.1 can't be reproduced because human reads were removed from the files uploaded to the SRA. Or maybe those reads that consist of only N letters were originally the human reads.

The publication date of the raw reads is listed as 2020-01-27 at the SRA, but at that time MN908947.2 had already been published so they had probably noticed the error they made in MN908947.1 (https://www.ncbi.nlm.nih.gov/sra/SRR10971381).

When I tried aligning the paired reads against MN908947.1, there was no read whose starting position was lower than 29,830:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz

curl -so MN908947.1.fa 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'

bwa index MN908947.fa;bwa mem -t 4 MN908947.fa SRR10971381_{1,2}.fastq.gz|awk '/^[^@]/{if($4>x)x=$4}END{print x}'

Expand full comment
US Mortality's avatar

paul & mongol, can we move this to a slack or telegram chat group?

We should discuss all this, in a better format, these comments are kind of getting out of hand ;)

Please let me know your preference., and i'll set it up..

Expand full comment
henjin's avatar

I was thinking earlier that maybe someone could start a Discord for COVID conspiracytard coders, or people who are writing scripts about COVID statistics or bioinformatics.

I've been posting some of my scripts on two SARS Discords but I don't think anyone there has even tried out to run my code.

Discord would be my preference, or oldschool forums like XenForo or Discourse. Discord is pretty good for posting code and images, and even if you have code that's longer than 2000 characters, you can post it as a text file that gets displayed with an expandable preview block, like in this example: https://i.ibb.co/M8gjF7F/20230405122308.png.

Expand full comment
US Mortality's avatar

Hi Mongol, would you like to join: https://discord.gg/9Weas6Ev

Thanks, would be great to get you in there as well :)

Expand full comment
Paul Devine's avatar

That looks pretty good , even if we look at the same data and come to different conclusions it's all good , youve cut a boatload of my messing around with seqkit and I struggled with awk, so I would love to pick your brains on this.

Expand full comment
US Mortality's avatar

I've created a discord, please join: https://discord.gg/9Weas6Ev

Expand full comment
Paul Devine's avatar

Invite expired , unfortunately!

Expand full comment
Kimmo Hellström's avatar

Hi, is it possible to get an invite to this discord?

I don't know if I have much to say, but I'd like to understand these things better :)

Expand full comment
Paul Devine's avatar

yes , agree . Substack is hard to paste up charts ,commands and it have any meaning to the reader etc . Telegram would be my preference.

Expand full comment
Paul Devine's avatar

You might think I am pulling your leg here , I had 12 sra files that I went through that where 98% +

installed lxd for clustering and its wiped my main PC , tried testdisk to recover the deleted files , but I havent found them yet , but I will run through similar searches again . First one I tried was https://www.ncbi.nlm.nih.gov/sra/SRR3647349 this came out with a 2% mismatch overall .

here is the end including the first error (N) Ngagtgtacagtgaacaatgctagggagagctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttaggagaatgacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Most of the errors are in the first 10 bases , quality controlled lower than 20 is set to N ( seqtk)

mapped with bbmap , no reformatting , consensus sequence generated with bwa. I will keep looking , there were better ones . Reads are chopped into pieces as small as 15bp up to 127 , which is why I don't think much of alignment.

Expand full comment
Paul Devine's avatar

The problem is that there are still millions of humans reads in the SRA file , here are the contigs by abundance ( sorry it doesn't paste very well)

k141_5448 913 78.694 CP024724.1 Prevotella intermedia strain KCOM 2837 chromosome 2, complete sequence 91,730 919 0,000E+00 1.264

k141_24026 850 134.614 CP085934.1 Prevotella copri DSM 18205 strain FDAARGOS_1573 plasmid unnamed2, complete sequence 97,291 406 0,000E+00 688

k141_4059 812 132.074 CP072363.1 Prevotella jejuni strain F0697 chromosome 1, complete sequence 98,684 532 0,000E+00 942

k141_525 754 217.291 CP072333.1 Porphyromonas sp, oral taxon 275 strain W7780 chromosome, complete genome 98,802 501 0,000E+00 891

k141_8586 656 762.405 CP016205.1 Prevotella scopos JCM 17725 strain W2052 chromosome 2 genome 98,176 658 0,000E+00 1.147

k141_19969 639 157.783 MT242596.1 Homo sapiens isolate DH001 mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OL521838.1 Homo sapiens haplogroup I3 mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK104093.1 Homo sapiens haplogroup H4a1a1a mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK266950.1 Homo sapiens haplogroup H3i mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK239657.1 Homo sapiens haplogroup H1b1d mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_534 620 199.625 CP072361.1 Prevotella melaninogenica strain F0054 chromosome 1, complete sequence 98,387 620 0,000E+00 1.090

k141_7697 569 360.511 CP023863.1 Prevotella jejuni strain CD3:33 chromosome I, complete sequence 96,473 567 0,000E+00 935

k141_11094 532 62.843 CP072331.1 Prevotella veroralis strain F0319 chromosome 2, complete sequence 97,543 529 0,000E+00 904

k141_17668 508 202.095 CP085941.1 Prevotella melaninogenica strain FDAARGOS_1567 chromosome 2, complete sequence 99,018 509 0,000E+00 911

k141_11371 427 198.847 CP072360.1 Prevotella melaninogenica strain F0091 chromosome 1, complete sequence 99,532 427 0,000E+00 778

k141_9606 408 63.287 CP072330.1 Prevotella veroralis strain F0319 chromosome 1, complete sequence 98,529 408 0,000E+00 721

k141_25754 384 133.274 MN297237.1 Homo sapiens LHRI_LNC32,2 lncRNA gene, complete sequence 100,000 256 4,700E-129 473

k141_25754 384 133.274 MN297236.1 Homo sapiens LHRI_LNC32,1 lncRNA gene, complete sequence 100,000 256 4,700E-129 473

k141_25754 384 133.274 CP034492.1 Eukaryotic synthetic construct chromosome 14 100,000 256 4,700E-129 473

k141_25754 384 133.274 NG_050638.2 Homo sapiens ribosomal protein S29 (RPS29), RefSeqGene (LRG_1147) on chromosome 14 100,000 256 4,700E-129 473

When I remove the human sequences , these definitely do get removed , none of these reads remain . So there is no way the published SRA file can be the one which they used to create these genomes , none of the megahit results are repeatable , human reads remain , the amount of contigs they claim from both assemblers is over 10-12 times less , which points to filtering of data , lots of N's present which also points to filtering of data . Trinity 2.5.1 is so unrepeatable it is possible they got a length of 11k , I have done around 12 runs and the shortest i've gotten is 17k , so this version of this assembler isn't very good , same split SRA files , different results every time. But the tail end of mn908947.3 cannot be made without short read alignment . If you look at the amounts of primers used and there positions and CT , it is possible that this was just attaching smaller human rna sequences , the other problem is that if you look at the mapped reads in tablet , the amplification wasn't specific or equal , so it could well be that some of the sequences where created during the later stages of the PCR process , also why design primers for a genome of 29802 bp if the longest contig was 30.4 k? To me it looks like they knew what they were looking for and repeated an experiment to create it , added the poly a tail to create a bat story . All very uncontrolled .

Expand full comment