Discussion about this post

User's avatar
Green Fields's avatar

Oh that is very, very interesting. I guess Wu or anyone else involved in the 2020 paper is not providing any answers. Here's little old me thinking that one should be able to replicate the results in published papers.

To me, publishing three different sequences should end it all right there.

Thanks for doing this work and sharing. It's big stuff.

Expand full comment
henjin's avatar

I now figured out a less hacky way to show the positions where all sequences in an alignment don't have an identical value. It also works with three or more sequences, so for example this shows all the differences between the three versions of Wuhan-Hu-1:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}|mafft ->temp.aln

Rscript -e 'm=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")));pick=which(rowSums(m!=m[,1])>0);write.table(`rownames<-`(m[pick,],pick),sep=";",quote=F,col.names=NA)'

Or the same in gawk:

awk 'NR>1{gsub("\t"," ")sub("\n","\t")gsub("\n","");print}' RS=\> temp.aln|awk -F\\t '{printf";%s",$1;l=length($2);for(i=1;i<=l;i++){x=substr($2,i,1);a[i][NR]=x;b[i][x]}}END{print"";for(i in b)if(length(b[i])>1){o=i;for(j=1;j<=NR;j++)o=o";"a[i][j];print o}}'

Expand full comment
38 more comments...

No posts