bwa mem requires sorted fastq files

While running WGS sequencing alignments bwa mem threw some new error:

[M::mem_pestat] low and high boundaries for proper pairs: (33, 376)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[mem_sam_pe] paired reads have different names: "CL100098912L1C001R001_20", "CL100098912L1C001R001_9"

This seemed strange as other samples ran through without any such complaint. In the end it’s because bwa requires the fastq files to be sorted for paired end sequencing which is usually the case. Well, usually…

If not one has to sort first, I found most suited for this task fastq-pair which from my understanding of quickly checking the code, loads the 1st Readfile into memory (probably just the IDs and pointers) and iterates through the 2nd Readfile and thus finding the proper pair. The memory requirements are still high and adjusting the -t, for the number of entries to keep in memory, is essential for performance. Setting -t to the number of entries in R1 makes sense, which can be determined by wc -l R1.fastq and dividing by 4. A disadvantage is that fastq-pair requires the fastq files to be uncompressed as it makes use of random file access which gets more tricky if the file is compressed.

Another program I saw but didn’t test was bbtools which seemed to be java based. Here links to both programs:

https://github.com/linsalrob/fastq-pair