TBD
Data is in fasta format. We use Biopython to process it. The working example pdb_seqres.txt has been downloaded from [2].
Showcasting three approaches to encode sequence data:
- Ordinal encoding DNA sequence
- One-hot encoding DAN sequence
- DNA sequence as a "language", known as k-mer counting
[1] - Demystify DNA sequencing with ML and Python [2] - RCBS Protein Data Bank