NLP & LLMs for Genomic Structural Variant Detection

Treating DNA as a specialized language to detect structural variants

We apply Natural Language Processing and Large Language Model techniques to genomic data analysis under the supervision of Prof. Vwani P. Roychowdhury at UCLA. The central question: can transformer-based architectures recognize structural variants in DNA the way they recognize structure in natural language?

Directions

  • One-shot structural-variant detection with transformer architectures that treat DNA sequences as a specialized form of language.
  • Reference-Free DNA Embedding (RDE) models inspired by NLP techniques like contrastive learning, producing semantically meaningful representations of genomic sequences for direct variant identification.

This research thread sits at the intersection of NLP, sequence modeling, and computational biology, with downstream implications for precision diagnostics.