Protein Structure
Tokenization
Bridging the gap between 3D atomic geometries and the linguistic nature of biological data through discrete codebooks.
01 Theoretical Foundations
The Paradigm Shift
"Protein structure tokenization is the process of compressing continuous 3D atomic coordinates into discrete, quantifiable elements (tokens)... enabling LLMs to ingest structural data analogously to natural language."
Discrete Tokenization
Mapping raw 3D space into a discrete codebook using VQ-VAEs or LFQ architectures.
Hierarchical Scale
Prioritizing global fold and topology before layering localized atomic details.
Tail Dropout Intervention
Selective removal of fine-grained tokens representing geometric noise to prioritize robust global structural designability during inference.
Unified Multimodality
Aligning DNA, RNA, and structures into a single vocabulary to enable native causal reasoning across biology’s central dogma.
02 Methodologies
Geometric Byte Pair Encoding
Iterative clustering of "Geo-Pairs" for resolution-controllable vocabularies.
Kanzi: Flow Autoencoders
Replacing SE(3)-equivariant attention with flow-based diffusion for coordinate mapping.
One Tokenizer Architecture
Mapping DNA/RNA and structural boundaries into a singular shared discrete space.
03 Critical Bottlenecks
Semantic Redundancy
Multiple distinct tokens mapping to identical local geometries, confusing the learning objective.
Resolution vs. Sequence Length
High-resolution details create dense token sets that exceed transformer context windows.
Multimodal Misalignment
Separate encoders often fail to interact, leading to a "modality gap" in latent representations.
Core Research Questions
How to discretize noisy multi-scale backbones without compromising global constraints?
Which algorithms enable models to ignore geometric noise during generative structural design?
Can flow-matching replace symmetry-invariant attention for decoding back to 3D space?
How can RNA/DNA and geometries be embedded directly into a zero-gap shared space?
04 Applications
Conformational Ensemble Generation
Using "synonym swap" methodologies within VQ-VAE vocabularies to capture protein flexibility.
Zero-Shot Multi-Modal Engineering
Language-conditioned biological synthesis, generating sequences from functional annotations.
Targeted Protein Maturation
Shrinking proteins and affinity maturation through direct latent intervention over resolution layers.
Future Directions
The trajectory aims for **Universal Biological Foundation Models**. We anticipate dynamic tokenization systems shifting granularity autonomously based on the task.