InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model
These articles are AI-generated summaries. Please check the original sources for full details.
Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model
InstaDeep has released Nucleotide Transformer v3 (NTv3), a new multi-species genomics foundation model capable of processing 1 Mb genomic windows at single nucleotide resolution. The model unifies representation learning, functional track prediction, genome annotation, and controllable sequence generation within a single architecture.
Current genomic models struggle to connect local genetic motifs with large-scale regulatory context across multiple organisms, hindering accurate predictions and design. Existing methods often lack the scale to capture long-range dependencies, leading to reduced predictive power and increased experimental validation costs.
Key Insights
- 9 trillion base pairs: NTv3 is pre-trained on this amount of data from the OpenGenome2 resource.
- U-Net architecture: Enables processing of very long genomic windows while maintaining single-base resolution.
- Masked diffusion language modeling: Allows for controllable sequence generation, validated through STARR-seq assays with 2x improved promoter specificity.
Working Example
# Example of tokenizing a sequence with NTv3's tokenizer
# Note: This is a conceptual example, actual implementation
# requires loading the NTv3 tokenizer and model.
sequence = "ATGCGTAGCTAGCTAGCT"
tokens = list(sequence) # Character-level tokenization
# Add special tokens like <bbox>, <cls>, <mask>, etc. as needed
tokens.append("<bbox>")
tokens.append("<cls>")
print(tokens)
# Expected output (example): ['A', 'T', 'G', 'C', 'G', 'T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'T', '<bbox>', '<cls>']
Practical Applications
- Drug Discovery: Designing enhancers to improve gene expression for therapeutic targets.
- Pitfall: Relying on single-species models can lead to poor generalization and inaccurate predictions when applying findings to different organisms.
References:
Continue reading
Next article
The Elm development environment
Related Content
Meta Releases TRIBE v2: A Tri-Modal Foundation Model for High-Resolution fMRI Prediction
Meta’s FAIR team introduces TRIBE v2, a tri-modal foundation model that predicts fMRI responses across video, audio, and text stimuli, achieving a group correlation near 0.4 on the HCP 7T dataset.
Generalist AI Introduces GEN-θ: A New Era of Embodied Foundation Models for Robotics
Generalist AI's GEN-θ is a groundbreaking embodied foundation model trained on real-world physical interaction data, enabling scalable robotics through Harmonic Reasoning and large-scale multimodal pre-training.
Cisco Released Cisco Time Series Model: Their First Open-Weights Foundation Model based on Decoder-only Transformer Architecture
Cisco's open-weight Time Series Model reduces MAE by 25% on observability benchmarks, leveraging multiresolution context for improved forecasting.