Skip to main content

On This Page

InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model

InstaDeep has released Nucleotide Transformer v3 (NTv3), a new multi-species genomics foundation model capable of processing 1 Mb genomic windows at single nucleotide resolution. The model unifies representation learning, functional track prediction, genome annotation, and controllable sequence generation within a single architecture.

Current genomic models struggle to connect local genetic motifs with large-scale regulatory context across multiple organisms, hindering accurate predictions and design. Existing methods often lack the scale to capture long-range dependencies, leading to reduced predictive power and increased experimental validation costs.

Key Insights

  • 9 trillion base pairs: NTv3 is pre-trained on this amount of data from the OpenGenome2 resource.
  • U-Net architecture: Enables processing of very long genomic windows while maintaining single-base resolution.
  • Masked diffusion language modeling: Allows for controllable sequence generation, validated through STARR-seq assays with 2x improved promoter specificity.

Working Example

# Example of tokenizing a sequence with NTv3's tokenizer
# Note: This is a conceptual example, actual implementation
# requires loading the NTv3 tokenizer and model.

sequence = "ATGCGTAGCTAGCTAGCT"
tokens = list(sequence)  # Character-level tokenization
# Add special tokens like <bbox>, <cls>, <mask>, etc. as needed
tokens.append("<bbox>")
tokens.append("<cls>")

print(tokens)
# Expected output (example): ['A', 'T', 'G', 'C', 'G', 'T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'T', 'A', 'G', 'C', 'T', '<bbox>', '<cls>']

Practical Applications

  • Drug Discovery: Designing enhancers to improve gene expression for therapeutic targets.
  • Pitfall: Relying on single-species models can lead to poor generalization and inaccurate predictions when applying findings to different organisms.

References:

Continue reading

Next article

The Elm development environment

Related Content