Language Model

49 articles in this category (Page 1 of 3)

AI NewsLanguage ModelSoftware Engineering

Benchmarking LLM Compression: FP8, GPTQ, and SmoothQuant with llmcompressor

Optimize Qwen2.5-0.5B deployment using llmcompressor to implement FP8, GPTQ W4A16, and SmoothQuant W8A8 quantization strategies.

May 17, 2026

AI NewsLanguage ModelAI Infrastructure

Nous Research Debuts Lighthouse Attention for 1.7x Faster Long-Context Pretraining

Nous Research introduces Lighthouse Attention, delivering up to 1.7x pretraining speedups and 21x faster forward passes at 512K context lengths.

May 16, 2026

AI NewsAI InfrastructureLanguage Model

Zyphra ZAYA1-8B: A 760M Parameter MoE Model Outperforming Claude 4.5 on Math

Zyphra's ZAYA1-8B uses 760M active parameters to outperform Claude 4.5 Sonnet on math benchmarks using novel Markovian RSA test-time compute.

May 6, 2026

AI NewsVoice AILanguage Model

Mistral Voxtral TTS: Closing the Expressivity Gap in Multilingual Voice Cloning

Mistral's Voxtral TTS uses a hybrid 4B-parameter architecture to achieve a 68.4% win rate over ElevenLabs Flash v2.5 in multilingual voice cloning.

May 5, 2026

AI NewsAI InfrastructureLanguage Model

Mastering LLM Post-Training: A Practical Guide to SFT, DPO, and GRPO with TRL

Learn to align LLMs using the TRL library, covering SFT, Reward Modeling, DPO, and GRPO for reasoning tasks, optimized for limited hardware like NVIDIA T4 GPUs.

May 1, 2026

AI NewsAgentic AILanguage Model

Moonshot AI Releases Kimi K2.6: Trillion-Parameter MoE for Long-Horizon Coding

Kimi K2.6 scales agent swarms to 300 sub-agents and 4,000 steps, achieving a leading 54.0 score on Humanity’s Last Exam (HLE-Full) with tools.

Apr 20, 2026

AI NewsLanguage ModelMachine Learning

PrfaaS: Scaling LLM Serving via Cross-Datacenter Prefill-as-a-Service

Moonshot AI and Tsinghua's PrfaaS architecture boosts LLM serving throughput by 54% using cross-datacenter KVCache transfer over commodity Ethernet.

Apr 19, 2026

AI NewsLanguage ModelArtificial Intelligence

Deploying 1-Bit LLMs: A Guide to PrismML Bonsai-1.7B on CUDA

PrismML's Bonsai-1.7B 1-bit LLM achieves a 14.2x memory reduction compared to FP16, enabling efficient CUDA-based inference at 674 tokens per second on RTX 4090.

Apr 18, 2026

AI NewsVoice AILanguage Model

xAI Launches Grok STT and TTS APIs for Enterprise Voice Developers

xAI releases standalone Grok speech APIs featuring a 5.0% error rate in phone call entity recognition, outperforming ElevenLabs and Deepgram.

Apr 18, 2026

AI NewsAI InfrastructureLanguage Model

Parcae: A Stable Looped Transformer Architecture for Scalable Quality

Parcae, a stable looped transformer by UCSD and Together AI, achieves the quality of a 1.3B model with 770M parameters by enforcing dynamical system stability.

Apr 16, 2026

AI NewsAI InfrastructureLanguage Model

NVIDIA KVPress: Optimizing Long-Context LLM Inference with KV Cache Compression

NVIDIA’s KVPress framework enables memory-efficient LLM inference by pruning KV cache pairs with compression ratios up to 0.7, significantly reducing GPU memory overhead for long-context tasks.

Apr 9, 2026

AI NewsAgentic AILanguage Model

Mastering Google LangExtract: A Technical Guide to Structured Document Intelligence

Automate document intelligence with Google LangExtract and OpenAI to transform unstructured text into grounded, machine-readable datasets with exact source spans.

Apr 8, 2026

AI NewsAgentic AILanguage Model

Building Production-Ready Agentic Systems with Z.AI GLM-5

Develop scalable agents with Z.AI GLM-5, utilizing a 744B parameter MoE architecture and native tool calling for production environments.

Apr 3, 2026

AI NewsLanguage ModelMachine Learning

Liquid AI LFM2.5-350M: High-Density Edge Intelligence via 28T Token Training

Liquid AI's LFM2.5-350M achieves high intelligence density by training 350M parameters on 28T tokens, outperforming models twice its size on edge hardware.

Mar 31, 2026

AI NewsArtificial IntelligenceLanguage Model

TinyLoRA: Achieving 91.8% GSM8K Accuracy with Only 13 Parameters

Researchers from FAIR, Cornell, and CMU introduced TinyLoRA, enabling Qwen2.5-7B to reach 91.8% GSM8K accuracy using just 13 parameters.

Mar 24, 2026

AI NewsOCRLanguage Model

Baidu Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model for End-to-End Parsing

Baidu Qianfan Team releases Qianfan-OCR, a 4B-parameter model achieving 93.12 on OmniDocBench v1.5 through a unified vision-language architecture.

Mar 18, 2026

AI NewsArtificial IntelligenceLanguage Model

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Model

Mistral Small 4 unifies instruct, reasoning, and multimodal tasks into a single 119B MoE model with 6B active parameters per token.

Mar 16, 2026

AI NewsArtificial IntelligenceLanguage Model

Google Drops Gemini 3.1 Flash-Lite: Optimizing High-Scale AI with Adjustable Thinking Levels

Google released Gemini 3.1 Flash-Lite, featuring 2.5x faster Time to First Token and adjustable reasoning levels for cost-efficient high-scale AI.

Mar 3, 2026

AI NewsRoboticsLanguage Model

MEM for Robots: Physical Intelligence Unveils 15-Minute Memory System for Gemma 3-4B VLAs

Physical Intelligence introduces MEM, a multi-scale memory system giving Gemma 3-4B VLAs a 15-minute context window for complex, long-horizon robotic tasks.

Mar 3, 2026

AI NewsLanguage ModelSmall Language Model

Alibaba Releases Qwen 3.5 Small: High-Performance On-Device AI Models

Alibaba's Qwen team launched the Qwen3.5 Small series, featuring models from 0.8B to 9B parameters designed for edge devices and high-reasoning tasks with native multimodality.

Mar 2, 2026

AI NewsLanguage ModelMachine Learning

Alibaba Qwen 3.5 Medium Series: High-Efficiency MoE Models with 1M Context

Alibaba's Qwen 3.5 Medium series introduces the 35B-A3B model, which outperforms its 235B predecessor using only 3B active parameters and a 1M token context window.

Feb 24, 2026

AI NewsAI InfrastructureLanguage Model

Google's Deep-Thinking Ratio: Boosting LLM Accuracy While Slashing Inference Costs by 50%

Google researchers introduce the Deep-Thinking Ratio (DTR), a metric that improves LLM accuracy while cutting inference costs by 49% on AIME 2025 benchmarks.

Feb 21, 2026

AI NewsLanguage ModelAgentic AI

Qwen Team Releases Qwen3-Coder-Next: An Open-Weight Language Model

Qwen Team releases Qwen3-Coder-Next, an open-weight language model with 80B parameters, achieving performance comparable to models with 10-20× more active parameters.

Feb 3, 2026

AI NewsKnowledge GraphsLanguage Model

How Tree-KG Enables Hierarchical Knowledge Graphs for Contextual Navigation and Explainable Multi-Hop Reasoning Beyond Traditional RAG

Tree-KG combines semantic embeddings with graph structure, achieving 100% more contextual navigation & explainable reasoning than flat RAG.

Jan 27, 2026