Latest Posts
-
·
Paper – UniLM
UNIfied pre-trained Language Model (UNILM)is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction, by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on, thus can be fine-tuned for both natural language understanding and generation tasks. Methodology Overview of unified LM pre-training.…
-
·
Paper – Zephyr
Zephyr is 7B LLM that utilizes distilled Direct Preference Optimization (dDPO) that significantly improves intent alignment and AI Feedback (AIF) preference data to achieve superior intent alignment in chat-based language modeling without requiring human annotation. Method The approach follows similar stages as InstructGPT. Distilled Supervised Fine-Tuning (dSFT) Starting with a raw LLM, it first needs…
-
·
Paper – CodeFusion
Auto-regressive models for code generation have a limitation: they do not easily allow reconsidering earlier tokens generated. CodeFusion is a 75M pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. Architecture Architecture diagram for CodeFusion showing the Encoder (E), Denoiser (N) and the…
-
·
Paper – LLemma
Llemma is an LLM for mathematics. Formed by continued pretraining of Code Llama on Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code. Llemma is capable of tool use and formal theorem proving without any further finetuning. Data Proof-Pile-2, a 55B-token mixture of scientific papers, web data containing mathematics, and mathematical…
-
·
Paper – GPT4V
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user. Incorporating additional modalities (such as image inputs) into LLMs is a key frontier in artificial intelligence research and development. Similar to GPT-4, the GPT-4V pre-trained model was first trained to predict the next word in a document, using…
-
·
Paper – GPT4
GPT-4 is a large-scale, multimodal Transformer based model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs. GPT-4 is trained using both publicly available data (such as internet data) and data licensed from third-party providers. The post-training alignment process i.e. fine-tuning using Reinforcement Learning…
-
·
Paper – GPT3
GPT-3 is an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model. It demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Model and Architectures GPT-3 uses the same model architecture as GPT-2, including the modified initialization, pre-normalization,…
-
·
Paper – GPT2
GPT-2 demonstrates that language models begin to learn various language processing tasks without any explicit supervision. GPT-2 is trained on a new dataset of millions of web pages called WebText. The experiments show that capacity of the language model is essential to the success of zero-shot transfer and increasing it improves performance in a log-linear…
-
·
Paper – Mistral
Mistral 7B is an LLM engineered for superior performance and efficiency. It leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released…