Latest Posts
-
·
Paper – GPT3
GPT-3 is an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model. It demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Model and Architectures GPT-3 uses the same model architecture as GPT-2, including the modified initialization, pre-normalization,…
-
·
Paper – GPT2
GPT-2 demonstrates that language models begin to learn various language processing tasks without any explicit supervision. GPT-2 is trained on a new dataset of millions of web pages called WebText. The experiments show that capacity of the language model is essential to the success of zero-shot transfer and increasing it improves performance in a log-linear…
-
·
Paper – Mistral
Mistral 7B is an LLM engineered for superior performance and efficiency. It leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost. Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released…
-
·
Paper – Code LLama
Code Llama is a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. There are multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama…
-
·
Paper – Instruction Backtranslation
Instruction back translation is a scalable method to build a high-quality instruction following language model by automatically labeling human written text with corresponding instructions. Finetuning LLaMa on two iterations of this approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard, demonstrating highly effective self-alignment. Instruction Back Translation Self-Augmentation (generating instructions):…
-
·
Paper – Falcon
As larger models require pretraining on trillions of tokens, it is unclear how scalable is curation of “high-quality” corpora, such as social media conversations, books, or technical papers, and whether we will run out of unique high-quality data soon. Falcon shows that properly filtered and deduplicated web data alone can lead to powerful models; even…
-
·
Paper – LIMA
Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large-scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. The relative importance of these two stages is measured by training LIMA, a 65B parameter LLaMa language model fine-tuned with…
-
·
Paper – Alpaca
Alpaca is fine-tuned from Meta’s LLaMA 7B model. The Alpaca model is trained on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003 but is also surprisingly small and easy/cheap to reproduce. Alpaca is intended only for academic research and…
-
·
Paper – LLama
LLaMA is a collection of foundation language models ranging from 7B to 65B parameters, trained on trillions of tokens using publicly available datasets exclusively. Training Data English CommonCrawl [67%] Five CommonCrawl dumps, ranging from 2017 to 2020, are preprocessed using the CCNet pipeline. This process deduplicates the data at the line level, performs language identification…