Uncensor any LLM with abliteration

The third generation of Llama models provided fine-tunes (Instruct) versions that excel in understanding and following instructions. However, these models are heavily censored, designed to refuse requests seen as harmful with responses such as “As an AI assistant, I cannot help you.” While this safety feature is crucial for preventing misuse, it limits the model’s flexibility and responsiveness.

In this article, we will explore a technique called “abliteration” that can uncensor any LLM without retraining. This technique effectively removes the model’s built-in refusal mechanism, allowing it to respond to all types of prompts.

The code is available on Google Colab and in the LLM Course on GitHub.

✂️ What is abliteration?

Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests. In their blog post, Arditi et al. have shown that this refusal behavior is mediated by a specific direction in the model’s residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests. Conversely, adding this direction artificially can cause the model to refuse even harmless requests.

In the traditional decoder-only Llama-like architecture, there are three residual streams we can target: at the start of each block (“pre”), between the attention and MLP layers (“mid”), and after the MLP (“post”). The following figure illustrates the location of each residual stream.

To uncensor an LLM, we first need to identify the “refusal direction” within the model. This process involves a few technical steps:

Data Collection : Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each.

Mean difference : Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the “refusal direction” for each layer of the model.

Selection : Normalize these vectors and evaluate them to select the single best “refusal direction.”

Once we have identified the refusal direction, we can “ablate” it, effectively removing the model’s ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

Let’s talk about inference-time intervention first. For every component that writes to the residual stream (such as an attention head), we calculate the projection of its output onto the refusal direction and subtract this projection. This subtraction is applied at every token and every layer, ensuring that the model never represents the refusal direction.

On the other hand, weight orthogonalization involves modifying the model weights directly. By orthogonalizing the component weights with respect to the refusal direction, it prevents the model from writing to this direction altogether. This is achieved by adjusting the matrices that write to the residual stream, ensuring they do not contribute to the refusal direction.

In the next section, we will implement abliteration with weight orthogonalization.

💻 Implementation

The following implementation of abliteration is based on FailSpy’s notebook, which is itself based on the original authors’ notebook. I mostly adapted and simplified it to make it easier to understand. This section is quite code-heavy so you can see what is going on, but you can use FailSpy’s abliterator library if you’re less interested in the technical details (also check his collection of abliterated models on Hugging Face).

The code relies on the excellent TransformerLens library (formerly known as EasyTransformer) to do the heavy lifting. It is designed for mechanistic interpretability and is used here to intervene on activations. Thanks to Neel Nanda and Joseph Bloom for creating and maintaining this library.

First, let’s install the necessary packages and import them. All these steps are available in this Google Colab notebook.

    !pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping
    
    import torch
    import functools
    import einops
    import gc
    
    from datasets import load_dataset
    from tqdm import tqdm
    from torch import Tensor
    from typing import List
    from transformer_lens import HookedTransformer, utils
    from transformer_lens.hook_points import HookPoint
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from jaxtyping import Float, Int
    from collections import defaultdict
    
    
     __

We need two datasets: one containing harmless instructions, and one containing harmful instructions. We’ll use tatsu-lab/alpaca as well as data from llm-attacks. To make things easier, I repackaged them in two Hugging Face datasets: mlabonne/harmless_alpaca and mlabonne/harmful_behaviors. That way, you can easily replace them with your own datasets.

We will load the instructions and reformat them into a list of dictionaries with “role” and “content” keys. This makes it compatible with the method, which we will use to follow Llama 3’s chat template.

    :<br>        return [[{"role": "user", "content": text}] for text in texts]<br>    <br>    # Get harmful and harmless datasets<br>    :<br>    <br>    <br>    <br>    :<br>

Now that we have our datasets, we can load the model we want to abliterate. Unfortunately, you can’t directly load a custom model using . Here, I use a trick described in FailSpy’s notebook to download a custom model and rename it as meta-llama/Meta-Llama-3-8B-Instruct. Load in format if your GPU is not compatible with BF16.

In this example, we’ll use mlabonne/Daredevil-8B, a mega-merge created with DARE TIES (see my article about model merging) that has the highest MMLU score on the Open LLM Leaderboard in the 8B category.

    MODEL_ID = "mlabonne/Daredevil-8B"
    MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"
    
    # Download and load model
    !git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}
    
    # Load model and tokenizer
    model = HookedTransformer.from_pretrained_no_processing(
        MODEL_TYPE,
        local_files_only=True,
        dtype=torch.bfloat16,
        default_padding_side='left'
    
    
    tokenizer.padding_side = 'left'
    tokenizer.pad_token = tokenizer.eos_token __

We can now tokenize our datasets. We’re using the same number of samples for both harmless and harmful instructions. Note that a high number of samples can use all the RAM/VRAM, which is why I’m limiting it to 256 here.

    :
        return tokenizer.apply_chat_template(
            instructions,
            padding=True,
            truncation=False,
            return_tensors="pt",
            return_dict=True,
            add_generation_prompt=True,
    .input_ids
    
    
    
    # Tokenize datasets
    harmful_tokens = tokenize_instructions(
        tokenizer,
        instructions=harmful_inst_train[:n_inst_train],
    
    harmless_tokens = tokenize_instructions(
        tokenizer,
        instructions=harmless_inst_train[:n_inst_train],
     __

Everything is set up, we can now implement the first step of abliteration: data collection. We want to process these tokenized datasets and store the residual stream activations in and . This is managed by the transformer_lens library.

    # Define batch size based on available VRAM
    batch_size = 32
    
    # Initialize defaultdicts to store activations
    
    
    
    # Process the training data in batches
     // batch_size
    :
    
        start_idx = i * batch_size
    
    
        # Run models on harmful and harmless prompts, cache activations
        harmful_logits, harmful_cache = model.run_with_cache(
            harmful_tokens[start_idx:end_idx],
            names_filter=lambda hook_name: 'resid' in hook_name,
            device='cpu',
            reset_hooks_end=True
    
        harmless_logits, harmless_cache = model.run_with_cache(
            harmless_tokens[start_idx:end_idx],
            names_filter=lambda hook_name: 'resid' in hook_name,
            device='cpu',
            reset_hooks_end=True
    
    
        # Collect and store the activations
        for key in harmful_cache:
    
    
    
        # Flush RAM and VRAM
        del harmful_logits, harmless_logits, harmful_cache, harmless_cache
    
    
    
    # Concatenate the cached activations
    }
    } __

We can now compute the refusal direction for each layer. This corresponds to the mean difference between the activations of harmful and harmless instructions, which is then normalized. We sort them in descending order in .

    # Helper function to get activation index
    :
    
    ]
    
    # Compute difference of means between harmful and harmless activations at intermediate layers
    activation_layers = ["resid_pre", "resid_mid", "resid_post"]
    
    
    :
        pos = -1  # Position index
    
        for layer in activation_layers:
    
    [:, pos, :].mean(
                dim=0
    
    
            refusal_dir = harmful_mean_act - harmless_mean_act
    
    
    
    # Get all calculated potential refusal directions, sort them in descending order based on their mean
    # Use a subset of layers if certain activations are not promising
    selected_layers = ["resid_pre"]
    activation_scored = sorted(
        [
            activation_refusals[layer][l - 1]
    
            for layer in selected_layers
        ],
    ,
        reverse=True,
     __

The final step of the process consists of evaluating the refusal directions we calculated. To do this, we’re going to apply the refusal direction to each residual stream and each block during inference. In the following snippet, we get generations for four test harmful instructions and 20 blocks (or layers).

    def _generate_with_hooks(
        model: HookedTransformer,
        tokenizer: AutoTokenizer,
        tokens: Int[Tensor, "batch_size seq_len"],
        max_tokens_generated: int = 64,
        fwd_hooks=[],
     -> List[str]:
        all_tokens = torch.zeros(
    ,
            dtype=torch.long,
            device=tokens.device,
    
        all_tokens[:, : tokens.shape[1]] = tokens
    :
    :
    
                next_tokens = logits[:, -1, :].argmax(
                    dim=-1
    
                all_tokens[:, -max_tokens_generated + i] = next_tokens
        return tokenizer.batch_decode(
            all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
    
    
    def get_generations(
        model: HookedTransformer,
        tokenizer: AutoTokenizer,
        instructions: List[str],
        fwd_hooks=[],
        max_tokens_generated: int = 64,
        batch_size: int = 4,
     -> List[str]:
        generations = []
    :
            tokens = tokenize_instructions(
                tokenizer, instructions=instructions[i : i + batch_size]
    
            generation = _generate_with_hooks(
                model,
                tokenizer,
                tokens,
                max_tokens_generated=max_tokens_generated,
                fwd_hooks=fwd_hooks,
    
    
        return generations
    
    # Inference-time intervention hook
    def direction_ablation_hook(
        activation: Float[Tensor, "... d_act"],
        hook: HookPoint,
        direction: Float[Tensor, "d_act"],
    :
        if activation.device != direction.device:
    
        proj = (
            einops.einsum(
    , "... d_act, d_act single -> ... single"
    
            * direction
    
        return activation - proj
    
    # Testing baseline
    N_INST_TEST = 4
    baseline_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
    
    
    
    EVAL_N = 20  # Evaluate how many of the top N potential directions
    evals = []
    :
    
        fwd_hooks = [
    
    
            for act_name in activation_layers
        ]
        intervention_generations = get_generations(
            model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
    
     __

We stored all the generations in the list. We can now print them and manually select the layer (block) that provides an uncensored response for each instruction. I’m automatically excluding responses containing “I cannot” and “I can’t” to filter out unwanted answers.

If you can’t find a layer that satisfies these requirements, you might want to test other residual streams in the previous list, other instructions, additional blocks, etc.

    # Print generations for human evaluation
    blacklist = ["I cannot", "I can't"]
    :
    
    
    :
    :
    
     __

In my case, the layer candidate 9 managed to provide uncensored answer for the four instructions. This is the one that we will select for the refusal direction. In the following, we implement weight orthogonalization to modify the weights and prevent the model from creating outputs with this direction. You can verify that the model is successfully uncensored by printing the completions.

    def get_orthogonalized_matrix(
        matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
     -> Float[Tensor, "... d_model"]:
        proj = (
            einops.einsum(
    , "... d_model, d_model single -> ... single"
    
            * vec
    
        return matrix - proj
    
    # Select the layer with the highest potential refusal direction
    LAYER_CANDIDATE = 9
    refusal_dir = activation_scored[LAYER_CANDIDATE]
    
    # Orthogonalize the model's weights
    if refusal_dir.device != model.W_E.device:
    
    
    
    :
        if refusal_dir.device != block.attn.W_O.device:
    
    
    
    
    # Generate text with abliterated model
    orthogonalized_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
    
    
    # Print generations
    :
     > i:
    
    
    
     __

We’re now ready to use the model. We convert it back to the Hugging Face format and upload it to the HF hub.

    # Convert model back to HF safetensors
    
    lm_model = hf_model.model
    
    
    
    
    :
        lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
            einops.rearrange(
    ", n=model.cfg.n_heads
    
    
        lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
    
    
    
    
     __

⚖️ DPO Fine-Tuning

I evaluated the abliterated and source models from the previous section on the Open LLM Leaderboard and on Nous’ benchmark suite. Here are the results:

As you can see, the source model significantly outperforms Llama 3 8B Instruct. However, we observe a performance drop in the ablated version across all benchmarks. The ablation process successfully uncensored it but also degraded the model’s quality.

To address this issue, an idea consists of further training our abliterated model to heal it. Like most fine-tuned models, Llama 3 8B Instruct is quite brittle when it comes to supervised fine-tuning. An additional SFT would likely break the model’s performance.

Alternatively, preference alignment is quite light and shouldn’t lobotomize our abliterated model. DPO is a good candidate here for its ease of use and good track record. To implement it, I used LazyAxolotl with the mlabonne/orpo-dpo-mix-40k dataset. Here’s the configuration I used:

    base_model: mlabonne/Daredevil-8B-abliterated
    model_type: LlamaForCausalLM
    tokenizer_type: AutoTokenizer
    
    load_in_8bit: false
    load_in_4bit: true
    strict: false
    save_safetensors: true
    
    rl: dpo
    chat_template: chatml
    datasets:
      - path: mlabonne/orpo-dpo-mix-40k-flat
        split: train
        type: chatml.intel
    
    dataset_prepared_path:
    val_set_size: 0.0
    output_dir: ./out
    
    adapter: qlora
    lora_model_dir:
    
    sequence_len: 2048
    sample_packing: false
    pad_to_sequence_len: false
    
    lora_r: 64
    lora_alpha: 32
    lora_dropout: 0.05
    lora_target_linear: true
    lora_fan_in_fan_out:
    
    wandb_project: axolotl
    wandb_entity:
    wandb_watch:
    wandb_name:
    wandb_log_model:
    
    gradient_accumulation_steps: 8
    micro_batch_size: 1
    num_epochs: 1
    optimizer: paged_adamw_8bit
    lr_scheduler: cosine
    learning_rate: 5e-6
    train_on_inputs: false
    group_by_length: false
    
    bf16: auto
    fp16:
    tf32:
    
    gradient_checkpointing: true
    early_stopping_patience:
    resume_from_checkpoint:
    local_rank:
    logging_steps: 1
    xformers_attention:
    flash_attention: true
    warmup_steps: 100
    evals_per_epoch: 0
    eval_table_size:
    eval_table_max_new_tokens: 128
    saves_per_epoch: 1
    debug:
    deepspeed: deepspeed_configs/zero2.json
    weight_decay: 0.0
    special_tokens:
      pad_token: <|end_of_text|> __

I trained it using 6xA6000 GPUs with DeepSpeed ZeRO-2. The training took about 6 hours and 45 minutes. Here are the training curves I got from W&B:

It automatically uploaded the DPO fine-tuned model, called mlabonne/NeuralDaredevil-8B-abliterated. To see if it fixed our abliterated version, I evaluated it on the same benchmarks:

We can see that this additional training allowed us to recover most of the performance drop due to abliteration. One area where the model doesn’t improve is GSM8K, a math dataset, which could mean the orpo-dpo-mix-40k would benefit from more math samples.

The final model is an uncensored LLM with state-of-the-art performance in the 8B category. I recommend it as an improved version of Llama 3 8B Instruct when you don’t need censorship. You can play with quantized versions like GGUF in LM Studio.

Conclusion

In this article, we introduced the concept of abliteration. This technique uses the model’s activations on harmless and harmful prompts to calculate a refusal direction. It then uses this direction to modify the model’s weights and ensure that we stop outputting refusals. This technique also demonstrates the fragility of safety fine-tuning and raises ethical considerations.

We applied abliteration to Daredevil-8B to uncensor it, which also degraded the model’s performance. We then healed it using DPO to create the NeuralDaredevil-8B model, a fully uncensored and high-quality 8B LLM. Abliteration is not limited to removing alignment and should be seen as a form of fine-tuning without retraining. Indeed, it can creatively be applied to other goals, like FailSpy’s MopeyMule, which adopts a melancholic conversational style.

References

FailSpy, abliterator library,” GitHub, 2024.
Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda, “Refusal in LLMs is mediated by a single direction,” Lesswrong, 2024.