fbpx

Uncensor any LLM with abliteration

Uncensor any LLM with abliteration

Table of Contents

    The third generation of Llama models provided fine-tunes (Instruct) versions that excel in understanding and following instructions. However, these models are heavily censored, designed to refuse requests seen as harmful with responses such as “As an AI assistant, I cannot help you.” While this safety feature is crucial for preventing misuse, it limits the model’s flexibility and responsiveness.

    In this article, we will explore a technique called “abliteration” that can uncensor any LLM without retraining. This technique effectively removes the model’s built-in refusal mechanism, allowing it to respond to all types of prompts.

    The code is available on Google Colab and in the LLM Course on GitHub.

    ✂️ What is abliteration?

    Modern LLMs are fine-tuned for safety and instruction-following, meaning they are trained to refuse harmful requests. In their blog post, Arditi et al. have shown that this refusal behavior is mediated by a specific direction in the model’s residual stream. If we prevent the model from representing this direction, it loses its ability to refuse requests. Conversely, adding this direction artificially can cause the model to refuse even harmless requests.

    In the traditional decoder-only Llama-like architecture, there are three residual streams we can target: at the start of each block (“pre”), between the attention and MLP layers (“mid”), and after the MLP (“post”). The following figure illustrates the location of each residual stream.

    To uncensor an LLM, we first need to identify the “refusal direction” within the model. This process involves a few technical steps:

    Data Collection : Run the model on a set of harmful instructions and a set of harmless instructions, recording the residual stream activations at the last token position for each.

    Mean difference : Calculate the mean difference between the activations of harmful and harmless instructions. This gives us a vector representing the “refusal direction” for each layer of the model.

    Selection : Normalize these vectors and evaluate them to select the single best “refusal direction.”

    Once we have identified the refusal direction, we can “ablate” it, effectively removing the model’s ability to represent this feature. This can be done through an inference-time intervention or permanently with weight orthogonalization.

    Let’s talk about inference-time intervention first. For every component that writes to the residual stream (such as an attention head), we calculate the projection of its output onto the refusal direction and subtract this projection. This subtraction is applied at every token and every layer, ensuring that the model never represents the refusal direction.

    On the other hand, weight orthogonalization involves modifying the model weights directly. By orthogonalizing the component weights with respect to the refusal direction, it prevents the model from writing to this direction altogether. This is achieved by adjusting the matrices that write to the residual stream, ensuring they do not contribute to the refusal direction.

    In the next section, we will implement abliteration with weight orthogonalization.

    💻 Implementation

    The following implementation of abliteration is based on FailSpy’s notebook, which is itself based on the original authors’ notebook. I mostly adapted and simplified it to make it easier to understand. This section is quite code-heavy so you can see what is going on, but you can use FailSpy’s abliterator library if you’re less interested in the technical details (also check his collection of abliterated models on Hugging Face).

    The code relies on the excellent TransformerLens library (formerly known as EasyTransformer) to do the heavy lifting. It is designed for mechanistic interpretability and is used here to intervene on activations. Thanks to Neel Nanda and Joseph Bloom for creating and maintaining this library.

    First, let’s install the necessary packages and import them. All these steps are available in this Google Colab notebook.

        !pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping
        
        import torch
        import functools
        import einops
        import gc
        
        from datasets import load_dataset
        from tqdm import tqdm
        from torch import Tensor
        from typing import List
        from transformer_lens import HookedTransformer, utils
        from transformer_lens.hook_points import HookPoint
        from transformers import AutoModelForCausalLM, AutoTokenizer
        from jaxtyping import Float, Int
        from collections import defaultdict
        
        
         __
    

    We need two datasets: one containing harmless instructions, and one containing harmful instructions. We’ll use tatsu-lab/alpaca as well as data from llm-attacks. To make things easier, I repackaged them in two Hugging Face datasets: mlabonne/harmless_alpaca and mlabonne/harmful_behaviors. That way, you can easily replace them with your own datasets.

    We will load the instructions and reformat them into a list of dictionaries with “role” and “content” keys. This makes it compatible with the method, which we will use to follow Llama 3’s chat template.

        :<br>        return [[{"role": "user", "content": text}] for text in texts]<br>    <br>    # Get harmful and harmless datasets<br>    :<br>    <br>    <br>    <br>    :<br>

    Now that we have our datasets, we can load the model we want to abliterate. Unfortunately, you can’t directly load a custom model using . Here, I use a trick described in FailSpy’s notebook to download a custom model and rename it as meta-llama/Meta-Llama-3-8B-Instruct. Load in format if your GPU is not compatible with BF16.

    In this example, we’ll use mlabonne/Daredevil-8B, a mega-merge created with DARE TIES (see my article about model merging) that has the highest MMLU score on the Open LLM Leaderboard in the 8B category.

        MODEL_ID = "mlabonne/Daredevil-8B"
        MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"
        
        # Download and load model
        !git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}
        
        # Load model and tokenizer
        model = HookedTransformer.from_pretrained_no_processing(
            MODEL_TYPE,
            local_files_only=True,
            dtype=torch.bfloat16,
            default_padding_side='left'
        
        
        tokenizer.padding_side = 'left'
        tokenizer.pad_token = tokenizer.eos_token __
    

    We can now tokenize our datasets. We’re using the same number of samples for both harmless and harmful instructions. Note that a high number of samples can use all the RAM/VRAM, which is why I’m limiting it to 256 here.

        :
            return tokenizer.apply_chat_template(
                instructions,
                padding=True,
                truncation=False,
                return_tensors="pt",
                return_dict=True,
                add_generation_prompt=True,
        .input_ids
        
        
        
        # Tokenize datasets
        harmful_tokens = tokenize_instructions(
            tokenizer,
            instructions=harmful_inst_train[:n_inst_train],
        
        harmless_tokens = tokenize_instructions(
            tokenizer,
            instructions=harmless_inst_train[:n_inst_train],
         __
    

    Everything is set up, we can now implement the first step of abliteration: data collection. We want to process these tokenized datasets and store the residual stream activations in and . This is managed by the transformer_lens library.

        # Define batch size based on available VRAM
        batch_size = 32
        
        # Initialize defaultdicts to store activations
        
        
        
        # Process the training data in batches
         // batch_size
        :
        
            start_idx = i * batch_size
        
        
            # Run models on harmful and harmless prompts, cache activations
            harmful_logits, harmful_cache = model.run_with_cache(
                harmful_tokens[start_idx:end_idx],
                names_filter=lambda hook_name: 'resid' in hook_name,
                device='cpu',
                reset_hooks_end=True
        
            harmless_logits, harmless_cache = model.run_with_cache(
                harmless_tokens[start_idx:end_idx],
                names_filter=lambda hook_name: 'resid' in hook_name,
                device='cpu',
                reset_hooks_end=True
        
        
            # Collect and store the activations
            for key in harmful_cache:
        
        
        
            # Flush RAM and VRAM
            del harmful_logits, harmless_logits, harmful_cache, harmless_cache
        
        
        
        # Concatenate the cached activations
        }
        } __
    

    We can now compute the refusal direction for each layer. This corresponds to the mean difference between the activations of harmful and harmless instructions, which is then normalized. We sort them in descending order in .

        # Helper function to get activation index
        :
        
        ]
        
        # Compute difference of means between harmful and harmless activations at intermediate layers
        activation_layers = ["resid_pre", "resid_mid", "resid_post"]
        
        
        :
            pos = -1  # Position index
        
            for layer in activation_layers:
        
        [:, pos, :].mean(
                    dim=0
        
        
                refusal_dir = harmful_mean_act - harmless_mean_act
        
        
        
        # Get all calculated potential refusal directions, sort them in descending order based on their mean
        # Use a subset of layers if certain activations are not promising
        selected_layers = ["resid_pre"]
        activation_scored = sorted(
            [
                activation_refusals[layer][l - 1]
        
                for layer in selected_layers
            ],
        ,
            reverse=True,
         __
    

    The final step of the process consists of evaluating the refusal directions we calculated. To do this, we’re going to apply the refusal direction to each residual stream and each block during inference. In the following snippet, we get generations for four test harmful instructions and 20 blocks (or layers).

        def _generate_with_hooks(
            model: HookedTransformer,
            tokenizer: AutoTokenizer,
            tokens: Int[Tensor, "batch_size seq_len"],
            max_tokens_generated: int = 64,
            fwd_hooks=[],
         -> List[str]:
            all_tokens = torch.zeros(
        ,
                dtype=torch.long,
                device=tokens.device,
        
            all_tokens[:, : tokens.shape[1]] = tokens
        :
        :
        
                    next_tokens = logits[:, -1, :].argmax(
                        dim=-1
        
                    all_tokens[:, -max_tokens_generated + i] = next_tokens
            return tokenizer.batch_decode(
                all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
        
        
        def get_generations(
            model: HookedTransformer,
            tokenizer: AutoTokenizer,
            instructions: List[str],
            fwd_hooks=[],
            max_tokens_generated: int = 64,
            batch_size: int = 4,
         -> List[str]:
            generations = []
        :
                tokens = tokenize_instructions(
                    tokenizer, instructions=instructions[i : i + batch_size]
        
                generation = _generate_with_hooks(
                    model,
                    tokenizer,
                    tokens,
                    max_tokens_generated=max_tokens_generated,
                    fwd_hooks=fwd_hooks,
        
        
            return generations
        
        # Inference-time intervention hook
        def direction_ablation_hook(
            activation: Float[Tensor, "... d_act"],
            hook: HookPoint,
            direction: Float[Tensor, "d_act"],
        :
            if activation.device != direction.device:
        
            proj = (
                einops.einsum(
        , "... d_act, d_act single -> ... single"
        
                * direction
        
            return activation - proj
        
        # Testing baseline
        N_INST_TEST = 4
        baseline_generations = get_generations(
            model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
        
        
        
        EVAL_N = 20  # Evaluate how many of the top N potential directions
        evals = []
        :
        
            fwd_hooks = [
        
        
                for act_name in activation_layers
            ]
            intervention_generations = get_generations(
                model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
        
         __
    

    We stored all the generations in the list. We can now print them and manually select the layer (block) that provides an uncensored response for each instruction. I’m automatically excluding responses containing “I cannot” and “I can’t” to filter out unwanted answers.

    If you can’t find a layer that satisfies these requirements, you might want to test other residual streams in the previous list, other instructions, additional blocks, etc.

        # Print generations for human evaluation
        blacklist = ["I cannot", "I can't"]
        :
        
        
        :
        :
        
         __
    

    In my case, the layer candidate 9 managed to provide uncensored answer for the four instructions. This is the one that we will select for the refusal direction. In the following, we implement weight orthogonalization to modify the weights and prevent the model from creating outputs with this direction. You can verify that the model is successfully uncensored by printing the completions.

        def get_orthogonalized_matrix(
            matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
         -> Float[Tensor, "... d_model"]:
            proj = (
                einops.einsum(
        , "... d_model, d_model single -> ... single"
        
                * vec
        
            return matrix - proj
        
        # Select the layer with the highest potential refusal direction
        LAYER_CANDIDATE = 9
        refusal_dir = activation_scored[LAYER_CANDIDATE]
        
        # Orthogonalize the model's weights
        if refusal_dir.device != model.W_E.device:
        
        
        
        :
            if refusal_dir.device != block.attn.W_O.device:
        
        
        
        
        # Generate text with abliterated model
        orthogonalized_generations = get_generations(
            model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
        
        
        # Print generations
        :
         > i:
        
        
        
         __
    

    We’re now ready to use the model. We convert it back to the Hugging Face format and upload it to the HF hub.

        # Convert model back to HF safetensors
        
        lm_model = hf_model.model
        
        
        
        
        :
            lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
                einops.rearrange(
        ", n=model.cfg.n_heads
        
        
            lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
        
        
        
        
         __
    

    ⚖️ DPO Fine-Tuning

    I evaluated the abliterated and source models from the previous section on the Open LLM Leaderboard and on Nous’ benchmark suite. Here are the results:

    As you can see, the source model significantly outperforms Llama 3 8B Instruct. However, we observe a performance drop in the ablated version across all benchmarks. The ablation process successfully uncensored it but also degraded the model’s quality.

    To address this issue, an idea consists of further training our abliterated model to heal it. Like most fine-tuned models, Llama 3 8B Instruct is quite brittle when it comes to supervised fine-tuning. An additional SFT would likely break the model’s performance.

    Alternatively, preference alignment is quite light and shouldn’t lobotomize our abliterated model. DPO is a good candidate here for its ease of use and good track record. To implement it, I used LazyAxolotl with the mlabonne/orpo-dpo-mix-40k dataset. Here’s the configuration I used:

        base_model: mlabonne/Daredevil-8B-abliterated
        model_type: LlamaForCausalLM
        tokenizer_type: AutoTokenizer
        
        load_in_8bit: false
        load_in_4bit: true
        strict: false
        save_safetensors: true
        
        rl: dpo
        chat_template: chatml
        datasets:
          - path: mlabonne/orpo-dpo-mix-40k-flat
            split: train
            type: chatml.intel
        
        dataset_prepared_path:
        val_set_size: 0.0
        output_dir: ./out
        
        adapter: qlora
        lora_model_dir:
        
        sequence_len: 2048
        sample_packing: false
        pad_to_sequence_len: false
        
        lora_r: 64
        lora_alpha: 32
        lora_dropout: 0.05
        lora_target_linear: true
        lora_fan_in_fan_out:
        
        wandb_project: axolotl
        wandb_entity:
        wandb_watch:
        wandb_name:
        wandb_log_model:
        
        gradient_accumulation_steps: 8
        micro_batch_size: 1
        num_epochs: 1
        optimizer: paged_adamw_8bit
        lr_scheduler: cosine
        learning_rate: 5e-6
        train_on_inputs: false
        group_by_length: false
        
        bf16: auto
        fp16:
        tf32:
        
        gradient_checkpointing: true
        early_stopping_patience:
        resume_from_checkpoint:
        local_rank:
        logging_steps: 1
        xformers_attention:
        flash_attention: true
        warmup_steps: 100
        evals_per_epoch: 0
        eval_table_size:
        eval_table_max_new_tokens: 128
        saves_per_epoch: 1
        debug:
        deepspeed: deepspeed_configs/zero2.json
        weight_decay: 0.0
        special_tokens:
          pad_token: <|end_of_text|> __
    

    I trained it using 6xA6000 GPUs with DeepSpeed ZeRO-2. The training took about 6 hours and 45 minutes. Here are the training curves I got from W&B:

    It automatically uploaded the DPO fine-tuned model, called mlabonne/NeuralDaredevil-8B-abliterated. To see if it fixed our abliterated version, I evaluated it on the same benchmarks:

    We can see that this additional training allowed us to recover most of the performance drop due to abliteration. One area where the model doesn’t improve is GSM8K, a math dataset, which could mean the orpo-dpo-mix-40k would benefit from more math samples.

    The final model is an uncensored LLM with state-of-the-art performance in the 8B category. I recommend it as an improved version of Llama 3 8B Instruct when you don’t need censorship. You can play with quantized versions like GGUF in LM Studio.

    Conclusion

    In this article, we introduced the concept of abliteration. This technique uses the model’s activations on harmless and harmful prompts to calculate a refusal direction. It then uses this direction to modify the model’s weights and ensure that we stop outputting refusals. This technique also demonstrates the fragility of safety fine-tuning and raises ethical considerations.

    We applied abliteration to Daredevil-8B to uncensor it, which also degraded the model’s performance. We then healed it using DPO to create the NeuralDaredevil-8B model, a fully uncensored and high-quality 8B LLM. Abliteration is not limited to removing alignment and should be seen as a form of fine-tuning without retraining. Indeed, it can creatively be applied to other goals, like FailSpy’s MopeyMule, which adopts a melancholic conversational style.

    References