Floating point operations per second squared

KeystoneFuse: Model that summerize learns better

We introduce KeystoneFuse, a novel methodology for data-efficient pre-training of causal language models. By guiding the model with summarized representations from an external encoder, KeystoneFuse achieves superior performance with significantly less training tokens and time. This is accomplished by fusing encoder features via a lightweight adapter at a specific “keystone” transformer layer during the pre-training phase. Our experiments show that a model pre-trained for only 3 epochs using the KeystoneFuse method consistently outperforms a baseline model pre-trained for 6 epochs.

1. Introduction

Pre-training large language models (LLMs) is a notoriously data- and compute-intensive process, demanding vast resources to achieve state-of-the-art performance. This presents a significant barrier to research and development. Therefore, methods that improve the data efficiency of the pre-training phase are of critical importance.

In this work, we propose KeystoneFuse, a novel pre-training strategy that injects high-level, summarized knowledge into a causal language model as it learns. The core idea is that by forcing the model to align its internal representations with a coherent summary of the text which is provided by an external encoder, it learns more robust and generalizable features, faster. We hypothesize that this guided learning approach can lead to superior performance with significantly less training.

2. Methodology

2.1. Experimental Setup

Our experiments commence from the LLaMA architecture, An external encoder provides supplementary feature vectors, which represent a summarized form of the input text. The models were pre-trained on the publicly available web data.

2.2. KeystoneFuse Architecture

The KeystoneFuse method introduces a lightweight KeystoneFuseAdapter module into the model architecture from the start of pre-training. This adapter takes the hidden states from a specific transformer layer (keystonefuse_layer) and the output from the external encoder as input. Its objective is to align its output with the external encoder’s vector representation via a cosine distance loss. This acts as an auxiliary objective during pre-training, encouraging the main model to develop representations that are conducive to summarization. The key hyperparameter investigated is the keystonefuse_layer, which determines the fusion point within the network.

Latent Attention and Dictionary Learning: Within the KeystoneFuseAdapter, a novel “Latent Attention” mechanism is employed. This mechanism performs dictionary learning on the latent vectors derived from the external encoder’s summarized representations. By doing so, Latent Attention enables the stable and non-conflicting transfer of this rich, additional information into the main model’s learning process, without interfering with the primary Causal Language Modeling (CLM) task.

Fusion Tokens: To facilitate the integration of the external encoder’s summarized information, “Fusion Tokens” are introduced. These are special, learnable tokens that are prepended or appended to the input sequence at the keystonefuse_layer. The KeystoneFuseAdapter then attends to these fusion tokens, allowing them to effectively “absorb” and represent the summarized knowledge from the external encoder. This mechanism provides a dedicated pathway for the external information to influence the model’s internal representations without directly modifying the original input tokens, thereby maintaining the integrity of the primary language modeling task and ensuring no conflict with the CLM objective.

2.3. Baseline Comparison

We compare our KeystoneFuse models against a standard baseline model pre-trained using only the causal language modeling objective. The KeystoneFuse models were pre-trained for 3 epochs, while the baseline model was pre-trained for 6 epochs to provide a robust performance benchmark.

3. Results and Analysis

Our results demonstrate a clear advantage for the KeystoneFuse architecture in both final performance and training efficiency.

3.1. Performance vs. Training Tokens

Performance vs Training Tokens

Figure 1: Test Loss (Cross-Entropy) vs. Total Tokens Trained. This graph compares the performance of a KeystoneFuse model (KSL3, purple) against the baseline (blue). The vertical dashed line indicates the end of the KeystoneFuse training run at approximately 22.5 billion tokens.

The training dynamics, illustrated in Figure 1, provide strong evidence for the data efficiency of the KeystoneFuse method. We observe several key trends:

Faster Convergence: From the very beginning of training, the KeystoneFuse model (purple line) exhibits a steeper decline in test loss compared to the baseline model (blue line). This indicates that the auxiliary summarization signal allows the model to learn more effectively and converge on a better solution more quickly.

Consistent Performance Lead: At any given point along the x-axis up to 22.5B tokens, the KeystoneFuse model’s loss is consistently lower than the baseline’s loss. This means that for the same amount of computational effort and data, KeystoneFuse yields a superior model.

Superior Final Performance with Half the Data: This is the most critical finding. The KeystoneFuse training was stopped at 22.5B tokens. At this point, its final test loss is visibly lower than the loss achieved by the baseline model even after the baseline continued training for another 22.5B tokens (for ~45B total). The baseline model’s performance saturates at a higher loss value, while KeystoneFuse reaches a better final performance with half the training data.

3.2. Performance vs. Training Epochs

As shown in Table 1, the KeystoneFuse models consistently outperform the baseline at every stage of pre-training. Most notably, the top-performing KeystoneFuse models achieved a lower test loss after only 3 epochs than the baseline model achieved after 6 epochs.

Table 1: Performance vs. Training Epochs.

Model Epoch 1 Epoch 2 Epoch 3 (KSL Final) Epoch 6 (Baseline Final)
Baseline 2.1719 2.1563 2.0781 2.0625
KSL3 2.1250 2.0938 2.0313
KSL4 2.1406 2.1094 2.0313

3.2. Impact of Fusion Layer

Final Test Loss vs Fusion Layer

Figure 2: Final Test Loss vs Fusion Layer. This graph illustrates the impact of the fusion layer location on the final test loss, showing that applying the fusion adapter at earlier layers yielded the best results.

Table 2: Final Test Loss vs Fusion Layer.

KSL Final Test Loss (CE) Performance Tier
2, 3, 4 2.03125 Top Tier
5, 6, 10, 12, 14 2.046875 Second Tier

Fusing at layers 2, 3, and 4 consistently produced the lowest test loss. This suggests that introducing the summarization objective early in the network allows the model to develop its core representations around this guidance, leading to a more robust final model. While later-layer fusion still outperformed the baseline, its impact was less pronounced.

4. Discussion

The superior performance and efficiency of KeystoneFuse can be attributed to its guided learning approach. By forcing the model to generate representations that align with a summarized version of the input, we provide a strong, high-level training signal. This prevents the model from getting stuck in suboptimal local minima and accelerates the learning of meaningful, generalizable features.

The success of early-layer fusion supports the hypothesis that foundational representations are more pliable and benefit more from this guidance. Introducing the summarization objective too late in the network may not allow for sufficient processing and integration with the model’s existing knowledge.

5. Conclusion

We have presented KeystoneFuse, a novel and highly efficient pre-training methodology for causal language models. Our experiments provide strong empirical evidence that KeystoneFuse models can achieve superior performance with as little as half the training data and time compared to a standard pre-training baseline. The optimal results are obtained by guiding the model with summarized representations at early-to-mid layers of the transformer architecture. These findings suggest that KeystoneFuse is a promising direction for pre-training more powerful and specialized AI models in a resource-conscious manner.

#Ai