Technology Innovation Institute trains the state-of-the-art Falcon LLM 40B foundation model on Amazon SageMaker

This blog post is co-written with Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII.

United Arab Emirate’s (UAE) Technology Innovation Institute (TII), the applied research pillar of Abu Dhabi’s Advanced Technology Research Council, has launched Falcon LLM, a foundational large language model (LLM) with 40 billion parameters. TII is a leading global research center dedicated to pushing the frontiers of knowledge. TII’s team of scientists, researchers, and engineers work to deliver discovery science and transformative technologies. TII’s work focuses on breakthroughs that will future-proof our society. Trained on 1 trillion tokens, TII Falcon LLM boasts top-notch performance while remaining incredibly cost-effective. Falcon-40B matches the performance of other high-performing LLMs, and is the top-ranked open-source model in the public Hugging Face Open LLM leaderboard. It’s available as open-source in two different sizes – Falcon-40B and Falcon-7B and was built from scratch using data preprocessing and model training jobs built on Amazon SageMaker. Open-sourcing Falcon 40B enables users to construct and customize AI tools that cater to unique users needs, facilitating seamless integration and ensuring the long-term preservation of data assets. The model weights are available to download, inspect and deploy anywhere.

Starting June 7th, both Falcon LLMs will also be available in Amazon SageMaker JumpStart, SageMaker’s machine learning (ML) hub that offers pre-trained models, built-in algorithms, and pre-built solution templates to help you quickly get started with ML. You can deploy and use the Falcon LLMs with a few clicks in SageMaker Studio or programmatically through the SageMaker Python SDK. To deploy and run inference against Falcon LLMs, refer to the Introduction to SageMaker JumpStart – Text Generation with Falcon LLMs example notebook.

Dr. Ebtesam Almazrouei, Executive Director–Acting Chief AI Researcher of the AI-Cross Center Unit and Project Lead for LLM Projects at TII, shares:

“We proudly announce the official open-source release of Falcon-40B, the world’s top-ranking open-source language model. Falcon-40B is an exceptional open-source model with 40B parameters, specifically designed as a causal decoder-only model. It was trained on a vast dataset of 1,000B tokens, including RefinedWeb enhanced with curated corpora. The model is made available under the Apache 2.0 license, ensuring its accessibility and usability. Falcon-40B has surpassed renowned models like LLaMA-65B, StableLM and MPT on the public leaderboard maintained by Hugging Face. The architecture of Falcon-40B is optimized for inference, incorporating FlashAttention and multiquery techniques.”

“This step reflects our dedication to pushing the boundaries of AI innovation and technology readiness level for community engagement, education, real-world applications, and collaboration. Continues Dr Ebtesam. “By releasing Falcon-40B as an open-source model, we provide researchers, entrepreneurs, and organizations with the opportunity to harness its exceptional capabilities and drive advancements in AI-driven solutions from healthcare to space, finance, manufacturing to biotech; the possibilities for AI-driven solutions are boundless. To access Falcon-40B and explore its remarkable potential, please visit Join us in leveraging the power of Falcon-40B to shape the future of AI and revolutionize industries”

In this post, we dive deep with Dr. Almazrouei about Falcon LLM training on SageMaker, data curation, optimization, performance, and next steps.

A new generation of LLMs

LLMs are software algorithms trained to complete natural text sequences. Due to their size and the volume of training data they interact with, LLMs have impressive text processing abilities, including summarization, question answering, in-context learning, and more.

In early 2020, research organizations across the world set the emphasis on model size, observing that accuracy correlated with number of parameters. For example, GPT-3 (2020) and BLOOM (2022) feature around 175 billion parameters, Gopher (2021) has 230 billion parameters, and MT-NLG (2021) 530 billion parameters. In 2022, Hoffman et al. observed that the current balance of compute between model parameters and dataset size was suboptimal, and published empirical scaling laws suggesting that balancing the compute budget towards smaller models trained on more data could lead to better performing models. They implemented their guidance in the 70B parameter Chinchilla (2022) model, that outperformed much bigger models.

LLM training on SageMaker

SageMaker is a collection of managed APIs for developing, training, tuning, and hosting machine learning (ML) models, including LLMs. Numerous customers rely on SageMaker for their LLM workloads, such as Stability AI, AI21 Labs, Hugging Face, and LG AI. SageMaker Training provisions compute clusters with user-defined hardware configuration and code. Compute jobs are billed per run, pro-rated to the second, meaning that users are not charged for GPU capacity when not using the service. TII used transient clusters provided by the SageMaker Training API to train the Falcon LLM, up to 48 ml.p4d.24xlarge instances, cumulating in 384 NVIDIA A100 GPUs. Now, TII is training the next Falcon LLM and scaled their training to 3,136 A100 GPU (392 ml.p4d instances).

An unprecedented amount of custom innovations went into all layers of the project in order to raise the bar of science quality and training speed. In the next sections, we describe the optimizations TII conducted at all layers of the deep learning (DL) training system.

Scalable data curation

Latest-generation LLMs get their strength from the size and quality of training data. The team put specific care into the craft of a high-quality trillion-token dataset. Several SageMaker Training CPU jobs transformed petabytes of cheap, scalable web data into a curated, safe training dataset. Automated systems filtered and deduplicated the data; for example, ML classifiers were used to filter profanity. CPU jobs running on ml.c5.18xlarge (72 vCPUs, 144 GB RAM) were instantiated in a few API calls via SageMaker Training to run data transformation tasks. The team used both single-instance and multi-instance CPU jobs for difference use cases. Some of these jobs used hundreds of parallel share-nothing architecture (SNA) jobs, each on a single machine, and for tasks requiring inter-worker synchronization, the team launched multi-instance jobs, cumulating in dozens of instances and thousands of vCPUs. Anecdotally, on a downstream dataset preparation task, the team went up to 257 ml.c5.18xlarge in a single SageMaker Training job, cumulating in 18,504 vCPU and 37 TB of memory.

Maximizing training throughput

To minimize both training costs and time-to-market, the team pursued several directions of optimization to accelerate the training speed proportional to training tokens processed per second and measured in TFLOPs/GPU. The team used a fully custom 3D-parallel LLM training framework, featuring custom optimized layers written in compiled GPU code. The team went as far as writing their own custom matrix multiplication implementation to gain further speed! The team also developed logic that adapts parallel communication to the underlying network topology. During their initial scaling experiments, TII was able to reach 166 TFLOPs/GPU on a 147B model on 256 GPUs, and 173 TFLOPs/GPU on a 13B model on 16 GPUs, in our knowledge the fastest-known model TFLOPs achieved in the cloud at the time of the test in late 2022.

Serverless storage

LLM training is storage intensive; several terabytes of training data need to be channeled to the training cluster, and several terabytes of model checkpoints regularly travel back from the cluster to the permanent storage. Checkpoints also need to reach the training cluster as fast as possible in the event of job restart. In traditional high-performance computing (HPC), computing nodes are connected to distributed file systems, which provide high-performance I/O and throughput via a POSIX-like interface. In AWS, customers regularly use the Amazon FSx for Lustre file system for this purpose (for more details, refer to Speed up training on Amazon SageMaker using Amazon FSx for Lustre and Amazon EFS file systems), and we also documented the self-managed use of BeeGFS in a distributed computer vision case study. Due to their focus on costs and operational simplicity, the team decided not to implement and operate file system servers, but instead took up the challenge of building exclusively on top of serverless object storage Amazon Simple Storage Service (Amazon S3). A custom S3 dataset class was built using the AWS SDK for Python (Boto3), and provided satisfactory performance while enabling the scientists to iterate autonomously on I/O engineering and model science within the same codebase.

Client-side innovation

An LLM project rarely consists of a single training job; numerous jobs are needed to conduct initial tests and experiences. Over the course of the main production training, several jobs may be chained, for example to update configuration or software versions, deploy patches, or recover from failures. Scientists from TII conducted significant engineering to build custom clients adapted to LLM training. A launcher client was built on top of the SageMaker Training SDK in order to pack together multiple functionalities in one command, for example code versioning, Docker image building, and job launch. Additionally, an AWS Lambda serverless compute function was designed to watch, monitor, and intervene on jobs as needed.

Using Slack bots for inference quality audits

Towards the end of training, the team deployed the model on an internal SageMaker Hosting GPU endpoint for real-time interaction. The team went as far as creating a Slack bot to dialog with, to get realistic feedback and run qualitative quality audits of the model.

Training and performance monitoring

Training an LLM requires large amounts of computational resources, including CPU, GPU, and memory resources. Therefore, TII needed to monitor the performance and idle time of the training job to ensure optimal utilization of the computational resources and their cost-effectiveness.

To build an automated monitoring solution, TII used Amazon CloudWatch alarms to monitor the utilization GPU, CPU, and memory for the training jobs. CloudWatch collects raw data and processes it into readable, near-real-time metrics from the underlying container instances being using in the SageMaker Training job. After that, we set thresholds for each of these metrics, and if any metric falls below the threshold, an alarm is triggered. This alarm notifies TII’s team of the low resource utilization, allowing them to take corrective actions to rectify resource utilization constraints.

In addition to monitoring resource utilization, TII could also monitor the idle time of the training job resources. If the training job resources were idle for a prolonged period of time, it could indicate a bottleneck at any stage of the training cycle and require manual investigation. In some instances, the resource utilization was still relatively optimal, but the training process itself wasn’t progressing. For these cases, TII integrated CloudWatch alarms with Lambda functions to query and read the generated training logs, then take automatic actions based on either the generated error or the idleness of the log generation process (cluster is halted). The alarm triggers an action to stop the training job, which ensures that TII doesn’t incur unnecessary costs when the resources were not being utilized.


Using SageMaker paired with proprietary, custom innovation, TII was able to train a model that is state-of-the-art in multiple dimensions: technological breakthrough, science quality, training speed, and also operational simplicity.

“Releasing UAE’s Falcon 40B, World’s Top-Ranked Open Source AI Model, illustrates the technology leadership, and paves the way for AI-powered innovation in the region” indicates Dr. Ebtesam Almazrouei; adding that “we demonstrate our commitment to the objectives outlined in the National AI Strategy 2031. Our active involvement in global technological advancements, represented by Falcon-40B, plays a crucial role in our pursuit of a knowledge-based economy. Through investments and development in AI solutions, we aim to create new opportunities for economic growth, social progress, and educational advancements.

“The open-source nature of Falcon-40B reflects our dedication to collaboration, transparency, innovation, and research in the field of AI. We believe in democratizing advanced AI technology capabilities, making Falcon-40B accessible to researchers and organizations worldwide.”

“Looking ahead, we will continue to contribute to AI and technology advancements, with upcoming models in the pipeline. Moreover, we will actively promote the adoption of advanced AI technology within organizations and businesses in our country, fostering growth and prosperity aligned with our strategic goals.”

– Dr. Almazrouei

To learn more about Falcon LLM, check out the website and the model card on Hugging Face!

About the Authors

Dr. Ebtesam Almazrouei is the Executive Director-Acting Chief AI Researcher and Founder of the Al-Cross Center Unit at the Technology Innovation Institute (TII). As the Founder of the Al-Cross Center Unit at the Technology Innovation Institute (TII), Dr. Almazrouei has played a pivotal role in shaping TII’s AI capabilities. Her strategic vision and expertise in AI and machine learning has empowered her to lead groundbreaking research initiatives and foster cross-functional collaborations, resulting in the delivery of innovative AI solutions across multiple industries.

One of Dr. Almazrouei’s notable achievements is her instrumental role in the development of Falcon 40B, a cutting-edge LLM that has garnered global recognition. Falcon 40B’s exceptional performance has ranked it as the number one LLM globally on Hugging Face’s leaderboard in May 2023. Additionally, she led the development of Noor, the world’s largest Arabic large language model (LLM)  released in April 2022.

Dr. Almazrouei is recognized worldwide for her contributions to AI and was featured in Leading AI Women in the World in 2023 list, alongside other distinguished women in the field. She is also an advocate for sustainability and AI for Good initiatives, as well as the general chair of Abu Dhabi AI Connect and TPC chair of many IEEE international conferences.

Her contributions extend beyond her work at TII where she leads the big data expert subcommittee of the UAE Council for AI and Blockchain and is a member of the worldwide steering board of the Wireless World Research Forum (WWRF). She is a scientific author, patent inventor, entrepreneur, and renowned speaker, known for her keynote speeches at prestigious summits such as the AI Summit in London, World AI Cannes Festival, and Tech summits.

Will Badr is a Sr. Manager AI/ML Solutions Architects based in Dubai – UAE who works as part of the global Amazon Machine Learning team. Will is passionate about using technology in innovative ways to positively impact the community. In his spare time, he likes to go diving, play soccer and explore the Pacific Islands.

Olivier Cruchant is a Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.

View Original Source ( Here.

Leave a Reply

Your email address will not be published. Required fields are marked *

Shared by: AWS Machine Learning