Falcon 180B foundation model from TII is now available via Amazon SageMaker JumpStart

Today, we are excited to announce that the Falcon 180B foundation model developed by Technology Innovation Institute (TII) is available for customers through Amazon SageMaker JumpStart to deploy with one-click for running inference. With a 180-billion-parameter size and trained on a massive 3.5-trillion-token dataset, Falcon 180B is the largest and one of the most performant models with openly accessible weights. You can try out this model with SageMaker JumpStart, a machine learning (ML) hub that provides access to algorithms, models, and ML solutions so you can quickly get started with ML. In this post, we walk through how to discover and deploy the Falcon 180B model via SageMaker JumpStart.

What is Falcon 180B

Falcon 180B is a model released by TII that follows previous releases in the Falcon family. It’s a scaled-up version of Falcon 40B, and it uses multi-query attention for better scalability. It’s an auto-regressive language model that uses an optimized transformer architecture. It was trained on 3.5 trillion tokens of data, primarily consisting of web data from RefinedWeb (approximately 85%). The model has two versions: 180B and 180B-Chat. 180B is a raw, pre-trained model, which should be further fine-tuned for most use cases. 180B-Chat is better suited to taking generic instructions. The Chat model has been fine-tuned on chat and instructions datasets together with several large-scale conversational datasets.

The model is made available under the Falcon-180B TII License and Acceptable Use Policy.

Falcon 180B was trained by TII on Amazon SageMaker, on a cluster of approximately 4K A100 GPUs. It used a custom distributed training codebase named Gigatron, which uses 3D parallelism with ZeRO, and custom, high-performance Triton kernels. The distributed training architecture used Amazon Simple Storage Service (Amazon S3) as the sole unified service for data loading and checkpoint writing and reading, which particularly contributed to the workload reliability and operational simplicity.

What is SageMaker JumpStart

With SageMaker JumpStart, ML practitioners can choose from a growing list of best-performing foundation models. ML practitioners can deploy foundation models to dedicated SageMaker instances within a network isolated environment, and customize models using Amazon SageMaker for model training and deployment.

You can now discover and deploy Falcon 180B with a few clicks in Amazon SageMaker Studio or programmatically through the SageMaker Python SDK, enabling you to derive model performance and MLOps controls with SageMaker features such as Amazon SageMaker Pipelines, Amazon SageMaker Debugger, or container logs. The model is deployed in an AWS secure environment and under your VPC controls, helping ensure data security. Falcon 180B is discoverable and can be deployed in Regions where the requisite instances are available. At present, ml.p4de instances are available in US East (N. Virginia) and US West (Oregon).

Discover models

You can access the foundation models through SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In this section, we go over how to discover the models in SageMaker Studio.

SageMaker Studio is an integrated development environment (IDE) that provides a single web-based visual interface where you can access purpose-built tools to perform all ML development steps, from preparing data to building, training, and deploying your ML models. For more details on how to get started and set up SageMaker Studio, refer to Amazon SageMaker Studio.

In SageMaker Studio, you can access SageMaker JumpStart, which contains pre-trained models, notebooks, and prebuilt solutions, under Prebuilt and automated solutions.

From the SageMaker JumpStart landing page, you can browse for solutions, models, notebooks, and other resources. You can find Falcon 180B in the Foundation Models: Text Generation carousel.

You can also find other model variants by choosing Explore all Text Generation Models or searching for Falcon.

You can choose the model card to view details about the model such as license, data used to train, and how to use. You will also find two buttons, Deploy and Open Notebook, which will help you use the model (the following screenshot shows the Deploy option).

Deploy models

When you choose Deploy, the model deployment will start. Alternatively, you can deploy through the example notebook that shows up by choosing Open Notebook. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. You can deploy any of the selected models on SageMaker with the following code:

from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id="huggingface-llm-falcon-180b-chat-bf16") predictor = my_model.deploy()

This deploys the model on SageMaker with default configurations, including the default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To learn more, refer to the API documentation. After it’s deployed, you can run inference against the deployed endpoint through a SageMaker predictor. See the following code:

payload = {
    "inputs": "User: Hello!nFalcon: ",
    "parameters": {"max_new_tokens": 256, "top_p": 0.9, "temperature": 0.6}
}
response = predictor.predict(payload)

Inference parameters control the text generation process at the endpoint. The max new tokens control refers to the size of the output generated by the model. Note that this is not the same as the number of words because the vocabulary of the model is not the same as the English language vocabulary and each token may not be an English language word. Temperature controls the randomness in the output. Higher temperature results in more creative and hallucinated outputs. All the inference parameters are optional.

This 180B parameter model is 335GB and requires even more GPU memory to sufficiently perform inference in 16-bit precision. Currently, JumpStart only supports this model on ml.p4de.24xlarge instances. It is possible to deploy an 8-bit quantized model on a ml.p4d.24xlarge instance by providing the env={"HF_MODEL_QUANTIZE": "bitsandbytes"} keyword argument to the JumpStartModel constructor and specifying instance_type="ml.p4d.24xlarge" to the deploy method. However, please note that per-token latency is approximately 5x slower for this quantized configuration.

The following table lists all the Falcon models available in SageMaker JumpStart along with the model IDs, default instance types, maximum number of total tokens (sum of the number of input tokens and number of generated tokens) supported, and the typical response latency per token for each of these models.

Model Name	Model ID	Default Instance Type	Max Total Tokens	Latency per Token*
Falcon 7B	`huggingface-llm-falcon-7b-bf16`	ml.g5.2xlarge	2048	34 ms
Falcon 7B Instruct	`huggingface-llm-falcon-7b-instruct-bf16`	ml.g5.2xlarge	2048	34 ms
Falcon 40B	`huggingface-llm-falcon-40b-bf16`	ml.g5.12xlarge	2048	57 ms
Falcon 40B Instruct	`huggingface-llm-falcon-40b-instruct-bf16`	ml.g5.12xlarge	2048	57 ms
Falcon 180B	`huggingface-llm-falcon-180b-bf16`	ml.p4de.24xlarge	2048	45 ms
Falcon 180B Chat	`huggingface-llm-falcon-180b-chat-bf16`	ml.p4de.24xlarge	2048	45 ms

*per-token latency is provided for the median response time of the example prompts provided in this blog; this value will vary based on length of input and output sequences.

Inference and example prompts for Falcon 180B

Falcon models can be used for text completion for any piece of text. Through text generation, you can perform a variety of tasks, such as answering questions, language translation, sentiment analysis, and many more. The endpoint accepts the following input payload schema:

payload = {
    "inputs": str,
    (optional)"parameters": {
        "do_sample": bool,
        "max_new_tokens": int,
        "repetition_penalty": float,
        "return_full_text": bool,
        "stop": List[str],
        "seed": int,
        "temperature": float,
        "top_k": int,
        "top_p": float,
        "truncate": int,
        "typical_p": float,
        "best_of": int,
        "watermark": bool,
        "details": bool,
        "decoder_input_details": bool,
    }
}

You can explore the definition of these client parameters and their default values within the text-generation-inference repository.

The following are some sample example prompts and the text generated by the model. All outputs here are generated with inference parameters {"max_new_tokens": 768, "stop": ["<|endoftext|>", "###"]}.

Building a website can be done in 10 simple steps:

Input Building a website can be done in 10 simple steps:

Output
- Choose a domain name
- Register your domain name
- Choose a web host
- Choose a website builder
- Create your website
- Add content to your website
- Add pages to your website
- Add images to your website
- Add videos to your website
- Add social media to your website
1. Choose a domain name
A domain name is the address of your website. It is what people will type into their browser to find your website. For example, the domain name of this website is.
When choosing a domain name, you should choose something that is easy to remember and that represents your business or website. You should also make sure that the domain name is available. You can check if a domain name is available by using a domain name search tool.
2. Register your domain name
Once you have chosen a domain name, you need to register it. This can be done through a domain name registrar. A domain name registrar is a company that manages the registration of domain names.
When registering your domain name, you will need to provide your contact information and choose a payment method. You will also need to choose a domain name extension. The most common domain name extensions are .com, .net, and .org.
3. Choose a web host
A web host is a company that provides space on a server for your website. When choosing a web host, you should choose one that is reliable and that offers good customer support. You should also make sure that the web host offers the features that you need for your website.
4. Choose a website builder
A website builder is a tool that allows you to create a website without having to learn how to code. There are many different website builders available, so you should choose one that is easy to use and that offers the features that you need.
5. Create your website
Once you have chosen a website builder, you can start creating your website. You will need to choose a template for your website and then add content to your website.
6. Add content to your website
The content of your website is what will make your website unique. You should add content that is relevant to your business or website. You can add text, images, videos, and other types of content to your website.
7. Add pages to your website
A website is made up of pages. You can add as many pages as you want to your website. Each page should have a different purpose. For example, you might have a page for your home page, a page for your about page, and a page for your contact page.
8. Add images to your website
Images can make your website more visually appealing. You can add images to your website by uploading them to your website or by using a image hosting service.
9. Add videos to your website
Videos can also make your website more visually appealing. You can add videos to your website by uploading them to your website or by using a video hosting service.
10. Add social media to your website
Social media can help you promote your website. You can add social media buttons to your website so that people can share your website on social media. You can also add social media feeds to your website so that people can see your latest social media posts.

You may notice this pretrained model generates long text sequences that are not necessarily ideal for dialog use cases. Before we show how the fine-tuned chat model performs for a larger set of dialog-based prompts, the next two examples illustrate how to use Falcon models with few-shot in-context learning, where we provide training samples available to the model. Note that “few-shot learning” does not adjust model weights — we only perform inference on the deployed model during this process while providing a few examples within the input context to help guild model output.

Input
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been :+1:"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredible"
Sentiment:

Output
 Positive
###

Input
Translate English to French:
sea otter => loutre de mer
###
peppermint => menthe poivrée
###
plush girafe => girafe peluche
###
cheese =>

Output
 fromage
###

Inference and example prompts for Falcon 180B-Chat

With Falcon 180B-Chat models, optimized for dialogue use cases, the input to the chat model endpoints may contain previous history between the chat assistant and the user. You can ask questions contextual to the conversation that has happened so far. You can also provide the system configuration, such as personas, which define the chat assistant’s behavior. Input payload to the endpoint is the same as the Falcon 180B model except the inputs string value should use the following format:

System: 
User: 
Falcon: 
User: 
Falcon:

The following are some sample example prompts and the text generated by the model. All outputs are generated with inference parameters {"max_new_tokens":256, "stop": ["nUser:", "<|endoftext|>", " User:", "###"]}.

Input
System: You are a helpful assistant
User: What is Amazon SageMaker JumpStart?
Falcon: 

Output 
Amazon SageMaker JumpStart is a capability of Amazon SageMaker that provides a set of solutions to common machine learning problems. These solutions include pre-built machine learning models, algorithms, and pre-configured model training and deployment settings. JumpStart also provides sample notebooks that demonstrate how to use these solutions for various use cases. The goal of JumpStart is to help users get started with machine learning quickly and easily, without the need for extensive machine learning expertise.
User:

Input
User: Simply put, the theory of relativity states that
Falcon: 

Output 
1. The laws of physics are the same for all observers in uniform motion relative to one another.
2. The speed of light in a vacuum is the same for all observers, regardless of their motion or the motion of the light source.
3. Time and space are relative, meaning that they can vary depending on the observer's frame of reference.
4. Mass and energy are interchangeable, as expressed by the famous equation E=mc².
User:

Input
User: Building a website can be done in 10 simple steps:
Falcon: 

Output
1. Determine the purpose and goals of your website.
2. Choose a domain name and register it.
3. Select a web hosting provider and sign up for a plan.
4. Choose a website builder or content management system (CMS) to use.
5. Select a theme or template for your website.
6. Customize the design and layout of your website.
7. Add content to your website, including text, images, and videos.
8. Set up navigation menus and links to make your website easy to navigate.
9. Test your website to ensure it is functioning properly and is mobile-friendly.
10. Launch your website and promote it to attract visitors.
User:

Input
System: Always answer with a Haiku.
User: I am deploying a large language model on Amazon SageMaker JumpStart. What should I do?
Falcon: 

Output
"JumpStart" your dreams,
Deploy with ease, scale with glee,
SageMaker awaits.

Input
System: Generate a table summarizing the options outlined in this email.
User: Team, we have to make a decision on the product plan for Goocrux v2. The first option will take us 3 weeks of development, and will achieve performance of 80ms, but means that the UI has to change. The second option will take us 4 weeks of development, and will achieve performance of 60ms, without any UI change. The third option will take the longest -- 6 weeks of development, will achieve performance of 30ms, without any UI change. A markdown table with four columns: (1) Option ID, (2) Development time, (3) Performance, (4) UI Change.
Falcon: 

Output
| Option ID | Development time | Performance | UI Change |
| --- | --- | --- | --- |
| 1 | 3 weeks | 80ms | Yes |
| 2 | 4 weeks | 60ms | No |
| 3 | 6 weeks | 30ms | No |

Input
System: You always translate English to French.
User: sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese =>
Falcon: 

Output
=> fromage

Input
System: Classify the user prompt as Negative, Positive, or Neutral sentiment. Use a short response.
User: Tweet: "This new music video was awesome" 
Falcon: 

Output
"Positive"

Input
User: What is the recipe of mayonnaise?
Falcon: 

Output 1. In a mixing bowl, whisk together 1 egg yolk, 1 tablespoon of Dijon mustard, and 1 tablespoon of lemon juice until well combined.
2. While whisking continuously, slowly drizzle in 1 cup of vegetable oil until the mixture thickens and emulsifies.
3. Season with salt and pepper to taste.
4. Store in an airtight container in the refrigerator for up to one week.
User:

In the following example, the user has had a conversation with the assistant about tourist sites in Paris. Next, the user is inquiring about the first option recommended by the chat assistant.

Input
System: You are a helpful trip planner.
User: I am going to Paris, what should I see?
Falcon: 1. Eiffel Tower: No trip to Paris is complete without a visit to the iconic Eiffel Tower.
2. Louvre Museum: Home to some of the world's most famous art, including the Mona Lisa.
3. Notre-Dame Cathedral: A stunning example of Gothic architecture and a must-see for history buffs.
4. Champs-Élysées: One of the most famous avenues in the world, lined with shops, cafes, and theaters.
5. Montmartre: A charming neighborhood known for its bohemian vibe, street art, and stunning views of the city.
6. Palace of Versailles: A short train ride from Paris, this opulent palace was once the seat of French royalty.
7. Musée d'Orsay: Housed in a former train station, this museum boasts an impressive collection of Impressionist and Post-Impressionist art.
8. Sainte-Chapelle: A hidden gem in the heart of Paris, this medieval chapel is famous for its stunning stained glass windows.
9. Luxembourg Gardens: A beautiful park in the heart of the city, perfect for a picnic or a leisurely stroll.
User: What is so great about #1?
Falcon: 

Output
#1 is the Eiffel Tower, which is one of the most famous landmarks in the world. It is a symbol of Paris and a must-see attraction for many visitors. The tower offers stunning views of the city from its observation decks, and it is especially beautiful when lit up at night. Additionally, the Eiffel Tower has a rich history and is an engineering marvel, making it a fascinating destination for anyone interested in architecture or history.
User:

Clean up

After you’re done running the notebook, make sure to delete all resources that you created in the process so your billing is stopped. Use the following code:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

In this post, we showed you how to get started with Falcon 180B in SageMaker Studio and deploy the model for inference. Because foundation models are pre-trained, they can help lower training and infrastructure costs and enable customization for your use case. Visit SageMaker JumpStart in SageMaker Studio now to get started.

Resources

About the Authors

Dr. Kyle Ulrich is an Applied Scientist with the Amazon SageMaker JumpStart team. His research interests include scalable machine learning algorithms, computer vision, time series, Bayesian non-parametrics, and Gaussian processes. His PhD is from Duke University and he has published papers in NeurIPS, Cell, and Neuron.

Dr. Ashish Khetan is a Senior Applied Scientist with Amazon SageMaker JumpStart and helps develop machine learning algorithms. He got his PhD from University of Illinois Urbana-Champaign. He is an active researcher in machine learning and statistical inference, and has published many papers in NeurIPS, ICML, ICLR, JMLR, ACL, and EMNLP conferences.

Olivier Cruchant is a Principal Machine Learning Specialist Solutions Architect at AWS, based in France. Olivier helps AWS customers – from small startups to large enterprises – develop and deploy production-grade machine learning applications. In his spare time, he enjoys reading research papers and exploring the wilderness with friends and family.