The Metal framework supports GPU-accelerated advanced 3D graphics rendering and data-parallel computation workloads. Metal provides a modern and streamlined API for fine-grain, low-level control of the organization, processing, and submission of graphics and computation commands and the management of the associated data and resources for these commands. A primary goal of Metal is to minimize the CPU overhead necessary for executing these GPU workloads.

Metal Programming Guide

Metal is a highly optimized framework for programming the GPUs found in iPhones and iPads. The name derives from the fact that Metal is the lowest-level graphics framework on the iOS platform (i.e. it is "closest to the metal").

The framework is designed to achieve two different goals: 3D graphics rendering and parallel computations. These two things have a lot in common. Both tasks run special code on a huge amount of data in parallel and can be executed on a GPU.

Who Should Use Metal?

Before talking about the API and shading language itself, we should discuss which developers will benefit from using Metal. As previously mentioned, Metal offers two functionalities: graphics rendering and parallel computing.

For those who are looking for a game engine, Metal is not the first choice. In this case, Apple's Scene Kit (3D) and Sprite Kit (2D) are better options. These APIs offer a high-level game engine, including physics simulation. Another alternative would be a full-featured 3D engine like Epic's Unreal Engine, or Unity, both of which are not limited to Apple's platform. In each of those cases, you profit (or will profit in the future) from the power of Metal without using the API directly.

When writing a rendering engine based on a low-level graphics API, the alternatives to Metal are OpenGL and OpenGL ES. Not only is OpenGL available for nearly every platform, including OS X, Windows, Linux, and Android, but there are also a huge amount of tutorials, books, and best practice guides written about OpenGL. Right now, Metal resources are very limited and you are constrained to iPhones and iPads with a 64-bit processor. On the other hand, due to limitations in OpenGL, its performance is not always optimal compared to Metal, which was written specifically to overcome such issues.

When looking for a high-performance parallel computing library on iOS, the question of which one to use is answered very simply. Metal is the only option. OpenCL is a private framework on iOS, and Core Image (which uses OpenCL) is neither powerful nor flexible enough for this task.

Benefits of Using Metal

The biggest advantage of Metal is a dramatically reduced overhead compared to OpenGL ES. Whenever you create a buffer or a texture in OpenGL, it is copied to ensure that the data cannot be accessed accidentally while the GPU is using it. Copying large resources such as textures and buffers is an expensive operation undertaken for the sake of safety. Metal, on the other hand, does not copy resources. The developer is responsible for synchronizing the access between the CPU and GPU. Luckily, Apple provides another great API, which makes resource synchronization much easier: Grand Central Dispatch. Still, using Metal requires some awareness of this topic, but a modern engine that is loading and unloading resources while rendering profits a lot when this extra copy is avoided.

Another advantage of Metal is its use of pre-evaluated GPU state to avoid redundant validation and compilation. Traditionally in OpenGL, you set GPU states one after the other, and the new set of states needs to be validated when making a draw call. In the worst case, OpenGL needs to recompile shaders again to reflect the new states. Of course, this evaluation is necessary, but Metal chooses to take a different approach here. During the initialization of the rendering engine, a set of states is baked into a pre-evaluted rendering pass. This rendering pass object can be used with different resources, but the other states will be constant. A render pass in Metal can be used without further validation, which reduces the API overhead to a minimum, and allows a greatly increased number of draw calls per frame.

The Metal API

Although many APIs on the platform expose concrete classes, Metal offers many of its types as protocols. The reason for this is that the concrete types of Metal objects are dependent on the device on which Metal is running. This also encourages programming to an interface rather than an implementation. This also means, however, that you won't be able to subclass Metal classes or add categories without making extensive and dangerous use of the Objective-C runtime.

Metal necessarily compromises some safety for speed. In regard to errors, other Apple frameworks are getting more secure and robust, but Metal is totally different. In some instances, you receive a bare pointer to an internal buffer, the access to which you must carefully synchronize. When things go wrong in OpenGL, the result is usually a black screen; in Metal, the results can be totally random effects, including a flickering screen and occasionally a crash. These pitfalls occur because the Metal framework is a very light abstraction of a special-purpose processor next to your CPU.

An interesting side note is that Apple has not yet implemented a software render for Metal that can be used in the iOS Simulator. When linking the Metal framework, the application must be executed on a physical device.

A Basic Metal Program

In this section, we introduce the necessary ingredients for writing your first Metal program. This simple program draws a square rotating about its center. You can download the sample code for this article from GitHub.

Although we can't cover each topic in detail, we will try to at least mention all of the moving parts. For more insight, you can read the sample code and consult other online resources.

Creating a Device and Interfacing with UIKit

In Metal, a device is an abstraction of the GPU. It is used to create many other kinds of objects, such as buffers, textures, and function libraries. To get the default device, use the MTLCreateSystemDefaultDevice function:

								id<MTLDevice> device = MTLCreateSystemDefaultDevice();


Notice that the device is not of a particular concrete class, but instead conforms to the MTLDevice protocol as mentioned above.

The following snippet shows how to create a Metal layer and add it as a sublayer of a UIView's backing layer:

								CAMetalLayer *metalLayer = [CAMetalLayer layer];
metalLayer.device = device;
metalLayer.pixelFormat = MTLPixelFormatBGRA8Unorm;
metalLayer.frame = view.bounds;
[view.layer addSublayer:self.metalLayer];


CAMetalLayer is a subclass of CALayer that knows how to display the contents of a Metal framebuffer. We must tell the layer which Metal device to use (the one we just created), and inform it of its expected pixel format. We choose an 8-bit-per-channel BGRA format, wherein each pixel has blue, green, red, and alpha (transparency) components, with values ranging from 0-255, inclusive.

Libraries and Functions

Much of the functionality of your Metal program will be written in the form of vertex and fragment functions, colloquially known as shaders. Metal shaders are written in the Metal shading language, which we will discuss in greater detail below. One of the advantages of Metal is that shader functions are compiled at the time your app builds into an intermediate language, saving valuable time when your app starts up.

A Metal library is simply a collection of functions. All of the shader functions you write in your project are compiled into the default library, which can be retrieved from the device:

								id<MTLLibrary> library = [device newDefaultLibrary]


We will use the library when building the render pipeline state below.

The Command Queue

Commands are submitted to a Metal device through its associated command queue. The command queue receives commands in a thread-safe fashion and serializes their execution on the device. Creating a command queue is straightforward:

								id<MTLCommandQueue> commandQueue = [device newCommandQueue];


Building the Pipeline

When we speak of the pipeline in Metal programming, we mean the different transformations the vertex data undergoes as it is rendered. Vertex shaders and fragment shaders are two programmable junctures in the pipeline, but there are other things that must happen (clipping, scan-line rasterization, and the viewport transform) that are not under our direct control. This latter class of pipeline features constitute the fixed-function pipeline.

To create a pipeline in Metal, we need to specify which vertex and fragment function we want executed for each vertex and each pixel, respectively. We also need to tell the pipeline the pixel format of our framebuffer. In this case, it must match the format of the Metal layer, since we want to draw to the screen.

To get the functions, we ask for them by name from the library:

								id<MTLFunction> vertexProgram = [library newFunctionWithName:@"vertex_function"];
id<MTLFunction> fragmentProgram = [library newFunctionWithName:@"fragment_function"];


We then create a pipeline descriptor configured with the functions and the pixel format:

								MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
[pipelineStateDescriptor setVertexFunction:vertexProgram];
[pipelineStateDescriptor setFragmentFunction:fragmentProgram];
pipelineStateDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm;


Finally, we create the pipeline state itself from the descriptor. This compiles the shader functions from their intermediate representation into optimized code for the hardware on which the program is running:

								id<MTLRenderPipelineState> pipelineState = [device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor error:nil];


Loading Data into Buffers

Now that we have a pipeline built, we need data to feed through it. In the sample project, we draw a simple bit of geometry: a spinning square. The square is comprised of two right triangles that share an edge:

								static float quadVertexData[] =
     0.5, -0.5, 0.0, 1.0,     1.0, 0.0, 0.0, 1.0,
    -0.5, -0.5, 0.0, 1.0,     0.0, 1.0, 0.0, 1.0,
    -0.5,  0.5, 0.0, 1.0,     0.0, 0.0, 1.0, 1.0,
     0.5,  0.5, 0.0, 1.0,     1.0, 1.0, 0.0, 1.0,
     0.5, -0.5, 0.0, 1.0,     1.0, 0.0, 0.0, 1.0,
    -0.5,  0.5, 0.0, 1.0,     0.0, 0.0, 1.0, 1.0,


The first four numbers of each row represent the x, y, z, and w components of each vertex. The second four numbers represent the red, green, blue, and alpha color components of the vertex.

You may be surprised that there are four numbers required to express a position in 3D space. The fourth component of the vertex position, w, is a mathematical convenience that allows us to represent 3D transformations (rotation, translation, and scaling) in a unified fashion. This detail is not relevant to the sample code for this article.

To draw the vertex data with Metal, we need to place it into a buffer. Buffers are simply unstructured blobs of memory that are shared by the CPU and GPU:

								vertexBuffer = [device newBufferWithBytes:quadVertexData


We will use another buffer for storing the rotation matrix we use to spin the square around. Rather than providing the data up front, we'll just make room for it by creating a buffer of a prescribed length:

								uniformBuffer = [device newBufferWithLength:sizeof(Uniforms) 



In order to rotate the square on the screen, we need to transform the vertices as part of the vertex shader. This requires updating the uniform buffer each frame. To do this, we use trigonometry to generate a rotation matrix from the current rotation angle and copy it into the uniform buffer.

The Uniforms struct has a single member, which is a 4x4 matrix that holds the rotation matrix. The type of the matrix, matrix_float4x4, comes from Apple's SIMD library, a collection of types that take advantage of data-parallel operations where they are available:

								typedef struct
    matrix_float4x4 rotation_matrix;
} Uniforms;


To copy the rotation matrix into the uniform buffer, we get a pointer to its contents and memcpy the matrix into it:

								Uniforms uniforms;
uniforms.rotation_matrix = rotation_matrix_2d(rotationAngle);
void *bufferPointer = [uniformBuffer contents];
memcpy(bufferPointer, &uniforms, sizeof(Uniforms));


Getting Ready to Draw

In order to draw into the Metal layer, we first need to get a 'drawable' from the layer. The drawable object manages a set of textures that are appropriate for rendering into:

								id<CAMetalDrawable> drawable = [metalLayer nextDrawable];


Next, we create a render pass descriptor, which describes the various actions Metal should take before and after rendering is done. Below, we describe a render pass that will first clear the framebuffer to a solid white color, then execute its draw calls, and finally store the results into the framebuffer for display:

								MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
renderPassDescriptor.colorAttachments[0].texture = drawable.texture;
renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionClear;
renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(1, 1, 1, 1);
renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;


Issuing Draw Calls

To place commands into a device's command queue, they must be encoded into a command buffer. A command buffer is a set of one or more commands that will be executed and encoded in a compact way that the GPU understands:

								id<MTLCommandBuffer> commandBuffer = [self.commandQueue commandBuffer];


In order to actually encode render commands, we need yet another object that knows how to convert from our draw calls into the language of the GPU. This object is called a command encoder. We create it by asking the command buffer for an encoder and passing the render pass descriptor we created above:

								id<MTLRenderCommandEncoder> renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];


Immediately before the draw call, we configure the render command encoder with our pre-compiled pipeline state and set up the buffers, which will become the arguments to our vertex shader:

								[renderEncoder setRenderPipelineState:pipelineState];
[renderEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[renderEncoder setVertexBuffer:uniformBuffer offset:0 atIndex:1];


To actually draw the geometry, we tell Metal the type of shape we want to draw (triangles), and how many vertices it should consume from the buffer (six, in this case):

								[renderEncoder drawPrimitives:MTLPrimitiveTypeTriangle vertexStart:0 vertexCount:6];


Finally, to tell the encoder we are done issuing draw calls, we call endEncoding:

								[renderEncoder endEncoding];


Presenting the Framebuffer

Now that our draw calls are encoded and ready for execution, we need to tell the command buffer that it should present the results to the screen. To do this, we call presentDrawable with the current drawable object retrieved from the Metal layer:

								[commandBuffer presentDrawable:drawable];


To tell the buffer that it is ready for scheduling and execution, we call commit:

								[commandBuffer commit];


And that's it!

Metal Shading Language

Although Metal was announced in the same WWDC keynote as the Swift programming language, the shading language is based on C++11, with some limited features and some added keywords.

The Metal Shading Language in Practice

To use the vertex data from our shaders, we define a struct type that corresponds to the layout of the vertex data in the Objective-C program:

								typedef struct
    float4 position;
    float4 color;
} VertexIn;


We also need a very similar type to describe the vertex type that will be passed from our vertex shader to the fragment shader. However, in this case, we must identify (through the use of the [[position]] attribute) which member of the struct should be regarded as the vertex position:

								typedef struct {
 float4 position [[position]];
 float4 color;
} VertexOut;


The vertex function is executed once per vertex in the vertex data. It receives a pointer to the list of vertices, and a reference to the uniform data, which contains the rotation matrix. The third parameter is an index that tells the function which vertex it is currently operating on.

Note that the vertex function arguments are followed by attributes that indicate their usage. In the case of the buffer arguments, the index in the parameter corresponds to the index we specified when setting the buffers on the render command encoder. This is how Metal figures out which parameter corresponds to each buffer.

Inside the vertex function, we multiply the rotation matrix by the vertex's position. Because of how we built the matrix, this has the effect of rotating the square about its center. We then assign this transformed position to the output vertex. The vertex's color is copied directly from the input to the output:

								vertex VertexOut vertex_function(device VertexIn *vertices [[buffer(0)]],
                                 constant Uniforms &uniforms [[buffer(1)]],
                                 uint vid [[vertex_id]])
    VertexOut out;
    out.position = uniforms.rotation_matrix * vertices[vid].position;
    out.color = vertices[vid].color;
    return out;


The fragment function is executed once per pixel. The argument is produced by Metal in the process of rasterization by interpolating between the position and color parameters specified at each vertex. In this simple fragment function, we simply pass out the interpolated color already produced by Metal. This then becomes the color of the pixel on the screen:

								fragment float4 fragment_function(VertexOut in [[stage_in]])
    return in.color;


Why Not Just Extend OpenGL?

Apple is on the OpenGL Architecture Review Board, and has also historically provided its own GL extensions in iOS. But changing OpenGL from the inside seems to be a difficult task because it has different design goals. In particular, it must run on a broad variety of devices with a huge range of hardware capabilities. Although OpenGL is continuing to improve, the process is slower and more subtle.

Metal, on the other hand, was exclusively created with Apple's platforms in mind. Even if the protocol-based API looks unusual at first, it fits very well with the rest of the frameworks. Metal is written in Objective-C, is based on Foundation, and makes use of Grand Central Dispatch to synchronize between the CPU and GPU. It is a much more modern abstraction of the GPU pipeline than OpenGL can be without a complete rewrite.

Metal on the Mac?

It will only be a matter of time before Metal will be available for OS X, too. The API itself is not limited to the ARM processors that power iPhones and iPads. Most of Metal's benefits are transferable to modern GPUs. Additionally, the iPhone and iPad share RAM between the CPU and GPU, which enables data exchange without actually copying any data. Current generations of Macs don't offer unified memory, but this too will only be a matter of time. Perhaps the API will be adjusted to support architectures with dedicated RAM, or Metal will only run on a future generation of Macs.


In this article, we have attempted to provide a helpful and unbiased introduction to the Metal framework.

Of course, most game developers will not have direct contact with Metal. However, top game engines already take advantage of it, and developers will profit from the newly available power without touching the API itself. Additionally, for those who want to use the full power of the hardware, Metal may enable developers to create unique and spectacular effects in their games, or perform parallel computations much faster, giving them a competitive edge.