Writing a custom Core Image kernel for a camera filter
Core Image ships with about two hundred filters, and for years that was enough. Then a client wanted a look that didn't exist in the catalogue, real-time, on the camera preview — and I had to stop chaining stock filters and write my own. It's less scary than it sounds: a CIKernel is just a tiny GPU program.
The mental block is the word "kernel." It conjures GPU programming, threadgroups, memory barriers — a whole discipline you'd rather not learn to ship one filter. But a Core Image kernel is the gentlest possible on-ramp to the GPU. You write a small function that takes a pixel and returns a pixel; Core Image handles the dispatch, the tiling, the buffer management, and the scheduling. You never touch a command encoder. Once that clicks, the stock-filter ceiling stops being a ceiling.
When the catalogue runs out
Apple's built-in filters are excellent and you should exhaust them first. CIColorControls, CIColorMatrix, the curve filters, the blur and convolution family — chain a few of those and you can fake a surprising amount. The reason to write your own kernel is rarely a single effect Core Image can't do. It's that the combination you need would take eight chained filters, and eight filters is eight intermediate buffers, eight passes over every pixel, and a frame budget you've already blown on a live preview.
My trigger was a "warm fade" film look: lift the shadows toward a teal, push highlights toward amber, desaturate slightly, then add a soft vignette — per pixel, the same math for every one. That's a textbook case for a single color kernel. One pass, one buffer, all the arithmetic fused into one function the compiler can optimise as a unit. The eight-filter version ran at maybe 24 fps on an iPhone 12 and dropped frames when you moved the camera. The one-kernel version pinned 60.
Color kernel vs. general kernel
Core Image gives you three flavours, and picking the right one is most of the performance battle:
CIColorKernel— the output pixel depends only on the same input pixel. No neighbours, no sampling elsewhere. This is the fast path: Core Image knows each output is independent, so it can tile and fuse aggressively. Use it for anything that's "a function of this pixel's color."CIWarpKernel— you return a coordinate, not a color. For geometric distortion: pinch, twist, lens correction.CIKernel— the general case, where output can sample input at arbitrary positions. Blurs, convolutions, anything that reads neighbours. Powerful, but you give up the locality guarantees, so reach for it only when you genuinely need other pixels.
My fade is purely per-pixel, so CIColorKernel it is. The vignette wants the pixel's position in the image, which a color kernel gives you via destCoord() — so I don't even need the general kernel for that.
Writing the kernel in Metal
You used to write these in a quirky GLSL dialect embedded in a Swift string. Don't. Since iOS 11 you write them in the Metal Shading Language, in a real .ci.metal file, compiled at build time. You get type checking, you get a compile error instead of a runtime surprise, and the shader is precompiled into your Metal library instead of parsed on first use.
The catch is two compiler flags. Core Image kernels need a slightly different Metal compile, so the file must be built with -fcikernel, and the metallib linked with -cikernel. In the target's Build Settings, add to Other Metal Compiler Flags:
-fcikernel
and to Other Metallib Linker Flags:
-cikernel
Miss either and you'll get a cryptic "could not create kernel" at runtime rather than at build. Here's the kernel itself — note the signature: a color kernel takes a sample_t and returns a float4, and includes Core Image's Metal header.
// WarmFade.ci.metal
#include <CoreImage/CoreImage.h> // CIKernelMetalLib types: sample_t, etc.
using namespace metal;
extern "C" {
// A purely per-pixel "warm film" look. Runs as a CIColorKernel.
// s — the input pixel (premultiplied, linear, extended range)
// center — image center in pixels, for the vignette
// radius — vignette falloff distance in pixels
// amount — overall strength, 0…1
[[ stitchable ]]
float4 warmFade(sample_t s, float2 center, float radius, float amount) {
// Core Image hands us PREMULTIPLIED color. Un-premultiply before we
// touch the RGB, or dark/translucent edges will shift hue on us.
float a = s.a;
float3 c = (a > 0.0) ? s.rgb / a : s.rgb;
// --- tone: lift shadows toward teal, push highlights toward amber ---
float luma = dot(c, float3(0.299, 0.587, 0.114));
float3 shadowTint = float3(0.00, 0.04, 0.06); // teal
float3 highlightTint = float3(0.06, 0.03, 0.00); // amber
c += shadowTint * (1.0 - luma);
c += highlightTint * luma;
// --- gentle desaturation toward the same luma ---
c = mix(float3(luma), c, 0.85);
// --- vignette from the pixel's own position (destCoord) ---
float2 p = destCoord();
float d = distance(p, center) / radius; // 0 at center, 1 at edge
float v = 1.0 - smoothstep(0.6, 1.25, d) * 0.55; // darken the corners
c *= v;
// blend the whole effect by `amount`, clamp, re-premultiply on the way out.
float3 base = (a > 0.0) ? s.rgb / a : s.rgb;
c = mix(base, c, amount);
c = clamp(c, 0.0, 1.0);
return float4(c * a, a);
}
}
The GPU runs pixels in lockstep groups; when threads in a group take different branches it has to execute both sides and mask. The two a > 0.0 guards above are uniform enough not to hurt, but resist the urge to add per-pixel if ladders. Prefer mix, step, smoothstep, and clamp — branchless math is what keeps a color kernel cheap.
The Swift wrapper
A kernel on its own isn't a filter. The idiomatic packaging is a CIFilter subclass that loads the kernel once from the Metal library and exposes typed inputs. Loading the .metallib is the part people get wrong: read the data from the bundle and hand it to CIColorKernel(functionName:fromMetalLibraryData:). Do it once, statically — never per frame.
import CoreImage
final class WarmFadeFilter: CIFilter {
var inputImage: CIImage?
var amount: Float = 1.0
// Load the compiled kernel ONCE. Building a kernel per frame is the
// classic way to tank your frame rate.
private static let kernel: CIColorKernel = {
guard let url = Bundle.main.url(forResource: "default",
withExtension: "metallib"),
let data = try? Data(contentsOf: url),
let k = try? CIColorKernel(functionName: "warmFade",
fromMetalLibraryData: data)
else { fatalError("warmFade kernel failed to load — check the -cikernel flags") }
return k
}()
override var outputImage: CIImage? {
guard let input = inputImage else { return nil }
let r = input.extent
let center = CIVector(x: r.midX, y: r.midY)
let radius = Float(max(r.width, r.height) * 0.65)
// The kernel reads the input pixel-for-pixel, so the region of
// interest is the identity — output extent == input extent.
return Self.kernel.apply(
extent: r,
roiCallback: { _, rect in rect },
arguments: [input, center, radius, amount]
)
}
}
The argument order in that array must match the kernel's parameter list exactly — Core Image binds them positionally, with no names to save you. The roiCallback returning rect unchanged is the "every output pixel needs exactly the matching input pixel" contract; for a blur you'd inflate that rect by the blur radius.
Feeding it the live camera
Now the real-time part. The camera path is AVCaptureVideoDataOutput, which hands you a CMSampleBuffer on a delegate callback for every frame. The whole game is to get from that buffer to the screen without the data ever touching the CPU. The moment you copy pixels to main memory and back, you've lost.
So: pull the CVPixelBuffer out of the sample buffer, wrap it as a CIImage (zero-copy — it's just a view onto GPU-resident memory), run the filter, and render straight into a Metal drawable with a CIContext you built once on a shared MTLDevice.
import AVFoundation
import CoreImage
import Metal
final class CameraFilterRenderer: NSObject,
AVCaptureVideoDataOutputSampleBufferDelegate {
private let device = MTLCreateSystemDefaultDevice()!
private let commandQueue: MTLCommandQueue
private let ciContext: CIContext // built ONCE, reused forever
private let filter = WarmFadeFilter()
let layer: CAMetalLayer
override init() {
commandQueue = device.makeCommandQueue()!
layer = CAMetalLayer()
layer.device = device
layer.pixelFormat = .bgra8Unorm
layer.framebufferOnly = false // CIContext needs to write to it
// Metal-backed context. Pin the working space; disable the software
// path. This object is expensive to create and cheap to reuse.
ciContext = CIContext(mtlDevice: device, options: [
.workingColorSpace: CGColorSpace(name: CGColorSpace.extendedLinearSRGB)!,
.cacheIntermediates: false,
.useSoftwareRenderer: false
])
super.init()
}
func captureOutput(_ output: AVCaptureOutput,
didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer),
let drawable = layer.nextDrawable() else { return }
// Zero-copy: wrap the GPU buffer as a CIImage, filter it.
let source = CIImage(cvPixelBuffer: pixelBuffer)
filter.inputImage = source
guard let output = filter.outputImage else { return }
// Render the filtered image straight into the drawable's texture.
let buffer = commandQueue.makeCommandBuffer()!
ciContext.render(output,
to: drawable.texture,
commandBuffer: buffer,
bounds: source.extent,
colorSpace: CGColorSpaceCreateDeviceRGB())
buffer.present(drawable)
buffer.commit()
}
}
The color-management gotchas
This is where most custom filters look subtly wrong, and it's worth slowing down. Three things bite, in order of how often they got me:
Premultiplied alpha. Core Image works in premultiplied color — RGB already multiplied by alpha. If your effect manipulates RGB and any pixel is translucent (edges, composited overlays), you must un-premultiply, do your math, and re-premultiply, exactly as the kernel above does. Skip it and translucent regions shift hue. On an opaque camera frame alpha is 1 everywhere and it doesn't matter — until the day someone feeds your filter a PNG with soft edges, and then it does.
The working color space. Core Image does its arithmetic in a working space, and for anything involving light — blending, blurring, averaging — that space should be linear, not gamma-encoded sRGB. Blend two colors in gamma space and the midpoint comes out too dark. I pin extendedLinearSRGB on the context so the math is physically correct and so I don't clip wide-gamut input.
Extended range. Modern cameras deliver more than [0,1]. HDR highlights and wide-gamut (Display P3) content carry values above 1.0 and outside the sRGB cube. If you clamp(c, 0, 1) too eagerly you'll crush specular highlights into flat white. I clamp only the final output, after the look is applied — and on an HDR pipeline I'd reconsider even that.
Performance discipline
A custom kernel is fast by default; you make it slow by accident. The rules that keep mine at 60 fps are boring and non-negotiable. Reuse one CIContext — it caches the compiled pipeline state, and a fresh one per frame re-pays that cost every time. Build the kernel once, statically. Never round-trip to the CPU: no UIImage in the hot path, no createCGImage per frame, no reading pixels back. Render through a Metal-backed context into a CAMetalLayer drawable so the result stays on the GPU all the way to the display. And keep the kernel itself lean — fuse your math into one color kernel instead of chaining filters, and stay branch-light.
Two more from shipping it. Set cacheIntermediates: false on a video context — with a different image every frame, caching intermediates just churns memory for nothing. And watch nextDrawable(): it can return nil or block if you're holding drawables too long, so do your work and present promptly, every frame.
That's the whole arc: exhaust the stock filters, then write one small Metal function, wrap it in a CIFilter subclass that loads the compiled kernel once, and drive it from AVCaptureVideoDataOutput through a reused Metal-backed CIContext with the data never leaving the GPU. The intimidating part was always the word "kernel," and it turns out to mean a dozen lines of arithmetic on a single pixel. Accept that you're writing a tiny GPU program — mind premultiplied alpha, a linear working space, and your frame budget — and the catalogue stops being a wall. It becomes a starting point.