Run Large Language Models on Apple Silicon with Core ML
CoreMLPipelines is an experimental Swift library for running pretrained Core ML models to perform different AI tasks. It provides high-performance inference on Apple Silicon devices with minimal memory usage.
- Features
- Installation
- Quick Start
- Supported Models
- Architecture
- CLI Usage
- Model Conversion
- Requirements
- Contributing
- License
- 🚀 High Performance: Optimized Core ML inference on Apple Silicon
- 💾 Memory Efficient: 4-bit and 8-bit quantization support
- 🔄 Streaming: Real-time text generation with async streams
- 🛠️ CLI Tools: Command-line interface for text generation and chat
- 🔧 Model Conversion: Python tools to convert Hugging Face models to Core ML
- 📱 Cross-Platform: Supports iOS 18+ and macOS 15+
- Text Generation: Generate text using causal language models
- Chat: Interactive conversational AI
Add it to your Xcode project via File > Add Package Dependencies.
Or clone and build locally:
git clone https://github.com/pywind/CoreMLPipelines.git cd CoreMLPipelines swift build -c release cp .build/release/coremlpipelines-cli /usr/local/bin/The Python conversion tools require Python 3.11+ and can be installed using uv:
cd coremlpipelinestools uv syncIf you want to upload model to your HuggingFace Hub, please create .env file and put:
HF_TOKEN=hf_your_tokenPlease read this coreml README for more details.
import CoreMLPipelines // Create a text generation pipeline let pipeline = try await TextGenerationPipeline(model: .llama_3_2_1B_Instruct_4bit) // Generate text with streaming let stream = pipeline( messages: [[ "role": "user", "content": "Write a haiku about programming" ]] ) for try await text in stream { print(text, terminator: "") }import CoreMLPipelines let pipeline = try await TextGenerationPipeline(model: .qwen2_5_0_5B_Instruct_4bit) // Configure generation parameters let config = GenerationConfig( maxNewTokens: 100, temperature: 0.7, topP: 0.9, repetitionPenalty: 1.1 ) let stream = pipeline( messages: [["role": "user", "content": "Explain quantum computing simply"]], generationConfig: config ) var fullResponse = "" for try await text in stream { fullResponse += text print(text, terminator: "") }import CoreMLPipelines // Use any Hugging Face model (must be converted to Core ML first) let pipeline = try await TextGenerationPipeline( model: "your-username/your-coreml-model" )CoreMLPipelines supports various quantized language models optimized for Apple Silicon:
llama_3_2_1B_Instruct_4bit- Meta's Llama 3.2 1B parameter model (4-bit quantized)
qwen2_5_0_5B_Instruct_4bit- Alibaba's Qwen2.5 0.5B model (4-bit quantized)qwen2_5_Coder_0_5B_Instruct_4bit- Qwen2.5 Coder 0.5B for code generation (4-bit quantized)
smolLM2_135M_Instruct_4bit- SmolLM2 135M model (4-bit quantized)smolLM2_135M_Instruct_8bit- SmolLM2 135M model (8-bit quantized)
lgai_exaone_4_0_1_2B_4bit- LG AI's EXAONE 4.0 1.2B model (4-bit quantized)
Note: Models/tokenizers and chat_template are automatically downloaded from Hugging Face on first use. Ensure you have a stable internet connection.
CoreMLPipelines/ ├── Models/ # Model definitions and configurations ├── Pipelines/ # Pipeline implementations │ ├── TextGenerationPipeline.swift │ └── TextGenerationPipeline+Models.swift ├── Samplers/ # Token sampling strategies │ ├── GreedySampler.swift │ └── Sampler.swift └── Extensions/ # Core ML tensor utilities - Unified API: Consistent interface across different model architectures
- Memory Management: Efficient memory usage with Core ML's MLModel
- Async/Await: Modern Swift concurrency support
- Streaming: Real-time token generation with AsyncSequence
- Type Safety: Strong typing with Swift's type system
The command-line interface provides convenient tools for testing and development.
coremlpipelines-cli generate-text --model finnvoorhees/coreml-Llama-3.2-1B-Instruct-4bit "Hello, world!" --max-new-tokens 50Options:
--model <model>: Hugging Face model repository ID--max-new-tokens <int>: Maximum number of tokens to generate (default: 100)<prompt>: Text prompt (default: "Hello")
coremlpipelines-cli chat --model finnvoorhees/coreml-Llama-3.2-1B-Instruct-4bitStart an interactive chat session with the specified model.
coremlpipelines-cli profile --model finnvoorhees/coreml-Llama-3.2-1B-Instruct-4bitProfile model performance and memory usage.
Convert Hugging Face models to Core ML format using the Python tools:
cd coremlpipelinestools uv run convert_causal_llm.py --model microsoft/DialoGPT-medium --quantize --half --compile --context-size 512 --batch-size 1Key Options:
--model: Hugging Face model ID--quantize: Apply 4-bit linear quantization--half: Load model in float16 precision--compile: Save as optimized .mlmodelc format--context-size: Maximum context length--batch-size: Batch size for inference--upload: Upload converted model to Hugging Face
# Convert Llama model with 4-bit quantization uv run convert_causal_llm.py \ --model meta-llama/Llama-3.2-1B-Instruct \ --quantize \ --half \ --compile \ --context-size 2048 \ --batch-size 1 \ --upload # Convert SmolLM model with 8-bit quantization uv run convert_causal_llm.py \ --model HuggingFaceTB/SmolLM2-135M-Instruct \ --half \ --compile \ --context-size 1024 \ --batch-size 1- macOS: 15.0 or later
- iOS: 18.0 or later
- Xcode: 16.0 or later
- Swift: 6.0 or later
- Python 3.11+
- uv package manager
- Core ML Tools
- Transformers library
We welcome contributions! Please see our Contributing Guide for details.
-
Clone the repository:
git clone https://github.com/pywind/CoreMLPipelines.git cd CoreMLPipelines -
Open in Xcode:
open Package.swift
-
Run tests:
swift test -
Build CLI tool:
swift build -c release --product coremlpipelines-cli
This project follows Swift's official style guidelines. Use swiftformat to format code:
swiftformat .This project is licensed under the CC0 1.0 Universal License - see the LICENSE file for details.
This project come from this repo: finnvoor
Become a sponsor to https://github.com/finnvoor
