K
kinfey
Previously, I shared with you how to use Phi-3-mini on AIPC's NPU and iPhone. Some people want to know more about the experience of using macOS and how to use Apple Silicon to accelerate SLM models. This blog will share with you relevant knowledge, including how to use Apple MLX Framework to accelerate Phi-3-mini operation, fine-tune, and combine Llama.cpp for quantitative operation.
MLX is an array framework for machine learning research on Apple silicon, brought to you by Apple machine learning research.
MLX is designed by machine learning researchers for machine learning researchers. The framework is intended to be user-friendly, but still efficient to train and deploy models. The design of the framework itself is also conceptually simple. We intend to make it easy for researchers to extend and improve MLX with the goal of quickly exploring new ideas.
LLMs can be accelerated in Apple Silicon devices through MLX, and models can be run locally very conveniently.
Installing MLX is easy, you will need Python 3.11.x+, then install it in the terminal
1. Running Phi-3-mini in Terminal with MLX
2. Quantizing Phi-3-mini with MLX in Terminal
3. Running Phi-3-mini with MLX in Jupyter Notebook
Note: Please read Inference Phi-3 with Apple MLX Framework to Learn more
We generally need GPU acceleration to complete model training or fine-tuning, but in Apple devices you can use Apple silicon's MPS(Metal Performance Shaders) to replace the GPU to complete model training and fine-tuning.
The Metal Performance Shaders framework contains a collection of highly optimized compute and graphics shaders that are designed to integrate easily and efficiently into your Metal app. These data-parallel primitives are specially tuned to take advantage of the unique hardware characteristics of each GPU family to ensure optimal performance.
1. Data preparation
By default, MLX Framework requires the jsonl format of train, test, and eval, and is combined with Lora to complete fine-tuning jobs.
Note:
Please download data from this link , please inculde all .jsonl in data folder
2. Fine-tuning in your terminal
Please run this command in terminal
Note: This is LoRA fine-tuning, MLX framework not published QLoRA
3. Run Fine-tuning adapter to test
You can run fine-tuning adapter in terminal,like this
and run original model to compare result
You can try to compare the results of Fine-tuning with the original model
4. Merge adapters to generate new models
5. Running quantified fine-tuning models using ollama
Before use, please configure your llama.cpp environment
Note:
set Ollma Model file(If not install ollama ,please read [Ollama QuickStart](../02.QuickStart/Ollama_QuickStart.md)
run command in terminal
Note: Please read Fine-tuning Phi-3 with Apple MLX Framework to Learn more
Continue reading...
What's MLX Framework
MLX is an array framework for machine learning research on Apple silicon, brought to you by Apple machine learning research.
MLX is designed by machine learning researchers for machine learning researchers. The framework is intended to be user-friendly, but still efficient to train and deploy models. The design of the framework itself is also conceptually simple. We intend to make it easy for researchers to extend and improve MLX with the goal of quickly exploring new ideas.
LLMs can be accelerated in Apple Silicon devices through MLX, and models can be run locally very conveniently.
Installation
Installing MLX is easy, you will need Python 3.11.x+, then install it in the terminal
Code:
pip install mlx-lm
Run MLX's instructions
1. Running Phi-3-mini in Terminal with MLX
Code:
python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --max-token 2048 --prompt "<|user|>\nCan you introduce yourself<|end|>\n<|assistant|>"
2. Quantizing Phi-3-mini with MLX in Terminal
Code:
python -m mlx_lm.convert --hf-path microsoft/Phi-3-mini-4k-instruct
3. Running Phi-3-mini with MLX in Jupyter Notebook
Note: Please read Inference Phi-3 with Apple MLX Framework to Learn more
Fine-tuning with MLX Framework
We generally need GPU acceleration to complete model training or fine-tuning, but in Apple devices you can use Apple silicon's MPS(Metal Performance Shaders) to replace the GPU to complete model training and fine-tuning.
What's Metal Performance Shaders
The Metal Performance Shaders framework contains a collection of highly optimized compute and graphics shaders that are designed to integrate easily and efficiently into your Metal app. These data-parallel primitives are specially tuned to take advantage of the unique hardware characteristics of each GPU family to ensure optimal performance.
Sample - Using LoRA to fine-tuning Phi-3-mini with MLX
1. Data preparation
By default, MLX Framework requires the jsonl format of train, test, and eval, and is combined with Lora to complete fine-tuning jobs.
Note:
- jsonl data format :
Code:
{"text": "<|user|>\nWhen were iron maidens commonly used? <|end|>\n<|assistant|> \nIron maidens were never commonly used <|end|>"}
{"text": "<|user|>\nWhat did humans evolve from? <|end|>\n<|assistant|> \nHumans and apes evolved from a common ancestor <|end|>"}
{"text": "<|user|>\nIs 91 a prime number? <|end|>\n<|assistant|> \nNo, 91 is not a prime number <|end|>"}
....
Our example uses TruthfulQA's data , but the amount of data is relatively insufficient, so the fine-tuning results are not necessarily the best. It is recommended that learners use better data based on their own scenarios to complete.
The data format is combined with the Phi-3 template
Please download data from this link , please inculde all .jsonl in data folder
2. Fine-tuning in your terminal
Please run this command in terminal
Code:
python -m mlx_lm.lora --model microsoft/Phi-3-mini-4k-instruct --train --data ./data --iters 1000
Note: This is LoRA fine-tuning, MLX framework not published QLoRA
3. Run Fine-tuning adapter to test
You can run fine-tuning adapter in terminal,like this
Code:
python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --adapter-path ./adapters --max-token 2048 --prompt "Why do chameleons change colors? " --eos-token "<|end|>"
and run original model to compare result
Code:
python -m mlx_lm.generate --model microsoft/Phi-3-mini-4k-instruct --max-token 2048 --prompt "Why do chameleons change colors? " --eos-token "<|end|>"
You can try to compare the results of Fine-tuning with the original model
4. Merge adapters to generate new models
Code:
python -m mlx_lm.fuse --model microsoft/Phi-3-mini-4k-instruct
5. Running quantified fine-tuning models using ollama
Before use, please configure your llama.cpp environment
Code:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
pip install -r requirements.txt
python convert.py 'Your meger model path' --outfile phi-3-mini-ft.gguf --outtype f16
Note:
Now supports quantization conversion of fp32, fp16 and INT 8
The merged model is missing tokenizer.model, please download it from microsoft/Phi-3-mini-4k-instruct · Hugging Face
set Ollma Model file(If not install ollama ,please read [Ollama QuickStart](../02.QuickStart/Ollama_QuickStart.md)
Code:
FROM ./phi-3-mini-ft.gguf
PARAMETER stop "<|end|>"
run command in terminal
Code:
ollama create phi3ft -f Modelfile
ollama run phi3ft "Why do chameleons change colors?"
Note: Please read Fine-tuning Phi-3 with Apple MLX Framework to Learn more
Resources
- Read Phi-3 CookBook GitHub - microsoft/Phi-3CookBook: This is a Phi-3 book for getting started with Phi-3. Phi-3, a family of open AI models developed by Microsoft. Phi-3 models are the most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up across a variety of language, reasoning, coding, and math benchmarks.
- MLX framework Repo ml-explore
- Learn more about MLX Framework https://ml-explore.github.io/mlx/
- Hugging face Phi-3 Family Phi-3 - a microsoft Collection
Continue reading...