how to run llama.cpp in smartphones


i love llamam.cpp

i am all in for local llms and i think they are the true future of ai for most of us common folks , maybe enterprise hosting will continue but for most daily tasks local models are great so to run them u need beefy computers with huge ram right ?

well No exactly , my compuer is an intel alderlake 11th gen i5 no dedicated gpu theres integrated graphics tho( whihc is pretty usesless(foreshadowing))

moving away from ollama

ollama is this cool wrapper around llama but i dont really like it i moved away from it because choices with llama.cpp you have more configuration you can bring any model from hugging face and has better speeds i think i didnt want the wrapper i wanted the core so i made the choice of switching

main reason being i wanted to run uncensored models , i think censorship is ass and useless like why even bother a kid is probably not gonna install llama.cpp and download things ( i might be wrong ) and even if they do what is this gatekeeping of information and i hate it so yes hauhauCS whoever you are thank u for making great uncensored models

enough yap lets get into action

so first things first we need to decide what device we are working with since the title of this blog is all about running it in smartphones smartphones it is

so ill give u a heads up on the speed my phone is 8gb ram running a snapdragon 7s gen 3 just a mid range processor but it has a gpu aderno or smthn which is good

so i compiled llama cpp for my laptop once and got around 7tps at starting averaging around 6-5tps in phone its the same i was super surprised like boy tps is around the same as a beefy lap plugged into supply isnt that crazy and also i compiled for cpu only since i neglected existence of my iris xe igpu but that was a good thing i tried to compile for it support then results where underwhelming iris xe actually performs way worse than cpu all by itself due to overhead and not having dedicated vram

so we are working with phones we are gonna be compiling and all so we need a terminal

termux enters the chat

termux is a linux emulator inside android to get it go to some oneline apk store and install it make sure its safe my recommendation is to go download fdroid then download termux from there

once termux is setup great now you have a terminal in android capable of doing everything from compiling c code to running llama lets go

install the packages

you could try to clone llama.cpp from github into the termux and compile the whole thing make sure u dont do too much -t cause too much tasks can overwhelm the system and lead to android killing it (signal 9) so try to compile it in 2 tasks

better than compiling is to just get the packages way faster and saves you lots of time

  pkg install llama-cpp 

this gives you access to commands like llama-cli and llama-server directly and u wont have to cd to some bin folder to run those which is nice

there are 2 backend options available that is vulkan and opencl i have tried with vulkan and i was getting errors so opencl worked for me ill be just talking about it here

  pkg install llama-cpp-opencl-backend
  pkg install opencl-vendor-driver
  pkg install ocl-icd

now we have all the packages ready for running llama-cpp make sure you dont have both backends tho i had issues for so long cause there were both vulkan backend and opencl in my system in that case i just removed vulkan packages toolkits etc from termux if you are following along from the start dont worry you never installed vulkan packages

telling dynamic linkers to look for right .so files

 export LD_LIBRARY_PATH="$TERMUX__PREFIX/opt/vendor/lib"
 mkdir -p "$LD_LIBRARY_PATH"
 cp "/system/vendor/lib64/libOpenCL.so" "$LD_LIBRARY_PATH"
 cp "/system/vendor/lib64/libOpenCL_adreno.so" "$LD_LIBRARY_PATH" 

most termux systems doesnt know where these linker files are and installing it and compiling it inside termux is also difficult so we are just gonna copy the files from android to termux accessible partition

this is a one time job make sure you export LD_LIBRARY_PATH in bashrc or smthn

just run these commands and you are good to go

Downloading models

i find it easier to install models with wget or curl

  pkg install wget

now go to hugging face and select model you want click the correct quantization u need then copy the link of the download button

paste it next wget and run

  wget -O <file name>  <downlaod link>
  

give any filename tbh whatever you like with the right extension model.gguf is what i used lol

i tried with qwen 3.6 1.5b Q4_KM and gemma 4 E2B Q3 both were givign a respectable 7tps

running

now that we have a model installed we can finally run it

  llama-server -m <path to model file> -c <context size> -ngl <gpu offloading layer count> -t < no of tasks>

after testing out everytign ngl 90 with t 2 worked the best for me for inference and everything i still keep the context size around 1024 or 2056 cause models can eat up your ram and 8gb aint much your phone will feel sluggish af

write an alias for this in bashrc cause why bother typing this thing every time

seeeing it in browser

llama server has a nice web ui thats why i love it so go to localhost:8080 in your browser and enjoy your local llm

your phone is with you 90% of the time in case of emergency these tools can give good info to some extend if it doesnt hallucinate like crazy even when theres no internet

i am still experimenting with this

so i am still trying out various configs to increase my tps i just want fast realiable ai but idk bro when will we get there and if u have a faster chip defo it will do wonders