Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 6, 2025

Ref: #17618

Fix: #11202

We are moving to a new CLI experience with the main code built on top of llama-server. This brings many additional features into llama-cli, making the experience feels mostly like a smaller version of the web UI:

  • Multimodal support
  • Regenerate last message
  • Speculative decoding
  • Fully jinja support (including some edge cases that old llama-cli doesn't support)
image

TODO:

Features planned for next versions:

  • Allow hide/show timings and reasoning content (saving user preferences to disk and reuse later)
  • Allow exporting/importing conversation
  • Support raw completion
  • Support remote media URL (downloaded from internet)
  • Show progress (in percentage) if prompt processing takes too long

TODO for console system:

  • Auto-completion for commands and file paths
  • Support "temporary display", for example clear loading messages when it's done

@github-actions github-actions bot added script Script related testing Everything test related examples devops improvements to build systems and github actions server labels Dec 6, 2025
@ngxson ngxson marked this pull request as ready for review December 9, 2025 14:56
@ngxson ngxson requested a review from CISC as a code owner December 9, 2025 14:56
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 9, 2025

The PR should be ready for review now!

@bandoti
Copy link
Collaborator

bandoti commented Dec 9, 2025

I recommend using the console::write abstraction instead of LOG. The reasons are:

  1. When logging is disabled the output will still work. There is no reason the app should fail when logging is disabled.
  2. It is cleaner to write to the output using template function and specializations. For example, writing std::string directly.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 9, 2025

@bandoti yes that can be a good suggestion. I thought about that but I think it can be a bit tricky when logging errors, as we may still using the LOG_ERR

However, what I'm thinking is that because the llama-server now become the engine behind llama-cli, it could be cleaner to draw a line between:

  • llama-server exclusively uses log.h
  • llama-cli exclusively uses console.h

The benefit could be that --log-file will now only log server-related lines to file, while the main CLI still function as normal.

In addition to write, console.cpp will have and write_err to log error with the correct color.

Tagging @ggerganov if you are OK with this.

@bandoti
Copy link
Collaborator

bandoti commented Dec 9, 2025

@ngxson I like the idea of separation. Though, I'm wondering if adding a log source instead would make sense so filtering may be done per-component. That might be a bit more complex and reach into other aspects of the logging code, but it would allow for other consumers of the server abstraction to log as well. It could also allow for backends to log based on source as well.

If we do go with console, I would consider maybe console::log_<level> using lowercase for the <level> that matches the LOG_* macros.

@ggerganov
Copy link
Member

ggerganov commented Dec 10, 2025

llama-cli exclusively uses console.h

The main concern about doing this is that printing text to stdout/stderr in the hot generation loop can affect the performance. It is better to do the printing asynchronously.

You can do that by creating a separate common_log instance for console and use that for all prints done by the console. The disadvantage would be that the generation output (i.e. from console) would not be coordinated with the internal server logs, so in some cases it might be difficult to do debugging - i.e. the debug logs can appear after the generated tokens to which they correspond to. Not sure if this is a big issue - just pointing it out, it could be ok.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 10, 2025

Forgot to mention that the generation is now running on server_context's thread, so I think printing directly to console from CLI main thread could be ok now (?)

The main motivation of my proposal was to redirect server's internal into another place, so in case of debugging, it won't cog up the console output.

@ggerganov
Copy link
Member

Forgot to mention that the generation is now running on server_context's thread, so I think printing directly to console from CLI main thread could be ok now (?)

Good point - I had missed that. As long as the thread that prints does not make calls to libllama, it should be good.

console::log("%s", diff.reasoning_content_delta.c_str());
console::flush();
}
fflush(stdout);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flush is not needed.

// save choice to use color for later
// (note for later: this is a slightly awkward choice)
console::init(params.simple_io, params.use_color);
atexit([]() { console::cleanup(); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better make the console a singleton and avoid using atexti(). If you want, leave a TODO for now.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 10, 2025

I implemented console::log and console::error to make the syntax a bit javascript-like. It logs directly to stdout using vprintf. The CLI is now more or less like another "web UI" from server_context's perspective

This system won't allow the console output to be logged into file, but I'm not sure if it's useful anymore. The more detailed server log can always be obtained using a combination of -v and --log-file

Comment on lines 27 to 28
void log(const char * fmt, ...);
void error(const char * fmt, ...);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With variadic argument formating, it's always useful to have the format attribute in order to generate compile-time warnings. See f.ex.:

llama.cpp/common/log.h

Lines 15 to 23 in a7a3fbe

#ifndef __GNUC__
# define LOG_ATTRIBUTE_FORMAT(...)
#elif defined(__MINGW32__) && !defined(__clang__)
# define LOG_ATTRIBUTE_FORMAT(...) __attribute__((format(gnu_printf, __VA_ARGS__)))
#else
# define LOG_ATTRIBUTE_FORMAT(...) __attribute__((format(printf, __VA_ARGS__)))
#endif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the LLAMA_COMMON_ATTRIBUTE_FORMAT from common.h instead, so that I don't need to include log.h inside console.h : e135b41

Copy link
Collaborator

@bandoti bandoti Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have overloads like this for simplicity:

    inline void log(const char * data) {
        log("%s", data);
    }

    inline void log(const std::string & data) {
        log("%s", data.c_str());
    }

There were too many places using LOG("%s", foo.c_str())); when we can simply console::log(foo);. This might need variadic templates to prevent ambiguity. I also ran into this scenario in the reasoning PR #16603 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave it like this because it guarantees that our version is always compatible with c-only printf() without any conversions

There are not many places we have this pattern, so probably fine for now. We can find a way to improve it in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside, there is also a risk of having LOG(str). If someone accidentally call it with str being a const char*, the code will still compile even when it's dangerous.

The current LOG("%s", str) does not completely prevent this case, but it leaves a bit more visibility for reviewers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside, there is also a risk of having LOG(str). If someone accidentally call it with str being a const char*, the code will still compile even when it's dangerous.

This is precisely the scenario the overload approach fixes. No one can inadvertently call LOG(str) because this overload explicitly prevents it:

    inline void log(const char * data) {
        log("%s", data);
    }

Reviewers should be familiar with this new type anyhow, so if there's a call to console::log("my log statement"); it will be known to use the proper overload that will never be a bug at runtime.

Copy link
Collaborator

@bandoti bandoti Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, we shouldn't be using the C API anyway, we ought to use C++ idioms (outside of ggml/libllama) so we ought to set precedent here. For example, we can add an iostream input too and then pass a stringstream to log.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any cases, we are having the multiple log macros across the code base, so I would prefer either having everything expose the same form of API.

Having an overload for std::string seems fine as you explain, but I prefer having a dedicated PR to implement the same pattern everywhere in the code base.

So for now, I'll keep this PR as-is, but you can push a follow-up to improve the logging system

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good. There are a lot of changes going in this PR. Don't want to cause any friction!

I will do a follow-up PR 😊

Comment on lines +7 to +14
enum display_type {
DISPLAY_TYPE_RESET = 0,
DISPLAY_TYPE_INFO,
DISPLAY_TYPE_PROMPT,
DISPLAY_TYPE_REASONING,
DISPLAY_TYPE_USER_INPUT,
DISPLAY_TYPE_ERROR
};
Copy link
Member

@ggerganov ggerganov Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtmd-cli.cpp and completion.cpp need to be updated to use the new enum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions examples script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Better chat UX for llama-cli

4 participants