# User Manual

## Main Concepts

### Text Generation Model <a href="#text-generation-model" id="text-generation-model"></a>

Moonshot's text generation model (referred to as moonshot-v1) is trained to understand both natural and written language. It can generate text output based on the input provided. The input to the model is also known as a "prompt." We generally recommend that you provide clear instructions and some examples to enable the model to complete the intended task. Designing a prompt is essentially learning how to "train" the model. The moonshot-v1 model can be used for a variety of tasks, including content or code generation, summarization, conversation, and creative writing.

### Language Model Inference Service <a href="#language-model-inference-service" id="language-model-inference-service"></a>

The language model inference service is an API service based on the pre-trained models developed and trained by us (Moonshot AI). In terms of design, we primarily offer a Chat Completions interface externally, which can be used to generate text. However, it does not support access to external resources such as the internet or databases, nor does it support the execution of any code.

### Token <a href="#token" id="token"></a>

Text generation models process text in units called Tokens. A Token represents a common sequence of characters. For example, a single Chinese character like "夔" might be broken down into a combination of several Tokens, while a short and common phrase like "中国" might be represented by a single Token. Generally speaking, for a typical Chinese text, 1 Token is roughly equivalent to 1.5-2 Chinese characters.

It is important to note that for our text model, the total length of Input and Output cannot exceed the model's maximum context length.

### Rate Limits <a href="#rate-limits" id="rate-limits"></a>

How do these rate limits work?

Rate limits are measured in four ways: concurrency, RPM (requests per minute), TPM (Tokens per minute), and TPD (Tokens per day). The rate limit can be reached in any of these categories, depending on which one is hit first. For example, you might send 20 requests to ChatCompletions, each with only 100 Tokens, and you would hit the limit (if your RPM limit is 20), even if you haven't reached 200k Tokens in those 20 requests (assuming your TPM limit is 200k).

For the gateway, for convenience, we calculate rate limits based on the max\_tokens parameter in the request. This means that if your request includes the max\_tokens parameter, we will use this parameter to calculate the rate limit. If your request does not include the max\_tokens parameter, we will use the default max\_tokens parameter to calculate the rate limit. After you make a request, we will determine whether you have reached the rate limit based on the number of Tokens in your request plus the number of max\_tokens in your parameter, regardless of the actual number of Tokens generated.

In the billing process, we calculate the cost based on the number of Tokens in your request plus the actual number of Tokens generated.

#### Other Important Notes: <a href="#other-important-notes" id="other-important-notes"></a>

* Rate limits are enforced at the user level, not the key level.
* Currently, we share rate limits across all models.

## Model List

You can use our List Models API to get a list of currently available models.

Currently, the models we support are:

* Generation Model Moonshot-v1
  * `moonshot-v1-8k`: This is an 8k-length model suitable for generating short texts.
  * `moonshot-v1-32k`: This is a 32k-length model suitable for generating longer texts.
  * `moonshot-v1-128k`: This is a 128k-length model suitable for generating very long texts.
  * `moonshot-v1-8k-vision-preview`: This is an 8k vision model that can understand the content of images and output text.
  * `moonshot-v1-32k-vision-preview`: This is a 32k vision model that can understand the content of images and output text.
  * `moonshot-v1-128k-vision-preview`: This is a 128k vision model that can understand the content of images and output text.

The difference between these models lies in their maximum context length, which includes both the input message and the generated output. There is no difference in effect. This is mainly to facilitate users in choosing the appropriate model.

* Generation Model kimi-latest
  * `kimi-latest` is a vision model with a maximum context length of 128k that supports image understanding. The kimi-latest model always uses the latest version of the Kimi large model in the Kimi intelligent assistant product, which may include features that are not yet stable.
* Long-Term Thinking Model Kimi-thinking-preview
  * `kimi-thinking-preview` is a multimodal reasoning model with both multimodal and general reasoning capabilities provided by Moonshot AI. It is a 128k-length model that great at diving deep into problems to help tackle more complex challenges.

## Usage Guide

### Getting an API Key <a href="#getting-an-api-key" id="getting-an-api-key"></a>

You need an API key to use our service. You can create an API key in our Console.

### Sending Requests <a href="#sending-requests" id="sending-requests"></a>

You can use our Chat Completions API to send requests. You need to provide an API key and a model name. You can choose to use the default max\_tokens parameter or customize the max\_tokens parameter. You can refer to the API documentation for the calling method.

### Handling Responses <a href="#handling-responses" id="handling-responses"></a>

Generally, we set a 5-minute timeout. If a single request exceeds this time, we will return a 504 error. If your request exceeds the rate limit, we will return a 429 error. If your request is successful, we will return a response in JSON format.

If you need to quickly process some tasks, you can use the non-streaming mode of our Chat Completions API. In this mode, we will return all the generated text in one request. If you need more control, you can use the streaming mode. In this mode, we will return an SSE(opens in a new tab) stream, where you can obtain the generated text. This can provide a better user experience, and you can also interrupt the request at any time without wasting resources.
