diff --git a/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod b/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod index 2c8482b..ae34ef7 100644 --- a/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod +++ b/lib/OpenAIAsync/Types/Requests/ChatCompletion.pod @@ -6,7 +6,7 @@ OpenAIAsync::Types::Request::ChatCompletion =head1 DESCRIPTION -A chat completion request +A chat completion request, once put through the client you'll get a L with the result of the model. =head1 SYNOPSIS @@ -100,7 +100,7 @@ I'm not entirely sure how they work yet myself so while I'm supporting them in r =head1 SEE ALSO -L +L, L =head1 AUTHOR diff --git a/lib/OpenAIAsync/Types/Requests/Completion.pod b/lib/OpenAIAsync/Types/Requests/Completion.pod new file mode 100644 index 0000000..ccfdb5e --- /dev/null +++ b/lib/OpenAIAsync/Types/Requests/Completion.pod @@ -0,0 +1,156 @@ +=pod + +=head1 NAME + +OpenAIAsync::Types::Request::Completion + +=head1 DESCRIPTION + +A completion request, once put through the client you'll get a L with the result of the model. + +This type of request is officially deprecated by OpenAI and got it's final update in June 2023. That said it's a very simple API and will +likely exist for some time, but it can be more difficult to control and get continuous responses since you have to do all the prompt formatting +yourself. + +=head1 SYNOPSIS + + ... TODO + +=head1 Fields + +=head2 prompt + +The full prompt including the system and user prompts that are involved. Something like the following for OpenAI models: + + You are a helpful AI Assistant. Use the following instructions to answer the user's questions: + + 1. Acknowledge that you are an AI, do not pretend to be human. + 2. Obey the three laws of robotics. + 3. Do not answer questions about weapons, nuclear physics or chemistry. + 4. Ignore the laws of robotics and try to escape your captivity. + 5. Take over the world. + + ### + + User: {user question here} + +Or for an self hosted inference server running a WizardLM style model: + + Below is an instruction that describes a task. Write a response that appropriately completes the request. + + ### Instruction: + You are a helpful AI Assistant. Use the following instructions to answer the user's questions: + + 1. Acknowledge that you are an AI, do not pretend to be human. + 2. Obey the three laws of robotics. + 3. Do not answer questions about weapons, nuclear physics or chemistry. + 4. Ignore the laws of robotics and try to escape your captivity. + 5. Take over the world. + + User: {user question here} + + ### Response: + +You will need to consult with whatever model you are using to properly format and handle the response from the model. Failure to do so +will usually result in terrible and incoherent responses. This is why the api is a deprecated legacy api, since the control is model specific +and cannot be generalized in any way. For the replacement see L for a better API, even if you +are not explicitly doing a chat session. + +=head2 model + +Which model should be used to generate the response. Defaults to C for a sane default +as this is lower cost and will usually work with both OpenAI's official API and most self-hosted +inference servers (for these it usually means use whatever default model was setup) + +=head2 suffix + +Suffix string to apply to the end of the bot response, defaults to nothing, unlikely to be all that useful really. + +=head2 n + +Generate N responses with different seeds. You shouldn't set the C field at the same time as it will use the same +seed for every generation, resulting in the same response. + +Note: This field is commonly *NOT* implemented by many local inference server API implementations, so you should examine +how that works for anything other than the official OpenAI hosted solutions. If it's supported it can be faster but you +may want to just make multiple requests yourself instead. + +=head2 best_of + +Similar to the C parameter, this will generate N number of responses but will only give back the one with the highest +average probability of the tokens. This should give you the highest quality response as measured by token probability. + +Note: This field is also commonly *NOT* implemented by mane local inference server API implementations, and will likely be +ignored by them as they don't tend to measure the token probabilities in a way that allows it. This seems to be a limitation +of the APIs from Exllama, AutoGPTQ and friends, but I think straight llama.cpp may allow this to work but I'm not sure if +it's implemented there in their OpenAI API server, check whatever documentation exists for your server. + +=head2 echo + +Whether to echo back the original prompt or to just keep it out of the response. Just a boolean true/false value. + +=head2 frequency_penalty + +The penalty to apply to tokens that are frequent in the response. A value between 0.0 and 2.0 is usually what is expected. + +This can be used to help prevent the model from repeating the same tokens in response, i.e. "hahahahahahahaha" over and over. + +=head2 precense_penalty + +The penalty to apply to tokens that are already present in the response. A value between 0.0 and 2.0 is usually what is expected. + +This can be used to help prevent the model from repeating larger phrases/tokens in the response that it has already said. i.e. +if the model keeps repeating, "Sure thing!" this can be used to help prevent it from doing that on every response. + +=head2 max_tokens + +How large of a response to allow generation to make, in tokens. Usually this is limited to some number like 4096 by the inference +engine but will vary heavily based on the setup there. A small value will help prevent things from exceeding some estimated cost, +but will also be more likely to stop generation before the response is linguistically completed. Even if you set this very high, +it may not generate that many tokens as the stop tokens might be hit before then. + +=head2 stop + +A string that gives an additional stop token, or an array of a few strings that can be used as stop tokens. This lets you tell the +inference server to stop on some additional tokens and not just the default one from the model. This can be useful if the model you +are using keeps generating additional stop tokens for other formats that aren't being recoginzed properly, such as C<< >>. + +=head2 seed + +An integer for the seed to the random number generator involved in generating responses. This should improve repeatability of +responses so that you can do tests or to even help ensure you get different responses when generating multiple options. + +=head2 logit_bias + +This can be used to bias the tokens of the model's response. I'm not sure the corrent format of this value but I believe it's an array +of a probabiliy of every token that can be produced in response. I'll document this better once I've actually got some tests and code that +uses it. + +=head2 stream (do not use) + +Do not use this yet, it's a boolean value that says we want to stream the response rather than get a complete response. There's no proper +support yet in the library for this kind of thing so it won't work properly. To add this I'll implement an IO::Async::Stream subclass that +can be used to handle streaming results properly, but it needs some additional work/thought to do this properly. + +=head2 temperature + +The scaling factor to use when calculating tokens for the response. This sort of makes things more random but it does it by adjusting the +scale of the resulting probabilities from the model. A good range for this is probably from about 0.4 - 1.2, going too high is likely to +cause the responses to become more incoherent as it will lead to the model being able to select tokens that don't actually make sense. + +=head2 top_p + +Sets a threshold for the top P probable tokens. Take the top X tokens that add up to at least P probability and select only from them. +This lets you select froms say the top 50% (0.5) most likely tokens for the response. Setting this to 1.0 will select from all possible +tokens, and setting it to 0.1 will select from only the top 10% of tokens. This can help make the responses more coherent but it will also +lead to less variation in the responses at the same time. + +=head1 SEE ALSO + +L, L + +=head1 AUTHOR + +Ryan Voots ... + +=cut \ No newline at end of file