While AI is getting better at understanding written commands, the prompt is still the backbone to generating beautiful images. A while ago I wrote a comprehensive guide on how to use camera angles and perspectives with Stable Diffusion. I’ve also showed how you can manipulate your images with the help of inpaint & outpaint. However, the most effective way to create beautiful images the way you want them is to learn how prompts are working.
BASIC PROMPTING
The absolute basic for every prompt is the positive and negative prompts, at least for Stable Diffusion. A lot of other generative AI, such as Copilot and various phone apps, only use the positive prompt.
It might sound self evident that you put the things you want in your image in the positive prompt, and the things you don’t want in the negative prompt. However, you might be surprised when you learn all the different things that goes in these two text areas.
Let’s say we want to create an image like the one below.
The base prompt for this image would look something like this.
This basic prompt will create the following image. Given the simple prompt, the image is not bad at all. But something to keep in mind is that the later the version of the AI, the better it is to create really great images from poorly formed prompts. I’m using Stable Diffusion XL for these images.
The full positive prompt for the first image looks like this.
And will create this image.
The full positive and negative prompt for the first image looks like this.
Here are the different images side by side with the left image being the one with full positive and negative prompts, the middle with only the basic positive prompt and the right image is created with the full positive prompt.
Click the link for full size image: Full size (3048×1119 px)
Is all of that really necessary? Different types of images are more or less sensitive for how you build your prompt. Which AI model you are using, if you use LoRa and which extensions you are using also make a huge difference in prompt adherence.
WHAT DOES THE DIFFERENT THINGS REALLY MEAN?
If you are new to generative AI you might have seen a lot of things that doesn’t make any sense being used in prompts. I’m going to explain some of them here.
Let’s break down the prompt in smaller pieces.
Full shot, (full body view:1.2)
Full shot and full body view tells the AI how to frame the image, so to speak. The default when creating images of people are either close-up on their face or upper body shots. For example, using only this for a prompt: Realistic photo of a ghostly woman will produce this image.
(Brackets)
Put emphasis on the word inside the brackets. If you have a word in your prompt that you feel isn’t reflected in the image, you can put the brackets around the word to make the AI understand that this is particularly important. You can put a word inside up to 3 (((brackets))) for different level of emphasis.
:1.2
This is similar to, and often used with, brackets and means that you put weights on a specific word. Sometimes you might have a word in brackets and still don’t get the results you want. You might state that you want a full body view but still get an upper body view, then you can assign more weight to the word or phrase, such as (full body view:1.2). The default weight is 1 and doesn’t need to be written out. This will increase the importance of the word or phrase by a factor of 1.2.
When the AI put focus on something, usually it will also zoom in on that particular thing. See the difference between Full body view. Realistic photo of a ghostly and the Full body view. Realistic photo of a ghostly womans (((feet:2.0)))
Even though you still have the full body view in the prompt, the feets are in focus and zoomed in on because of the brackets and the weights put on them. Weights can be used in the range of :-3.0 to :3.0.
[square brackets]
The opposite of round brackets, they are used to put less emphasis on a word and can be used together with weights just as the round brackets.
(leopard:crocodile:0.4) or (leopard|crocodile:0.4)
This basically tells the AI to create an image of a creature that are 40% leopard and 60% crocodile. These types of prompts doesn’t always work the way you expect them to, and some hybrids are easier to create than others. Often it helps to put it in the prompt that it’s supposed to be a hybrid.
A (leopard:crocodile:0.4) hybrid
PROMPT STRUCTURE
The structure of a well written prompt should go like this.
- Medium
- Set the scene
- Key elements
- Details
- Style
- Resolution
- Light/sensation/emotion
Personally I always add the framing of the image first in the prompt (i.e full shot, full body view). I’ve seen others do it in another way and make it work, so I guess it’s a matter of taste mostly.
Lets create an image of a female pirate.
Full shot, full body view. Realistic photo. A beautiful woman in a pirate costume. Detailed face, realistic skin texture, red roses. Epic fantasy style. 16k HD, highly detailed, bokeh, macro photography. artistic expression, immersive experience, expert technique
Using this prompt alone without any brackets, weights or negative prompt results in the following image.
As you can see it’s not a full body view image, despite I clearly wrote Full shot, full body view in the prompt. This is partly because of the aspect ratio and partly because I’ve put detailed face and realistic skin texture in the prompt. It creates a conflict because the AI will reason that it can either create a full body view image or one with a detailed face and skin texture. Most likely it choose the latter because there are two different phrases that requires a more close-up image.
Now I can put weights on the full body view part, and force the AI to adhere, but then the face will become less detailed. Let’s see what happens.
Even when using brackets and weights the AI doesn’t really zoom out much.
Full shot, (full body view:1.5). Realistic photo. A beautiful woman in a pirate costume. Detailed face, realistic skin texture, red roses. Epic fantasy style. 16k HD, highly detailed, bokeh, macro photography. artistic expression, immersive experience, expert technique
I could add more brackets and weights, but I don’t believe that’s a good idea. Let’s try something else first.
Full shot, (full body view:1.5). Realistic photo. A beautiful woman in a pirate costume in high heel shoes. Detailed face, realistic skin texture, red roses. Epic fantasy style. 16k HD, highly detailed, bokeh, macro photography. artistic expression, immersive experience, expert technique
By adding that she is wearing high heel shoes to the prompt, we have given the AI another condition that requires it to generate and actual full body view image. But it’s also evident that the detailed face suffered from it.
Let’s go back to the drawing board, and this time change the aspect ratio to something that is more suitable for a full body view, and maybe camera angle and distance.
Eventually I had to settle for the following:
(from above, full body shot:1.5). Realistic photo. Full body view of a beautiful woman in a pirate costume. detailed face, [red roses:0.45]. Epic fantasy style. 16k HD, highly detailed, bokeh, macro photography. artistic expression, immersive experience, expert technique
The solution to get both detailed face and full body view is the camera angle. Obviously you can use tools such as Adetailer to get a detailed face even when zoomed out, but that’s a whole other story.
So this is the breakup of the prompt.
- Framing: (from above, full body shot:1.5)
- Medium: Realistic photo
- Set the scene: Full body view of a beautiful woman
- Key elements: in a pirate costume
- Details: detailed face, [red roses:0.45]
- Style: Epic fantasy style
- Resolution: 16k HD, highly detailed, bokeh, macro photography
- Light/sensation/emotion: artistic expression, immersive experience, expert technique
I hope you have learned something new today!