sd_hassan

Captioning and Settings

Added 2023-01-20 22:27:55 +0000 UTC

Some of you folks asked about captioning, I don't think there's a need for a full guide on captioning however here's a post about it

Finetuning:

I used to use CLIP but for proper training a model I don't like the result of the various CLIP models available. So I manual caption images. I batch create text files for each image I'm using and I manually type the caption out. I find it quicker to type on an empty text file than to edit part CLIP captions for hundreds or thousands of images.

There are multiple captioning tools out there to help with manually typing them in, I'm also releasing my own on github soon using a UI built with NodeJS.

For finetuning captions make a huge difference, there are so many settings that make a difference, learning rate and settings, image quality / defects vs captioning.

I am using EveryDreamV2 for my finetuning, the owner of this repo also has autocaptioning built in if you want to use it but I still prefer complete manual captions.

For my NSFW images, I go into specific detail about the NSFW content I am training. Examples below are NSFW

As an example https://i.imgur.com/01QxsQa.png

I would caption this image as:

a closeup of a man gagged with a bdsm ball gag in his mouth

I wouldn't mention the environment, the clothes etc, i'm specifically focusing on what I want the AI to know and terms to learn.

Another example:

https://i.imgur.com/PvuH85G.png
a naked woman with her mouth ball gagged with her exposed small breasts, hard nipples, bdsm

The focus of this image is BDSM but in other examples I may describe the actual breast type too and nipples etc

another example:

https://i.imgur.com/xz8MHNJ.png
a woman with black lacy lingerie on her knees, with a ball gag in her mouth, wearing a neck choker collar and chains, handcuffed hands behind her back, bdsm

https://i.imgur.com/XpKbVw9.png

POV view of a woman with glasses giving a man a blowjob, sucking his hard dick, holding a hard dick in her mouth

https://i.imgur.com/RUgIhYL.png

a man with a large sized flacid hairy dick hanging

I'm describing the hard dick in various ways because theres multiple angles, ways the dick is portrayed/used, various hair types, flacid vs hard vs fully erect, shaved, small medium large etc. I want the AI to know the variety

Embeddings:

You don't need to caption embeddings, you can and I previously have but more recently just switched to no captioning and I notice the results are just as good without captioning for subjects.

For settings, in general I will use between 0 batch size and 4 as most of my embeddings are local GPU, if I am on the cloud I will up the batch size and gradient step higher, around 8-10 batch size and usually half that for gradient step.

Here's the average settings I used

"num_of_dataset_images": 15,

"num_vectors_per_token": 2,

"learn_rate": "0.005:1000, 0.001:2000, 0.0001:5000, 0.00005",

"batch_size": 4,

"gradient_step": 2,

"training_width": 512,

"training_height": 512,

"steps": 8000,

"clip_grad_mode": "disabled",

"clip_grad_value": "0.1",

"latent_sampling_method": "deterministic",

"create_image_every": 50,

"save_embedding_every": 500,

"save_image_with_stored_embedding": true,

"template_file": "x", This template file is one like that says "a photo of [name] woman" or man etc

Hypernetworks:

I use default CLIP captioning for hypernetworks, I again don't think the captioning has a huge impact on how the hypernetwork comes out anymore unlike finetuning.

The settings for hypernetworks I leave close to standard enough. For a hypernetwork with hundreds of images, here's my settings: