Captioning and Settings
Added 2023-01-20 22:27:55 +0000 UTCSome of you folks asked about captioning, I don't think there's a need for a full guide on captioning however here's a post about it
Finetuning:
I used to use CLIP but for proper training a model I don't like the result of the various CLIP models available. So I manual caption images. I batch create text files for each image I'm using and I manually type the caption out. I find it quicker to type on an empty text file than to edit part CLIP captions for hundreds or thousands of images.
There are multiple captioning tools out there to help with manually typing them in, I'm also releasing my own on github soon using a UI built with NodeJS.
For finetuning captions make a huge difference, there are so many settings that make a difference, learning rate and settings, image quality / defects vs captioning.
I am using EveryDreamV2 for my finetuning, the owner of this repo also has autocaptioning built in if you want to use it but I still prefer complete manual captions.
For my NSFW images, I go into specific detail about the NSFW content I am training. Examples below are NSFW
As an example https://i.imgur.com/01QxsQa.png
I would caption this image as:
a closeup of a man gagged with a bdsm ball gag in his mouth
I wouldn't mention the environment, the clothes etc, i'm specifically focusing on what I want the AI to know and terms to learn.
Another example:
https://i.imgur.com/PvuH85G.png
a naked woman with her mouth ball gagged with her exposed small breasts, hard nipples, bdsm
The focus of this image is BDSM but in other examples I may describe the actual breast type too and nipples etc
another example:
https://i.imgur.com/xz8MHNJ.png
a woman with black lacy lingerie on her knees, with a ball gag in her mouth, wearing a neck choker collar and chains, handcuffed hands behind her back, bdsm
https://i.imgur.com/XpKbVw9.png
POV view of a woman with glasses giving a man a blowjob, sucking his hard dick, holding a hard dick in her mouth
https://i.imgur.com/RUgIhYL.png
a man with a large sized flacid hairy dick hanging
I'm describing the hard dick in various ways because theres multiple angles, ways the dick is portrayed/used, various hair types, flacid vs hard vs fully erect, shaved, small medium large etc. I want the AI to know the variety
Embeddings:
You don't need to caption embeddings, you can and I previously have but more recently just switched to no captioning and I notice the results are just as good without captioning for subjects.
For settings, in general I will use between 0 batch size and 4 as most of my embeddings are local GPU, if I am on the cloud I will up the batch size and gradient step higher, around 8-10 batch size and usually half that for gradient step.
Here's the average settings I used
"num_of_dataset_images": 15,
"num_vectors_per_token": 2,
"learn_rate": "0.005:1000, 0.001:2000, 0.0001:5000, 0.00005",
"batch_size": 4,
"gradient_step": 2,
"training_width": 512,
"training_height": 512,
"steps": 8000,
"clip_grad_mode": "disabled",
"clip_grad_value": "0.1",
"latent_sampling_method": "deterministic",
"create_image_every": 50,
"save_embedding_every": 500,
"save_image_with_stored_embedding": true,
"template_file": "x", This template file is one like that says "a photo of [name] woman" or man etc
Hypernetworks:
I use default CLIP captioning for hypernetworks, I again don't think the captioning has a huge impact on how the hypernetwork comes out anymore unlike finetuning.
The settings for hypernetworks I leave close to standard enough. For a hypernetwork with hundreds of images, here's my settings:
"num_of_dataset_images": 319,
"layer_structure": [
1.0,
2.0,
1.0
],
"activation_func": "linear",
"weight_init": "Normal",
"add_layer_norm": false,
"use_dropout": false,
"hypernetwork_name": "x",
"learn_rate": "0.00001:1000,0.00005:5000,0.000001:10000,0.0000005:50000",
"batch_size": 2,
"gradient_step": 1,
"data_root": "/content/dataset/cropped",
"log_directory": "x",
"training_width": 512,
"training_height": 512,
"steps": 100000,
"clip_grad_mode": "disabled",
"clip_grad_value": "0.1",
"latent_sampling_method": "deterministic",
"create_image_every": 500,
"save_hypernetwork_every": 500,
"template_file": "/content/gdrive/MyDrive/sd/stable-diffusion-webui/textual_inversion_templates/hypernetwork.txt",
"initial_step": 0
Comments
Hmm I may have. What's the minimum training set size you recommend then? Your settings suggest, 15. But could one do better with fewer higher quality?
joe baker
2023-02-03 16:22:23 +0000 UTCIs there any artifacts or blur at all on your dataset? Even if one image has a small amount of blur on the face etc that can be scaled to a larger effect of blur after training.
2023-02-03 06:59:58 +0000 UTCthis was really good advice, my embeddings are more predictable. What I'm running into is the embedding works inconsistently across SD 1.5 based models, and when it seems to work, the eyes come out "blurry". I'm using negative prompts like (empty pupils, spidery eyelashes, deformed iris, deformed pupils:1.3),... and prompts like "perfect eyes" muck with my likeness. I've even tried going back to the lower step embeddings - like -4000 (vs. 8000). any thoughts on what's going on? Edit: one more thought, I trained on SD 1.5 prune emaonly as the full model had trouble loading
joe baker
2023-02-03 01:57:10 +0000 UTCThanks for the reply! I took your advice and made training as simple as possible instead of spinning all the dials at random. I discovered that embeddings trained on the standard SD 1.5 ckpt trained best, with preview images looking like photos and starting to resemble the subject in less than 100 steps. Saturation was mostly under control with some oversaturated reds by 400 steps. HassanBlend 1.5.1.2 was not as well-behaved in this way but was in the same ballpark. These may or may not self-correct over time; I was doing short test runs. SmirkingFace's EB1.1 was a dumpster fire and never gave me a preview image that looked like a photo at all. This seems like a simple case of "reduce the learning rate by a LOT" for EB1.1, or maybe to not train on it at all, but I'll know for sure for this model and a couple more in a few test runs of 5,000+ steps. Thanks again.
2023-01-27 01:00:11 +0000 UTCSo this sounds like an overtraining or maybe high vector count for the dataset you are using. It may be worth trying a 1-2 vector count. how many images are you using in your dataset vs how many steps are you training for? There's a few variations you can try: - Try without captions altogether - Try "deterministic" method for training - Low vector count of 1 or 2 - * as the initialization text instead of man/woman etc - Try a constant learning rate instead such as 0.005:1000, 0.001:2000, 0.0001:5000, 0.00005
2023-01-25 08:32:09 +0000 UTCThanks. This was enlightening in so many ways! I have a specific problem with embeddings that your embeddings don't have: My embeddings always seem too strong, overwhelming the rest of the prompt at normal CFGs (like, say, 7): oversaturation, insisting on being anything but a photo, ignoring the prompt, etc. The preview images show the same thing but more so if I don't check the "Read parameters from tx2img tab" box. Using your settings and even an unreasonably low LR doesn't seem to help much. (I'm training on HassanBlend1.5.1.2-pruned-safetensors, but I have identical problems with other models.) But with your embeddings, I can just slap them into a prompt and they work. What's your secret?
2023-01-24 18:13:33 +0000 UTCThanks!
2023-01-21 05:07:49 +0000 UTCthank you kindly ser :D
William Tatum
2023-01-21 03:06:05 +0000 UTC