2023年10月

找到一个Github star比较多的CLIP跑起来玩一下。结果发现挺多坑的......(我发现Shariatnia好像很喜欢在Jupyter改代码,却不把改动更新到源文件)

项目地址:https://github.com/moein-shariatnia/OpenAI-CLIP

关于Loss的一些问题

CLIP网络的输入是一个个图像-文本对,对于一个batch_size=8的输入,总共有8个正样本对和56个负样本对。于是我们可以使用torch.eye来创建标签值。在这个项目中,由于Flicker-8k数据集较小,而且一张图片对应5个caption,所以作者使用文本(图像)之间的相似度作为标签,如下代码块。

targets = F.softmax(
            (images_similarity + texts_similarity) / 2 * self.temperature,                 
dim=-1)
texts_loss = cross_entropy(logits, targets, reduction='none')
images_loss = cross_entropy(logits.T, targets.T, reduction='none')

这引发了我几个问题:

  • images_losstexts_loss好像写反了?(虽然这并不影响模型的训练,因为最终的loss是两者的平均) 在原论文中,images_loss指的是针对某个文本,images的分类损失
  • 使用图像(文本)的embeddings之间的相似度来作为标签值,这确实可以处理一个batch中出现重复图片(文本)的问题,并且带有Pseudo-Labelling的味道。但是这样处理同样也会出现问题,1)如果不使用预训练模型,那么训练初期得到的图像(文本)相似度没有意义,无法作为标签值,即一定需要一个好的预训练模型;2)训练后期,出现过拟合现象,两个encoder或投影层会倾向于输出相差不大的向量,从而得到一个较低的loss,这类似于GAN中的模式坍塌(实际上, 如果将模型的学习率统一改成较大的一个值例如lr=1e-3,那么一个epoch之后,你就会发现无论输入什么文本,模型给出的图像检索结果都是固定的,如果你问我为什么发现,其实作者的代码一开始就是这么写的555)

    model = CLIPModel().to(CFG.device)
    optimizer = torch.optim.AdamW(
            model.parameters(), lr=CFG.lr, weight_decay=CFG.weight_decay)
    lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer, mode="min", patience=CFG.patience, factor=CFG.factor)
  • 我同样好奇(images_similarity + texts_similarity) / 2 这一步是否正确。对于images_loss,我们可以通过images_similarity找到重复(类似)的图像,从而给他们相同的标签值。但是texts_similarity是否可以指导images_loss? 我认为不能,因为文字具有多义性,两段文本类似,不代表对应的图片就是类似的。所以我尝试修改了loss函数,使它更符合直觉,如下。

    images_similarity = F.softmax(images_similarity, dim=-1)
    texts_similarity = F.softmax(texts_similarity, dim=-1)
    images_loss = cross_entropy(logits, images_similarity, reduction='none')
    texts_loss = cross_entropy(logits.T, texts_similarity, reduction='none')

实验结果

为了验证我的猜想,我做了一些实验。

①模式坍塌

正如上文所说,如果两个Encoder学习率过大,那么会发生模式坍塌现象。这里就不再赘述,大家可以自己试一下~

②Loss Function && Temperature

首先我修改了一下Temperature这个超参(原论文里这是一个可学习参数,数值最后稳定在100左右),motivation在于:我发现使用similarity作为label值,正负样本对之间的差距有点小(大概是2.5%),所以我想把T调小一点,增大他们之间的差距。当然,结果好像不是很好(T=0.1,可能给小了)

clip_author_loss_1epochs_bs32_t10

第二个实验是关于(images_similarity + texts_similarity) / 2这一步similarity平均的效果,我使用了上文提到的分别计算的方法进行对比,结果发现:没啥区别... valid_loss也几乎没差

clip_author_mine_loss_4epochs_bs32

image-20231024182147993

我猜测可能是因为batch_size不够大,即图片(文本)重复的概率不够大。所以我把batch_size从32调到了128,结果发现:好像平均可以加快收敛,但是为什么可以,还是有点不够intuitive...(也有可能单纯没差~)

clip_author_mine_loss_4epochs_bs128

image-20231024182233396

③Prompt Engineering

Prompt的好坏可以通过两幅图表示,我使用训练了5个epoch的模型权重。

当我们使用so many dogs作为query时,可以发现检索出来的图片除了狗以外,还有一些成群的鸭子或鸽子,这可能是因为text encoder同时注意到了many

many_dogs

当我们使用a photo of dogs作为query时,可以发现检索出来的图片就都是狗了。

photo_of_dogs

当然还可以像CLIP官方那样使用80个prompt做prompt ensembling。

总结

CLIP是一篇利用对比学习完成多模态特征对齐的工作,它的ZERO-SHOT能力非常亮眼(毕竟在400million的数据集上训练的...),同时也因为多模态的特点,可以引出很多有趣的应用,比如图像检索、编辑、生成等等。

这两天在看PyTorch DistributedDataParallel(DDP)相关文章,发现有个系列写的还不错。

虽然讲的是torch.distributed.launch(快被torchrun替代),但是整个思路应该还是有参考意义的。
看的过程中遇到一些问题,顺便补几个知识点。

补充SyncBN里的一个问题:2.1.5 eval部分,在torch 1.13版本里,只要满足eval模式或track_running_stats=True,就会使用统计量(running_mean, running_var)进行计算了。源码如下:

# torch.nn.modules.batchnorm
return F.batch_norm(
            input,
            # If buffers are not to be tracked, ensure that they won't be updated
            self.running_mean
            if not self.training or self.track_running_stats
            else None,
            self.running_var if not self.training or self.track_running_stats else None,
            self.weight,
            self.bias,
            bn_training,
            exponential_average_factor,
            self.eps,)

Typecho使用Markdown语法解析文档,而Markdown默认图片靠左显示,且没有比较方便的居中方案(现在常用的方法是利用html语法解析,如使用img和div标签来设置居中)。
如果使用内嵌html来设置居中,对JJJYmmm来说工作量太大。换个角度思考,可以修改网页渲染的css文件。(如果使用了主题,则在主题对应的css文件修改)
使用以下代码,就可以实现所有文章的图片一键居中了~

.your_class #your_id img {
    max-width:100%;
    margin:0 auto;
    display:block;
}
后来发现,这个方案已经有人提出:https://zhuanlan.zhihu.com/p/474859854

最近写的一个Multi-task框架~
项目地址:https://github.com/JJJYmmm/Pix2SeqV2-Pytorch

Simple PyTorch implementation of Pix2SeqV2. This project references moein-shariatnia's Pix2Seq and the paper A Unified Sequence Interface for Vision Tasks.

overview

Introduction

Pix2Seq is a generalized framework for solving visual tasks proposed by Google. Essentially it treats visual tasks as language tasks, generating sequences of tokens by auto-regression, and obtaining the output of many visual tasks(e.g., object detection, segmentation, captioning, keypoint, etc.) by decoding the tokens.

The official implementation of Pix2Seq google-research/pix2seq: Pix2Seq codebase is written in TensorFlow. I wish there was a PyTorch implementation. Then Shariatnia gave me a simple implementation of Pix2SeqV1(just for object detection, no multi-task training). I followed his project and added something new below:

  • For objection detection, add support for COCO2017 datasets.
  • Keep the network structure unchanged and add interfaces for more tasks(instance segmentation, image captioning and keypoint detection).
  • Add support for multi-task training.

Something notes:

  • This project is just a simple implementation in PyTorch, I only referred to the original paper for other tasks' interface, please refer to the official implementation for more details.
  • Since this is a practice project, I only used one GPU of RTX 3090ti for training and reasoning. The main purpose of this project is to verify the feasibility of multi-task training, so I don't have more performance requirements.
  • If you want to improve the performance, just try to 1)add more data augmentations, 2)train for more epochs and 3)replace the model with a larger number of params, etc.

If you have more questions about this project, feel free to issues and PRs!

Environment

I use anaconda to manage my python environment. You can clone my environment by doing this:

# change to root dir
cd Pix2SeqV2
# create a new python 3.8 env
conda create -n your_env_name python=3.8
# install essential packages
pip install -r ./requirements.txt

If you want to run the project, you need to have at least one GPU with more than 12G memory. Of course, the more GPUs the better!

I haven't written the code for multi-GPU training, but it's coming soon.

Configurations

All configurations can be modified in CFG class in Pix2SeqV2/config.py. Most of my training configurations come from Shariatnia's tutorials.

I use relative paths for other configs like weights and other required files. The only thing you need to change is the path of the dataset.

To fetch VOC dataset, just cd download and bash download_voc.sh

To fetch COCO2017 dataset, download here

# For VOC dataset, you need to change the following two var
img_path = '../download/VOCdevkit/VOC2012/JPEGImages'
xml_path = '../download/VOCdevkit/VOC2012/Annotations'
# For COCO dataset, you need to change dir_root
dir_root = '/mnt/MSCOCO'

I trained some weights for different tasks, you can fetch them here. Put them in folder Pix2SeqV2/weights, so that you don't need to change corresponding configs.

A Small Demo

Before diving into the formal Train&Infer session, let me show a small demo for the multi-task processing.

I just trained the multi-task model weight for 2 epochs in 11 hours, including four tasks(instance segmentation, object detection, image captioning and keypoint detection). So the results are unsurprisingly poor, forgive me =v=. The weight can be download here.

I random choose a picture(No.6471) from COCO validation dataset for visualization.

000000006471

Next, you can run the following code to get the results of the four tasks.

# mask sure you're in the root directory and set the right weight path(multi_task_weight_path) in CFG 
cd infer
python infer_single_image_multi_task.py --image ../images/baseball.jpg > result.txt

After that you can see three images(instance_segmentation.png, keypoint_detection.png, object_detection.png) and a txt file(result.txt) in the infer directory.

result.txt shows all the predictions,

skipping pos_embed...
skipping pos_embed...
<All keys matched successfully>
Captioning:
[['baseball', 'player', 'swinging', 'his', 'bat', 'at', 'home', 'plate', '.']]
[['batter', 'during', 'the', 'game', 'of', 'baseball', 'game', '.']]
[['baseball', 'players', 'are', 'playing', 'baseball', 'in', 'a', 'field', '.']]
Bounding boxes:
[[ 15.665796 134.68234  130.5483   191.906   ]
 [262.4021    69.40819   90.07831  232.37599 ]
 [  0.        94.21238   15.665796  53.524773]
 [ 96.60574   78.54657   44.38643   61.357697]
 [206.26633  223.45518   28.720627  37.859024]
 [259.79114   72.01914   75.71802  229.76505 ]
 [ 97.911224 180.37425  137.07573  140.99214 ]
 [  0.        95.51784   19.582247  53.52481 ]]
Labels:
['person', 'person', 'person', 'person', 'baseball glove', 'person', 'person', 'person']
Keypoint list:
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 125, 198, 117, 168, 145, 237, 157, 0, 0, 261, 167, 180, 184, 205, 183, 173, 208, 204, 210, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Three images visualize the results of the different visual tasks.

obeject_detection

instance_segmentation

keypoint_detection

The low recall of object detection task may be due to poor data augmentation and not enough training epochs.

The segmentation task performed OK given the object detection box, as it was given the maximum training weight and I followed the settings of the original paper: repeat the prediction eight times to ensure recall.

The keypoint detection task performed very poorly, I think there are a few reasons for this: firstly it has the lowest weight in the multi-task training; secondly the bounding box I used in data augmentation seems to be too big (twice as big of a detection box, following the original paper's setup), resulting more than one person in the bbox.

Anyway, JJJymmm's pix2seqV2 has taken the first step !!!

Training & Inference

Object Detection

object_detection

For object detection, you can run the following code to train Pix2Seq from scratch. Hyperparameters such as training epochs, learning rate, etc. can be set in ./config.py. And the weights are saved in the directory ./train.

# mask sure you're in the root directory
cd train
python train_coco_object_detection.py # train on COCO2017
python train_voc_object_detection.py # train on VOC

Once the weights are obtained, you can run the code to infer a single image.

# mask sure you're in the root directory
cd infer
python infer_single_image_object_detection.py --image your_image_path # COCO2017
python infer_single_image_voc.py --image your_image_path # VOC

The predictions(bounding boxes and labels) are printed in terminal and the results of visualization are saved in object_detection.png.

Training and prediction for the other tasks did not differ much from this task.

Instance Segmentation

segmentation

Code for training.

# mask sure you're in the root directory
cd train
python train_coco_segmentation.py

Code for inference.

# mask sure you're in the root directory
cd infer
python infer_single_image_segmentation.py --image your_image_path --box selected_area(format:xywh)

The results of visualization are saved in instance_segmentation.png.

Image Captioning

captioning

Code for training.

# mask sure you're in the root directory
cd dataset
python build_captioning_vocab.py # generate vocab.pkl
# put the vocab.pkl to train folder or set the vocab_path in CFG
cd ../train
python train_coco_img_captioning.py

Code for inference.

# mask sure you're in the root directory
cd infer
python infer_single_image_caption.py --image

The results are printed in terminal.

Keypoint Detection

keypoint

Code for training.

# mask sure you're in the root directory
cd train
python train_coco_segmentation.py

Code for inference.

# mask sure you're in the root directory
cd infer
python infer_single_image_segmentation.py --image your_image_path --box selected_area(format:xywh)

The results of visualization are saved in keypoint_detection.png.

Multi-Task

Code for training.

# mask sure you're in the root directory
cd train
python train_multi_task.py --task task1,task2,task3...
# supported tasks: detection,keypoint,segmentation,captioning

Code for inference.

# mask sure you're in the root directory
cd infer
python infer_single_image_segmentation.py --image your_image_path --box selected_area(format:xywh)

The text results are printed in terminal and the results of visualization are saved in object_detection.png, keypoint_detection.png, instance_segmentation.png.

Some Results

pix2seq_result_objection_detection

pix2seq_result_objection_detection2

pix2seq_result_instance_segmentation

pix2seq_result_keypoint_detection

Cite

  • Pix2seq : official implementation(by Tensorflow)
  • Pix2seqV1 implementation(by PyTorch)
  • Pix2seq paper:

    @article{chen2021pix2seq,
      title={Pix2seq: A language modeling framework for object detection},
      author={Chen, Ting and Saxena, Saurabh and Li, Lala and Fleet, David J and Hinton, Geoffrey},
      journal={arXiv preprint arXiv:2109.10852},
      year={2021}
    }
  • Pix2seq multi-task paper:

    @article{chen2022unified,
      title={A Unified Sequence Interface for Vision Tasks},
      author={Chen, Ting and Saxena, Saurabh and Li, Lala and Lin, Tsung-Yi and Fleet, David J. and Hinton, Geoffrey},
      journal={arXiv preprint arXiv:2206.07669},
      year={2022}
    }

Acknowledgement

坐上那飞机去拉萨(civi粉丝版)