AIxiv專欄是本站發布學術、技術內容的欄位。過去數年,本站AIxiv專欄接收通報了2,000多篇內容,涵蓋全球各大專院校與企業的頂尖實驗室,有效促進了學術交流與傳播。如果您有優秀的工作想要分享,歡迎投稿或聯絡報道。投稿信箱:liyazhou@jiqizhixin.com;zhaoyunfeng@jiqizhixin.com
人物互動圖像生成指產生滿足文字描述需求,內容為人與物體互動的圖像,並要求圖像盡可能真實且符合語意。近年來,文字生成圖像模型在生成真實圖像方面取得了顯著的進展,但這些模型在生成以人物互動為主體內容的高保真圖像生成方面仍然面臨挑戰。其困難主要源自於兩個面向:一是人體姿勢的複雜性和多樣性給合理的人物生成帶來挑戰;二是交互邊界區域(交互語意豐富區域)不可靠的生成可能導致人物交互語意表達的不足。
針對上述問題,來自北京大學的研究團隊提出了一種姿勢和交互感知的人物交互圖像生成框架(SA-HOI), 利用人體姿勢的生成質量和交互邊界區域信息作為去噪過程的指導,生成了更合理,更真實的人物互動圖像。為了全面測評產生影像的質量,他們還提出了一個全面的人物互動影像生成基準。
論文連結:https://proceedings.mlr.press/v235/xu24e.html
計畫首頁:https://sites.html
/🜟
SA-HOI 是一種語意感知的人物互動影像產生方法,從人體姿態和互動語義兩方面提升人物互動影像生成的整體品質並減少存在的生成問題。透過結合影像反演的方法,產生了迭代式反演和影像修正流程,可以使生成影像逐步自我修正,提升品質。
研究團隊在論文中也提出了第一個涵蓋人 - 物體、人 - 動物和人 - 人交互的人物交互圖像生成基準,並為人物交互圖像生成設計了針對性的評估指標。大量實驗表明,該方法在針對人物交互圖像生成的評估指標和常規圖像生成的評估指標下均優於現有的基於擴散的圖像生成方法。
方法介紹論文中提出的方法如圖1 所示,主要由兩個設計組成:
姿態和互動指導(Pose and Interaction Guidance, PIG)和迭代反演和修正流程(Iterative Inversion and Refinement Pipeline, IIR)。
在PIG 中,對於給定的人物交互文本描述和噪聲,首先使用穩定擴散模型(Stable Diffusion [2])生成作為初始圖像,並使用姿態檢測器[3] 獲取人類體關節位置和對應的置信分數 , 建構姿態遮罩 高亮低品質姿態區域。
對於交互指導,利用分割模型定位交互邊界區域,得到關鍵點和相應的置信分數, 並在交互掩碼中高亮交互區域,以增強交互邊界的語義表達。對於每個去噪步驟, 和 作為約束來對這些高亮的區域進行修正,從而減少這些區域中存在的生成問題。此外, IIR 結合影像反演模型N,從需要進一步修正的影像中擷取雜訊n 和文字描述的嵌入t,然後使用PIG 對此影像進行下一次修正,利用品質評估器Q 對修正後的影像品質進行評估,以 的操作來逐步提高影像品質。
姿態與互動指導
圖 2:姿勢與互動指導取樣偽代碼
姿勢和交互引導採樣的偽代碼如圖 2 所示,在每個去噪步驟中,我們首先按照穩定擴散模型(Stable Diffusion)中的設計獲取預測的噪聲 ϵt 和中間重構 。然後我們在 上應用高斯模糊 G 來獲得退化的潛在特徵 和 ,隨後將對應潛在特徵中的信息引入去噪過程中。
和 被用於產生 和,並在 和 中突出低姿勢品質區域,指導模型減少這些區域的畸變生成。為了指導模型改進低品質區域,將透過以下公式來高亮低姿勢得分區域:
其中 ,x、y 是影像的逐像素座標,H,W 是影像大小,σ 是高斯分佈的變異數。 表示以第 i 個關節為中心的注意力,透過結合所有關節的注意力,我們可以形成最終的注意力圖,並使用閾值將 轉換為一個掩碼 。
其中 ϕt 是在時間步 t 產生遮罩的閾值。類似地,對於交互指導,論文作者利用分割模型得到物體的外輪廓點O 以及人體關節點C,計算人與物體之間的距離矩陣D,從中採樣得到交互邊界的關鍵點,利用和姿勢指導相同的方法產生交互注意力與掩蔽,並應用於計算最終的預測雜訊。
迭代式反演與影像修正流程
In order to obtain the quality assessment of the generated images in real time, the author of the paper introduces the quality evaluator Q as a guide for the iterative operation. For the k-th round image , the evaluator Q is used to obtain its quality score , and then is generated based on . In order to retain the main content of after optimization, the corresponding noise is needed as the initial value for denoising.
However, such noise is not readily available, so the image inversion method is introduced to obtain its noise potential features and text embedding , as the input of PIG, to generate optimized results .
By comparing the quality scores in the before and after iteration rounds, you can judge whether to continue optimization: when there is no significant difference between and , that is, below the threshold θ, it can be considered that the process may have made sufficient improvements to the image. Correction, thus ending optimization and outputting the image with the highest quality score.
Character interaction image generation benchmark
Human interaction image generation benchmark (data set + evaluation index)
Considering that there are no existing models and benchmarks designed for the human interaction image generation task, The author of the paper collected and integrated a human interaction image generation benchmark, including a real human interaction image data set containing 150 human interaction categories, and several evaluation indicators customized for human interaction image generation.
This data set is filtered from the open source human interaction detection data set HICO-DET [5] to obtain 150 human interaction categories, covering three different interaction scenarios: human-object, human-animal and human-human. A total of 5k real images of human interaction were collected as a reference data set for this paper to evaluate the quality of generated human interaction images.
In order to better evaluate the quality of the generated character interaction images, the author of the paper customized several evaluation criteria for character interaction generation, from the perspectives of reliability (Authenticity), feasibility (Plausibility) and fidelity (Fidelity) Comprehensive evaluation of generated images. In terms of reliability, the author of the paper introduced pose distribution distance and person-object distance distribution to evaluate whether the generated results are close to the real images: the closer the generated results are to the real images in a distribution sense, the better the quality. In terms of feasibility, the pose confidence score is calculated to measure the credibility and rationality of the generated human joints. In terms of fidelity, the human interaction detection task and the image-text retrieval task are used to evaluate the semantic consistency between the generated image and the input text.
Experimental results
Comparison with existing methods The experimental results are shown in Table 1 and Table 2, which compare the performance on character interaction image generation indicators and conventional image generation indicators respectively. Table 2: Comparative experimental results with existing methods in conventional image generation indicators
Experimental results show that the method in this paper is superior to existing models in multiple dimensions such as human body generation quality, interactive semantic expression, human interaction distance, human posture distribution, and overall image quality.In addition, the author of the paper also conducted a subjective evaluation, inviting many users to rate from multiple perspectives such as human body quality, object appearance, interactive semantics and overall quality. The experimental results prove that the SA-HOI method is more in line with human aesthetics from all angles. .
Table 3: Subjective evaluation results with existing methods
In qualitative experiments, the figure below shows the comparison of the results generated by different methods for the same character interaction category description. In the above group of pictures, the model using the new method accurately expresses the semantics of "kissing", and the generated human body postures are also more reasonable. In the group of pictures below, the method in the paper also successfully alleviates the distortion and distortion of the human body that exists in other methods, and enhances the interaction of "taking the suitcase" by generating the suitcase's lever in the area where the hand interacts with the suitcase. Semantic expression, thereby obtaining results that are superior to other methods in both human body posture and interaction semantics.
.
References:[1] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022
[2] HuggingFace, 2022. URL https://huggingface .co/CompVis/stable-diffusion-v1-4.
[3] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X ., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C
., Cheng, T., Zhao, Q., Li , B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., and Lin, D. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019. [4] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-
text inversion for editing real images using guided diffusion models. arXiv preprint
arXiv:2211.09794, 2022.
[5] Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
以上是ICML 2024 | 人物互動圖像,現在更懂你的提示詞了,北大推出基於語意感知的人物交互圖像生成框架的詳細內容。更多資訊請關注PHP中文網其他相關文章!