AI帮你理解科学

AI 生成解读视频

AI抽取解析论文重点内容自动生成视频


pub
生成解读视频

AI 溯源

AI解析本论文相关学术脉络


Master Reading Tree
生成 溯源树

AI 精读

AI抽取本论文的概要总结


微博一下
Modelling a visual search task as a Bayesian ideal observer problem, we assume that the visual system computes a series of optimal eye movements that reduces the uncertainty of target location

Optimal visual search based on a model of target detectability in natural images

NIPS 2020, (2020)

被引用0|浏览12
EI
下载 PDF 全文
引用
微博一下

摘要

To analyse visual systems, the concept of an ideal observer promises an optimal response for a given task. Bayesian ideal observers can provide optimal responses under uncertainty, if they are given the true distributions as input. In visual search tasks, prior studies have used signal to noise ratio (SNR) or psychophysics experiments to ...更多

代码

数据

0
简介
  • Humans have evolved a foveated visual system which, instead of processing an entire view with uniform resolution, receives higher spatial detail in the centre of the visual field.
  • Modelling a visual search task as a Bayesian ideal observer problem [8], the authors assume that the visual system computes a series of optimal eye movements that reduces the uncertainty of target location.
  • This assumes that the detectability of the target is known across the visual field.
  • This process has been modeled for simple targets in noise images, and has been shown to predict human eye movements in these
重点内容
  • Humans have evolved a foveated visual system which, instead of processing an entire view with uniform resolution, receives higher spatial detail in the centre of the visual field
  • Modelling a visual search task as a Bayesian ideal observer problem [8], we assume that the visual system computes a series of optimal eye movements that reduces the uncertainty of target location
  • We present a novel approach for calculating the foveated detectability of an object target on natural background images, which can be fed to a Bayesian ideal observer in a visual search task
  • We propose a method for computing a foveated model of detectability for object targets in natural backgrounds that closely mimics human performance
  • A comparison of different feature pipelines confirms that more complex visual features, such as those computed by Convolutional Neural Network (CNN), are needed to estimate the detectability of objects in natural images, though deeper CNN architectures are not necessarily better at producing human-like representations
  • This work adds to a growing body of literature showing the potential of CNNs to serve as an approximation of the feed-forward visual processing pipeline in humans, enabling the development of more sophisticated models of visual attention and fixation control
方法
  • The authors describe the method for computing detectability of a given target on any natural background; the ideal observer model that the authors use for later comparison to human visual search [9]; and the psychophysics experiment that the authors used to establish ground truth for evaluating the models.

    3.1 Proposed model

    The architecture of the proposed model is presented in Figure 1.
  • Assuming the internal noise and uncertainty of the model as a Gaussian noise, the authors can write the template response as a random value from standard normal distributions with means −d /2 if target-absent and +d /2 if target-present [12]
  • In this case, d is interpreted as the discriminability of the distributions, which gives them an estimate of how well separated the two distributions are from each other.
  • The authors calculate these two probabilities using a classifier which is trained to classify patches as target-present or target-absent
结果
  • The authors evaluate the performance of the model using different feature representation pipelines to compute the detectability of the target against each background.
  • The patch size is scaled to [1.4, 1.8, 2.2, 2.4] of its original size at the sample locations 1.8, 3.6, 5.4, and 7.2 degrees from the fovea.
  • As this scaling is not necessarily equivalent to the feature pooling in the human visual system, the authors fit the detectabilities to the human detectabilities with an appropriate scaling and translation for each background.
结论
  • The authors propose a method for computing a foveated model of detectability for object targets in natural backgrounds that closely mimics human performance.
  • A comparison of different feature pipelines confirms that more complex visual features, such as those computed by CNNs, are needed to estimate the detectability of objects in natural images, though deeper CNN architectures are not necessarily better at producing human-like representations.
  • This work adds to a growing body of literature showing the potential of CNNs to serve as an approximation of the feed-forward visual processing pipeline in humans, enabling the development of more sophisticated models of visual attention and fixation control
总结
  • Introduction:

    Humans have evolved a foveated visual system which, instead of processing an entire view with uniform resolution, receives higher spatial detail in the centre of the visual field.
  • Modelling a visual search task as a Bayesian ideal observer problem [8], the authors assume that the visual system computes a series of optimal eye movements that reduces the uncertainty of target location.
  • This assumes that the detectability of the target is known across the visual field.
  • This process has been modeled for simple targets in noise images, and has been shown to predict human eye movements in these
  • Methods:

    The authors describe the method for computing detectability of a given target on any natural background; the ideal observer model that the authors use for later comparison to human visual search [9]; and the psychophysics experiment that the authors used to establish ground truth for evaluating the models.

    3.1 Proposed model

    The architecture of the proposed model is presented in Figure 1.
  • Assuming the internal noise and uncertainty of the model as a Gaussian noise, the authors can write the template response as a random value from standard normal distributions with means −d /2 if target-absent and +d /2 if target-present [12]
  • In this case, d is interpreted as the discriminability of the distributions, which gives them an estimate of how well separated the two distributions are from each other.
  • The authors calculate these two probabilities using a classifier which is trained to classify patches as target-present or target-absent
  • Results:

    The authors evaluate the performance of the model using different feature representation pipelines to compute the detectability of the target against each background.
  • The patch size is scaled to [1.4, 1.8, 2.2, 2.4] of its original size at the sample locations 1.8, 3.6, 5.4, and 7.2 degrees from the fovea.
  • As this scaling is not necessarily equivalent to the feature pooling in the human visual system, the authors fit the detectabilities to the human detectabilities with an appropriate scaling and translation for each background.
  • Conclusion:

    The authors propose a method for computing a foveated model of detectability for object targets in natural backgrounds that closely mimics human performance.
  • A comparison of different feature pipelines confirms that more complex visual features, such as those computed by CNNs, are needed to estimate the detectability of objects in natural images, though deeper CNN architectures are not necessarily better at producing human-like representations.
  • This work adds to a growing body of literature showing the potential of CNNs to serve as an approximation of the feed-forward visual processing pipeline in humans, enabling the development of more sophisticated models of visual attention and fixation control
表格
  • Table1: Mean squared error (MSE) and standard error of the mean (SE) of various models for predicting human detectability (see text)
Download tables as Excel
相关工作
  • In recent years, visual search and fixation prediction models have been widely studied as an approach to better understand human visual attention and perception [16; 17]. Early works in this area were proposed as extensions of existing saliency maps. Saliency maps model the important locations in a visual scene which should be processed or searched [18]. A saliency map can be treated as a probability map of potential target locations and searched by choosing the next fixation based on next highest probability [19; 20]. However, these models do not model target detectability and are not able to predict optimal eye movement sequences.

    Other visual search models are based on biological aspects of the human visual system, and their predictions match closely with human eye movements. The authors in [9] proposed that an ideal observer, in an effort to find the target, directs their gaze to locations in the scene which will reduce uncertainty about target location. Statistics of human eye movements have been shown to match the ideal observer model, assuming that the signal to noise ratio (SNR) of a target to a background (detectability of the target) is known at all positions in the visual field [9]. Similarly to [9], which considers search for sine wave gratings on 1/f noise, most previous work uses simple targets and backgrounds for which calculating SNR is straightforward [11; 21; 13].
基金
  • This Facility was established with the assistance of ARC LIEF Grant LE170100200 [49]
研究对象与分析
observers: 12
Our automated prediction algorithm uses trained logistic regression as a post processing phase of a pre-trained deep neural network. Eye tracking data from 12 observers detecting targets on natural image backgrounds are used as ground truth to tune foveation parameters and evaluate the model, using cross-validation. Finally, the model of target detectability is used in a Bayesian ideal observer model of visual search, and compared to human search performance

texture samples: 21302
The backgrounds were 18 images from the ETH dataset [41]. The ETH dataset consists of 21,302 texture samples from different categories. To choose the 18 backgrounds, we ran the entire dataset through our pipeline to find images with a wide range of apparent difficulty, resulting in the 18 backgrounds shown in Figure 3 with a variety of low, medium and high detectabilities, and with a variety of image content and patterns

participants: 12
The screen width and distance were 48cm and 67cm respectively. 3.3.1 Detection In the detection task, 12 participants (age range 20 - 40) judged if the target occurred in the first or second frame in a two-alternative forced choice (2AFC) paradigm. The experimental procedure received ethics approval from University of Melbourne Human Research Ethics Committee (ID: 1955695)

participants: 12
Each session was preceded by a short practice session of 20 trials with a sample background, to familiarize the participant with the target appearance, location, and experiment process. The averaged probability of correct detection across the 12 participants for each background is shown in Figure 4. The probability of correct detection pcorrect(r) for each background at eccentricity r is fit to a sigmoid function as in pcorrect(r)

引用论文
  • L. W. Renninger, P. Verghese, and J. Coughlan, “Where to look next? eye movements reduce local uncertainty,” Journal of vision, vol. 7, no. 3, pp. 6–6, 2007.
    Google ScholarLocate open access versionFindings
  • Y. Semizer and M. M. Michel, “Intrinsic position uncertainty impairs overt search performance,” Journal of vision, vol. 17, no. 9, pp. 13–13, 2017.
    Google ScholarLocate open access versionFindings
  • D. G. Pelli, M. Palomares, and N. J. Majaj, “Crowding is unlike ordinary masking: Distinguishing feature integration from detection,” Journal of vision, vol. 4, no. 12, pp. 12–12, 2004.
    Google ScholarLocate open access versionFindings
  • E. R. Kupers, M. Carrasco, and J. Winawer, “Modeling visual performance differences ‘around’the visual field: A computational observer approach,” PLoS computational biology, vol. 15, no. 5, p. e1007063, 2019.
    Google ScholarLocate open access versionFindings
  • L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 1254– 1259, 1998.
    Google ScholarLocate open access versionFindings
  • L. Itti and C. Koch, “A saliency-based search mechanism for overt and covert shifts of visual attention,” Vision research, vol. 40, no. 10-12, pp. 1489–1506, 2000.
    Google ScholarLocate open access versionFindings
  • N. Liu and J. Han, “Dhsnet: Deep hierarchical saliency network for salient object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686, 2016.
    Google ScholarLocate open access versionFindings
  • J. Najemnik and W. S. Geisler, “Eye movement statistics in humans are consistent with an optimal search strategy,” Journal of Vision, vol. 8, no. 3, pp. 4–4, 2008.
    Google ScholarLocate open access versionFindings
  • J. Najemnik and W. S. Geisler, “Simple summation rule for optimal fixation selection in visual search,” Vision research, vol. 49, no. 10, pp. 1286–1294, 2009.
    Google ScholarLocate open access versionFindings
  • A. D. Clarke, P. Green, M. J. Chantler, and A. R. Hunt, “Human search for a target on a textured background is consistent with a stochastic model,” Journal of vision, vol. 16, no. 7, pp. 4–4, 2016.
    Google ScholarLocate open access versionFindings
  • D. G. Pelli, Effects of visual noise. PhD thesis, Citeseer, 1981.
    Google ScholarFindings
  • J. Najemnik and W. S. Geisler, “Optimal eye movement strategies in visual search,” Nature, vol. 434, no. 7031, pp. 387–391, 2005.
    Google ScholarLocate open access versionFindings
  • B. S. Tjan, W. L. Braje, G. E. Legge, and D. Kersten, “Human efficiency for recognizing 3-d objects in luminance noise,” Vision research, vol. 35, no. 21, pp. 3053–3069, 1995.
    Google ScholarLocate open access versionFindings
  • A. B. Sekuler, C. M. Gaspar, J. M. Gold, and P. J. Bennett, “Inversion leads to quantitative, not qualitative, changes in face processing,” Current Biology, vol. 14, no. 5, pp. 391–396, 2004.
    Google ScholarLocate open access versionFindings
  • L. Fridman, B. Jenik, S. Keshvari, B. Riemer, C. Zetzsche, and R. Rosenholtz, A Fast Foveated Fully Convolutional Network Model for Human Peripheral Vision. Verlag nicht ermittelbar, 2017.
    Google ScholarFindings
  • E. Kowler, “Eye movements: the past 25 years,” Vision research, vol. 51, no. 13, pp. 1457–1483, 2011.
    Google ScholarLocate open access versionFindings
  • M. Hayhoe and D. Ballard, “Eye movements in natural behavior,” Trends in cognitive sciences, vol. 9, no. 4, pp. 188–194, 2005.
    Google ScholarLocate open access versionFindings
  • C. Koch and S. Ullman, “Shifts in selective visual attention: towards the underlying neural circuitry,” in Matters of intelligence, pp. 115–141, Springer, 1987.
    Google ScholarFindings
  • A. Coutrot, J. H. Hsiao, and A. B. Chan, “Scanpath modeling and classification with hidden markov models,” Behavior research methods, vol. 50, no. 1, pp. 362–379, 2018.
    Google ScholarLocate open access versionFindings
  • T. Judd, F. Durand, and A. Torralba, “A benchmark of computational models of saliency to predict human fixations,” 2012.
    Google ScholarFindings
  • J. M. Gold, A. B. Sekuler, and P. J. Bennett, “Characterizing perceptual learning with external noise,” Cognitive Science, vol. 28, no. 2, pp. 167–207, 2004.
    Google ScholarLocate open access versionFindings
  • C. Dorronsoro, C. Walshe, S. Sebastian, and W. Geisler, “Separable effects of similarity and contrast on detection in natural backgrounds,” Journal of Vision, vol. 18, no. 10, pp. 747–747, 2018.
    Google ScholarLocate open access versionFindings
  • C. Oluk and W. S. Geisler, “Effects of target amplitude uncertainty, background contrast uncertainty, and prior probability are predicted by the normalized template-matching observer,” Journal of Vision, vol. 19, no. 10, pp. 79c–79c, 2019.
    Google ScholarLocate open access versionFindings
  • J. M. Wolfe and W. Gray, “Guided search 4.0,” Integrated models of cognitive systems, pp. 99– 119, 2007.
    Google ScholarFindings
  • R. Tudor Ionescu, B. Alexe, M. Leordeanu, M. Popescu, D. P. Papadopoulos, and V. Ferrari, “How hard can it be? estimating the difficulty of visual search in an image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2157–2166, 2016.
    Google ScholarLocate open access versionFindings
  • A. Toet, F. L. Kooi, P. Bijl, and J. M. Valeton, “Visual conspicuity determines human target acquisition performance,” Optical Engineering, vol. 37, no. 7, pp. 1969–1976, 1998.
    Google ScholarLocate open access versionFindings
  • M. Pomplun, E. M. Reingold, J. Shen, and D. E. Williams, “The area activation model of saccadic selectivity in visual search,” in Proceedings of the 22nd annual conference of the cognitive science society, pp. 375–380, Lawrence Erlbaum Associates Mahwah, NJ, 2000.
    Google ScholarLocate open access versionFindings
  • J. Tsotsos, I. Kotseruba, and C. Wloka, “A focus on selection for fixation,” 2016.
    Google ScholarFindings
  • C. Wloka, I. Kotseruba, and J. K. Tsotsos, “Active fixation control to predict saccade sequences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3184– 3193, 2018.
    Google ScholarLocate open access versionFindings
  • G. Zelinsky, Z. Yang, L. Huang, Y. Chen, S. Ahn, Z. Wei, H. Adeli, D. Samaras, and M. Hoai, “Benchmarking gaze prediction for categorical visual search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0, 2019.
    Google ScholarLocate open access versionFindings
  • D. G. Pelli and K. A. Tillman, “The uncrowded window of object recognition,” Nature Neuroscience, vol. 11, no. 10, pp. 1129–1135, 2008.
    Google ScholarLocate open access versionFindings
  • A. Deza and M. Eckstein, “Can peripheral representations improve clutter metrics on complex scenes?,” in Advances in Neural Information Processing Systems, pp. 2847–2855, 2016.
    Google ScholarLocate open access versionFindings
  • A. Deza, A. Jonnalagadda, and M. Eckstein, “Towards metamerism via foveated style transfer,” arXiv preprint arXiv:1705.10041, 2017.
    Findings
  • D. M. Green, J. A. Swets, et al., Signal detection theory and psychophysics, vol.
    Google ScholarLocate open access versionFindings
  • 1. Wiley New York, 1966.
    Google ScholarFindings
  • [35] W. Peterson, T. Birdsall, and W. Fox, “The theory of signal detectability,” Transactions of the IRE professional group on information theory, vol. 4, no. 4, pp. 171–212, 1954.
    Google ScholarLocate open access versionFindings
  • [36] D. Green and J. Swets, “Signal detection theory and psychophysics (rev. ed.),” Huntington, NY: RF Krieger, 1974.
    Google ScholarFindings
  • [37] G. Lindsay, “Convolutional neural networks as a model of the visual system: past, present, and future,” Journal of cognitive neuroscience, pp. 1–15, 2020.
    Google ScholarLocate open access versionFindings
  • [38] D. L. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo, “Performanceoptimized hierarchical models predict neural responses in higher visual cortex,” Proceedings of the National Academy of Sciences, vol. 111, no. 23, pp. 8619–8624, 2014.
    Google ScholarLocate open access versionFindings
  • [39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
    Google ScholarFindings
  • [40] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606–3613, 2014.
    Google ScholarLocate open access versionFindings
  • [41] D. Dai, H. Riemenschneider, and L. Van Gool, “The synthesizability of texture examples,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
    Google ScholarLocate open access versionFindings
  • [42] J. Freeman and E. P. Simoncelli, “Metamers of the ventral stream,” Nature neuroscience, vol. 14, no. 9, p. 1195, 2011.
    Google ScholarLocate open access versionFindings
  • [43] C. F. Staugaard, A. Petersen, and S. Vangkilde, “Eccentricity effects in vision and attention,” Neuropsychologia, vol. 92, pp. 69–78, 2016.
    Google ScholarLocate open access versionFindings
  • [44] E. Peli, J. Yang, and R. B. Goldstein, “Image invariance with changes in size: The role of peripheral contrast thresholds,” JOSA A, vol. 8, no. 11, pp. 1762–1774, 1991.
    Google ScholarLocate open access versionFindings
  • [45] K. A. Ehinger, B. Hidalgo-Sotelo, A. Torralba, and A. Oliva, “Modelling search for people in 900 scenes: A combined source model of eye guidance,” Visual cognition, vol. 17, no. 6-7, pp. 945–978, 2009.
    Google ScholarLocate open access versionFindings
  • [46] L. Raad and B. Galerne, “Efros and freeman image quilting algorithm for texture synthesis,” Image Processing On Line, vol. 7, pp. 1–22, 2017.
    Google ScholarLocate open access versionFindings
  • [47] J. F. Ackermann and M. S. Landy, “Choice of saccade endpoint under risk,” Journal of vision, vol. 13, no. 3, pp. 27–27, 2013.
    Google ScholarLocate open access versionFindings
  • [48] M. Michel and W. Geisler, “Saccadic plasticity in visual search,” Journal of Vision, vol. 9, no. 8, pp. 403–403, 2009.
    Google ScholarLocate open access versionFindings
  • [49] L. Lafayette, G. Sauter, L. Vu, and B. Meade, “Spartan performance and flexibility: An hpc-cloud chimera,” OpenStack Summit, Barcelona, vol. 27, 2016.
    Google ScholarLocate open access versionFindings
  • [50] S. M. Crouzet and T. Serre, “What are the visual features underlying rapid object recognition?,” Frontiers in psychology, vol. 2, p. 326, 2011.
    Google ScholarLocate open access versionFindings
  • [51] T. Serre, A. Oliva, and T. Poggio, “A feedforward architecture accounts for rapid categorization,” Proceedings of the national academy of sciences, vol. 104, no. 15, pp. 6424–6429, 2007.
    Google ScholarLocate open access versionFindings
  • [52] J. Malik, S. Belongie, J. Shi, and T. Leung, “Textons, contours and regions: Cue integration in image segmentation,” in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 918–925, IEEE, 1999.
    Google ScholarLocate open access versionFindings
  • [53] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • [54] R. Eyraud, E. Zibetti, and T. Baccino, “Allocation of visual attention while driving with simulated augmented reality,” Transportation research part F: traffic psychology and behaviour, vol. 32, pp. 46–55, 2015.
    Google ScholarLocate open access versionFindings
  • [55] P. Rane, H. Kim, J. L. Marcano, and J. L. Gabbard, “Virtual road signs: Augmented reality driving aid for novice drivers,” in Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 60, pp. 1750–1754, SAGE Publications Sage CA: Los Angeles, CA, 2016.
    Google ScholarLocate open access versionFindings
作者
SHIMA RASHIDI
SHIMA RASHIDI
Krista A Ehinger
Krista A Ehinger
Lars Kulik
Lars Kulik
您的评分 :
0

 

标签
评论
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科