Stanford Vision Lab's latest research: Human eye movement can also be used to detect illness! ?

Classification of developmental disorders based on vision through eye-movement

Joint compilation: Zhang Min, Chen Yang Yingjie


This paper proposes a system that allows the fine-grained classification of developmental disorders by measuring individual eye-movements using multi-modality visual data. Although the purpose of designing this system is to solve mental problems, we believe that its basic principles and general methods can not only attract psychiatrists, but also attract researchers and engineers in medical machine vision. The idea is to build the future from different visual sources (captured information is not included in any one way). By using an eye-tracker and a camera monitoring two people's speaking process, we have established a time attention feature to describe a person's semantic position (focus on the face of the other person). In our clinical context, these time attention characteristics describe the patient's attention to the fine discrete areas of the doctor's face and are used to classify the patient's specific developmental disorders.


Autism Spectrum Disorder (ASD) is a major developmental disorder that increases prevalence and substantial social impact. Making the best efforts in early diagnosis is the key to proper treatment. In addition, ASD is also a highly heterogeneous disease, which makes the diagnosis process particularly troublesome. At present, the identification of ASD requires a set of cognitive tests and a few hours of clinical evaluation, including a wide range of test participants, and the need to observe their behavioral patterns (eg, their social interactions with others). The use of computer-aided technology to identify autism is one of the important goals. It may reduce diagnostic costs and raise standards.

In this work we are working on Fragile X Syndrome (FXS). FXS is the most common genetic cause of autism. About 100,000 people in the United States have been affected by it. Personal FXS has a range of developmental and cognitive disorders, including executive dysfunction, visual memory and sensory disturbances, social avoidance, communication disorders, and repetitive behaviors. In particular, in ASD's more general behavior, avoiding the line of sight of others when socializing is the most prominent behavioral feature of personal FXS. FXS is one of the most important factors for learning ASD because a single gene mutation can be easily diagnosed. To achieve our goal, focusing on FXS means that true diagnosis is available and the heterogeneity of symptoms in the infected group is also reduced.

For language development, emotional recognition, social participation, and general learning through attention sharing, maintaining a proper social gaze is the key. Previous studies have shown that gaze fluctuation plays an important role in the characteristics of individuals with autism. In this work, we studied the basic pattern of visual fixation during binary interactions. In particular, we used these patterns to describe different developmental disorders.

We solve two problems. The first challenge is to establish new features to describe the good behavior of participants with developmental disabilities. We use computer vision and multimodal data to capture detailed visual gaze in binary interactions. The second challenge is to use these features to establish a system that can identify different developmental disorders. The rest of the paper is structured as follows: In section 2, we discussed the previous work. In Section 3, we described the raw data: its collection and the use of sensors. In Section 4, we described the built-in features and analyzed them. In Section 5, we described our own classification techniques, experiments, and results. In section 6, we discuss the results.

Figure 1. (a) We use multi-modal data from a remote eye tracker and camera to study social interactions between individuals involved in an interview with mental disorders. The goal of the system is to use this data to achieve fine-grained classification of developmental disorders. (b) A video frame of participant's perspective (the participant's head is visible in the bottom frame). A remote eye tracker is used to track eye movements and map it into the spatial coordinate system of this video.

2. Preliminary work

The pioneering work of Rehg et al. demonstrated the potential to use coarse gaze information to measure ASD-related behavior in children. However, this work did not solve the fine-grained classification problem between ASD and other diseases in an automated manner. Therefore, we have extended a barrier classification method with multimodal data. In addition, some previous efforts in developmental disorders such as epilepsy and schizophrenia rely on the use of electroencephalography (EEG) for recording. This method is very accurate, but it requires a long record; In addition, the use of EEG probes to locate the participants' scalp and face will limit the applicability of the developmental population. At the same time, eye tracking has been used to study autism, but we do not realize that an automated system uses eye tracking to perform cross-obstacle assessments (as proposed here).

3. Data set

Our dataset included videos of 70 clinician interview participants covering the participants' sight (as measured by a remote eye tracker), which was first reported in [6].

Participants were diagnosed with idiopathic developmental disorders (DD) or fragile X syndrome (FXS). Participants with DD showed the same level of autism symptoms as those with FXS participants, but none of them had a diagnosis of FXS or any other known genetic syndrome. There are known gender-related behavioral differences among participants in FXS, so we further subdivide the gender of this group to be male (FXS-M) and female (FXS-F). There was no sex-related behavioral difference in the DD group, and genetic testing confirmed that DD participants did not have FXS.

The participants' ages were between 12 and 28, with 51 FXS participants (32 men, 19 women) and 19 DD participants. The two groups are well matched in terms of time and developmental age. They have similar average scores on the Vlan Adaptive Behavior Scale (VABS), an effective measure to develop the function. The average score for people with FXS was 58.5, and for those with control was 57.7 (SD = 16.78), which indicates that the cognitive function levels of these two groups are 2-3 SDs lower than the typical average.

Participants will be interviewed by clinically trained experimenters. In our setup, the camera is placed behind the patient and faces the interviewer. Figure 1 depicts the configuration of the interview, and the physical environment. Eye movements were recorded using an eye tracking of the Tobii X120 remote cornea reflex, which was synchronized from the scene camera time. Eye tracking is spatially calibrated to a remote camera by the patient looking at a group of locations prior to the known interviewer.

4. Visual gaze characteristics

Our job goal is to design a feature that can provide insight into these obstacles and can accurately classify them. These features are the building blocks of our system, and the key challenge is to properly extract the most meaningful parts from the original eye tracker and video. We captured the participant's gaze and the distribution of the face in the interview, which was 5 times per second throughout the interview. There are 6 related areas: nose, left eye, right eye, mouth, jaw, appearance. The precise detection of these fine-grained features ensures that we examine the participants' attention in smaller scale changes. For each video frame, based on the partial model, we found a set of markers on the faces of 69 interviewers. Figure 1 shows a landmark detection example. We handled a total of 14,414,790 signs. The DD, FXF-woman, FXS-man group 59K, 56K and 156k frames were calculated separately. We evaluated a sample of randomly selected 1K frames, where only a single frame was mis-annotated. We use a linear converter to map the eye tracking coordinates to the facial landmark coordinates. The cluster of tags that our features take (for example, 颚) is the closest sign to the participant's gaze. Next, we present some descriptive analysis of these data.

Figure 2. Focus on face time analysis. The X axis represents the time in the frame (in 0.2 seconds increments). The y-axis represents each participant. The black dots represent the time when participants looked at the interviewer's face. White space means they are not.

Feature granularity. We want to analyze the relevance of our fine-grained attention features. Participants (especially those with FXS) spent only a small fraction of their time looking at the interviewer's face. Analyzing the time-series data of the individual when they looked at the interviewer's face (see Figure 2), we observed differences among the high-group participants. For example, most FSX-F ​​individual sequences can easily be confused with other groups.

Clinicians often think that the distribution of fixations is not just a pure lack of facial fixation - it seems to be related to the general symptoms of autism [8]. The distribution in Figure 3 supports this view: DD and FXS-F are very similar, while FXS-M is different. FXS-M is mainly concentrated in the mouth (4) and nose (1) areas.

Figure 3. Visual gaze histogram of various obstacles. The X axis represents the gaze, from left to right: nose (1), left eye (2), right eye (3), oral cavity (4), jaw (5). The histogram calculates the data for all participants. For ease of visibility we removed non-faced gaze.

Note the conversion . In addition to the distribution of fixations, clinicians also believe that the order of fixation describes the basic behavior. In particular, FXS participants often browse the face quickly and then move away or scan non-eye areas. Figure 4 shows the transition between zones in the form of heat. There are landmark differences between the two different obstacles: People with DD make more changes, while those with FXS show significantly less - consistent with clinical intuition. The transition between face regions can better identify the three groups compared to the transition from non-face to face regions. Participants of FXS-M tend to frequently exchange their eyes between the mouth and nose while the other two do not. Participants in the DD showed more movement between facial areas with no apparent preference. The FXS-F mode is similar to the DD, although the mode is less obvious.

Figure 4 Perceptual transformation barriers of the matrix. Each square [ij] represents the number of aggregations of the attention of each group of participants from state i to state j. The axes represent different states: non-face area (0), nose (1), left eye (2), right eye (3), mouth (4), and chin (5).

Approximate entropy . We next estimate the results of approximate entropy (ApEn) analysis to provide a means to predict the sequence. The low entropy in the signal indicates a high degree of regularity. For each category (DD, FXS-Female, FXS-Male), we selected 15 random participants. We calculate ApEn with different w (sliding window length). Figure 5 depicts this analysis. We can see that there are huge differences between people. Many of them have similar entropy with other groups of participants. The high variability of data sequences makes it difficult to classify them.

Figure 5 (a)-(c) ApEn analysis of data for each different data window length parameter w. The Y axis represents ApEn, and the X axis represents the parameter w. Each row represents the data of one participant. We observe huge differences among individuals.

5 classifier

The goal of this work is to create an end-to-end system for classifying developmental disorders from raw image information. So far, we have introduced the characteristics of capturing social awareness information and analyzing their instantaneous structures. Next we need to construct methods that can optimize these characteristics to predict the patient's specific developmental disorders.

Model (RNN). Recurrent neural network (RNN) is a generalization of pre-feedback neural networks. Our deep learning model is an adaptive model of the perceptually enhanced recurrent neural network structure proposed by Hinton et al. (LSTM+A). This model has obtained very remarkable results in other fields, and it is necessary to enter the language model and speech processing. Our feature sequence is very much in line with the data model. In addition, an encryption/decryption recursive neural network structure allows us to effectively experiment with varying length sequences. Our actual model differs from LSTM+A in two ways. First, we used GRU cells instead of LSTM cells. They can save memory and better fit our data. Second, our decoder produces a single output value (eg, class). The decoder is a unit multilayer recursive neural network (unexpanded) with a soft-max output layer. In general, it can be regarded as a many-to-one recurrent neural network, but we often represent it as a distance-based and perceptual mechanism.

In our experiment, we used three recurrent neural network structures: RNN_128: 3 layers of 128 cells; RNN_256: 3 layers of 256 cells; RNN_512: 3 layers of 512 cells. These parameters are selected based on our GPU memory allocation limit.

The total number of trainings in our model reached one thousand times. We batch-processed the series results using the steepest gradient descent method (SGD) and maximum slope (0.5).

Other classifiers. We also train shallow reference classifiers that use the CNN approach to exploit the local-temporal relationship of our data. It is a hidden layer consisting of 6 convolution units, point-by-point nonlinear curls. The eigenvectors cross-compute the tandem units and produce an output layer that is approximated by another tandem function. We also trained Support Vector Machines (SVMs), Naive Bayesian (NB) classifiers, and Hidden Markov Models (HMMs).

6. Experiments and results

By changing the classification method described in Section 5, we performed a quantitative assessment of the overall system. We assume that the patient's gender is known and select the clinically relevant combination of comparative tests DD vs FXS-F and DD vs FXS-M. In the experiment, we used 32 FXS-male, 19 FXS-female and 19 DD participants. In order to maintain equal distribution of data during training and testing, we constructed Strain and Stest to randomly disrupt each group of participants to ensure that the two participant categories were distributed at 50% / 50%. In each new training/test subset, this process is repeated so that the average classification result can represent the entire group of participants. We give individual time-series property data p to classify participants with developmental disabilities to assess the accuracy of our system. For N, for all participants, we create an 80%/20% training/test dataset so that no participant's data will be shared by both datasets. For each experiment, we performed a 10-level cross-validation where each level was defined as a new random subset and the participants were separated by 80/20—about 80 participants per experiment were tested.

Table 1 Comparison of the accuracy of this system with other classifiers. The column shows the participant's classification accuracy for DD vs FXS-female and DD vs FXS-male binary classes. The classifiers operate in time windows of 3 seconds, 10 seconds, and 50 seconds, respectively. We compare the system classifiers, RNN and CNN, SVM, NB, and HMM algorithms.

Indicators. We consider a binary classifier of unknown participants as DD or FXS. We use a voting strategy where a patient's data p = [f1,f2,....fT] is given. We classify all subsequences of p with a correction length w by means of a time window. In our experiment, w corresponds to video steps of 3 seconds, 10 seconds, and 50 seconds. In order to predict the obstacles of participants, we use a max-voting system for each category. The participant's prediction class C is defined as:

Where C1, C2 ∈ {DD, FXS-F, FXS-M}, Class(s) is the output of the classifier given input s. We use ten cross-certification elements to calculate the average classifier accuracy.

The results are shown in Table 1. We have found that using the RNN_512 model in a 50-second time window yields the highest average accuracy. We suspect that the conspicuous results produced by RNN_512 are related to high capacity and ability to represent complex transient structures.

7. Conclusion

We illustrate the use of computer vision and machine learning techniques in a cost-effective system to aid in diagnosing developmental disorders and visual phenotypic expression in social life. Observed subjects with developmental disabilities collected experimental data by video or close eyeball capture. We established the corresponding visual features of particle perception and used it to develop classification models for FXS and congenital developmental disorders. Despite the high variance and noise found in the signals used, our high precision means the presence of transient structures in the data.

This work conceptually proves the ability of modern computer vision systems to assist in diagnosing developmental disorders. We can provide a high probability prediction for the diagnosis of specific developmental disorders based on short-range eye movement records. This system, and others like it, can significantly speed up individual screening. Future work will consider this feature to extend to a wider range of diseases and improve classification accuracy.

Via: Stanford Visual Laboratory

PS : This article was compiled by Lei Feng Network (search “Lei Feng Network” public number) and it was compiled without permission.

HIGH_PRO Disposable Vape

Electronic Vapor Cigarettes,Wholesale Disposable Vape Pen,Disposable E Cig,Cigarette Electronic

Maskking(Shenzhen) Technology CO., LTD ,