A Dynamic Vision Sensor Object Recognition Model with Fusion of Multi-spiking Attention Mechanisms

In this section, we propose a multi-spiking attention mechanisms fusion model for DVS objects recognition. The overall architecture of the model is shown in Fig. 1, the model consists of four parts, the first part is a spiking spatio-temporal feature extraction module, which extracts spatio-temporal features of DVS objects. The second part is a trainable event-driven convolution module, which extracts local edge features using trainable event-driven convolution. The third part is a spiking self-attention mechanism module, which extracts global dependence features. The fourth part is a feature fusion module, which fuses the features extracted by the above mentioned modules and makes network decisions.

Fig. 1 Fig. 1

The alternative text for this image may have been generated using AI.

The overall structure of the multi-spiking attention mechanisms fusion model

Spiking Spatio-temporal Feature Extraction Module

To effectively utilize the spatio-temporal information of event stream data, we propose a spiking spatio-temporal feature extraction module. It consists of a spatio-temporal information statistics layer and a spatio-temporal feature extraction layer.

Spatio-Temporal Information Statistics Layer

The spatio-temporal information statistics layer directly receives the sparse event stream data. It counts the number of times the convolution kernel parameter covers the location of the response map event by event. To utilize the temporal information of the DVS objects, the layer reads the timestamps of two adjacent events, calculates the time interval between them, and multiplies it by a linear leakage rate to determine the leakage value. Before proceeding with the spatio-temporal information of the next event, the spatio-temporal information statistics layer is updated by the global leakage of the feature map. For simplicity of implementation, a linear leakage mechanism is considered. The process is shown in Fig. 2: Assume there are two neighboring events $e_1=(t_1,x_1,y_1)$ and $=(,,)$… The leakage rate is 0.1. For the event $=(,,)$, the feature map location $(,)$ value is increased by 1 and the event timestamp  is recorded. The time interval between these two neighboring events is $ - $. Assuming $ - $ is 3, before updating the feature map by the event $=(,,)$, the feature map is globally decreased by 0.3.

Fig. 2 Fig. 2

The alternative text for this image may have been generated using AI.

The leakage process of spatio-temporal feature map

After processing the entire event stream data, the feature map generated by the spatio-temporal information statistics layer contains both the temporal and spatial information.

Spatio-Temporal Feature Extraction Layer

The spatio-temporal feature extraction layer receives the feature maps from the spatio-temporal information statistics layer and extracts the spatio-temporal features of the event stream data, it uses spiking form of channel and spatial attention mechanisms. This layer consists of a spiking channel attention mechanism layer and a spiking spatial attention mechanism layer.

Traditional channel and spatial attention mechanisms rely on floating-point data, which brings high computation power consumption [33]. Here we propose spiking channel and spatial attention mechanisms to reduce the power consumption of the network and extract the spatio-temporal features of the event stream data. The spatio-temporal feature extraction layer first encodes the feature maps into sparse spiking sequences by Leaky Integrate-and-Fire (LIF) spiking neurons. The spatiotemporal feature extraction layer first converts feature maps into sparse spike sequences via spiking neurons, which serve as the fundamental computation units in SNNs by mimicking biological neurons that communicate through discrete spikes. Typical neuron models include Integrate-and-Fire (IF), LIF, Hodgkin–Huxley (H–H), and Izhikevich model [34]. The IF model offers fast and efficient computation but is overly simplified. The LIF model introduces a membrane leakage mechanism, making it more biologically realistic while maintaining computational efficiency. Although the H–H model captures detailed ionic channel dynamics using four coupled differential equations, its high computational cost hinders large-scale simulation, and the Izhikevich model, despite its biological plausibility, requires extensive parameter tuning [34]. Therefore, the LIF neuron achieves an optimal trade-off between biological fidelity and computational feasibility, and is adopted in this study.

The following three equations describe the dynamic process of LIF neurons:

$$H\lbrack t\rbrack=V\lbrack t-1\rbrack+\frac1\tau(X\lbrack t\rbrack-(V\lbrack t-1\rbrack-V_))$$

(1)

$$S[t]=\theta (H[t] - })$$

(2)

$$V[t]=H[t](1 - S[t])+}S[t]$$

(3)

where Eq. 1 is the neuronal charging process, $H[t]$denotes the membrane potential at the time step $t$ , $V[t - 1]$ denotes the membrane potential at the time step $t - 1$, $\tau$ is the membrane time constant, and $X[t]$ is the input current at the time step $t$ . Equation 2 is the process of neuron emitting spike, when the membrane potential $H[t]$ exceeds the discharge threshold $}$, the LIF neuron emits a spike $S[t]$, $\theta (x)$ is a step function with a value of 1 when the input is greater than or equal to 0, and 0 for the rest of the cases. Equation 3 is the neuronal membrane potential reset process, $}$ is the reset potential, $V[t]$ indicates the membrane potential after the trigger event, if no spike is generated, it is equal to $H[t]$, otherwise it is equal to the reset potential $}$.

The spiking sequence of the LIF neurons is fed to the spiking channel attention mechanism layer, which assigns different weights to each time step to help the network emphasize the time steps that have more important features than others. We first apply global average pooling and global max pooling along the spatial dimensions (assuming the data format is $T \times H \times W$, where $T$ is the time step, $H$ is the feature map height, $W$ is the feature map width):

$$\begin\;F_t^=AvgPool(F_t),\\\;F_t^=MaxPool(F_t)\end$$

(4)

which yield two descriptors of shape $T \times 1 \times 1$, $}$ denotes the average pooling operation, $}$ denotes the maximum pooling operation,  denotes the feature map of $t$ . These descriptors are fed to a multilayer perceptron (MLP) to compute the temporal attention vector:

$$\beginM_C(F)=\sigma(MLP(F^)+\\MLP(F^max}))\end$$

(5)

where$(F) \in }^}$ denotes the channel attention value, and $\sigma$ denotes the sigmoid function. The input feature map is reweighted as:

The channel attention value obtained from the spiking channel attention mechanism layer represents the weights of each time step of the input, and the feature map is multiplied by the channel attention value before being fed to the spiking spatial attention mechanism layer. This layer assigns different weights to each spatial location of the feature map, helping the network to emphasize the locations that have more important features than others. It should be noted that the channel attention values generated by the formula are not “theoretically optimal solutions” derived directly from analytical forms, but rather effective attention weights learned adaptively through the training process. We perform global pooling along the temporal dimension:

$$F_}^}=}}_T}(F^),}F_}^}=}}_T}(F^)$$

(7)

yielding two spatial descriptors of size $1 \times H \times W$. They are concatenated and passed through a convolutional layer to produce the spatial attention map:

$$(F)=\sigma (}(\left[ }^};}}^ }}} \right]))$$

(8)

where $(F) \in }^}$ denotes the spatial attention value, $}$ denotes the convolution operation, $\sigma$ denotes the sigmoid function.

The final spatio-temporal refined output is computed as:

The spatial attention value obtained from the spiking spatial attention mechanism layer represents the weights of each spatial location of the input. The output from the spiking channel attention mechanism is multiplied by the spiking spatial attention value. Through training, the network can assign weights to time steps and spatial locations, helping it to emphasize valuable features and improve feature extraction capabilities.

Trainable Event-driven Convolution ModulePrinciples of Trainable Event-driven Convolution

The response map of traditional event-driven convolution is updated by summing the feature values at corresponding locations on the map with each parameter of the convolution kernel [3]. Assume that the kernel parameter index of the convolution kernel and the position index of the response map are shown in Fig. 3a. Assume that the convolution kernel value and the position of two successive events in an event stream are shown in Fig. 3b. The process of traditional event-driven convolution is shown in Fig. 3c. Whenever an event is input, the convolution kernel is centered on the position of this event in the response map, and each convolution kernel parameter covers one position of the response map. The response map is updated by adding the fixed convolution kernel to the current response map. The oriented edge filters emulate receptive fields of simple cells in the primary visual cortex, which respond strongly to features in a specific orientation [19]. Traditional event-driven convolution has the advantages of high temporal resolution and low latency, but the oriented edge filter parameters are fixed and only respond significantly to edge features in specific orientation, ignoring other rich features of the event stream data. If the event-driven convolution kernel parameters can be updated in network training, the feature extraction of the event stream could be more comprehensive.

Fig. 3 Fig. 3

The alternative text for this image may have been generated using AI.

Traditional event-driven convolution and trainable event-driven convolution. (a) The index of the convolution kernel and the response map. (b) The convolution kernel and the events in the response map. (c) Traditional event-driven convolution. (d) Trainable event-driven convolution

We propose a trainable event-driven convolution. We find that after processing the raw event stream data with traditional event-driven convolution, the intensity values at each response map location can be represented as a linear combination of the kernel parameters, which can be formulated as follows:

$$\beginResponemap(\text)=n_\times W_1+n_\times\\W_2+n_\times W_3+\cdots+n_\times W_n\end$$

(10)

Where $Responemap(})$ denotes the intensity value at position A of the response map,  denotes the parameter whose convolution kernel position is indexed to $n$ , and $}$ denotes the number of times the convolution kernel parameter  covers position A of the response map after processing the entire event stream data. The intensity value at a position in the traditional event-driven convolution response map can be represented as the convolution operation results of the number of times the position of coverage and the convolution kernel parameters.

Therefore, we count the number of times each parameter of the convolution kernel covers each position of the response map. The response map of the trainable event-driven convolution is obtained by convolution to the feature map with a step size equal to the size of the convolution kernel. As shown in Figs. 2d and 3c. In Fig. 3c, the response map is obtained by performing convolution event by event, that is updating the response map event by event. In Fig. 3d, the response map is obtained by make one convolution of the parameter count feature map and the convolutional kernel. The convolution kernel in our method can be updated using backpropagation and gradient descent. The trainable event-driven convolution module consists of a parameter counter and a trainable convolution layer.

Parameter Counter

The parameter counter is designed to count the number of times each parameter of the convolution kernel covers each position of the response map event by event. The size of the parameter count feature map is determined as the product of the size of the convolution kernel and the size of the response map. Specifically, if the response map is $a \times a$ and the convolution kernel is $b \times b$, the size of the parameter counter is $(a \times b) \times (a \times b)$. Each value of the parameter counter is the number of times a certain parameter of the convolution kernel covers a certain position on the response map. For example, VW1 in Fig. 3d indicates the number of times the convolution kernel parameter W1 covers the position V of the response map, and VW2 is the number of times the parameter W2 covers the position V. Similarly, YW9 is the number of times the parameter of the convolution kernel W9 covers the position Y of the response map. We initialize the parameter count feature map to zero before processing an event stream data. When the parameter counter receives the event $=(,,)$as shown in Fig. 3b, the parameter counter centers on the position $(,)$and the parameter count feature map is updated by adding 1 to the value of the corresponding position. It means that AW1, BW2, CW3, FW4, GW5, HW6, KW7, LW8, and MW9 are updated by adding 1. Similarly, for the event $=(,,)$as shown in Fig. 3b, the parameter counter centers on the position $(,)$, RW1, SW2, TW3, WW4, XW5, and YW6 are updated by adding 1.

After processing the entire raw event stream data, the value of each position on the parameter count feature map corresponds to the number of times each parameter of the event-driven convolution kernel has covered each position on the event-driven convolution response map.

Trainable Convolution Layer

The trainable convolution layer receives the parameter count feature map from the parameter counter, it performs a convolution with a stride equal to the size of the convolution kernel. This layer consists of a convolution layer and a batch normalization layer, the convolution layer extracts the features of the parameter count feature map using the convolution, while the batch normalization layer is used to accelerate network convergence, mitigate the problem of feature distribution dispersion, and ensure fast and stable network training [35]. After the convolution on the entire parameter count feature map, the response map is the feature response map of the trainable event-driven convolution, and this layer can update the convolution kernel parameters by backpropagation and gradient descent to extract the event stream features effectively.

Spiking Self-attention Mechanism Module

After extracting local edge features of DVS objects using the trainable event-driven convolution, we further extract global dependence features of DVS objects using the spiking self-attention mechanism, it consists of a patch split and flattened block and a Spikformer block [33].

Patch Split and Flattened Block

Since the spiking self-attention mechanism receives inputs in the form of tokens, the patch split and flattened block is used to transform the feature map from the trainable event-driven convolution module into tokens [33], it consists of a convolution layer and a LIF neuron layer. Assuming that the feature map of the trainable event-driven convolution is$Featuremap\in R^$ (where $T$ , $C$ , $H$ , and $W$ , are the time step, the number of channels, the height of the feature map, and the width of the feature map, respectively), $} \in }^}$is transformed through convolution layers to a token of dimension D (where $T$ , $N$ , $D$ are the time step, the number of patches, and the dimension of the token, respectively). LIF neurons receive tokens, generate currents from dynamic equations, accumulate membrane potentials, emit spikes, and encode tokens into spiking sequences $x\in Spikes^$. To characterize the relative position information of the sequence, we introduce the conditional position encoder [36]. The conditional position encoder consists of a convolution layer, a batch normalization layer and a LIF neuron layer. The encoder receives the spiking sequence $x \in Spike}$ encoded by the LIF neuron layer and uses it to generate a spiking sequence $}$ containing relative position information. The spiking sequence $}$ is directly summed with the spiking sequence $X$ to obtain the spiking sequence $x_o\in Spikes^$. The patch split and flattened block proceeds as follows.

$$\beginx=LIF(Conv2d(Featuremap))\\Freaturemap\in R^,x\in Spikes^\end$$

Spikformer Block

The Spikformer block receives the spiking sequence $} \in Spike}$ from the patch split and flattened block, and extracts the global dependence features. It consists of the spiking self-attention mechanism module and variable number of MLPs. The spiking self-attention mechanism module receives the spiking sequences $} \in Spike}$ directly. It computes the key components $Q,K,V \in }^}$ from three trainable weight matrices $,, \in }^}$ with the self-attention mechanism, which has been applied to the Transformer model and achieved good performance due to its ability to capture global dependence features. The LIF neurons encode them into spiking sequences $},},} \in Spike}$.

The traditional self-attention mechanism consists of three floating-point components [12]: queries, keys, and values, which can capture global dependence features, but have a high computation power cost due to the operations on floating-point matrices. To improve the feature extraction capability of the self-attention mechanism, multiple sets of floating-point type components are generally used to compute the attention map, such as the multi-head attention mechanism, which makes the computation power cost higher than single. Therefore, we use the spiking self-attention mechanism to further extract global dependence features of DVS objects with a reduced computation power cost. The component queries, keys and values in the spiking self-attention mechanism are spiking sequences. Replacing floating-point matrix multiplication with AND and OR logical operations to compute attention offers significant advantages, including a markedly low computation power cost [37]. The attentional feature maps generated by the spiking self-attention mechanism module are mapped to high latitudes by variable number of MLPs. To avoid gradient vanishing, residual connections are used between the spiking self-attention mechanism block and the MLP.

Feature Fusion Module

The feature fusion module fuses the feature maps and makes network decisions. Specifically, this module fuses spatio-temporal features extracted by the spiking spatio-temporal feature extraction module, local edge features extracted by the trainable event-driven convolution module, and global dependence features extracted by the spiking self-attention mechanism module. The feature fusion is performed by summing the values at the corresponding positions of the feature maps. The multi-feature fusion feature maps are fed to the fully connected spiking neuron layer. Before training, the neurons are given different category labels corresponding to the number of classification task categories. During the network decision-making, the frequency of spikes emitted by each neuron in the output layer represents the decision of the network. The label represented by the neuron that emits spikes with the highest frequency for the same duration implies the final decision of the network. The following is the pseudo-code of the feature fusion process.

The alternative text for this image may have been generated using AI.

View original article

COGNITIVE COMPUTATION

Share Bookmark

0 0 0 0 0 0 0

More from this channel

A Dynamic Vision Sensor Object Recognition Model with Fusion of Multi-spiking Attention Mechanisms

Comments (0)