Mapping model units to visual neurons reveals population code for social behaviour

Written by

[ad_1]

Flies

For all experiments, we used four- to seven-day-old virgin flies collected from density-controlled bottles seeded with eight males and eight females. Fly bottles were kept at 25 °C and 60% relative humidity. Virgin flies were housed individually and kept in behavioural incubators under a 12 h:12 h light:dark cycling; individual males were paired with a pheromone insensitive and blind (PIBL) female to encourage longer courtship sessions—see Supplementary Table 1 for more info on genotype. UAS-TNT-C was obtained from the Bloomington Stock Center. All LC split-GAL4 lines and the spGAL4 control line^33,49 were generously provided by M. Reiser, A. Nern and G. Rubin—see Supplementary Table 2 for more information. We note that LC10 has seven different types (LC10a–g) whose genetic lines have not all been isolated; their names come from prior cell typing based on light microscopy¹². LC10 genetic line names have not yet been mapped to these new types identified in the connectome.

Courtship experiments

Behavioural chambers were constructed as previously described^9,50. Each recording chamber had a floor lined with white plastic mesh and equipped with 16 microphones (Extended Data Fig. 1). Video was recorded from above the chamber at a 60 Hz frame rate; features for behavioural tracking were extracted from the video and downsampled to 30 Hz for later analysis. Audio was recorded at 10 kHz. Flies were introduced gently into the chamber using an aspirator. Recordings were timed to be within 150 min of the behavioural incubator lights switching on to catch the morning activity peak. Recordings were stopped either after 30 min or after copulation, whichever came sooner. All flies were used; we did not use any criteria (for example, if males sang during the first 5 min of the experiment or not) to drop fly sessions from analyses. In total, behaviour was recorded and analysed from 459 pairs; the number of flies per condition were as follows:

LC type	LC4	LC6	LC9	LC10a	LC10ad	LC10bc	LC10d	LC11	LC12
number of pairs	17	19	18	13	15	16	16	14	14

LC type	LC13	LC15	LC16	LC17	LC18	LC20	LC21	LC22	LC24
number of pairs	17	16	19	14	14	16	15	22	18

LC type	LC25	LC26	LC31	LPLC1	LPLC2	control	total
number of pairs	16	18	24	16	17	75	459

Joint positions for the male and female for every frame were tracked with a DNN trained for multi-animal pose estimation called SLEAP⁵¹. We used the default values for the parameters and proofread the resulting tracks to correct for errors. We estimated the presence of sine, Pfast and Pslow song for every frame using a song segmenter on the audio signals recorded from the chamber’s microphones according to a previous study³⁵.

From the tracked joint positions and song recordings, we extracted the following six behavioural variables of the male fly that represented his moment-to-moment behaviour. (1) ‘Forward velocity’ was the difference between the male’s current position and his position one frame in the past; this difference in position was projected onto his heading direction (that is, the vector from the male’s thorax to his head). (2) ‘Lateral velocity’ was the same difference in position as computed for forward velocity except this difference was projected onto the direction orthogonal to the male’s heading direction; rightward movements were taken as positive. (3) ‘Angular velocity’ was the angle between the male’s current heading direction and the male’s heading direction one frame in the past; rightward turns were taken as positive, and angles were reported in degrees (that is, a turn to a male’s right is 90°, a turn to his left is −90°). (4) ‘Probability of sine song’ was computed as a binary variable for each frame, where a value of 1 was reported if sine song was present during that frame, else 0 was reported. (5) ‘Probability of fast pulse (Pfast) song’ and (6) ‘probability of slow pulse (Pslow) song’ were computed in the same manner as that for the probability of sine song. These six behavioural output variables described the male’s movements (forward, lateral, and angular velocity) as well as his song production (probability of sine, Pfast and Pslow song).

Often a male fly spends periods of time without noticeable courtship of the female (for example, the ‘whatever’ state as defined in ref. ⁵²). During these periods, the male probably does not rely much on the visual feedback of the female to guide his behaviour; this makes predicting his behaviour only from visual input difficult. In addition, these time periods can make up a large enough fraction of the training data to bias models to output ‘do nothing’ owing to the imbalanced training data. To mitigate these effects, we devised a set of loose criteria to identify ‘courtship frames’ in which the male is likely in a courtship state (for example, chasing or singing to the female); we then only train and test on these courtship frames.

We devised the following four criteria to determine if a frame is a courtship frame:

(1)

The male and female distance (taken between the joint positions of their thoraxes) averaged over the time window is less than 5 mm.
(2)

The proportion of frames in which the male produced song (Pfast, Pslow, or sine) during the time window is greater than 0.1.
(3)

The angle of the female’s location from the male’s heading direction (with respect to the male’s head), averaged over the time window, is no more than 45 visual degrees.
(4)

The male is traveling at least 4.5 mm/s towards the female, averaged over the time window.

The time window was 20 s long, centred on the candidate frame. Only one criterion needed to be met to classify a frame as a courtship frame. Given these criteria, roughly 70% of all frames in control sessions were considered as courtship frames. Although silencing an LC type likely alters the amount of courtship during a session, we ensured that enough courtship frames were present for training the model. LC9-silenced males had the lowest percentage of courtship frames over the entire session at 42% (consistent with its high male-to-female distance, Fig. 1e, top); the average across LC types was roughly 70% and similar to that of control sessions.

Visual input reconstruction

To best mimic how a male fly transforms his retina’s visual input into behaviour, we desired an image-computable model (that is, one that takes as input an image rather than abstract variables determined by the experimenter, such as female size or male-to-female distance). We approximately reconstructed the male’s visual input based on pose estimation of both the male and female fly during courtship, as described in the following process. For each frame, we created a 64-pixel × 256-pixel greyscale image with a white background. Given the female rotation, size and location (see below), we placed an image patch of a greyscale fictive female (composed of ellipses that represented the head, eyes, thorax and tail of the female; no wings were included) occluding the white background. Because male flies perceive roughly 160 visual degrees on either side⁵³, we removed from the image the 40 visual degrees directly behind the male, leading to images with 64 × 228 pixels. Example input images are shown in Fig. 1f, where the reconstructed female flies were coloured and on grey background for illustrative purposes. Example videos of input image sequences are present in Supplementary Videos 1 and 2.

We computed the female’s rotation, size and location in the following way. For female rotation, we computed the angle between the direction of the male head to female body and the direction of the female’s heading. A rotation angle of 0° indicates the female is facing away from the male, ±180° indicates the female is facing towards the male, and −90° or +90° indicates the female is facing to the left or right of the male, respectively. We pre-processed a set of 360 image patches (25 × 25 pixels) that depicted a rotated female for each of 360 visual degrees. Given the computed rotation angle, we accessed the image patch corresponding to that rotation angle. For female size, we treated the female fly as a sphere (whose diameter matched the average length of a female fly from head to wing tips, ~4 mm) and computed as size the visual angle between the two vectors of the male’s head position to the two outermost points on the sphere that maximize the visual angle (that is, the two furthest points along the horizontal centre line); this angle was normalized so that a size of 1 corresponded to 180 visual degrees. This size determined the width (and height, equal to the width) of the selected image patch to be placed into the 64 × 228-pixel image. Here, size indicates the size of the image patch, not the actual size of the fictive female (which may vary because a female facing away is smaller than a female facing to the left or right). For reference, for a fictive female with a size of 1.0 and facing away from the male in the centre of his visual field, her body subtends 65 visual degrees. For female position, we computed the visual angle between the male’s heading direction and the direction between the male’s head and the female’s body position. We normalized this angle such that a position of 0 is directly in front of the male, a position of either −1 or 1 is directly behind the male fly, and a position of −0.5 or +0.5 is 90 visual degrees to the left or right, respectively. We then used this position to place the image patch (with its chosen rotation and size) at a certain pixel location along the horizontal centre line of the image. Because the male and female flies did not have room to fly in the experimental chamber, we assumed that only the female’s lateral position (and not vertical position) could change.

Description of 1-to-1 network

We designed our 1-to-1 network to predict the male fly’s behaviour (that is, movement and song production) only from his visual input. Although the male can use other sensory modalities such as olfaction or mechanosensation to detect the female, we chose to focus solely on visual inputs because: (1) the male relies primarily on his visual feedback for courtship chasing and singing^9,44; and (2) we wanted the model to have a representation solely based on vision to match the representations of visual LC neurons.

The 1-to-1 network comprised three parts: a vision network, an LC bottleneck, and a decision network (Fig. 1a). Hyperparameters, such as the number of filters in each layer, the number of layers, and the types of layers were chosen based on prediction performance assessed on a validation set of the control sessions separate from the test set. Unless specified, each convolutional or dense layer was followed by a batchnorm operation⁵⁴ and a relu activation function. The 1-to-1 network took as input the images of the 10 most recent time frames (corresponding to ~300 ms)—longer input sequences did not lead to an improvement in predicting behaviour. Each greyscale image was 64 × 228 pixels (with values between 0 and 255) depicting a fictive female fly on a white background (see ‘Visual input reconstruction’). Before being fed into the network, the input was first re-centred by subtracting 255 from each pixel intensity to ensure the background pixels had values of 0. The model’s output was six behavioural variables of the male fly: forward velocity, lateral velocity, angular velocity, probability of sine song, probability of Pfast song, and probability of Pslow song (see ‘Courtship experiments’).

Vision network

The first layer of the vision network was spatial convolutions with 32 filters (kernel size 3 × 3) and a downsampling stride of 2. The second and third layers were identical to the first except with separable 2D convolutions⁵⁵. The final layer was a two-stage linear mapping⁵⁶ which first spatially pools its input of activity maps and then linearly combines the pooled outputs across channels into 16 embedding variables; pooling the spatial inputs in this manner greatly reduced the number of parameters for this layer. Batchnorm and relus did not follow this two-stage layer. The vision network processed each of the 10 input images separately; in other words, the vision network’s weights were shared across time frames (that is, a 1D convolution in time). Allowing for 3D convolutions of the visual inputs (that is, 3D kernels for the two spatial dimensions and the third time dimension) did not improve prediction performance (Extended Data Fig. 3), likely because of the increase in the number of parameters. For simplicity, the vision network’s input was the entire image (that is, the entire visual field); we did not include two retinae. We found that incorporating two retinae into the model, while more biologically plausible, made it more difficult to interpret the tuning of each LC neuron type. For example, for a two-retinae model, it is difficult to determine if differences in tuning for two model units of the same LC type but in different retinae are true differences in real LC types or instead differences due to overfitting between the two retinal vision networks. The 1-to-1 network avoids this discrepancy through the simplifying assumption that each LC type has a similar response across both retinae.

LC bottleneck

The next component of the DNN model was the LC bottleneck, which received 10 16-dimensional embedding vectors corresponding to the past 10 time frames. These embedding vectors were passed through a dense layer with 64 filters followed by another dense layer with number of filters equal to the number of silenced LC types (23 in total). We call the 23-dimensional output of this layer the ‘LC bottleneck’. Each model LC unit represents the summed activity of all neurons of the same LC type (that is, projecting to the same optic glomerulus), which makes it easy to compare to calcium imaging recordings of LC neurons which track the overall activity level of a single glomerulus. We found that adding additional unperturbed ‘slack’ model LC units to match the total number of LC types (for example, 45 model LC units instead of 23 units) did not improve prediction performance; in the extreme case, adding a large number slack variables encourages the network to ignore the ‘unreliable’ knocked-out units in favor of predicting shared behaviour across silenced and control sessions (that is, similar to training without knockout). For two perturbations (LC10ad and LC10bc), the genetic lines silenced two LC neuron types together. For simplicity, we assigned each of these to its own model LC unit, which represented the summed activity of all neurons from both types (for example, LC10a and LC10d for LC10ad). Because the LC bottleneck reads from all 10 past time frames, each model LC unit integrates information over time (for example, for motion detection). Additionally, the model LC responses are guaranteed to be nonnegative because of the relu activation functions.

Decision network

The decision network took as input the activations of the 23 LC bottleneck units and comprised 3 dense layers, where each layer had 128 filters. The decision network predicted the movement output variables (forward velocity, lateral velocity, and angular velocity) each with a linear mapping and the song production variables (probability of sine, Pfast and Pslow song) each with a linear mapping followed by a sigmoid activation function.

Knockout training

We sought a one-to-one mapping between the model’s 23 LC units in its bottleneck and the 23 LC neuron types in our silencing experiments (Fig. 1a). To identify this mapping, we devised knockout training. We first describe the high-level training procedure and then give details about the optimization. For a randomly initialized 1-to-1 network, we arbitrarily assigned model LC units to real LC types (that is, in numerical order). For each training sample, we knocked out (that is, set to 0 via a mask) the model LC unit that corresponded to the silenced LC type; no model units were silenced for control data (Fig. 1b). This is similar to dropout training³⁶ except that hidden units were purposefully—not randomly—chosen. The intuition behind knockout training is that the remaining unperturbed model LC units must encode enough information or ‘pick up the slack’ to predict the silenced behaviour; any extra information will not be encoded in the unperturbed units (as the back-propagated error would not contain this information). For example, let us assume that female size is encoded solely by LPLC1 and that this cell type contributes strongly to forward velocity. To predict the forward velocity of LPLC1-silenced males (which would not rely on female size), the other model LC units would need only to encode other features of the fictive female (for example, her position or rotation). In fact, any other model LC unit encoding female size would hurt prediction because forward velocity of LPLC1-silenced males does not depend on it. Another view of knockout training is that we optimize the model to predict behaviour while also constraining the model on which internal representations it may use. These constraints are set by the perturbations (for example, genetic silencing) we use in our experiments.

The optimization details are as follows. The model was trained end-to-end using stochastic gradient descent with learning rate 10⁻³ and momentum 0.7. Each training batch had 288 samples, where each sample was a sequence of 10 images and 6 output values. Each batch was balanced across LC types (24 in total including control), where each LC type had 12 samples. The batch was also balanced for types of song (sine song, pulse song, or no song), as different flies sang different amounts of song. The model treated different flies for the same silenced LC type as the same to capture overall trends of an ‘average’ silenced fly. We z-scored the movement behavioural variables (forward, lateral, and angular velocity) based on the mean and standard deviation of the control data in order to have similarly sized gradients from each output variable. The loss functions were mean squared error for forward, lateral, and angular velocity and binary cross-entropy for the probabilities of sine, Pfast, and Pslow song. The model instantiation and optimization was coded in Keras (https://keras.io/) on top of Tensorflow⁵⁷; we used the default random initialization parameters to initialize weights. We stopped training when prediction performance for forward velocity (evaluated on a validation set, see below) began to decrease (that is, early stopping).

Training and test data

After identifying courtship frames (see ‘Courtship experiments’), we split these frames into train, validation and test sets. To form a test set for a given LC type (or control), we randomly selected 3-s windows across all flies until we had 15 min of data (27,000 frames). Selecting windows instead of randomly choosing time frames ensured that no frame in the visual input of the test data overlapped with any training frames. For control sessions, after selecting the test set, we also randomly sampled from the remaining frames to form a validation set (27,000 frames) in the same way as we did for the test set; the validation set was used for hyperparameter choices and early stopping. All remaining frames were used for training. To balance the number of frames for each LC type and control, we randomly sampled at most 600,000 frames (~5.5 h) across sessions for each LC type and control. This ensured no single LC type or control was over-represented in the training data (that is, a class imbalance). In total, our training set had ~11.6 million training samples. To account for the observation that flies tend to prefer to walk along the edge of the chamber in either a clockwise or counter-clockwise manner—biasing lateral and angular velocities to one direction—we augmented the training set by flipping the visual input from left to right and correspondingly changing the sign of the lateral and angular velocities; each training sample had a random 50% chance of being flipped. No validation or test data were augmented.

Dropout and no knockout training

For comparison to knockout (KO) training, we considered three networks with the same architecture as the 1-to-1 network but trained with other procedures (Extended Data Fig. 3). First is the untrained network for which no training is performed (that is, all parameters remain at their randomized initial values). Second, we performed a version of dropout (DO) training³⁶ by setting to 0 a randomly chosen model LC unit for each training sample independent of the sample’s silenced LC type; no model LC unit’s values are set to 0 for samples from control sessions. This training procedure knocks out the exact same number of units as that of knockout training. No dropout is performed during inference. Third, we consider training a network without knocking out (noKO) any model LC units. We trained the DO and noKO networks with the exact same data as that for KO training (a combined dataset of courtship sessions from 23 different LC types and control), but the DO and noKO networks were not given any information about which LC type was silenced for a training sample. This makes the DO and noKO fair null hypotheses: The DO and noKO networks assume that no change in behaviour occurs between LC-silenced males and control males, whereas the KO network attempts to find these differences. The DO and noKO networks helped us to ground the prediction performance of knockout training when predicting moment-to-moment behaviour (Extended Data Figs. 3 and 4) and real LC responses (Fig. 2e) as well as consistency in training (below).

Consistency across different training runs

Because DNNs are optimized via stochastic gradient descent, the training procedure of a DNN is not deterministic; different random initializations and different orderings of the training data may lead to DNNs with different prediction performances. To assess whether the 1-to-1 network is consistent across training runs, we trained 10 runs of the 1-to-1 network with different random initializations and different random orderings of training samples. For comparison, we also trained 10 networks either with dropout training or without knockout training (above) as well as 10 untrained networks. For a fair comparison across training procedures (knockout, dropout, without knockout and untrained), each run had the same parameter initialization and ordering of training samples. We compared the 1-to-1 network to these three networks by assessing prediction performance of moment-to-moment behaviour (Extended Data Fig. 3), overall mean changes to behaviour across silenced LC types (Extended Data Fig. 4), consistency both in behavioural predictions (Extended Data Fig. 5) and neural predictions (Extended Data Fig. 6), prediction performance of real LC responses for a one-to-one mapping (Fig. 2e and Extended Data Fig. 8) and prediction performance of real LC responses for a fitted linear mapping (Extended Data Fig. 8). We opted to investigate the inner workings of a single 1-to-1 network in Figs. 3 and 4 both for simplicity and because some analyses can only be performed on a single network (for example, the cumulative ablation experiments in Fig. 4). Different runs of the 1-to-1 networks had some differences in their predictions (Extended Data Figs. 5 and 6), but the overall conclusion that the LC bottleneck in the 1-to-1 network revealed a combinatorial requirement for multiple LC types to drive the male’s courtship behaviours remained true over all runs. For our analyses in Figs. 3 and 4, we chose the 1-to-1 network that had the best prediction for both behaviour and neural responses (model 1 in Extended Data Fig. 3, and in Extended Data Fig. 8).

Two-photon calcium imaging

We recorded LC responses of a head-fixed male fly using a custom-built two-photon microscope with a 40× objective and a two-photon laser (Coherent) tuned to 920 nm for imaging of GCaMP6f. A 562 nm dichroic split the emission light into red and green channels, which were then passed through a red 545–604 nm and green 485–555 nm bandpass filter, respectively. We recorded the imaging data from the green channel with a single plane at 50 Hz. Before head fixation, the male’s cuticle above the brain was surgically removed, and the brain was perfused with an extracellular saline composition. The male’s temperature was controlled at 30 °C by flowing saline through a Peltier device and measured via a water bath with a thermistor (BioscienceTools TC2-80-150). We targeted LC neuron types LC6, LC11, LC12, LC15 and LC17 (Fig. 2a) for their proximity to the surface (and thus better imaging signal), prior knowledge about their responses from previous studies^29,30,31, and because they showed changes to male behaviour when silenced (Fig. 1e and Extended Data Fig. 1).

Each head-fixed male fly walked on an air-supported ball and viewed a translucent projection screen placed in the right visual hemifield (matching our recording location in the right hemisphere). The flat screen was slanted 40 visual degrees from the heading direction of the fly and perpendicular to the axis along the direction between the fly’s head and the centre of the screen (with a distance of 9 cm between the 2). An LED-projector (DLP Lightcrafter LC3000-G2-PRO) with a Semrock FF01-468/SP-25-STR filter projected stimulus sequences onto the back of the screen at a frame rate of 180 fps. A neutral density filter of optical density 1.3 was added to the output of the projector to reduce light intensity. The stimulus sequences (described below) comprised a moving spot and a fictive female that varied her size, position and rotation.

We recorded a number of sessions for each targeted LC type: LC6 (5 flies), LC11 (5 flies), LC12 (6 flies), LC15 (4 flies) and LC17 (5 flies). We imaged each glomerulus at the broadest cross-section, typically at the midpoint, given that we positioned the head of the fly to be flat (tilted down 90°, with the eyes pointing down). We hand selected regions of interest (ROIs) that encompassed the shape of the glomerulus within the 2D cross-section. We computed ΔF/F₀ for these targeted ROIs using a baseline ROI for F₀ that had no discernible response and was far from targeted ROIs. For each LC and stimulus sequence, we concatenated repeats across flies. To remove effects due to adaptation across repeats and differences among flies, we de-trended responses by taking the z-score across time for each repeat; we then scaled and re-centred each repeat’s z-scored trace by the standard deviation and mean of the response trace averaged across all the original repeats (that is, the original and denoised repeat-averaged trace had the same overall mean and standard deviation over time). To test whether an LC was responsive to a stimulus sequence or not, we computed a metric akin to a signal-to-noise ratio for each combination of LC type and stimulus sequence in the following way. For a single run, we split the repeats into two separate groups (same number of repeats per group) and computed the repeat-averaged response for each group. We then computed the R² between the two repeat-averaged responses by computing the Pearson correlation over time and squaring it. We performed 50 runs with random split groups of repeats to establish a distribution of R² values. We compared this distribution to a null distribution of R² values that retained the timecourses of the responses but none of the time-varying relationships among repeats. To compute this null distribution, we sampled 50 runs of split groups (same number of repeats as the actual split groups) from the set of repeats for all stimulus sequences; in addition, the responses for each repeat were randomly reversed in time or flipped in sign, breaking any possible co-variation across time among repeats. For each combination of LC type and stimulus, we computed the sensitivity⁵⁸ d′ between the actual R² distribution and the null R² distribution. We designated a threshold d′ > 1 to indicate that an LC was responsive for a given stimulus sequence (that is, we had a reliable estimate of the repeat-averaged response). After this procedure, a total of 27 combinations of stimulus sequence and LC type out of a possible 45 combinations remained (Extended Data Fig. 8).

We considered two types of stimulus sequences: a moving spot and a moving fictive female. The moving spot (black on isoluminant grey background) had three different stimulus sequences (Fig. 2b,c). The first stimulus sequence was a black spot with fixed diameter of 20° that moved from the left to right with a velocity chosen from candidate velocities {1, 2, 5, 10, 20, 40, 80} ° s⁻¹; each sequence lasted 2 s. The second stimulus sequence was a spot that loomed from a starting diameter of 80° to a final diameter of 180° according to the formula \(\theta (t)=-2{\tan }^{-1}(-r/v\cdot 1/t)\), where r/v is the radius-to-speed ratio with units in ms and t is the time (in ms) until the object reaches its maximum diameter²¹ (that is, t = t_final − t_current). A larger r/v corresponds with a slower object loom. We presented different loom speed ratios chosen from candidate r/v ∈ {10, 20, 40, 80} ms. Once a diameter of 180° was reached, the diameter remained constant. The third stimulus sequence was a spot that linearly increased its size from a starting diameter of 10° according to the formula θ = 10 + v ⋅ t, where v is the angular velocity (in ° s⁻¹) and t is the time from stimulus onset (in seconds). The final diameter of the enlarging spot for each velocity (30°, 50°, 90° or 90°, respectively) was determined based on the chosen angular velocity v ∈ {10, 20, 40, 80} ° s⁻¹. Once a diameter of 90° was reached, the diameter remained constant.

The second type of stimulus sequence was a fictive female varying her size, position, and rotation. The fictive female was generated in the same manner as that for the input of the 1-to-1 network (see ‘Visual input reconstruction’). We took the angular size of the fictive female (65 visual degrees for a size of 1.0, where the female faces away from the male at the centre of the image) and used it to set the angular size of the fictive female on the projection screen. We considered three kinds of fictive female stimulus sequences with 9 different sequences in total (Supplementary Video 1 and Extended Data Fig. 8); we first describe them at a high level and then separately in more detail. The first kind consisted of sequences in which the female varied only one visual parameter (for example, size) while the other two parameters remained fixed (for example, position and rotation); we varied this parameter with three different speeds. Second, we generated sequences that optimized a model output variable (for example, maximizing or minimizing forward velocity). Third, we used a natural image sequence taken from a courtship session. Each stimulus sequence lasted for 10 s (300 frames).

Details of the fictive female sequences are as follows. For reference, a size of 1.0 is ~65 visual degrees, and a position of 0.5 is 90 visual degrees to the right from centre.

Vary female position: the female varied only her lateral position (with a fixed size of 0.8 and a rotation angle of 0° facing away from the male) from left to right (75 frames) then right to left (75 frames). Positions were linearly sampled in equal intervals between the range of −0.1 and 0.5. This range of positions was biased to the right side of the visual field to account for the fact that the projection screen was oriented in the male’s right visual hemifield. After the initial pass of left to right and right to left (150 frames total), we repeated this same pass two more times with shorter periods (100 frames and 50 frames in total, respectively), interpolating positions in the same manner as the initial pass.
Vary female size: the same generation procedure as for ‘vary female position’ except that instead of position, we varied female size from 0.4 to 0.9 (sampled in equal intervals) with a fixed position of 0.25 and a rotation angle of 0° facing away from the male.
Vary female rotation: the same generation procedure as for ‘vary female position’ except that instead of position, we varied the female rotation angle from −180° to 180° (sampled in equal intervals) with a fixed position of 0.25 and a fixed size of 0.8.
Optimize for forward velocity: we optimized a 10-s stimulus sequence in which female size, position, and rotation were chosen to maximize the 1-to-1 network’s output of forward velocity for 5 s and then minimize forward velocity for 5 s. In a greedy manner, the next image in the sequence was chosen from candidate images to maximize the objective. We confirmed that this approach did yield large variations in the model’s output. To ensure smooth transitions, the candidate images were images ‘nearby’ in parameter space (that is, if the current size was 0.8, we would only consider candidate images with sizes in the range of 0.75 to 0.85). Images were not allowed to be the same in consecutive frames and had to have a female size greater than 0.3 and a female position between −0.1 and 0.5.
Optimize for lateral velocity: the same generation procedure as for ‘Optimize for forward velocity’ except that we optimized for the model output of lateral velocity. In this case, maximizing or minimizing lateral velocity is akin to asking the model to output the action of moving to the right or left.
Optimize for angular velocity: the same generation procedure as for ‘Optimize for forward velocity’ except that we optimized for the model output of angular velocity. In this case, maximizing or minimizing angular velocity is akin to asking the model to output the action of turning to the right or left.
Optimize for forward velocity with fixed position: the same generation procedure as for ‘Optimize for forward velocity’ except that we limited female position p to be within the tight range of 0.225 < p < 0.275. This ensured that most changes of the female stemmed from changes in either female size or rotation, not position.
Optimize for lateral velocity with multiple transitions: the same generation procedure as for ‘Optimize for lateral velocity’ except that we had four optimization periods: maximize for 2.5 s, minimize for 2.5 s, maximize for 2.5 s and minimize for 2.5 s.
Natural stimulus sequence: a 10-s stimulus sequence taken from a real courtship session. This sequence was chosen to ensure large variation in the visual parameters and that the female fly was mostly in the right visual field between positions −0.1 and 0.5.

For each recording session, we presented the stimuli in the following way. For the moving spot stimuli, each stimulus sequence was preceded by 400 ms of a blank, isoluminant grey screen. For the fictive female stimuli, a stimulus sequence of the same kind (for example, ‘Vary female size’) was presented in three consecutive repeats for a total of 30 s; this stimulus block was preceded by 400 ms of a blank, isoluminant grey screen. All stimulus sequences (both moving spot and the fictive female) were presented one time each in a random ordering. Another round (with the same ordering) was presented if time allowed; usually, we presented 3 to 4 stimulus rounds before an experiment concluded. This typically provided 9 or more repeats per stimulus sequence per fly.

Predicting real neural responses

To obtain the model predictions for the artificial moving spot stimuli (Fig. 2b,c), we generated a fictive female facing away from the male and whose size and position matched that of the moving spot. This was done to prevent any artifacts from presenting a stimulus (for example, a high-contrast moving spot) on which the model had not been trained, as the model only observed a fictive female. We matched the angular size of the fictive female to that of the presented stimulus by using the measured conversion factor of 65 visual degrees for a fictive female size of 1.0. For the stimulus of the moving spot with varying speed (Fig. 2b), the fictive female translated from left to right (that is, same as the stimuli presented to the male fly). Because the 1-to-1 network’s responses could remain constant and not return to 0 for different static stimuli (that is, no adaptation mechanism), we added a simple adaptation mechanism to the model’s responses such that if responses were the same for consecutive frames, the second frame’s response would return to its initial baseline response with a decay rate of 0.1. To obtain model predictions for the fictive female stimuli (Fig. 2d,e), we input the same stimulus sequences presented to the fly except that we changed the greyscale background to white (to match the training images).

To evaluate the extent to which the 1-to-1 network predicted the repeat-averaged LC responses for each stimulus sequence of the moving fictive female, we sought an R² prediction performance metric that accounted for the fact that our estimates of the repeat-averaged responses were noisy. Any metric not accounting for this repeat noise would undervalue the true prediction performance (that is, the prediction performance between a model and a repeat-averaged response with an infinite number of repeats). To measure prediction performance, we chose a noise-corrected R² metric recently proposed⁵⁹ that precisely accounts for noise across repeats and corrects for bias in estimating the ground truth normalized R². A noise-corrected R² = 1 indicates that our model perfectly predicts the ground truth repeat-averaged responses up to the amount of noise across repeats. We note that our noise-corrected R² metric accounts for differences in mean, standard deviation, and sign between model and real responses, as these differences do not represent the information content of the responses.

We computed this noise-corrected R² between the 1-to-1 network and real responses for each LC type and stimulus sequence (Fig. 2e) for which the LC was responsive (that is, d′ > 1, see ‘Two-photon calcium imaging’). Importantly, the 1-to-1 network never had access to any neural data in its training; instead, for a given LC type, we directly took the response of the corresponding model LC unit as the 1-to-1 network’s predicted response. This is a stronger criterion than typical evaluations of DNN models and neural activity, where a linear mapping from DNN features (~10,000 feature variables) to neural responses is fit¹. To account for the smoothness of real responses due to the imaging of calcium dynamics, we causally smoothed the predicted responses with a linear filter. We fit the weights of the linear filter (filtering the 10 past frames) along with the relu’s offset parameter (accounting for trivial mismatches due to differences in thresholding) to the real responses. This fitting only used responses of one model LC unit, keeping in place the one-to-one mapping; we also relaxed this constraint by fitting a linear mapping using all model LC units (Extended Data Fig. 8). We performed the same smoothing procedure not only for the 1-to-1 network but also for an untrained network, a network trained with dropout training, and a network trained without knockout (see ‘Knockout training’ above). This procedure was only performed for predicted responses in Fig. 2d,e and Extended Data Fig. 8. For analysing response magnitudes (Fig. 2f and Extended Data Fig. 7), the responses came directly from model LC units (that is, no smoothing or fitting of the relu’s offset was performed).

Analysing model LC responses to visual input

To better understand how each model LC unit responds to the visual input, we passed natural stimulus sequences (taken from courtship sessions with control males) into the 1-to-1 network and computed the cross-validated R² between model LC responses and each visual parameter (Fig. 3b). Because female position and rotation are circular variables, we converted each variable x to a 2D vector [cos(x),sin(x)] and took the maximum R² across both variables for each model LC unit. We further investigated model LC tuning by systematically varying female size, position, and rotation to generate a large bank of stimulus sequences. We input these stimulus sequences into the 1-to-1 network and formed heat maps out of the model LC responses (Fig. 3c,d). For each input stimulus sequence, each of its 10 images was a repeat of the same image of a fictive female with a given size, lateral position, and rotation angle (that is, the fictive female remained frozen over time for each 10-frame input sequence). Across stimulus sequences, we varied female size (50 values linearly interpolated between 0.3 to 1.1), lateral position (50 values linearly interpolated between −1 to 1), and rotation angle (50 values linearly interpolated between −180 and 180 visual degrees), resulting in 50 × 50 × 50 = 125,000 different stimulus sequences that enumerated all possible combinations. To understand the extent to which each visual parameter contributed to a model LC unit’s response, we decomposed the total response variance into different components³⁷ (Fig. 3e). The first three components represent the variance of the marginal response to each of the 3 visual parameters (which we had independently varied). We computed these marginalized variances by: (1) taking the mean response for each value of a given visual parameter by averaging the other two parameters over all stimulus sequences; and (2) taking the variance of this mean response over values of the marginalized parameter (50 values in total). Any remaining variance (subtracting the three marginalized variances from the total response variance) represents response variance arising from interactions among the three visual parameters (for example, the model LC response depends on female position but only if the female is large and faces away from the male, see Fig. 3d, ‘LC10a’). Because the 1-to-1 network was deterministic, no response variance was attributed to noise across repeats (unlike trial-to-trial variability observed in the responses of real neurons).

Analysing the model LC responses to a large bank of static stimuli is helpful to understand LC tuning (Fig. 3c–e). However, we may miss important relationships between the features of the visual input and model LC responses without considering dynamics (for example, the speed at which female size changes). To account for these other temporal features, we devised three dynamic stimulus sequences that varied in time for roughly 10 s each (Fig. 3f and Supplementary Video 2); these stimuli were similar to a subset of stimuli we presented to real male flies (see ‘Two-photon calcium imaging’). For each stimulus sequence, we varied one visual parameter while the other two remained fixed at nominal values chosen based on natural sequence statistics.

The first 2.5 s of each stimulus were the following:

(1)

vary female size: linearly increase from 0.5 to 0.9 with fixed position = 0 and rotation = 0°
(2)

vary female position: linearly increase from −0.25 to 0.25 with fixed size = 0.8 and rotation = 0°
(3)

vary female rotation: linearly increase from −45° to 45° with fixed size = 0.8 and position = 0

The next 2.5 s were the same as the first 2.5 s except reversed in time (for example, if the female increased in size the first 2.5 s, then the female decreased in size at the same speed for the next 2.5 s). Thus, the first 5 s was one period in which the female increased and decreased one parameter. The stimulus sequence contained 4 repeats of this period with different lengths (that is, different speeds): 5, 3.33, 1.66, and 0.66 s (corresponding to 150, 100, 50, and 10 time frames, respectively). We passed these stimulus sequences as input into the 1-to-1 network (that is, for each time frame, the 10 most recent images were passed into the model) and collected the model LC responses over time. We directly computed the squared correlation R² between each model LC unit’s responses and the visual parameters (and features derived from the visual parameters, such as speed) for all three stimulus sequences (Fig. 3g). Velocity and speed were computed by taking the difference of the visual parameter between two consecutive time frames.

Analysing how model LCs contribute to behaviour

Because the 1-to-1 network identifies a one-to-one mapping, the model predicts not only the response of an LC neuron but also how that LC neuron causally relates to behaviour. We wondered to what extent each model LC unit causally contributed to each behavioural output variable. We designed an ablation approach (termed the cumulative inactivation procedure (CLIP)) to identify which model LCs contributed the most to each behavioural output. The first step in CLIP is to inactivate each model LC unit individually by setting a model LC’s activity value for all time frames to a constant value (chosen to be the mean activity value across all frames). We found that setting the activity to 0 (as we do during knockout training) obscures nuanced but important relationships because a value of 0 may be far from the working regime of activity for a given stimulus, resulting in large deviations in predicted output. Instead, we focus on how variation in a model LC unit’s response contributes to variations in predicted behaviour. We test to what extent the 1-to-1 network with the inactivated model LC unit predicts the behavioural output of held-out test data from control flies (from the test set). We choose the model LC unit that, once inactivated, leads to the least drop in prediction performance (that is, the model LC unit that contributes the least to the behavioural output). We then iteratively repeat this step, keeping all previously inactivated model LC units still inactivated. In this way, we greedily ablate model LC units until only one model LC unit remains. After performing CLIP, we obtain an ordering of model LC units from weakest to strongest contributor of a particular behavioural output (Fig. 4b,c). We measure the contribution to behaviour as the normalized change in performance. For movement variables, normalized change in performance is the difference in R² between no silencing (‘none’) and silencing K model LC units, normalized by the R² of no silencing. For song variables, normalized change in performance is the same as for the movement variables except we use 1 − cross-entropy. We then use this ordering (and prediction performance) to infer which model LC units contribute to which behavioural outputs. We performed CLIP to predict held-out behaviour from control flies (Fig. 4c). Because different behavioural outputs had different prediction performances (Extended Data Fig. 3), we normalized each model LC unit’s change in performance by the maximum change in performance (that is, prediction performance for no inactivation minus that of inactivating all model LC units); for model LC units for which inactivation led to an increase in performance due to overfitting (Extended Data Fig. 12), we clipped their change in performance to be 1. We also performed CLIP to predict the model output to simple, dynamic stimulus sequences (Fig. 4d–f). Because we did not have real behavioural data for these dynamic stimulus sequences, we used the model output when no silencing occurred as ground truth behaviour.

Connectome analysis

To obtain the pre- and postsynaptic partners of LC and LPLC neuron types, we leveraged the recently released FlyWire connectome of an adult female Drosophila^15,19, for which optic lobe intrinsic neurons were recently typed¹⁸. We downloaded the synaptic connection matrix at https://codex.flywire.ai/ of the public release version 630. We isolated the following 57 LC and LPLC types: LC4, LC6, LC9, LC10a-f, LC11, LC12, LC13, LC14a1, LC14a2, LC14b, LC15, LC16, LC17, LC18, LC19, LC20a-b, LC21, LC22, LC24, LC25, LC26, LC27, LC28a, LC29, LC31a-c, LC33a, LC34, LC35, LC36, LC37a, LC39, LC40, LC41, LC43, LC44, LC45, LC46, LCe01-LCe09, LPLC1, LPLC2, and LPLC4. We report individual cell types LC10a, LC10b, LC10c, and LC10d which have been identified in FlyWire, but we do not yet know how the driver lines LC10ad and LC10bc map onto these individual types. We summed the number of synaptic connections across all neurons of the same type that were either inputs or outputs of one of the LC and LPLC neuron types. We denoted a connection (Fig. 5b, tick lines) if at least 5 synaptic connections existed between an LC or LPLC neuron type and another neuron type. We identified 538 presynaptic cell types and 956 postsynaptic cell types. We categorized partner cell types into classes based on the naming conventions in FlyWire’s connectome dataset¹⁵ and sorted cell types within each class based on the number of connections to the LC types. To see if LC types with similar inputs project to similar outputs—in other words, identify groupings of LC types, we performed agglomerative clustering separately on the pre- and postsynaptic connections. Specifically, we summed up connections across partner cell types within a class and used these summed connections as features for clustering (complete linkage with cosine similarity as affinity). LC types within a cluster are listed in numerical order. The following classes were used: LC, lobula columnar; LPLC, lobula plate-lobula columnar; AOTU, anterior optic tubercle; AVLP, anterior ventrolateral protocerebrum; CB, cross brain; CL, clamp; cL, centrifugal lobula; cM, centrifugal medulla; DN, descending neuron; Dm, distal medulla; Li, lobula intrinsic; LLPC, lobula-lobula plate columnar; LM, lobula medulla; LT, lobula tangential; mAL, medial antenna lobes; ML, medial lobe; MT, medulla tangential; OA, octopaminergic; PLP, posterior lateral protocerebrum; Pm, proximal medulla; PS, posterior slope; PVLP, posterior ventrolateral protocerebrum; SMP, superior medial protocerebrum; T2-T5, optic intrinsic; Tlp, translobula plate; Tm, transmedullary; TmY, transmedullary; Y, optic intrinsic; IB, inferior bridge; LAL, lateral accessory lobe; SAD, saddle; SLP, superior lateral protocerebrum; WED, wedge.

Statistical analysis

Unless otherwise stated, all statistical hypothesis testing was conducted with permutation tests, which do not assume any parametric form of the underlying probability distributions of the sample. All tests were two-sided and non-paired, unless otherwise noted. Each test was performed with 1,000 runs, where P < 0.001 indicates the highest significance achievable given the number of runs performed. When comparing changes in behaviour due to genetic silencing versus control flies (Fig. 1e), we accounted for multiple hypothesis testing by correcting the false discovery rate with the Benjamini–Hochberg procedure with α = 0.05. Paired permutation tests were performed when comparing prediction performance between models (Fig. 2e) for which paired samples were randomly permuted with one another. Error bars of the response traces in Fig. 2b–d were 90% bootstrapped confidence intervals of the means, computed by randomly sampling repeats with replacement. No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those of previous studies^11,12,29,30. Experimenters were not blinded to the conditions of the experiments during data collection and analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

[ad_2]

Source link