RNA expression dataset of 384 sunflower hybrids in field condition

This article describes how RNA expression data of 173 genes were produced on 384 sunflower hybrids grown in field conditions. Sunflower hybrids were selected to represent genetic diversity within cultivated sunflower. The RNA was extracted from mature leaves at one time seven days after anthesis. These data allow to differentiate the different genotype behaviours and constitute a valuable resource to the community to study the adaptation of crops to field conditions and the molecular basis of heterosis. It is available on data.inra.fr repository.

Mots clés : tournesol / génétique / génomique / sécheresse 1 Data Domesticated sunflower, Helianthus annuus L., is the fourth most important oilseed crop in the world (USDA, 2019). It can maintain stable yields across a wide range of environmental conditions, especially during drought stress (Hussain et al., 2018). In the context of climate change, a major interest in crop science is to better understand the adaptation of those plants to this phenomenon. Response to drought stress involves a large number of molecular pathways and subsequent physiological processes. Cultivated sunflowers in the world are mostly hybrid genotypes produced using the heterosis phenomenon to improve the yield and the stress tolerance of plants in field. In this data article, we are sharing the RNA expression profile of 173 genes (linked to drought stress and heterosis) on 384 hybrid genotypes of sunflower grown in fields. These datasets are part of a larger project that integrates other phenotypic data collected on this trial. They include the development of harvested plants and of course agronomic and yield traits at the plot level. They are available upon request from the authors according intellectual property limitations. The raw data associated with this article can be found at https://doi.org/10.15454/HESVA0.
2 Experimental design, plant material and growth conditions (Charente-Maritime, France) field station. The field station is divided into 540 plots including 475 plots with hybrids obtained by crossing 36 sterile (CMS PET1) to 36 restorer lines as previously described (Bonnafous et al., 2018). The other plots (around 11% of the field) contained one of the four check-hybrids (Extrasol, ES Akustic, NK Kondi and LG5450HO). These four hybrids are used as environmental control. The harvest was performed on July 22nd 2015 between 11:00 and 12:30 local time (sunrise at 6:36 and sun zenith at 14:00). This harvest was performed seven days after anthesis as the studied hybrids flowered on July 15th on average ±3 days (standard deviation). On four plants per plot, we collected one leaf per plant positioned at nÀ4 rank (with n the total number of leaves detached of the flower head) for this molecular analysis. Leaves were cut without their petiole and immediately frozen in dry ice and then, transferred to À80°C freezer before grinding.

Transcriptome analysis 3.1 RNA extraction
Leaf grinding was performed using a ZM200 grinder (Retsch, Haan, Germany) with a 0.5-mm sieve cooled with liquid nitrogen. Total RNA was extracted with the Nucleospin 96 RNA Tissue Core Kit (Macherey-Nagel, Düren, Germany) following the manufacturer's instructions. RNA quality controls were done with a Nanodrop Lite Spectrophotometer (Thermo Scientific, Waltham, USA) and the quantity was assessed using the Agilent RNA 6000 Nano Kit (Agilent, Santa Clara, CA, USA). DNAse treatment was performed with the TURBO DNAfree TM Kit from Invitrogen (Thermo Fisher Scientific, Vilnius, Lithuania) and RNA was finally subjected to reverse transcription using the Transcriptor Reverse Transcriptase Kit from Roche (Mannheim, Germany).
The specificity of each pair of primers was verified by BLAST on the whole sunflower genome and tested with the LightCycler 480 Real-Time PCR Instrument (Roche Diagnostics, Mannheim, Germany) on a pool of cDNA representative of the genetic diversity of our samples. Finally, the primers were selected according to their amplification and melting curves, their cycle detection thresholds, and after verification of their specificity (absence of dimer).

Expression measurements
qPCR analyses were conducted with the BioMark HD System using 96.96 dynamic arrays chips (Fluidigm Corporation, San Francisco, CA, USA) (Spurgeon et al., 2008). Ten different Fluidigm plates were designed to measure a total of 180 genes against 435 experimental samples (including 353 hybrid genotypes and check hybrids). Plates are described in Figure 1.
For the sample part, plates were separated into five groups of two plates. Samples are the same on the two plates of the same group. Each plate always contained five points of dilution of pool representative of the genetic diversity of our samples; two genotype controls (LG5450HO and SF092_SF342); one point of water; one intern control of the Fluidigm Biomark HD platform and 87 genotype samples. For the 87 genotype samples, 11% of them correspond to field control genotypes (corresponding to the proportion of environmental control genotypes in the field). All genotype samples are repeated on the two different plates of the group. In total, 435 genotype samples were used, including 82 environment checks and 353 hybrids genotypes.
For the gene part, two batches of five plates were constituted. For the first batch, each plate contains two reference genes (HanXRQChr05g0131911 and HanXRQChr01g0029571) and three biomarker genes (HanXRQChr01g0021351, HanXRQChr06g0175391 and HanXRQChr04g0120731) as in Marchand et al. (2013). On the second batch, the two previous reference genes are present with three other reference genes (HanXRQChr04g0115631, HanXRQChr03g0090171 and HanXRQChr01g0021131). In both batches, an intern control, specific of the platform was added. Sunflower reference genes were chosen among the genes presenting no modulation under a range of drought stress intensities in eight genotypes obtained from the Affymetrix hybridizations performed in Rengel et al. (2012).

Data curation
Variation of Ct of each gene measured on the different hybrid genotypes can be due to environmental and experimen-tal perturbations. Consequently, normalization and correction of Ct values are necessary.

Amplification efficiency
The amplification efficiency of each gene was estimated from its values on the range of dilution, using the robustfit function in Matlab (version 7.11) Statistics Toolbox (version 7.4), as follows: where Effi is the amplification efficiency of gene i, X i 0 is the initial amount of target molecules, X i t is the amount of target molecules when the threshold cycle Ct is reached. For 32 genes, efficiency was manually curated. The aberrant values deleted were determined as plus or minus three times the standard deviation.

Ct normalization
Expression levels of each gene are normalized by their amplification efficiency and the expression levels of reference genes and their amplification efficiency as follows: where Ct s i is the cycle threshold of gene i on sample s, DCt s i the corresponding normalized expression level and Effi its amplification efficiency. r is one of the R reference genes, Effr its amplification efficiency and Ct s r its cycle threshold on sample s. Five different reference genes were measured on Fluidigm plates. The dispersion evaluation of those genes reveals a high variance of gene HanXRQChr04g0115631 expression (Tab. 1). Consequently, this gene was removed from the analysis, with also the gene HanXRQChr03g0090171 that shows low quality too. In total, only three reference genes were kept for the analysis: HanXRQChr05g0131911, HanXRQChr01g0029581, HanXRQChr01g0021131.

Plate effect
Each gene is measured on five different Fluidigm chips (hereafter named plate), as part of the expression variation can be due to the use of different plates, referred as plate effect. The range of dilution was constituted with a pool of genotypes and  is present on each plate. The plate effect is estimated on the range of dilution for each gene with a linear model. In this model, expression level variation on the range of dilution is explained by the plate where the gene is measured and the dilution level. The DCt s i of each gene are then adjusted depending on this plate effect.

Field effect
Sunflowers are field cultivated, local environmental variations in the field can influence expression levels referred as field effect. The field effect was estimated on field control genotypes (11% of total samples) using a mixed model including two spatial fixed factors (line and column numbers in the field), a replicate fixed factor if necessary, an independent random genetic factor, and the residual error (Bonnafous et al., 2018). For each hybrid genotype, the DCt value of each gene is corrected by the field effect measured on the closest control genotype in the field.

Missing values: curation and imputation
Among the 66,951 points measured using Fluidigm, 684 had NA value, which corresponds to 1% of missing data on the experiment. For the hybrids, 259 of them have one or more missing value. The average number of missing values per genotype is 1.8. Three genotypes had more than 10% of value missing (Tab. 2), we decided to remove those genotypes.
Over the 180 studied genes, 110 genes had missing values. The average number of missing values is four per gene. Five genes had 10% or more of missing values (see Tab. 3). We decided to keep those genes for future analysis.
Missing values were imputed gene per gene. For a gene i, its missing value on a hybrid genotype h is replaced by the mean of the expression level of gene i on the other hybrids, as follows: where S is the total number of hybrid genotype, for which expression level is available for the gene i.
5 Data records 5.1 List of genotypes file 15EX05_genotype.tsv. This file contains the list of the different genotypes used. Each row corresponds to a genotype.
The first column (CROSS_GENOTYPE) contains the name of the hybrid genotype. Genotype name is composed of its parents name (first female genotype and second male genotype). The second column (STATUS_GENOTYPE) contains supplemental information about the genotype (e.g. control genotype or removed from the analysis)

List of genes file
15EX05_geneList.tsv. This file contains the list of genes measured on the Fluidigm plates. The first column (GENE_ID) contains the gene ID and the second column (STATUS_-GENE) the nature of the gene (e.g. reference, biomarker or measured).

Field file
15EX05_field.tsv. This file contains the composition of each plot in the field (in rows). 'SAMPLE_NAME' is a number associated to the parcel, later used to name the sample providing from specific plot. 'XTRIAL' and 'YTRIAL' are the coordinate X and Y of the plot in the field. 'CROSS_-GENOTYPE' is the name of the genotype. 'STATUS_EXP' is the type of sample ('exp' for studied hybrid or 'check' checkhybrids). The names of the other columns are the following: 'SOWING_DATE', 'EMERGENCE_DATE', 'PLANT_-DENSITY', F1_DATE.

Raw data files
15EX05_Raw_Data_Fluidigm_PlateX-BatchX.gz. These 10 files contain all the raw data files obtained by the BioMark HD Fluidigm System depending to the plate and the batch.

Fluidigm results files
15EX05_Fluidigm_results.gz. This file contains the 10 result data sets (in CSV format) of the Fluidigm measurements obtained by the program, one file per Fluidigm plate. The structure of the file is the same for each file. No correction was applied to the data. Each row corresponds to a spot/chamber on the Fluidigm plate. The first column corresponds to the id of the chamber in the Fluidigm plate. The next three columns correspond to 'Sample' information (1) 'Name' contain a number that corresponds to the sample name of the hybrid genotype, (2) the 'Type' of sample, (3) 'rConc' the concentration of the sample. The correspondence between the sample number and the hybrid name is given in the file: 15EX05_field.tsv. The next two columns correspond to the gene tested in the chamber plate (1) 'Name' its identifier and (2) its 'Type' (test or reference gene). The five next columns contain 'Ct' information (1) 'Value', (2) 'Calibrated rConc', (3) 'Quality', (4) 'Call', (5) 'Threshold'. The last three columns contain information about the melting temperature 'Tm', (1) 'In Range', (2) 'Out Range', (3) 'Peak Ratio' 5.6 Gene expressions file 15EX05_expression_noNA.tsv. This file contains the DCt of each gene (in rows) on each genotype (in columns) corrected by plate and field effect where the missing data were imputed.