Assessment of phenotypic variability among EEA INTA Pergamino sunflower lines: Its relationship with the grain yield and oil content

This is anOpe Abstract – The aims of the present study were to assess the phenotypic diversity among 221 sunflower accessions of INTA Pergamino Sunflower Breeding Program, to obtain discriminant functions that allow the classification of new accessions in similar groups and to evaluate the relationship between genetic distance pairwise accessions and hybrid performance for grain yield and oil content. We used 19 quantitative descriptors to evaluate phenotypic and morphological variability. Principal Component Analysis (PCA) and Hierarchical Cluster Analysis (HCA) were used to evaluate simultaneously all the variables and to describe phenotypic variation patterns of the germplasm. The distribution of germplasm in the dendrogram did not follow a clear pattern with regard to the predefined groups. This study revealed the variability observed among the lines that form the INTA Pergamino breeding program despite the highly selective forces applied to obtain inbred lines that produce superior hybrids for the Argentinean sunflower area. This work demonstrates the need for more in-depth study of genetic variability to be used as a predictor of heterosis in sunflower.

Mots clés : tournesol / sélection / variabilité génétique / hétérosis 1 Introduction Sunflower (Helianthus annuus) is produced in more than fifty countries, and two thirds of its production is concentrated in Europe, including Ukraine, Russia and Turkey. Other major producing countries are Argentina, China, United States, and the South-Eastern parts of Africa. Sunflower oilseed came in fourth position on vegetable oils market in 2017/18 after palm, soybean, and rapeseed (Pilorgé, 2020).
Since 1931, Argentina has been working on sunflower breeding by exploiting the diversity of a broad range of international genetic resources in combination with introgressions of wild Helianthus species. At its Experimental Stations in Manfredi (Córdoba, Argentina) and Pergamino (Buenos Aires, Argentina), the Instituto Nacional de Tecnología Agropecuaria (INTA), has pioneered Argentinean sunflower breeding and has become one of the most prolific sunflower breeding groups in the country (Filippi et al., 2020).
Multivariate analysis refers to a statistical technique that is widely used to analyze data which arise from more than one variable. Principal component analysis (PCA), hierarchical cluster analysis (HCA) and discriminant analysis (DA) are commonly employed multivariate techniques. PCA is one of the most useful methods employed in diversity studies. It reduces the dimensionality of the data while retaining most of the variation in the data set, and it is used to order genotypes according to their relationships while identifying the traits which explain most of the variability in the data set (Ringnér, 2008;Dudhe et al., 2019). HCA is a multivariate analysis widely used to assess relatedness and distance of any type of samples characterized by any type of descriptors. Therefore, it is used routinely to assess genetic diversity in germplasm collections (Peeters and Martinelli, 1989). DA maximizes between-group variability, minimizes within-group variability and provides a visual assessment of their relatedness. It also works as a predictive tool, enabling the classification of new accessions to previously characterized groups (Hernández et al., 2019;Palacio et al., 2020).
Knowledge of the amount and distribution of genetic diversity within sunflower breeding germplasm is essential to ensure the maintenance of the genetic variability of breeding pools, to make crop improvement more efficient through the directed accumulation of desirable alleles and to select the most diverse parental lines in crosses intended to generate breeding populations (Hladni et al., 2017). For studies of genetic diversity in sunflower, morphological, physiological, biochemical, pedigree and molecular data have been used (Rama Subrahmanyam et al., 2003;Dudhe et al., 2019;Filippi et al., 2015Filippi et al., , 2020. Melchinger (1999) concluded that quantitative traits such as yield and heterotic response are expected to increase with parental genetic distance. Melchinger and Gumber (2015) defined a heterotic group as a group of related or unrelated genotypes from the same or different populations that exhibit similar combining ability when crossed with genotypes from other germplasm groups. According to Miller (1999) and Vear and Miller (1993) four heterotic groups are being utilized worldwide in sunflower breeding: a group of inbred female maintainer lines derived from the Russian open-pollinated varieties, a US restoration group, formed by crossing wild annual sunflower species with lines of domesticated sunflowers, a grouping of Argentine germplasm and lastly a group of Romanian and South African female lines. The US restorer lines tend to be good sources for disease resistance and fertility restorer genes and the Argentine germplasm provides female parental lines with disease resistance and high grain yield. Also, French and Serbian inbred lines are important and widely used in sunflower seed industry.
The proposed heterotic groups in sunflower are less rigid than in other hybrid crops. The within group genetic distances appear to be greater and between group distances smaller than those observed in maize. Hongtrakul et al. (1997) also emphasized the need for further study of heterotic groups in sunflower, in their work on a group of public maintainer and restorer lines and using AFLP markers they found up to 4 subgroups within the restorer lines and 2 subgroups within the maintainer lines, suggesting that these could be defined as distinct heterotic groups. The heterotic groups are very useful for sunflower to catalog diversity and direct the process to introgress new alleles and create new heterotic groups (Cheres and Knapp, 1998;Cheres et al., 2000;Lochner, 2011).
The aims of the present study were (i) to assess the diversity among a set of INTA Pergamino sunflower accessions using multivariate techniques, (ii) to compare the patterns of phenotypic variation obtained according to the different predefined groups of accessions, (iii) to obtain discriminant functions that allow the classification of new accessions and (iv) to evaluate the relationship between the genetic distance among accessions and the hybrid performance for grain yield and oil content.

Sunflower accessions
A set of 221 inbred lines (maintainer and restorer) with desirable characteristics for subsequent use in hybrids and bred by the Sunflower Breeding Program of the Instituto Nacional de Tecnología Agropecuaria (INTA) in the Estación Experimental Agropecuaria (EEA) in Pergamino (Buenos Aires, Argentina) were included in this study. The only line that was included in this study and was not developed by INTA is the public line RHA 278. Five groups of accessions were previously determined according to the origin of the germplasm, the agronomic characteristics and if there are maintainer or restorer lines. The lines that form each group are shown in Supplementary Table 1. The characteristics of each group of lines and their origin are detailed in Table 1. For more information on the development of the lines released by INTA, their characteristics and selection strategies used, can be found in the work previously presented by González et al. (2015).

Field experiments
The  Table 2. The A lines used to develop the hybrids were obtained by incorporating a unique source of cytoplasmic male sterility Helianthus petiolaris (CMS) PET1 gene (Leclercq, 1969) to the maintainer lines by backcrossing.
In each environment, a randomized incomplete block design with three replicates was used to test the 70 A Â R hybrids. Environments were considered as random and hybrids as fixed. The trials were planted at 45 000 plants/ha. A plot size of 2 rows Â 6 m and inter-row spacing of 0.70 m was used in all the trials. Planting took place within the normal sowing window (mid-October) at each location. Grain yield was determined by hand harvesting of 3.92 m 2 (both rows discarding border plants) and is presented at 11% moisture. Oil content was determined on a 10 g oven-dried achene sample Nuclear Magnetic Resonance (NMR) with an Oxford MQ5 equipment calibrated by solvent extraction (Grandlund and Zimmerman, 1975).

Quantitative descriptors
We measured phenotypic and morphological variability in four randomly selected plants from each row using 19 descriptors of the International Board for Plant Genetic Resources list (IBPGR, 1985) ( Tab. 3). DTF, PH, NL, SD, LL, LW, PL, CD, BL, NRF, RFW and RFL were measured in R5.5 (50% of the disk flowers have completed or are in anthesis) (Schneiter and Miller, 1981). The % Oil was determined by Nuclear Magnetic Resonance (NMR) with an Oxford MQ5 equipment calibrated by solvent extraction (Grandlund and Zimmerman, 1975).

Statistical analysis
Descriptive and multivariate analyses were carried out to characterize the phenotypic diversity of sunflower accessions. The mean, range, standard deviation and coefficient of variation (CV) were obtained. Both statistical analyses were conducted using the statistical software R (R core Team, 2020).
Based on the groups obtained from the cluster analysis, a Linear Discriminant Analysis (LDA) was performed to define functions to discriminate between groups and to classify new accessions. Discriminant functions were obtained by splitting the original database in two subsets: the training data set (80%) and the testing data set (20%). Also, the variables that were highly correlated and provided redundant information were removed (AW, AT, LW and RFW). The testing data set was used to evaluate the accuracy of the LDA functions to classify new accessions into the groups. The LDA was performed using MASS R package (Venables and Ripley, 2002). The LD functions were graphed in a two-dimensional space using ggplot2 R package (Wickham, 2016). The quality of the discriminant analysis was evaluated using the Wilks' Lambda statistic and was done using the basic R "manova" function.
The adjusted means for variables grain yield (GY) and Oil derived from multi-environment trials (MET) of the 70 A Â R hybrids were regressed on Euclidean distance estimates obtained from the PC1, PC2 and PC3 to evaluate their association (Supplementary Table 2). The performance of these hybrids was also evaluated in relation to the group that was classified as the parental lines in the HCA. The relationship between the adjusted means of GY and Oil and the groups of classification of the parental lines were graphed in a violin plot using ggplot2 R package (Wickham, 2016).

Results
Summary statistics for each quantitative descriptor are presented in (Tab. 4).
The PCA results showed that the first three components concentrate 43.7% of total variation. The first component (PC1) explained most of this variation (21.8%) and ordered the accessions according to a gradient of achene size, plant height, number of ray flowers, size of leaves (leave length and width) and weight of 100 achenes. In the opposite direction it ordered the accessions according to a gradient of oil content and kernel. As indicated in Figure 1, accessions with taller plants, wider capitulum, longer and wider achenes, longer petioles, numerous ray flowers and bigger leaves were placed in quadrants 1 (upper right) and 4 (lower right). Most of the confectionary lines were placed in these quadrants. Accessions with higher % of kernel and oil content were located in quadrants 2 (upper left) and 3 (lower left). The majority of restorer lines were in the quadrant 3.
The second component (PC2) explained 11.3% of variation and ordered the accessions according to the size of ray flowers, locating the accessions with the largest ray flowers in quadrants 1 and 2. In the same direction, it ordered the accessions with the most days to flowering, the highest number of leaves per plant and the number of achenes per capitulum. The accessions with the longer time to flowering, the largest number of achenes per capitulum and the number of leaves per plant were located in quadrants 1 and 2 (Fig. 1). All other components PC3 (10.6%), PC4 (9.8%) and PC5 (6.7%) explained less variation and did not differentiate between accessions. The biplot also showed that group 2 is the most diverse phenotypically and could discriminate between maintainer and restorer lines. Besides, it showed the positive associations between DTF and NAC, also positive associations were noticed between the variables of dimension of achenes and between the variables obtained on ray flowers. The vector of Oil showed an opposite direction to the vectors of the achene dimension variables.
The dendrogram resulting from HCA formed at a cut-off of 5.0 Euclidean distance showed three groups (S1, S2 and S3) and two un-grouped accessions (Fig. 2). The smallest group (S1) contained 23 accessions, the second (S2) has 45 and the third one (S3) 151 accessions. The main cluster was formed primarily by maintainer lines from group 1 and 2 and restorer lines from group 4. The second cluster was formed mainly by accessions derived from group 2 and in minor proportion by lines derived from group 5 and restorer lines. Most of the confectionary lines were concentrated in the smallest cluster. The HCA showed that group 2 was the predominant group in all clusters.
The LDA performed on the three clusters defined in the HCA showed that the discriminant functions obtained were significant in the Wilks test (p-value < 0.001) (Fig. 3). The two LDA functions obtained were:  The variables that differentiated best between accession groups were achene length and weight of 100 achenes, because they obtained the highest coefficients in both functions. The discriminant function LDA1 explained 78.8% of the variability while the function LDA2 accounted for the rest (Fig. 3).
The predictive capacity of the LDA functions obtained was evaluated through the testing data set by using a confusion matrix, showing an accuracy of 89%.
The associations between the adjusted means of GY and Oil and Euclidean distance were not significant. The simple linear regression coefficient of Oil on Euclidean distance was 0.14 and the regression coefficient of GY on Euclidean distance was 16.27. Neither regression models explained the variability of the data, both multiple R 2 were near to 0. The adjusted means of the variable Oil for the 70 hybrids evaluated in MET showed no differences associated with the parental group classification obtained by HCA. The highest values were obtained by the crosses between lines belonging to the S3 group. On the other hand, for GY the values obtained for crosses between lines A belonging to group S2 and lines R belonging to group S3 were higher than the grain yields obtained between crosses of lines belonging to the same group. However, it can be seen that crosses of lines A of group S3 with lines R of group S2 showed the lowest yields (Fig. 4).

Discussion
Germplasm collections are valuable resources for crop breeding. To exploit their potential, it is essential to characterize them phenotypically and genotypically to understand the genetic diversity available in order to maximize the genetic gain in the breeding process.
PCA identified the traits which contributed most to explanation of the variability in the data. In this study, the variables related to achene size play an important role in the arrangement of accessions, with the greatest capacity to discriminate between them. Terzić et al. (2020) in a phenotypic study of the UGA-SAM1 population also reported achene size among the descriptors with the greatest power of genotype discrimination. Dudhe et al. (2019), in a diversity study on a germplasm bank collection, reported that oil content was one of the traits that contributed most to the variability of the data, whereas in our work the results were not similar. This difference can be explained by the limited range of variation of this trait in the set of lines selected for this study, since most of them were selected to increase oil content and they do not constitute a germplasm bank collection. DTF was one of the variables with greatest contribution to the variability explained by the PCs. Similarly, Ayaz et al. (2014) reported that DTF was   one of the variables with more incidence in the variability of the studied germplasm. In addition, the biplot clearly identified in opposite quadrants the restorer and maintainer lines, defined by their different characteristics in terms of oil content, flowering time and achene size. Similar results were expressed by Terzić et al. (2020) taking also into account the branching trait.
The PCA allowed us to interpret in a simple form the information collected through the different descriptors used, to identify groups of materials with certain phenotypic characteristics to direct future crosses, to identify the traits with less variability and to facilitate the search for specific characteristics in the materials available from the breeding program.
The distribution of germplasm in the dendrogram did not follow a clear pattern with regard to the predefined groups at the beginning of the work. The largest cluster included most of the restorer lines. Other population studies in sunflower germplasm collections reported the preponderance of this characteristic in cluster delimitation (Mandel et al., 2011;Cadic et al., 2013;Filippi et al., 2015;Terzić et al., 2020). This distinction between maintainer and restorer lines would be expected due to the breeding strategy of keeping these groups as separate heterotic pools to maximize heterosis in hybrid crosses (Fick and Miller, 1997). The close relationship between the restorer lines can be understood from their common ancestor, the RHA 274 line, which has the Rf1 gene, one of the main sources of fertility restorer genes (Cadic et al., 2013). In addition, the similarity between the restorer lines studied is justified by the fact that all of them are branched. The smallest group of accessions is composed of confectionary lines that are characterized by their achene size. These lines represent a small group in this study, their characteristics are different from the rest of the accessions due to the different objectives of confectionary sunflower breeding with respect to oil sunflower breeding. These results revealed the variability of the lines that form the INTA Pergamino sunflower breeding program despite the high selective forces applied to obtain inbred lines for superior hybrids production.
The discriminant analysis provided functions that allow us to classify the accessions in three groups that will allow a more efficient use of the genetic resources and the preservation of the genetic variability of the germplasm. In addition, it is a lowcost tool to make a better use of the phenotypic information generated in each growing season. In the same sense, Terzić et al. (2020) highlight the importance of using regularly collected phenotypic data in breeding programs. They propose to include as part of the germplasm selection criteria for breeding, traits related to leaf variation, seed coloration and certain flower traits, in order to preserve the variability of the populations used in breeding. Although the clusters defined by the HCA and used in the LDA differed from the groups mentioned at the beginning of the paper, this might be due to the fact that both analyses were performed using the information provided only by the morphological descriptors.
Developing hybrids is a costly long-term process; therefore, being able to predict heterosis is desirable. There are numerous papers in the literature on the relationship between genetic distance and heterosis or the response of traits such as yield and oil content. In sunflower, Cheres et al. (2000) found no satisfactory results in predicting grain yield as a function of genetic distance using molecular markers and coefficients of coancestry. Similarly, Reif et al. (2013) reported no correlation between the genetic distance obtained through a genomic matrix of 572 AFLP and grain yield, oil yield and oil content for a set of A Â R hybrids. In contrast, Darvishzadeh (2012) reported high correlations between distances obtained from morphological and molecular data and the yield of sunflower F1 hybrids. Hladni et al. (2018) in a study carried out on interspecific crosses of sunflower, positive associations were demonstrated between the genetic distance (obtained from a matrix of 37 SSR and 1 STS markers) and grain yield. The results obtained in our work indicate no relationship between the distances obtained and the performance of the hybrids we evaluated. In other crops such as chickpea, cotton and corn, low correlations have also been found between the distance of the parental genotypes and the hybrid performance (Ajmone Marsan et al., 1998;Meredith and Brown, 1998;Sant et al., 1999). In addition, Dias dos Santos et al. (2004) suggested that the inconsistent results obtained for the relationship between genetic distance and heterosis could be explained by the narrow genetic base of the germplasm used. Considering the genomic tools currently available, there is a clear need to deepen research on the relationship between genetic distance and heterosis in order to make the parental selection in breeding programs more efficient.
The higher oil content obtained by the hybrids whose parental lines belong to the S3 group can be explained by the additive action of the genes that control this characteristic. Similar results were obtained by Reif et al. (2013) evaluating a group of hybrids formed by crosses between and within heterotic groups (A Â B, A Â R and R Â R), that did not present heterosis for oil content for any of the crosses. On the other hand, for grain yield, it was determined that hybrids obtained from the cross between lines of the group 2 and 3 had a greater grain yield. This tendency can be explained by the phenomenon of heterosis that is enhanced by the greater genetic distance between the lines of these groups. The fact that the reciprocal crosses (A lines of S2 Â R lines of S3) showed a lower yield suggests the need for a further definition of the germplasm groups. The inclusion of molecular data in future studies on this set of lines and more extensive testing of the hybrids would probably allow a more precise differentiation of the groups.
In summary, this work points out the need for more indepth study of genetic variability to be used as a predictor of heterosis. Future genotypic characterizations of the sunflower accessions of the EEA Pergamino germplasm are proposed to develop predictive models that allow a greater efficiency in the selection of parental lines for the production of superior hybrids exploiting heterosis. Also, the results obtained confirm the importance of INTA EEA Pergamino's sunflower breeding program in the development of improved, locally adapted and diverse materials as valuable resources for sunflower research.

Supplementary Material
Supplementary