Genome-wide meta-analyses of restless legs syndrome yield insights into genetic architecture, disease biology and … – Nature.com

Ethics statement

All studies were approved by the respective local ethical committees, and all participants provided informed consent. The EU-RLS-GENE study was approved by an institutional review board at the University Hospital of the Technical University of Munich (2488/09). The INTERVAL dataset was approved by the National Research Ethics Service Committee East of EnglandCambridge East (REC 11/EE/0538). Participants of 23andMe provided informed consent under a protocol approved by the external AAHRPP-accredited IRB, Ethical and Independent (E&I) Review Services. As of 2022, E&I Review Services is part of Salus IRB (https://www.versiticlinicaltrials.org/salusirb). The deCODE dataset was approved by the National Bioethics Committee of Iceland. The Danish Blood Donor Study (DBDS) dataset was approved by the Scientific Ethical Committee of Central Denmark (M-20090237) and by the Danish Data Protection agency (30-0444). GWAS studies in the DBDS were approved by the National Ethical Committee (NVK-1700407). The Emory dataset was approved by an institutional review board at Emory University, Atlanta, GA, USA (HIC ID 133-98).

Some of the samples were included already in our previous GWAS meta-analysis3. The reported sample numbers are the final sample numbers after quality control. Additional details are provided in the Supplementary Note.

RLS cases were recruited in specialized outpatient clinics for movement disorders and in sleep clinics in European countries (Austria, Czech Republic, Estonia, Finland, France, Germany and Greece), Canada (Quebec) and the USA. RLS was diagnosed in a face-to-face interview by an expert neurologist or sleep specialist based on IRLSSG diagnostic criteria1. Controls were either population-based unscreened controls (Austria, Estonia, Finland, France, Germany) or healthy individuals recruited in hospitals (Canada, Czech Republic, Greece, USA). A total of 6,228 cases and 10,992 ancestry-matched controls had been genotyped on the Axiom array and were the study sample used in our previous meta-analysis. For the current study, 1,020 cases and 8,810 ancestry-matched controls were added who were genotyped on the Infinium Global Screening Array-24 version 1.0. Genotype calling was performed in GenomeStudio 2.0 according to the GenomeStudio Framework User Guide, and identical quality-control criteria were used for both datasets. Imputation was performed on the UK10K haplotype and 1000 Genomes Phase 3 reference panel using the EAGLE2 (version 2.0.5) and PBWT (version 3.1) imputation tools as implemented in the Sanger imputation server. Imputed SNPs with pHWE1105 or an INFO score <0.5 were filtered out.

The INTERVAL study includes whole-blood donors recruited in England between 2012 and 2014. The Cambridge-Hopkins Restless Legs questionnaire was used to define RLS cases, and probable and definite cases were combined to form a binary phenotype as described previously3. A detailed description of Axiom Biobank array genotyping and the imputation procedure plus related quality control in the INTERVAL trial can be found elsewhere34. Briefly, imputation was performed using a joint UK10K and 1,000 Genomes Phase 3 (May 2013 release) reference panel via the Sanger imputation server, and variants with MAF0.1% and INFO score0.4 were retained for analysis.

This study includes research participants of 23andMe who agreed to participate in research studies. The RLS phenotype was defined by self-reported responses to survey questions that assessed whether someone had ever been diagnosed with RLS or had ever received treatment for RLS as described previously3. Participants were genotyped on one of five platforms, all using Illumina arrays with added custom content (HumanHap550+ BeadChip, OmniExpress+ BeadChip, Infinium Global Screening Array). Participant genotype data were imputed in a two-step procedure using a reference panel created by combining the May 2015 release of the 1000 Genomes Phase 3 haplotypes with the UK10K imputation reference panel. Pre-phasing was carried out using either the internally developed tool Finch, which implements the Beagle algorithm, or EAGLE2. Imputation was performed with Minimac3.

This cohort includes only individuals who had not been part of the 23andMe GWAS used in the discovery meta-analysis. Cases and controls were defined as described above.

Individuals in this cohort do not overlap with samples included in the INTERVAL GWAS used in the discovery meta-analysis. RLS status was assessed with a single question on having received a diagnosis of RLS.

For 23andMe and INTERVAL, genotyping and imputation was carried out as described for the discovery stage.

This dataset included the DBDS, a cohort from deCODE Genetics, Iceland, the Emory Hospital Atlanta, USA and the Donor InSight-III study. Phenotyping and genotyping procedures have been described in detail previously4.

First, the Axiom- and the GSA-genotyped datasets were analyzed separately using SNPTEST version 2.5.4 with genotype dosages and assuming an additive model. Age, sex and the first ten PCs from the MDS analysis in PLINK were included as covariates. These summary statistics of the two datasets were then combined by fixed-effect inverse-variance meta-analysis (STERR scheme) using METAL (release 2011-03-25)35. One round of genomic control was performed in each dataset before meta-analysis.

Assuming an additive genetic model, genotype dosages were analyzed in SAIGE (0.35.8.8) using a linear mixed model to account for cryptic relatedness and saddle point approximation to account for casecontrol imbalance36. Age, sex and the first ten PCs of ancestry were included as potential genomic confounders. The analysis was restricted to genetic variants with MAF0.001, INFO0.4 and a minor allele count of 10.

Association analysis was conducted by logistic regression (LRT) assuming additive allelic effects and imputed dosages. Age, sex, genotyping platform and the first ten PCs were included as covariates.

In all individual GWAS, sex-specific analyses were performed using the same pipelines as those for the pooled analyses minus adjustment for sex as a covariate.

We applied the same methods for both the pooled and the sex-specific GWAS. The three independent datasets were combined in a multivariate GWAS meta-analysis using the N-weighted-GWAMA R function (version 1.2.6)37. To assess the possibility of heterogeneity of SNP effects between the studies, Cochrans Q-test was applied as described in METAL.

Data for the X chromosome were available in two of the discovery-stage datasets: EU-RLS-GENE and 23andMe.

For the pooled association analysis, male genotypes were coded as 0/2 (assuming no dosage compensation in males). All other methods were identical to those of the autosomal analyses. In sex-stratified analyses, males were coded as 0/1 and females as 0/1/2.

In both pooled and sex-stratified analyses, males were coded as 0/2 and females as 0/1/2.

Pooled and sex-specific meta-analyses were performed using the N-GWAMA R function as in the autosomal analysis. Because N-GWAMA operates with Z scores, the type of male allele coding did not affect the results.

We performed sex-specific (male-only and female-only) meta-analyses of the corresponding GWAS using the N-GWAMA approach as described above. The results were used to estimate sex-specific heritability and genetic correlation between the sexes.

To detect sex-specific effects, we tested all independent (r2<0.2) genome-wide significant SNPs of the pooled and sex-specific meta-analyses for heterogeneity of effect sizes between the two sexes using Cochrans Q-test (one-sided) and a Bonferroni-corrected significance threshold of Padj0.05/221.

For 23andMe and INTERVAL, quality control and statistical analysis were performed as described for the discovery stage. Statistical analysis for the DBDS, deCODEEmory and Donor Insight studies has been described previously4. Meta-analysis was performed using Han and Eskins random-effects model in METASOFT (RE2, METASOFT version 2.0.1)38.

To define independent risk loci, we first used the --clump command in PLINK (version 1.90b6.7)39 to collapse multiple genome-wide significant association signals based on linkage disequilibrium (LD) and distance (clump-r2>0.05, clump-kb<500kb clump-p1<5108, clump-p2p-value<105). We then performed conditional analyses to identify secondary independent signals in risk loci using GCTA (version 1.93.0beta) with the -cojo-slct option, the P-value threshold for genome-wide significance set at 5108, the distance window set at 10Mb and the colinearity cutoff set at 0.9 (ref. 40). LD was derived from EU-RLS-GENE genotype data. Independent genome-wide significant signals were merged into one genomic risk locus if either their LD block distance was <500kb or their clumped regions were overlapping.

Heritability is reported on the liability scale unless otherwise indicated. Prevalence estimates were derived from the population cohorts INTERVAL and 23andMe themselves. For the EU-RLS-GENE casecontrol dataset and for the meta-analysis, prevalence estimates were derived from previous publications on European ancestries.

We estimated SNP-based heritability under several different heritability models. LDSC (version 1.0.1) was used with standard settings, invoking a model where SNPs with different MAFs are expected to contribute equally to heritability41. LDAK (version 5.0) was used with standard settings to implement the LDAK model, where SNP contributions depend on LD structure and MAF as well as the BLD-LDAK and BLD-LDAK+Alpha models, which incorporate additional annotation-based features42. All analyses were based on summary statistics and filtering according to LDSC default settings, that is, HapMap3 non-HLA SNPs with MAF>0.01 and INFO0.9. The Akaike information criterion of each of these models was reported for model comparison. Further details are provided in the Supplementary Note.

For X chromosome heritability estimation, we followed the approach described by Lee et al. and used the summary statistics of the N-GWAMA meta-analysis43. For sex k, the SNP heritability ({h}_{k}^{2}) relates to the expected 2 statistics as ({mathbb{E}}({chi }_{k}^{2})approx 1+{N}_{k}{h}_{k}^{2}/{M}_{{rm{eff}}}), where Nk is the GWAS sample size, and Meff is the effective number of loci within the examined genomic region (assumed to be the same in males and females). For calculation of the (sex-specific) relative heritability contribution of the X chromosome, 2 statistic-based h2 was also calculated for the autosomes.

For autosomal data, genetic correlations were calculated using LDSC (version 1.0.1) using the same SNP filtering criteria and the two-step estimation option as in the heritability estimation. Because the LDSC framework is not applicable for chromosome X, the genetic correlation coefficient ({hat{r}}_{rm{g}}) was estimated as ({hat{r}}_{rm{g}}=,frac{widehat{{Z}_{rm{m}}{Z}_{rm{f}}}}{sqrt{(;{hat{chi }}_{rm{f},}^{2}-,1)(;{hat{chi }}_{rm{m},}^{2}-,1)}}), where Z and 2 are the Z scores and mean 2 estimates from the female (f) and male (m)-specific studies.

In addition to between-study and between-sex genetic correlations, we performed a large-scale genetic correlation screen for RLS (represented by the pooled autosomal meta-analysis data) and other traits using LDSC as described above. Sources and filtering criteria for summary statistics included in this screen are provided in the Supplementary Note.

Traits significantly correlated with RLS (FDR<0.05, one-sample two-sided Z-test) were taken forward to a bi-serial genetic correlation analysis. Here, we computed the pairwise ({hat{r}}_{rm{g}}) between all traits.

An unsigned weighted correlation matrix was built using the pairwise ({hat{r}}_{rm{g}}) and used as input for a weighted correlation matrix analysis to perform hierarchical clustering and to detect modules with the WGCNA package (version 1.69)44. The following settings were applied in WGCNA: softPower, 6; network type, unsigned; TOMDenom, min; Dynamic-cutree, method=hybrid; deepSplit, 2; minModuleSize, 30; pamStage, TRUE; pamRespectsDendro, FALSE; useMedoids, FALSE. The defining trait categories in each module were determined by consensus through independent review of the within-module cluster structure by visual inspection of network plots at two sites (Helmholtz and Cambridge).

To select traits for MR, we defined two to eight clusters in a module based on its complexity. In each cluster, the traits were ranked according to the significance of their correlation with RLS, and we selected the most significantly correlated medical conditions or potentially modifiable lifestyle factors. We supplemented this list with traits for which an association with RLS has been described in the literature.

Using R version 4.0.4, we filtered GWAS datasets to uncorrelated SNPs (r2<0.01 in the European 1000 Genomes Phase 3 data), aligned them to GRCh37 and mapped them to dbSNP 153 with the gwasvcf package (version 0.1.0). We harmonized effect alleles across studies using the TwoSampleMR package (version 0.5.6)45. Palindromic variants with ambiguous allele frequencies and those with unresolved strand issues were excluded from analysis.

To avoid violations of the classical MR assumptions when studying correlated and likely pleiotropic traits, we used a robust method for bidirectional MR, LHC-MR (version 0.0.0.9000)32. Traits with low heritability (h2<2.5%, ({P_{h^2}})>0.05) were excluded from the analysis. Significance of directionality and confounding effect were tested by comparing the goodness of fit of six degenerate LHC-MR models (only latent effect, only causal effect, only causal effect to RLS, only causal effect from RLS, no causal effect to RLS and no causal effect from RLS) to the full model. We supplemented these analyses with those based on the IVW and MR-Egger methods.

All analyses were performed on the N-GWAMA results of the pooled meta-analysis. We applied several complementary approaches to prioritize candidate genes in the genome-wide significant risk loci. These included the gene-prioritization pipeline of DEPICT (version 1.rel194), three prioritization workflows (positional, eQTL-based and topology-based mapping) provided on the FUMA platform (https://fuma.ctglab.nl/, version 1.3.6a), a gene-level GWAS using MAGMA version 1.08, a transcriptome-wide association study using S-PrediXcan and S-MultiXcan (MetaXcan package version 0.7.4), a colocalization analysis with eCAVIAR (version 2.2) and statistical fine-mapping with CAVIARBF (version 0.2.1)46,47,48,49,50,51,52. In the DEPICT, FUMA eQTL-based mapping, MAGMA and transcriptome-wide association study analyses, a gene was considered prioritized if it had an FDR <0.05; in FUMA topology-based mapping, if it had an FDR <1105; and in eCAVIAR, if it had a colocalization posterior probability >0.1. In FUMA positional mapping, a gene was considered prioritized if genome-wide significant SNPs physically mapped to it. In statistical fine-mapping, a gene was considered prioritized if an SNP in the 95% credible set of the risk locus could be linked to it by either eQTL, chromatin interaction or positional mapping. In addition, we checked whether a gene contained genome-wide significant coding variants (the gene was considered prioritized if it did) and whether a gene mapped to a gene set that was significant in our enrichment analyses (the gene was considered prioritized if it did). We combined the results of all approaches per gene in a prioritization score by summing up the individual results, counting not prioritized as 0 and prioritized as 1. Further details are provided in the Supplementary Note.

We ran DEPICT to detect enrichment of gene sets across risk loci as well as to identify tissue and cell types where expression is enriched for genes across risk loci. We set the significance thresholds for lead SNPs at 1105 and at 5104 for null GWAS; all other settings were the same as those used for gene prioritization (see above). DEPICT was run with all built-in datasets. eQTL mapping and functional prioritization were evaluated in DEPICTs built-in eQTL and reconstituted gene sets.

Excluding 12 SNPs not reaching genome-wide significance in the joint analysis of discovery and validation did not change the main results (Supplementary Table 25).

MAGMA (version 1.08) was used to perform gene set enrichment testing for pathway identification. MAGMA conducts competitive gene set tests with correction for gene size, variant density and LD structure. A total of 7,522 gene sets representing the GO biological process ontology (MSigDB version 7.1, C5 collection, GO:BP subset) were tested for association. We adopted a significance threshold of FDR<0.05 (one-sided t-test).

Using the settings described above, we tested enrichment of RLS heritability with DEPICT across 209 different tissue types covered in the built-in dataset. For an independent validation on the tissue level as well as for the analyses on the cell type level, we mainly used the CELLEX and CELLECT tools53. CELLECT provides two different gene-prioritization approaches for heritability enrichment testing, S-LDSC and MAGMA covariate analysis54,55. For compatibility of the results, the summary statistics of the pooled N-GWAMA analysis were filtered using settings identical to those in our LDSC heritability analyses. Following the recommendations by Timshel et al.53, we applied a tiered approach by starting with body-wide datasets and then focusing on CNS-centric datasets. We used CELLECT software (version 1.3.0) with default settings but updated to MAGMA version 1.08 to test enrichment of RLS heritability in cell type- or tissue-specific genes for datasets with publicly available RNA-seq data. These analyses require a measure of expression specificity for each gene in a cell or tissue type. We either used CELLEX (version 1.2.1) to compute expression specificity or relied on precomputed CELLEX expression specificity scores. Human adult datasets without publicly available raw RNA-seq data were analyzed using MAGMA_Celltyping (version 2.0.0) in top10 mode. The list of input datasets is provided in the Supplementary Note, and results of our evaluation of both approaches showing high correlation are presented in Supplementary Fig. 1 and Supplementary Table 26.

We applied three types of models for genetic risk evaluation and RLS risk prediction: GLM with and without interaction terms, RF models and DNN models. These were implemented as binary classifiers as well as time-to-event classifiers.

Training of the models and evaluation by tenfold cross-validation were based on the EU-RLS-GENE Axiom subset. Therefore, we first conducted a meta-analysis excluding this dataset to generate unbiased summary statistics to be used in all models. Because GWAS have an ascertainment bias, we constructed a simulation cohort dataset by resampling of the EU-RLS-GENE Axiom subset based on the year of birth of the sampled individuals, their ages at onset and the demographic composition of the German population (Supplementary Note). We calculated the PRS using dosages of 216 independent lead SNPs of our discovery meta-analyses.

For a baseline comparison of the predictive power of this score to a PRS based on genome-wide data, we calculated a genome-wide PRS using the LDpred2-auto option of LDpred2 (R package bigsnpr version 1.12.2)56. Variants and the LD reference panel were based on the HapMap3 EUR dataset, and window size for calculating SNP correlation was set to 3cM.

Binary classification models were evaluated by Nagelkerkes pseudo-R2, receiver operator characteristic AUC and precisionrecall AUC. A 5-year binary classifier was constructed for each of the time-to-event models by predicting the label until the next 5 years and evaluated by the metrics for binary classification.

To evaluate the contribution of the interaction effects to model performance, we estimated the effect sizes of interaction terms such as PRSage by logistic regression:

$$begin{array}{l}P({rm{RLS}}=1|{rm{PRS}},{rm{sex}},{rm{age}},{bf{PC}})\=displaystylefrac{1}{1+{e}^{-left({beta }_{0}+{beta }_{1}{rm{PRS}}+{beta }_{2}{rm{sex}}+{{beta }}_{3}{rm{age}}+{beta }_{4}{rm{PRS}}times {rm{sex}}+{{beta }}_{5}{ {rm{PRS}timesrm{age}}}+{{beta }}_{6}{{rm{sex}}timesrm{age}} +{{beta }}_{7}{{rm{PRS}}times {rm{sex}timesrm{age}}}+{boldsymbol{gamma }}cdot{bf{PC}}right)}},end{array}$$

where age is the dummy variable of age in bins of 20 years, PC indicates the first ten PCs from the MDS analysis in PLINK, is a vector of effect sizes of PCs and the PRS=jwjgj, where wj and gj are the per-allele effect size and dosage of the j-th SNP, respectively.

For the DNN and RF models, we used these logistic regression estimates as the baseline and then further estimated the interaction effect sizes indirectly by calculating the incremental gain in explained variance (Nagelkerkes pseudo-R2) from model0 to model1 as:

$${R}^{2}=left(1-left(Lleft(rm{model}_{0}right)/{it{L}}(rm{model}_{1})right)^{frac{2}{it{N}}}right)left(1-{it{L}}(rm{model}_{0})^{frac{2}{it{N}}}right)^{-1},$$

where L is the likelihood function for a logistic regression model with the first ten PCs included as covariates.

Binary classification models, GLMs and RF and DNN models were built, optimized and trained by H2O AutoML (version 3.36.0.2) in R (version 4.0.2)57. Time-to-event models were implemented with randomForestSRC (version 3.0.1) in R (version 4.0.2) and PyTorch58 (pycox version 0.2.1 and PyTorch version 1.6.0). Cross-validation-based Nagelkerkes pseudo-R2 was calculated in R version 4.0.2.

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Original post:

Genome-wide meta-analyses of restless legs syndrome yield insights into genetic architecture, disease biology and ... - Nature.com

Related Posts