Title: | Removal of Unwanted Technical Variation from UK Biobank NMR Metabolomics Biomarker Data |
---|---|
Description: | A suite of utilities for working with the UK Biobank <https://www.ukbiobank.ac.uk/> Nuclear Magnetic Resonance spectroscopy (NMR) metabolomics data <https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220>. Includes functions for extracting biomarkers from decoded UK Biobank field data, removing unwanted technical variation from biomarker concentrations, computing an extended set of lipid, fatty acid, and cholesterol fractions, and for re-deriving composite biomarkers and ratios after adjusting data for unwanted biological variation. For further details on methods see Ritchie SC et al. Sci Data (2023) <doi:10.1038/s41597-023-01949-y>. |
Authors: | Scott C Ritchie [aut, cre] (0000-0002-8454-9548) |
Maintainer: | Scott C Ritchie <[email protected]> |
License: | MIT + file LICENSE |
Version: | 3.2.0 |
Built: | 2025-01-21 05:41:27 UTC |
Source: | https://github.com/sritchie73/ukbnmr |
For the 76 biomarker ratios computed by compute_extended_ratios()
,
aggregates the biomarker QC flags from the biomarkers composing each ratio
(see nmr_info
), which can be useful for determining the reason
underlying missing values in the biomarker ratios.
compute_extended_ratio_qc_flags(x)
compute_extended_ratio_qc_flags(x)
x |
|
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
Biomarker QC Flags in the input data are also returned alongside those aggregated by this function for the computed biomarker ratios.
a data.frame
with QC flags aggregated for all computed
biomarker ratios.
nmr_info
for list of computed biomarker ratios and
extract_biomarkers()
for details on how raw data from
ukbconv
is processed.
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- compute_extended_ratio_qc_flags(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- compute_extended_ratio_qc_flags(ukb_data)
Computes 76 additional ratios not provided by the Nightingale platform. These
include lipid fractions in HDL, LDL, VLDL, and total serum, as well as
cholesterol fractions, and omega to polyunsaturated fatty acid ratios. See
nmr_info
for details.
compute_extended_ratios(x)
compute_extended_ratios(x)
x |
|
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
All biomarkers in the input data are also returned alongside the ratios computed by this function.
a data.frame
with the additional computed biomarker ratios.
nmr_info
for list of computed biomarker ratios,
compute_extended_ratio_qc_flags()
for obtaining an
aggregate of the biomarker QC flags from the biomarkers underlying each
computed ratio, and extract_biomarkers()
for details on how
raw data from
ukbconv
is processed.
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package nmr <- compute_extended_ratios(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package nmr <- compute_extended_ratios(ukb_data)
Given an input data.frame
loaded from a dasaset of
NMR metabolomics QC indicator fields
extracted by the Table Exporter tool on the
UK Biobank Research Analysis Platform,
this function extracts the
quality control (QC) flags for the NMR metaolomics biomarkers
giving them short variable names as listed in the nmr_info
information data sheet
available in this package. QC Flags are separated by "; " in each column where
there are multiple QC Flags for a single measurement.
extract_biomarker_qc_flags(x)
extract_biomarker_qc_flags(x)
x |
|
#' Data sets extracted on the UK Biobank Research Analysis Platform have one row per UK Biobank participant, whose project specific sample identifier is given in the first column named "eid". Columns following this follow a naming scheme based on the unique identifier of each field, assessment visit, and (optionally if relevant) repeated measurement of "p<field_id>_i<visit_index>_a<repeat_index>". For example, the QC flags for the measurement of 3-Hydroxybutyrate at baseline assessment has the column names "p23774_i0_a0", "p23774_i0_a1", and "p23774_i0_a2"; indicating that the 3-Hydroxybutyrate measurement can have up to three QC flags per sample at baseline assessment. Measurements for blood samples collected at the first repeat assessment have 1 in the visit index, e.g. for 3-Hydroxybutyrate at the first repeat assessment there are three columns "p23774_i1_a0", "p23774_i1_a1", "p23774_i1_a2".
The data.frame
returned by this function gives each field a unique
recognizable name, with measurements from baseline and repeat assessment
given in separate rows. The "visit_index" column immediately after the "eid"
column indicates whether the biomarker measurement was quantified from the
blood samples taken at baseline assessment (visit_index == 0) or first repeat
assessment (visit_index == 1). Where multiple QC flags were present at the
same measurement, these are collated into a single entry with the multiple QC
flags separated by a "; ". Rows are uniquely identifiable by the combination
of entries in columns "eid" and "visit_index".
This function will also work with data predating the Research Analysis Platform,
including data sets extracted by the ukbconv
tool and/or the ukbtools
R package.
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
A data.table
will be returned instead of a data.frame
if the
the user has loaded the package into their R session.
a data.frame
or data.table
with column names "eid",
and "visit_index" followed by columns for each biomarker
e.g. "bOHbutyrate", ..., "Valine".
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- extract_biomarker_qc_flags(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- extract_biomarker_qc_flags(ukb_data)
Given an input data.frame
loaded from a dasaset of
NMR metabolomics fields
extracted by the Table Exporter tool on the
UK Biobank Research Analysis Platform,
this function extracts the NMR metabolomics biomarkers giving them short variable
names as listed in the nmr_info
information data sheet
available in this package.
extract_biomarkers(x)
extract_biomarkers(x)
x |
|
Data sets extracted on the UK Biobank Research Analysis Platform
have one row per UK Biobank participant, whose project specific sample
identifier is given in the first column named "eid". Columns following this
follow a naming scheme
based on the unique identifier of each field, assessment visit, and (optionally if relevant)
repeated measurement of "p<field_id>_i<visit_index>_a<repeat_index>". For example,
the measurement of 3-Hydroxybutyrate at baseline assessment has the column name
"p23474_i0". For the UKB NMR data, measurements are available at baseline assessment
and the first repeat assessment (e.g. "p23474_i1"). For the UKB NMR data, the
<repeat_index> is reserved for cases where biomarker measurements have more than
one QC Flag (see extract_biomarker_qc_flags()
).
The data.frame
returned by this function gives each field a unique
recognizable name, with measurements from baseline and repeat assessment
given in separate rows. The "visit_index" column immediately after the "eid"
column indicates whether the biomarker measurement was quantified from the
blood samples taken at baseline assessment (visit_index == 0) or first repeat
assessment (visit_index == 1). Rows are uniquely identifiable
by the combination of entries in columns "eid" and "visit_index".
This function will also work with data predating the Research Analysis Platform,
including data sets extracted by the ukbconv
tool and/or the ukbtools
R package.
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
A data.table
will be returned instead of a data.frame
if the
the user has loaded the package into their R session.
a data.frame
or data.table
with column names "eid",
and "visit_index", followed by columns for each biomarker
e.g. "bOHbutyrate", ..., "Valine".
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package nmr <- extract_biomarkers(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package nmr <- extract_biomarkers(ukb_data)
Given an input data.frame
loaded from a dasaset of
NMR metabolomics processing fields
extracted by the Table Exporter tool on the
UK Biobank Research Analysis Platform, this function
extracts the
sample quality control flags for the NMR metabolomics biomarker data giving them short variable names as listed in the
sample_qc_info
information data sheet available in this package.
extract_sample_qc_flags(x)
extract_sample_qc_flags(x)
x |
|
Data sets extracted on the UK Biobank Research Analysis Platform
have one row per UK Biobank participant, whose project specific sample
identifier is given in the first column named "eid". Columns following this
follow a naming scheme
based on the unique identifier of each field, assessment visit, and (optionally if relevant)
repeated measurement of "p<field_id>_i<visit_index>_a<repeat_index>". For example,
the Shipment Plate for each sample collected at baseline assessment has the
column name "p23649_i0". For the UKB NMR data, measurements are available at
baseline assessment and the first repeat assessment (e.g. "p23649_i1"). For the
UKB NMR data, the <repeat_index> is reserved for cases where biomarker
measurements have more than one QC Flag (see extract_biomarker_qc_flags()
).
The data.frame
returned by this function gives each field a unique
recognizable name, with measurements from baseline and repeat assessment
given in separate rows. The "visit_index" column immediately after the "eid"
column indicates whether the biomarker measurement was quantified from the
blood samples taken at baseline assessment (visit_index == 0) or first repeat
assessment (visit_index == 1). Rows are uniquely identifiable
by the combination of entries in columns "eid" and "visit_index".
This function will also work with data predating the Research Analysis Platform,
including data sets extracted by the ukbconv
tool and/or the ukbtools
R package.
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
A data.table
will be returned instead of a data.frame
if the
the user has loaded the package into their R session.
a data.frame
or data.table
with column names "eid"
and "visit_index", followed by columns for each sample
QC tag, e.g. "Shipment.Plate", ..., "Low.Protein".
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package sample_qc_flags <- extract_sample_qc_flags(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package sample_qc_flags <- extract_sample_qc_flags(ukb_data)
Contains details on the Nightingale biomarkers available in UK Biobank and computed by this package.
nmr_info
nmr_info
A data table with 325 rows and 8 columns:
Short column name assigned to the biomarker
Biomarker description, matching the description field provided by UK Biobank and Nightingale Health
Units of measurement for the biomarker ("mmol/L", "g/L", "nm", "degree", "ratio", or "%")
Biomarker type ("Non-derived", "Composite", "Ratio", or "Percentage")
Biomarker group as provided by Nightingale Health
Biomarker sub-group as provided by Nightingale Health
Logical. Indicates biomarker is quantified by the Nightingale Health platform
Field ID in UK Biobank, see https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220
Field ID in UK Biobank for the biomarker QC Flags, see https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220
For composite biomarkers and ratios, details formula through which the biomarker can be derived from the 107 non-derived biomarkers
Simplified form of the full formula most clearly expressing how each composite biomarker and ratio can be rederived from other biomarkers
For the 61 composite biomarkers, 81 Nightingale biomarker ratios, and 76
extended biomarker ratios computed by recompute_derived_biomarkers()
,
aggregates the biomarker QC flags from the underlying biomarkers
(see nmr_info
).
recompute_derived_biomarker_qc_flags(x)
recompute_derived_biomarker_qc_flags(x)
x |
|
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
Biomarker QC Flags in the input data are also returned alongside those aggregated by this function for the computed biomarker ratios.
a data.frame
with QC flags aggregated for all computed
biomarkers and ratios.
nmr_info
for list of computed biomarker ratios and
extract_biomarkers()
for details on how raw data from
ukbconv
is processed.
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- recompute_derived_biomarker_qc_flags(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package biomarker_qc_flags <- recompute_derived_biomarker_qc_flags(ukb_data)
When adjusting biomarkers for unwanted biological covariates, it is desirable
to recompute composite biomarkers and ratios to ensure consistency in the
adjusted dataset. This function will compute all composite biomarkers and
ratios from their parts (see nmr_info
for biomarker details).
recompute_derived_biomarkers(x)
recompute_derived_biomarkers(x)
x |
|
If your UK Biobank project only has access to a subset of biomarkers, then this function will only return the subset of ratios that can be computed from the biomarker data provided.
All biomarkers in the input data are also returned alongside those computed by this function.
a data.frame
with all composite biomarkers and ratios
(re)computed from the 107 non-derived biomarkers (see nmr_info
for details).
nmr_info
for list of computed biomarker ratios,
recompute_derived_biomarker_qc_flags()
for obtaining an
aggregate of the biomarker QC flags from the biomarkers underlying each
computed biomarker, and extract_biomarkers()
for details on
how the raw field data extracted by the Table Exporter tool is processed.
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package bio_qc <- recompute_derived_biomarkers(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package bio_qc <- recompute_derived_biomarkers(ukb_data)
Remove technical variation from NMR biomarker data in UK Biobank.
remove_technical_variation( x, remove.outlier.plates = TRUE, skip.biomarker.qc.flags = FALSE, version = 3L )
remove_technical_variation( x, remove.outlier.plates = TRUE, skip.biomarker.qc.flags = FALSE, version = 3L )
x |
|
remove.outlier.plates |
logical, when set to |
skip.biomarker.qc.flags |
logical, when set to |
version |
version of the QC algorithm to apply. Defaults to 3, the latest version of the algorithm (see details). |
Three versions of the QC algorithm have been developed. Version 1 was designed based on the first phase of data released to the public covering ~120,000 UK Biobank participants. Version 2 made several improvements to the algorithm based on the subsequent second public release of data covering an additional ~150,000 participants. Version 3, the default, makes some further minor tweaks primarily so that the algorithm is compatible with the full public data release covering all ~500,000 participants.
Details on the impact of these algorithms on technical variation in the latest UK Biobank data are provided in the package vignette and on the github readme.
## Version 1 Setting version to 1 applies the algorithm as exactly described in Ritchie S. C. *et al.* _Sci Data_ 2023. In brief, this multi-step procedure applies the following steps in sequence:
1. First biomarker data is filtered to the 107 biomarkers that cannot be derived from any combination of other biomarkers. 2. Absolute concentrations are log transformed, with a small offset applied to biomarkers with concentrations of 0. 3. Each biomarker is adjusted for the time between sample preparation and sample measurement (hours). 4. Each biomarker is adjusted for systematic differences between rows (A-H) on the 96-well shipment plates. 5. Each biomarker is adjusted for remaining systematic differences between columns (1-12) on the 96-well shipment plates. 6. Each biomarker is adjusted for drift over time within each of the six spectrometers. To do so, samples are grouped into 10 bins, within each spectrometer, by the date the majority of samples on their respective 96-well plates were measured. 7. Regression residuals after the sequential adjustments are transformed back to absolute concentrations. 8. Samples belonging to shipment plates that are outliers of non-biological origin are identified and set to missing. 9. The 61 composite biomarkers and 81 biomarker ratios are recomputed from their adjusted parts. 10. An additional 76 biomarker ratios of potential biological significance are computed.
At each step, adjustment for technical covariates is performed using robust linear regression. Plate row, plate column, and sample measurement date bin are treated as factors, using the group with the largest sample size as reference in the regression.
## Version 2 Version 2 of the algorithm modifies the above procedure to (1) adjust for well and column within each processing batch separately in steps 4 and 5, and (2) in step 6 instead of splitting samples into 10 bins per spectrometer uses a fixed bin size of approximately 2,000 samples per bin, ensuring samples measured on the same plate and plates measured on the same day are grouped into the same bin.
The first modification was made as applying version 1 of the algorithm to the combined data from the first and second tranche of measurements revealed introduced stratification by well position when examining the corrected concentrations in each data release separately.
The second modification was made to ensure consistent bin sizes across data releases when correcting for drift over time. Otherwise, spectrometers used in multiple data releases would have different bin sizes when adjusting different releases. A bin split is also hard coded on spectrometer 5 between plates 0490000006726 and 0490000006714 which correspond to a large change in concentrations akin to a spectrometer recalibration event most strongly observed for alanine concentrations.
## Version 3 Version 3 of the algorithm makes two further minor changes:
1. Imputation of missing sample preparation times has been improved. Previously, any samples missing time of measurement (N=3 in the phase 2 public release) had their time of measurement set to 00:00. In version 3, the time of measurement is set to the median time of measurement for that spectrometer on that day, which is between 12:00-13:00, instead of 00:00. 2. Underlying code for adjusting drift over time has been modified to accommodate the phase 3 public release, which includes one spectrometer with ~2,500 samples. Version 2 of the algorithm would split this into two bins, whereas version 3 keeps this as a single bin to better match the bin sizes of the rest of the spectrometers.
a list
containing three data.frames
:
A data.frame
with column names "eid",
and "visit_index", containing project-specific sample identifier and
UK Biobank visit index (0 for baseline assessment, 1 for first repeat
assessment), followed by columns for each biomarker containing their
absolute concentrations (or ratios thereof) adjusted for technical
variation. See nmr_info
for information on each biomarker.
A data.frame
with the same format as
biomarkers
with entries corresponding to the
quality
control indicators for each sample. "High plate outlier" and "Low
plate outlier" indicate the value was set to missing due to systematic
abnormalities in the biomarker's concentration on the sample's shipment
plate compared to all other shipment plates (see Details). For
composite and derived biomarkers, quality control flags are aggregates
of any quality control flags for the underlying biomarkers from which
the composite biomarker or ratio is derived.
A data.frame
containing the
processing
information and quality control indicators for each sample, including
those derived for removal of unwanted technical variation by this
function. See sample_qc_info
for details.
A data.frame
containing diagnostic information on
the offset applied so that biomarkers with concentrations of 0 could
be log transformed, and any right shift applied to prevent negative
concentrations after rescaling adjusted residuals back to absolute
concentrations. Should contain only biomarkers with minimum
concentrations of 0 (in the "Minimum" column). "Minimum.Non.Zero"
gives the smallest non-zero concentration for the biomarker.
"Log.Offset" the small offset added to all samples prior to
log transformation: half the mininum non-zero concentration.
"Right.Shift" gives the small offset added to prevent negative
concentrations that arise after rescaling residuals to log
concentrations: this should be at least one order of magnitude
smaller than the smallest non-zero value (i.e. the offset added
should amount to noise in numeric precision for all samples). See
publication for more details.
A data.frame
containing diagnostic
information and details of outlier plate detection. For each of the
107 non-derived biomarkers, the median concentration on each of the
1,352 plates was calculated, then plates were flagged as outliers if
their median value deviated more than expected from the mean of plate
medians. "Mean.Plate.Medians" gives the mean of the plate medians for
each biomarker. "Lower.Limit" and "Upper.Limit" give the values below
and above which plates are flagged as outliers based on their plate
median. See publication for more details.
Version of the algorithm run, either 1, 2, or 3 (default).
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package processed <- remove_technical_variation(ukb_data)
ukb_data <- ukbnmr::test_data # Toy example dataset for testing package processed <- remove_technical_variation(ukb_data)
Contains details on the sample processing and quality control information for the NMR biomarker data in UK Biobank.
sample_qc_info
sample_qc_info
A data table with 18 rows and 3 columns:
Column name assigned to the sample information field
Brief description of the field contents. Further details on the Nightingale sample QC columns can be found in the accompanying resource on the UK Biobank showcase.
Field ID in UK Biobank, see https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=222.
Rows missing UKB.Field.ID
entries correspond to additional sample
processing information derived from these fields and returned by remove_technical_variation
.
Dataset mimicking structure of decoded UK Biobank dataset of NMR metabolomics biomarker concentrations and associated processing variables for testing package functions.
test_data
test_data
A data table with 50 rows and 735 columns with column names "eid" followed by extracted UK Biobank field data of the format "p23649_i0", "p23649_i1", ..., "p23655_i1".
Data in each column has been randomly drawn from the distribution present in the UK Biobank dataset. Importantly, random sampling was performed for each column separately, thus no rows represent real observations or participants in UK Biobank.
@description
This package provides utilities for working with the
UK Biobank MR metabolomics data.
Details are provided below, and in the package vignette (type vignette("ukbnmr")
to view).
There are three groups of functions in this package: (1) data extraction, (2) removal of technical variation, and (3) recomputing derived biomarkers and biomarker ratios.
All functions can be applied directly to raw data extracted from UK Biobank.
This package also provides a data.frame
of biomarker information, loaded
as nmr_info
, and data.frame
of sample processing information,
loaded as sample_qc_info
.
The extract_biomarkers()
function will take a phenotype dataset extracted on
the UK Biobank Research Analysis Platform by the Table Exporter tool, extract the
NMR biomarker fields
and give them short comprehensible column names as described in nmr_info
.
Measurements are also split into multiple rows where a participant has measurements
at both baseline and first repeat assessment.
The extract_biomarker_qc_flags()
function will take a phenotype
dataset extracted on the UK Biobank Research Analysis Platform by the Table
Exporter tool, extract the Nightingale quality control flags
for each biomarker measurement, returning a single column per biomarker
(corresponding to respective columns output by extract_biomarkers()
).
The extract_sample_qc_flags()
function will take a phenotype
dataset extracted on the UK Biobank Research Analysis Platform by the Table
Exporter tool and extract the sample quality control tags
for the Nightingale NMR metabolomics data.
These functions will also work with older datasets predating the UK Biobank Research Analysis Platform, e.g. those extracted by ukbconv, and/or by the ukbtools R package.
The remove_technical_variation()
function will take a phenotype
dataset extracted on the UK Biobank Research Analysis Platform by the Table
Exporter tool, extract all the biomarkers and QC flags, remove the effects of
technical variation on biomarker concentrations, and return a list
containing the adjusted NMR biomarker data, biomarker QC flags, and sample quality control
and processing information.
This applies a multistep process as described in Ritchie et al. 2023:
First biomarker data is filtered to the 107 biomarkers that cannot be derived from any combination of other biomarkers.
Absolute concentrations are log transformed, with a small offset applied to biomarkers with concentrations of 0.
Each biomarker is adjusted for the time between sample preparation and sample measurement (hours) on a log scale.
Each biomarker is adjusted for systematic differences between rows (A-H) on the 96-well shipment plates.
Each biomarker is adjusted for remaining systematic differences between columns (1-12) on the 96-well shipment plates.
Each biomarker is adjusted for drift over time within each of the six spectrometers. To do so, samples are grouped into 10 bins, within each spectrometer, by the date the majority of samples on their respective 96-well plates were measured.
Regression residuals after the sequential adjustments are transformed back to absolute concentrations.
Samples belonging to shipment plates that are outliers of non-biological origin are identified and set to missing.
The 61 composite biomarkers and 81 biomarker ratios are recomputed from their adjusted parts.
An additional 76 biomarker ratios of potential biological significance are computed.
Further details can be found in Ritchie S. C. et al. Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants, Sci Data 10, 64 (2023). doi: 10.1038/s41597-023-01949-y
The compute_extended_ratios()
function will compute an extended
set of biomarker ratios expanding on the biomarkers available directly from
the Nightingale platform. A companion function, compute_extended_ratio_qc_flags()
,
will aggregate the QC flags for the biomarkers underlying each ratio.
The recompute_derived_biomarkers()
function will recompute all
composite biomarkers and ratios from 107 non-derived biomarkers, which is
useful for ensuring data consistency when adjusting for unwanted biological
variation. This includes the extended biomarker rations computed by the
compute_extended_ratios()
function. A companion function,
recompute_derived_biomarker_qc_flags()
will aggregate the QC
flags for the biomarkers underlying each composite biomarker and ratio.
Maintainer: Scott C Ritchie [email protected] (0000-0002-8454-9548)
Useful links:
Report bugs at https://github.com/sritchie73/ukbnmr/issues