Skip to content

add AMALIA algorithm for lead isotope provenance matching#51

Open
Abagna123 wants to merge 3 commits into
mainfrom
amalia
Open

add AMALIA algorithm for lead isotope provenance matching#51
Abagna123 wants to merge 3 commits into
mainfrom
amalia

Conversation

@Abagna123

Copy link
Copy Markdown
Collaborator

Hi @archaeothommy, I replaced the nested loops with expand.grid and mapply. The function is working and all tests pass (I haven't pushed the test function yet). Since the Argentina database has no measurement errors, I derived absolute 2SD uncertainties for the reference data using the SRM-981 interlaboratory ranges from the paper (0.10%, 0.14%, 0.20%, 0.13%, 0.20%) multiplied by each ratio value.

@archaeothommy archaeothommy left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine overall, thank you! The idea of using expand.grid and mapply is great!

I think it the function could be quite significantly simplified, as explained in the comments.

Comment thread R/Amalia_algorithm.R
#' "207Pb/206Pb_err2SD", "208Pb/206Pb_err2SD")`.
#' @param id_sample String with the column name of the sample IDs in `df`.
#' Default is `"ID"`.
#' @param id_ref String with the column name of the reference groups in `ref`.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better referred to as "group" rather than individual IDs. That's how it must be handled at least (this is admittedly not well documented in the pointcloud-function). We cannot expect that reference data have individual IDs. A table with e.g. the regions as only information except the isotope ratios and analytical values must be sufficient.

Comment thread R/Amalia_algorithm.R Outdated
Comment on lines +161 to +166
diff_1 = abs(as.numeric(df[matched$i, iso[1]]) -
as.numeric(ref[matched$j, iso[1]])),
diff_2 = abs(as.numeric(df[matched$i, iso[2]]) -
as.numeric(ref[matched$j, iso[2]])),
diff_3 = abs(as.numeric(df[matched$i, iso[3]]) -
as.numeric(ref[matched$j, iso[3]])),

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I thought it is a TRUE/FALSE type of matching. If it is required, why don't you recycle the values calculated in lines 142 and 143?

Comment thread R/Amalia_algorithm.R
Comment on lines +23 to +38
#' @param ratios_204 Character vector of length 3 with column names of the
#' 204Pb-normalised isotope ratios. Must be identical in `df` and `ref`.
#' Default to `c("206Pb/204Pb", "207Pb/204Pb", "208Pb/204Pb")`.
#' @param ratios_206 Character vector of length 3 with column names of the
#' 206Pb-normalised isotope ratios. Must be identical in `df` and `ref`. Only
#' used when `triplet = "206"` or `triplet = "both"`. Default to
#' `c("206Pb/204Pb", "207Pb/206Pb", "208Pb/206Pb")`.
#' @param error_204 Character vector of length 3 with column names of the
#' analytical uncertainty in 2SD for the 204Pb-normalised ratios. Must be
#' identical in both `df` and `ref`. Default to `c("206Pb/204Pb_err2SD",
#' "207Pb/204Pb_err2SD", "208Pb/204Pb_err2SD")`.
#' @param error_206 Character vector of length 3 with column names of the
#' analytical uncertainty (2SD) for the 206Pb-normalised ratios. Must be
#' identical in both `df` and `ref`. Only used when `triplet = "206"` or
#' `triplet = "both"`. Default to `c("206Pb/204Pb_err2SD",
#' "207Pb/206Pb_err2SD", "208Pb/206Pb_err2SD")`.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the function requires only the input of the triplet that should be used. I think this applies only to the checks. The input for the other triplet should just be ignored.
It is probably best to make this check based on the value of triplet. I did not saw a check that the column names and the value of triplet are matching.

Comment thread R/Amalia_algorithm.R Outdated
Comment on lines +177 to +209
if (triplet == "204Pb" || triplet == "both") {
matches_204 <- run_amalia(ratios_204, error_204)
}

if (triplet == "206Pb" || triplet == "both") {
matches_206 <- run_amalia(ratios_206, error_206)
}

# Combine results
if (triplet == "204Pb") {
matches <- matches_204
}

if (triplet == "206Pb") {
matches <- matches_206
}

if (triplet == "both") {
if (nrow(matches_204) == 0 || nrow(matches_206) == 0) {
matches <- data.frame()
} else {
pair_204 <- paste(matches_204$sample_id, matches_204$ref_id)
pair_206 <- paste(matches_206$sample_id, matches_206$ref_id)
# Strict intersection — must pass in both triplets
keep_204 <- pair_204 %in% pair_206
keep_206 <- pair_206 %in% pair_204
# Combine diff columns from both triplets for matched pairs
matches <- cbind(
matches_204[keep_204, ],
matches_206[keep_206, 3:5]
)
}
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this all can be simplified by having the matching running over all isotope ratios at once if triplet = "both". The algorithm returns FALSE when one value in a triplet doesn't match. Consequently, running the matching of all six ratios at the same time will give the same result as running the matching for each triplet separately and figuring out which matches both triples have in common.
Implementing this would require to use triplet only for determining which isotope ratios should be supplied to run_amalia() and all output of this function can then be handled the same.

Comment thread R/Amalia_algorithm.R Outdated
stop("Reference ID column '", id_ref, "' not found in reference data.")
}

run_amalia <- function(iso, err) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid undocumented function nesting. See e.g. the ASTR parsing functions how they should be arranged and documented.

Comment thread R/Amalia_algorithm.R Outdated
Comment on lines +214 to +230
if (nrow(matches) > 0) {
n_matches <- as.data.frame(table(matches$sample_id))
names(n_matches) <- c("sample_id", "n_matches")
missing <- setdiff(all_ids, n_matches$sample_id)
if (length(missing) > 0) {
n_matches <- rbind(
n_matches,
data.frame(sample_id = missing, n_matches = 0L)
)
}
n_matches <- n_matches[match(all_ids, n_matches$sample_id), ]
} else {
n_matches <- data.frame(
sample_id = all_ids,
n_matches = 0L
)
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to simplify this with ifelse(), probably something similar to:

all_ids$n_matches <- ifelse(all_ids `%in%` matches$sample_id, <get summarised values from table() call>, 0L)

Comment thread R/Amalia_algorithm.R Outdated

if (length(unmatched) > 0) {
message(
length(unmatched), " sample(s) had no matches: ",

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
length(unmatched), " sample(s) had no matches: ",
length(unmatched), " sample(s) without matches: ",

Comment thread R/Amalia_algorithm.R
Comment on lines +66 to +94
#' df <- data.frame(
#' ID = c("Art1", "Art2", "Art3"),
#' `206Pb/204Pb` = c(18.244, 18.419, 18.050),
#' `207Pb/204Pb` = c(15.634, 15.658, 15.620),
#' `208Pb/204Pb` = c(38.407, 38.638, 38.157),
#' `206Pb/204Pb_err2SD` = c(0.001, 0.001, 0.001),
#' `207Pb/204Pb_err2SD` = c(0.001, 0.001, 0.001),
#' `208Pb/204Pb_err2SD` = c(0.002, 0.002, 0.002),
#' `207Pb/206Pb` = c(0.857, 0.850, 0.865),
#' `208Pb/206Pb` = c(2.105, 2.098, 2.114),
#' `207Pb/206Pb_err2SD` = c(0.00001, 0.00001, 0.00001),
#' `208Pb/206Pb_err2SD` = c(0.00004, 0.00004, 0.00004),
#' check.names = FALSE
#' )
#'
#' ref <- data.frame(
#' ID = c("Ore_A", "Ore_B", "Ore_C"),
#' `206Pb/204Pb` = c(18.242, 18.500, 18.048),
#' `207Pb/204Pb` = c(15.633, 15.700, 15.619),
#' `208Pb/204Pb` = c(38.405, 38.800, 38.155),
#' `206Pb/204Pb_err2SD` = c(0.001, 0.001, 0.001),
#' `207Pb/204Pb_err2SD` = c(0.001, 0.001, 0.001),
#' `208Pb/204Pb_err2SD` = c(0.002, 0.002, 0.002),
#' `207Pb/206Pb` = c(0.857, 0.850, 0.865),
#' `208Pb/206Pb` = c(2.105, 2.098, 2.114),
#' `207Pb/206Pb_err2SD` = c(0.00001, 0.00001, 0.00001),
#' `208Pb/206Pb_err2SD` = c(0.00004, 0.00004, 0.00004),
#' check.names = FALSE
#' )

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns for 204Pb/206Pb and respective analytical uncertainties are missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants