_fdrx¶

The _fdrx module contains experimental functionality for false discovery rate control.

This module implements a base class for semisupervised FDR estimation using targets and decoys. It is flexible with regards to the features, type of classifier and type of identifications (precursors, peptides, proteins).

class alphadia.fdr._fdrx.base.TargetDecoyFDR(classifier: BaseEstimator, feature_columns: list, decoy_column: str = 'decoy', competition_columns: list | None = None)[source]¶

Bases: object

__init__(classifier: BaseEstimator, feature_columns: list, decoy_column: str = 'decoy', competition_columns: list | None = None)[source]¶

Target Decoy FDR estimation using a classifier.

This class supports target decoy competition as well as fragment competition.

Parameters:

classifier (sklearn.base.BaseEstimator) – The classifier to use for target decoy estimation.
feature_columns (list) – The columns to use as features for the classifier.
decoy_column (str, default='decoy') – The column to use as decoy information.
competition_columns (list, default=[]) – Perform target decoy competition on these columns. Only the best PSM for each group will be kept.

fit_classifier(psm_df: DataFrame)[source]¶

Fit the classifier on the PSMs.

Parameters:: psm_df (pd.DataFrame) – The dataframe containing the PSMs.

fit_predict_qval(psm_df: DataFrame, fragments_df: DataFrame | None = None, cycle: ndarray | None = None)[source]¶

Fit the classifier, predict the decoy probabilities and calculate q-values.

Parameters:

psm_df (pd.DataFrame) – The dataframe containing the PSMs.
fragments_df (pd.DataFrame, default=None) – The dataframe containing the fragments.
cycle (np.ndarray, default=None) – The DIA cycle for the fragments.

Returns:

The input dataframe with q-values and PEPs added.

Return type:

pd.DataFrame

predict_classifier(psm_df: DataFrame)[source]¶

Predict the decoy probability for the PSMs.

Parameters:: psm_df (pd.DataFrame) – The dataframe containing the PSMs.
Returns:: The decoy probabilities for the PSMs with same shape and order as the input dataframe.
Return type:: np.ndarray

predict_qval(psm_df: DataFrame, fragments_df: DataFrame | None = None, dia_cycle: ndarray | None = None, competition_heuristic: float = 0.1) → DataFrame[source]¶

Calculate q-values for scored identifications.

Parameters:

psm_df (pd.DataFrame) – The dataframe containing the PSMs.
fragments_df (pd.DataFrame, default=None) – The dataframe containing the fragments.
dia_cycle (np.ndarray, default=None) – The DIA cycle for the fragments.
competition_heuristic (float, default=0.10) – The q-value threshold for fragment competition. Only precursors with q-values below this threshold will be considered for fragment competition.

Returns:

The input dataframe with q-values and PEPs added.

Return type:

pd.DataFrame

alphadia.fdr._fdrx.stats.add_q_values(df: DataFrame, decoy_proba_column: str = 'decoy_proba', decoy_column: str = 'decoy', qval_column: str = 'qval', r_target_decoy: float = 1.0)[source]¶

Calculates q-values for a dataframe containing PSMs.

Parameters:

df (pd.DataFrame) – The dataframe containing the PSMs.
decoy_proba_column (str, default='proba') – The name of the column containing the probability of being a decoy. Value should be between 0 and 1 with 1 being a decoy.
decoy_column (str, default='_decoy') – The name of the column containing the decoy information. Decoys are expected to be 1 and targets 0.
qval_column (str, default='qval') – The name of the column to store the q-values in.

Returns:

The dataframe containing the q-values in column qval.

Return type:

pd.DataFrame

alphadia.fdr._fdrx.stats.fdr_to_q_values(fdr_values: ndarray)[source]¶

Converts FDR values to q-values. Takes a ascending sorted array of FDR values and converts them to q-values. for every element the lowest FDR where it would be accepted is used as q-value.

Parameters:: fdr_values (np.ndarray) – The FDR values to convert.
Returns:: The q-values.
Return type:: np.ndarray

alphadia.fdr._fdrx.stats.get_pep(psm_df: DataFrame, score_column: str = 'decoy_proba', decoy_column: str = 'decoy', score_std: float = 0.01, pep_granularity: int = 1000, kernel_size: int = 20)[source]¶

Implementation of a very simple nonparametric PEP estimation using a gaussian kernel.

Parameters:

psm_df (pd.DataFrame) – The dataframe containing the PSMs.
score_column (str, default='decoy_proba') – The name of the column containing the score to use for the selection.
decoy_column (str, default='decoy') – The name of the column containing the decoy information.
score_std (float, default=0.01) – The standard deviation of the gaussian kernel.
pep_granularity (int, default=1000) – The number of bins to use for the score histogram.
kernel_size (int, default=20) – The size of the kernel to use for the convolution.

Returns:

The PEP values with same shape and order as the input dataframe.

Return type:

np.ndarray

alphadia.fdr._fdrx.stats.keep_best(df: DataFrame, score_column: str = 'decoy_proba', group_columns: list[str] | None = None)[source]¶

Keep the best PSM for each group of PSMs with the same precursor_idx. This function is used to select the best candidate PSM for each precursor. if the group_columns is set to [‘channel’, ‘elution_group_idx’] then its used for target decoy competition.

Parameters:

df (pd.DataFrame) – The dataframe containing the PSMs.
score_column (str) – The name of the column containing the score to use for the selection.
group_columns (list[str], default=['channel', 'precursor_idx']) – The columns to use for the grouping.

Returns:

The dataframe containing the best PSM for each group.

Return type:

pd.DataFrame