outputtransform

The outputtransform package contains functionality to perform the consensus workflow. Individual search workflows, part of a joined search plan, are combined.

This includes:

  • protein inference

  • false-discovery-rate correction

  • quantification

  • spectral library generation.

alphadia.outputtransform.df_builders

alphadia.outputtransform.outputaccumulator

Output Accumulator This module contains classes to accumulate the information from the output folders of the alphadia pipeline in a linear fashion. This is hugely useful when we have a large number of output folders and we want to accumulate the information from all of them in a single object/Library which can be a challenge to do in a single go due to memory constraints. The module is designed as broadcast-subscriber pattern where the AccumulationBroadcaster class loops over the output folders creating a speclibBase object from each output folder and then broadcasts the information to the subscribers.

alphadia.outputtransform.protein_fdr

alphadia.outputtransform.search_plan_output

alphadia.outputtransform.utils

alphadia.outputtransform.df_builders.build_run_internal_df(folder_path: str)[source]

Build stat dataframe for a single run.

Parameters:

folder_path (str) – Path (from the base directory of the output_folder attribute of the SearchStep class) to the directory containing the raw file and the managers

Returns:

Dataframe containing the statistics

Return type:

pd.DataFrame

alphadia.outputtransform.df_builders.build_run_stat_df(folder: str, raw_name: str, run_df: DataFrame, channels: list[int] | None = None)[source]

Build stat dataframe for a single run.

Parameters:
  • folder (str) – Directory containing the raw file and the managers

  • raw_name (str) – Name of the raw file

  • run_df (pd.DataFrame) – Dataframe containing the precursor data

  • channels (List[int], optional) – List of channels to include in the output, default=[0]

Returns:

Dataframe containing the statistics

Return type:

pd.DataFrame

alphadia.outputtransform.df_builders.log_stat_df(stat_df: DataFrame)[source]

log statistics dataframe to console

Parameters:

stat_df (pd.DataFrame) – statistics dataframe

alphadia.outputtransform.df_builders.transfer_library_stat_df(transfer_library: SpecLibBase) DataFrame[source]

create statistics dataframe for transfer library

Parameters:

transfer_library (SpecLibBase) – transfer library

Returns:

statistics dataframe

Return type:

pd.DataFrame

Output Accumulator

This module contains classes to accumulate the information from the output folders of the alphadia pipeline in a linear fashion. This is hugely useful when we have a large number of output folders and we want to accumulate the information from all of them in a single object/Library which can be a challenge to do in a single go due to memory constraints. The module is designed as broadcast-subscriber pattern where the AccumulationBroadcaster class loops over the output folders creating a speclibBase object from each output folder and then broadcasts the information to the subscribers.

Classes

BaseAccumulator

Base class for accumulator classes, which are used to subscribe on the linear accumulation of a list of output folders. it has two methods update and post_process.

AccumulationBroadcaster

Class that loops over output folders in a linear fashion to prevent having all the output folders in memory at the same time.

TransferLearningAccumulator

Class that accumulates the information from the output folders for fine-tuning by selecting the top keep_top precursors and their fragments from all the output folders.

class alphadia.outputtransform.outputaccumulator.AccumulationBroadcaster(folder_list: list, number_of_processes: int, processing_kwargs: dict)[source]

Bases: object

Class that loops over output folders in a linear fashion to only have one folder in memory at a time. And broadcasts the output of each folder to the subscribers.

__init__(folder_list: list, number_of_processes: int, processing_kwargs: dict)[source]
run()[source]
subscribe(subscriber: BaseAccumulator)[source]
class alphadia.outputtransform.outputaccumulator.BaseAccumulator[source]

Bases: object

Base class for accumulator classes, which are used to subscribe on the linear accumulation of a list of output folders.

post_process() None[source]

Called after all output folders have been processed.

update(info: SpecLibBase) None[source]

Called when a new output folder is obtained.

Parameters:

info (SpecLibBase) – The information from the output folder.

class alphadia.outputtransform.outputaccumulator.TransferLearningAccumulator(keep_top: int = 3, norm_delta_max: bool = True, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]

Bases: BaseAccumulator

__init__(keep_top: int = 3, norm_delta_max: bool = True, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]

TransferLearningAccumulator is used to accumulate the information from the output folders for fine-tuning by selecting the top keep_top precursors and their fragments from all the output folders. The current measure of score is the probA

Parameters:
  • keep_top (int, optional) – The number of top precursors to keep, by default 3

  • norm_w_calib (bool, optional) –

    If true, advanced normalization of retention times will be performed. Retention times are normalized using calibrated deviation from the library at the start of the gradient and max normalization at the end of the gradient.

    If false, max normalization will be performed, by default True

  • precursor_correlation_cutoff (float, optional) – Only precursors with a median fragment correlation above this cutoff will be used for MS2 learning, by default 0.5

  • fragment_correlation_ratio (float, optional) – The cutoff for the fragment correlation relative to the median fragment correlation for a precursor, by default 0.75

post_process()[source]

Post process the consensus_speclibase by normalizing retention times.

update(speclibase: SpecLibBase)[source]

Update the consensus_speclibase with the information from the speclibase.

Parameters:

speclibase (SpecLibBase) – The information from the output folder.

alphadia.outputtransform.outputaccumulator.build_speclibflat_from_quant(folder: str, mandatory_precursor_columns: list[str] | None = None, optional_precursor_columns: list[str] | None = None, charged_frag_types: list[str] | None = None) SpecLibFlat[source]

Build a SpecLibFlat object from quantification output data stored in a folder for transfer learning.

Parameters:
  • folder (str) – The output folder to be parsed.

  • mandatory_precursor_columns (list[str], optional) – The columns to be selected from the precursor dataframe

  • optional_precursor_columns (list[str], optional) – Additional optional columns to include if present

Returns:

A spectral library object containing the parsed data

Return type:

SpecLibFlat

alphadia.outputtransform.outputaccumulator.error_callback(e)[source]
alphadia.outputtransform.outputaccumulator.ms2_quality_control(spec_lib_base: SpecLibBase, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]

Perform quality control for transfer learning by filtering out precursors with low median fragment correlation and fragments with low correlation.

Parameters:
  • spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.

  • precursor_correlation_cutoff (float) – Only precursors with a median fragment correlation above this cutoff will be used for MS2 learning. Default is 0.5.

  • fragment_correlation_ratio (float) – The cutoff for the fragment correlation relative to the median fragment correlation for a precursor. Default is 0.75.

Returns:

The SpecLibBase object with the precursors and fragments that pass the quality control filters.

Return type:

SpecLibBase

alphadia.outputtransform.outputaccumulator.normalize_rt_delta_max(spec_lib_base: SpecLibBase) SpecLibBase[source]

Normalize the retention times of the precursors in the SpecLibBase object using delta max normalization.

Parameters:

spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.

Returns:

The SpecLibBase object with the retention times normalized using delta max normalization.

Return type:

SpecLibBase

alphadia.outputtransform.outputaccumulator.normalize_rt_max(spec_lib_base: SpecLibBase) SpecLibBase[source]

Normalize the retention times of the precursors in the SpecLibBase object using max normalization.

Parameters:

spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.

Returns:

The SpecLibBase object with the retention times normalized using max normalization.

Return type:

SpecLibBase

alphadia.outputtransform.protein_fdr.perform_protein_fdr(psm_df: DataFrame, figure_path: str) DataFrame[source]

Perform protein FDR on PSM dataframe

class alphadia.outputtransform.search_plan_output.SearchPlanOutput(config: Config, output_folder: str)[source]

Bases: object

INTERNAL_OUTPUT = 'internal'
LIBRARY_OUTPUT = 'speclib.mbr'
PG_OUTPUT = 'protein_groups'
PRECURSOR_OUTPUT = 'precursors'
PSM_INPUT = 'psm'
STAT_OUTPUT = 'stat'
TRANSFER_MODEL = 'peptdeep.transfer'
TRANSFER_OUTPUT = 'speclib.transfer'
TRANSFER_STATS_OUTPUT = 'stats.transfer'
__init__(config: Config, output_folder: str)[source]

Combine individual searches into and build combined outputs

In alphaDIA the search plan orchestrates the library building preparation, schedules the individual searches and combines the individual outputs into a single output.

The SearchPlanOutput class is responsible for combining the individual search outputs into a single output.

This includes: - combining the individual precursor tables - building the output stat table - performing protein grouping - performing protein FDR - performin label-free quantification - building the spectral library

Parameters:
  • config (Config) – Configuration object

  • output_folder (str) – Output folder

build(folder_list: list[str], base_spec_lib: SpecLibBase | None)[source]

Build output from a list of search outputs.

The following files are written to the output folder: - precursor.tsv - protein_groups.tsv - stat.tsv - speclib.mbr.hdf

Parameters:
  • folder_list (List[str]) – List of folders containing the search outputs

  • base_spec_lib (base.SpecLibBase, optional) – Base spectral library

alphadia.outputtransform.utils.apply_output_column_names(df: DataFrame) DataFrame[source]

Convert internal column names to output names and filter to only mapped columns.

Only columns that are present in INTERNAL_TO_OUTPUT_MAPPING are kept in the output. This ensures that output files only contain the defined output columns.

Parameters:

df (pd.DataFrame) – Dataframe with internal column names

Returns:

Dataframe with output column names applied, containing only mapped columns

Return type:

pd.DataFrame

alphadia.outputtransform.utils.apply_protein_inference(psm_df: DataFrame, inference_strategy: Literal['library', 'maximum_parsimony', 'heuristic'], group_level: str) DataFrame[source]

Apply protein inference strategy to PSM dataframe.

Parameters:
  • psm_df (pd.DataFrame) – PSM dataframe

  • inference_strategy (Literal["library", "maximum_parsimony", "heuristic"]) – Inference strategy: ‘library’, ‘maximum_parsimony’, or ‘heuristic’

  • group_level (str) – Grouping level: ‘proteins’ or ‘genes’

Returns:

PSM dataframe with protein grouping applied

Return type:

pd.DataFrame

alphadia.outputtransform.utils.get_channels_from_config(config: dict) list[int][source]

Extract and compute channel list from configuration.

Parameters:

config (dict) – Configuration dictionary containing search and multiplexing settings

Returns:

Sorted list of channel integers

Return type:

list[int]

alphadia.outputtransform.utils.load_psm_files_from_folders(folder_list: list[str], psm_file_name: str) list[DataFrame][source]

Load PSM files from multiple folders.

Parameters:
  • folder_list (list[str]) – List of folders containing PSM files

  • psm_file_name (str) – Name of the PSM file (without extension)

Returns:

List of PSM dataframes from all folders

Return type:

list[pd.DataFrame]

alphadia.outputtransform.utils.log_protein_fdr_summary(psm_df: DataFrame) None[source]

Log summary statistics for protein FDR results.

Parameters:

psm_df (pd.DataFrame) – Precursor table with protein grouping and FDR filtering applied

alphadia.outputtransform.utils.merge_quant_levels_to_psm(psm_df: DataFrame, lfq_results: dict[str, DataFrame], quantlevel_configs: list) DataFrame[source]

Merge quantification results from all levels back to the precursor table.

Parameters:
  • psm_df (pd.DataFrame) – Precursor table to merge quantification data into

  • lfq_results (dict[str, pd.DataFrame]) – Dictionary containing quantification results for each level

  • quantlevel_configs (list) – List of LFQOutputConfig objects defining quantification levels

Returns:

Updated precursor table with merged quantification data

Return type:

pd.DataFrame

alphadia.outputtransform.utils.prepare_psm_dataframe(psm_df: DataFrame) DataFrame[source]

Prepare PSM dataframe by cleaning modification columns and hashing precursors.

Parameters:

psm_df (pd.DataFrame) – Raw PSM dataframe

Returns:

Prepared PSM dataframe with hashed precursor information

Return type:

pd.DataFrame

alphadia.outputtransform.utils.read_df(path_no_format, file_format='parquet')[source]

Read dataframe from disk with choosen file format

Parameters:
  • path_no_format (str) – File to read from disk without file format

  • file_format (str, default = 'parquet') – File format for loading the file. Available options: [‘parquet’, ‘tsv’]

Returns:

loaded dataframe from disk

Return type:

pd.DataFrame

alphadia.outputtransform.utils.write_df(df: DataFrame, path_no_format: str, file_format: str = 'parquet') None[source]

Write dataframe from disk with chosen file format.

Parameters:
  • df (pd.DataFrame) – Dataframe to save to disk

  • path_no_format (str) – Path for file without format

  • file_format (str, default = 'parquet') – File format for loading the file. Available options: [‘parquet’, ‘tsv’]