outputtransform¶
The outputtransform package contains functionality to perform the consensus workflow. Individual search workflows, part of a joined search plan, are combined.
This includes:
protein inference
false-discovery-rate correction
quantification
spectral library generation.
Output Accumulator This module contains classes to accumulate the information from the output folders of the alphadia pipeline in a linear fashion. This is hugely useful when we have a large number of output folders and we want to accumulate the information from all of them in a single object/Library which can be a challenge to do in a single go due to memory constraints. The module is designed as broadcast-subscriber pattern where the AccumulationBroadcaster class loops over the output folders creating a speclibBase object from each output folder and then broadcasts the information to the subscribers. |
|
- alphadia.outputtransform.df_builders.build_run_internal_df(folder_path: str)[source]¶
Build stat dataframe for a single run.
- Parameters:
folder_path (str) – Path (from the base directory of the output_folder attribute of the SearchStep class) to the directory containing the raw file and the managers
- Returns:
Dataframe containing the statistics
- Return type:
pd.DataFrame
- alphadia.outputtransform.df_builders.build_run_stat_df(folder: str, raw_name: str, run_df: DataFrame, channels: list[int] | None = None)[source]¶
Build stat dataframe for a single run.
- Parameters:
folder (str) – Directory containing the raw file and the managers
raw_name (str) – Name of the raw file
run_df (pd.DataFrame) – Dataframe containing the precursor data
channels (List[int], optional) – List of channels to include in the output, default=[0]
- Returns:
Dataframe containing the statistics
- Return type:
pd.DataFrame
- alphadia.outputtransform.df_builders.log_stat_df(stat_df: DataFrame)[source]¶
log statistics dataframe to console
- Parameters:
stat_df (pd.DataFrame) – statistics dataframe
- alphadia.outputtransform.df_builders.transfer_library_stat_df(transfer_library: SpecLibBase) DataFrame[source]¶
create statistics dataframe for transfer library
- Parameters:
transfer_library (SpecLibBase) – transfer library
- Returns:
statistics dataframe
- Return type:
pd.DataFrame
Output Accumulator¶
This module contains classes to accumulate the information from the output folders of the alphadia pipeline in a linear fashion. This is hugely useful when we have a large number of output folders and we want to accumulate the information from all of them in a single object/Library which can be a challenge to do in a single go due to memory constraints. The module is designed as broadcast-subscriber pattern where the AccumulationBroadcaster class loops over the output folders creating a speclibBase object from each output folder and then broadcasts the information to the subscribers.
Classes¶
- BaseAccumulator
Base class for accumulator classes, which are used to subscribe on the linear accumulation of a list of output folders. it has two methods update and post_process.
- AccumulationBroadcaster
Class that loops over output folders in a linear fashion to prevent having all the output folders in memory at the same time.
- TransferLearningAccumulator
Class that accumulates the information from the output folders for fine-tuning by selecting the top keep_top precursors and their fragments from all the output folders.
- class alphadia.outputtransform.outputaccumulator.AccumulationBroadcaster(folder_list: list, number_of_processes: int, processing_kwargs: dict)[source]¶
Bases:
objectClass that loops over output folders in a linear fashion to only have one folder in memory at a time. And broadcasts the output of each folder to the subscribers.
- subscribe(subscriber: BaseAccumulator)[source]¶
- class alphadia.outputtransform.outputaccumulator.BaseAccumulator[source]¶
Bases:
objectBase class for accumulator classes, which are used to subscribe on the linear accumulation of a list of output folders.
- class alphadia.outputtransform.outputaccumulator.TransferLearningAccumulator(keep_top: int = 3, norm_delta_max: bool = True, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]¶
Bases:
BaseAccumulator- __init__(keep_top: int = 3, norm_delta_max: bool = True, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]¶
TransferLearningAccumulator is used to accumulate the information from the output folders for fine-tuning by selecting the top keep_top precursors and their fragments from all the output folders. The current measure of score is the probA
- Parameters:
keep_top (int, optional) – The number of top precursors to keep, by default 3
norm_w_calib (bool, optional) –
If true, advanced normalization of retention times will be performed. Retention times are normalized using calibrated deviation from the library at the start of the gradient and max normalization at the end of the gradient.
If false, max normalization will be performed, by default True
precursor_correlation_cutoff (float, optional) – Only precursors with a median fragment correlation above this cutoff will be used for MS2 learning, by default 0.5
fragment_correlation_ratio (float, optional) – The cutoff for the fragment correlation relative to the median fragment correlation for a precursor, by default 0.75
- alphadia.outputtransform.outputaccumulator.build_speclibflat_from_quant(folder: str, mandatory_precursor_columns: list[str] | None = None, optional_precursor_columns: list[str] | None = None, charged_frag_types: list[str] | None = None) SpecLibFlat[source]¶
Build a SpecLibFlat object from quantification output data stored in a folder for transfer learning.
- Parameters:
folder (str) – The output folder to be parsed.
mandatory_precursor_columns (list[str], optional) – The columns to be selected from the precursor dataframe
optional_precursor_columns (list[str], optional) – Additional optional columns to include if present
- Returns:
A spectral library object containing the parsed data
- Return type:
SpecLibFlat
- alphadia.outputtransform.outputaccumulator.ms2_quality_control(spec_lib_base: SpecLibBase, precursor_correlation_cutoff: float = 0.5, fragment_correlation_ratio: float = 0.75)[source]¶
Perform quality control for transfer learning by filtering out precursors with low median fragment correlation and fragments with low correlation.
- Parameters:
spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.
precursor_correlation_cutoff (float) – Only precursors with a median fragment correlation above this cutoff will be used for MS2 learning. Default is 0.5.
fragment_correlation_ratio (float) – The cutoff for the fragment correlation relative to the median fragment correlation for a precursor. Default is 0.75.
- Returns:
The SpecLibBase object with the precursors and fragments that pass the quality control filters.
- Return type:
SpecLibBase
- alphadia.outputtransform.outputaccumulator.normalize_rt_delta_max(spec_lib_base: SpecLibBase) SpecLibBase[source]¶
Normalize the retention times of the precursors in the SpecLibBase object using delta max normalization.
- Parameters:
spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.
- Returns:
The SpecLibBase object with the retention times normalized using delta max normalization.
- Return type:
SpecLibBase
- alphadia.outputtransform.outputaccumulator.normalize_rt_max(spec_lib_base: SpecLibBase) SpecLibBase[source]¶
Normalize the retention times of the precursors in the SpecLibBase object using max normalization.
- Parameters:
spec_lib_base (SpecLibBase) – The SpecLibBase object to be normalized.
- Returns:
The SpecLibBase object with the retention times normalized using max normalization.
- Return type:
SpecLibBase
- alphadia.outputtransform.protein_fdr.perform_protein_fdr(psm_df: DataFrame, figure_path: str) DataFrame[source]¶
Perform protein FDR on PSM dataframe
- class alphadia.outputtransform.search_plan_output.SearchPlanOutput(config: Config, output_folder: str)[source]¶
Bases:
object- INTERNAL_OUTPUT = 'internal'¶
- LIBRARY_OUTPUT = 'speclib.mbr'¶
- PG_OUTPUT = 'protein_groups'¶
- PRECURSOR_OUTPUT = 'precursors'¶
- PSM_INPUT = 'psm'¶
- STAT_OUTPUT = 'stat'¶
- TRANSFER_MODEL = 'peptdeep.transfer'¶
- TRANSFER_OUTPUT = 'speclib.transfer'¶
- TRANSFER_STATS_OUTPUT = 'stats.transfer'¶
- __init__(config: Config, output_folder: str)[source]¶
Combine individual searches into and build combined outputs
In alphaDIA the search plan orchestrates the library building preparation, schedules the individual searches and combines the individual outputs into a single output.
The SearchPlanOutput class is responsible for combining the individual search outputs into a single output.
This includes: - combining the individual precursor tables - building the output stat table - performing protein grouping - performing protein FDR - performin label-free quantification - building the spectral library
- Parameters:
config (Config) – Configuration object
output_folder (str) – Output folder
- build(folder_list: list[str], base_spec_lib: SpecLibBase | None)[source]¶
Build output from a list of search outputs.
The following files are written to the output folder: - precursor.tsv - protein_groups.tsv - stat.tsv - speclib.mbr.hdf
- Parameters:
folder_list (List[str]) – List of folders containing the search outputs
base_spec_lib (base.SpecLibBase, optional) – Base spectral library
- alphadia.outputtransform.utils.apply_output_column_names(df: DataFrame) DataFrame[source]¶
Convert internal column names to output names and filter to only mapped columns.
Only columns that are present in INTERNAL_TO_OUTPUT_MAPPING are kept in the output. This ensures that output files only contain the defined output columns.
- Parameters:
df (pd.DataFrame) – Dataframe with internal column names
- Returns:
Dataframe with output column names applied, containing only mapped columns
- Return type:
pd.DataFrame
- alphadia.outputtransform.utils.apply_protein_inference(psm_df: DataFrame, inference_strategy: Literal['library', 'maximum_parsimony', 'heuristic'], group_level: str) DataFrame[source]¶
Apply protein inference strategy to PSM dataframe.
- Parameters:
psm_df (pd.DataFrame) – PSM dataframe
inference_strategy (Literal["library", "maximum_parsimony", "heuristic"]) – Inference strategy: ‘library’, ‘maximum_parsimony’, or ‘heuristic’
group_level (str) – Grouping level: ‘proteins’ or ‘genes’
- Returns:
PSM dataframe with protein grouping applied
- Return type:
pd.DataFrame
- alphadia.outputtransform.utils.get_channels_from_config(config: dict) list[int][source]¶
Extract and compute channel list from configuration.
- Parameters:
config (dict) – Configuration dictionary containing search and multiplexing settings
- Returns:
Sorted list of channel integers
- Return type:
list[int]
- alphadia.outputtransform.utils.load_psm_files_from_folders(folder_list: list[str], psm_file_name: str) list[DataFrame][source]¶
Load PSM files from multiple folders.
- Parameters:
folder_list (list[str]) – List of folders containing PSM files
psm_file_name (str) – Name of the PSM file (without extension)
- Returns:
List of PSM dataframes from all folders
- Return type:
list[pd.DataFrame]
- alphadia.outputtransform.utils.log_protein_fdr_summary(psm_df: DataFrame) None[source]¶
Log summary statistics for protein FDR results.
- Parameters:
psm_df (pd.DataFrame) – Precursor table with protein grouping and FDR filtering applied
- alphadia.outputtransform.utils.merge_quant_levels_to_psm(psm_df: DataFrame, lfq_results: dict[str, DataFrame], quantlevel_configs: list) DataFrame[source]¶
Merge quantification results from all levels back to the precursor table.
- Parameters:
psm_df (pd.DataFrame) – Precursor table to merge quantification data into
lfq_results (dict[str, pd.DataFrame]) – Dictionary containing quantification results for each level
quantlevel_configs (list) – List of LFQOutputConfig objects defining quantification levels
- Returns:
Updated precursor table with merged quantification data
- Return type:
pd.DataFrame
- alphadia.outputtransform.utils.prepare_psm_dataframe(psm_df: DataFrame) DataFrame[source]¶
Prepare PSM dataframe by cleaning modification columns and hashing precursors.
- Parameters:
psm_df (pd.DataFrame) – Raw PSM dataframe
- Returns:
Prepared PSM dataframe with hashed precursor information
- Return type:
pd.DataFrame
- alphadia.outputtransform.utils.read_df(path_no_format, file_format='parquet')[source]¶
Read dataframe from disk with choosen file format
- Parameters:
path_no_format (str) – File to read from disk without file format
file_format (str, default = 'parquet') – File format for loading the file. Available options: [‘parquet’, ‘tsv’]
- Returns:
loaded dataframe from disk
- Return type:
pd.DataFrame
- alphadia.outputtransform.utils.write_df(df: DataFrame, path_no_format: str, file_format: str = 'parquet') None[source]¶
Write dataframe from disk with chosen file format.
- Parameters:
df (pd.DataFrame) – Dataframe to save to disk
path_no_format (str) – Path for file without format
file_format (str, default = 'parquet') – File format for loading the file. Available options: [‘parquet’, ‘tsv’]