SomaData
Package from Somalogic, Inc.
This document accompanies the Python package somadata
, which loads the SomaLogic, Inc. structured text data file called an *.adat
. The somadata.Adat
object is an extension of the pandas.DataFrame
class. The package provides auxiliary functions for extracting relevant information from the ADAT object once in the Python environment. Basic familiarity with the Python environment is assumed, as is the ability to install contributed packages from the Python Package Installer (pip).
The easiest way to install SomaData
is to install directly from PyPI
PIP:
Alternatively one can install from the GitHub repository.
GitHub:
pip install git+https://github.com/SomaLogic/Canopy.git#egg=somadata
Alternatively, if you wish to develop or change the source code, you may clone the repository and install manually via:
git clone https://github.com/SomaLogic/Canopy.git pip install -e ./somadata
Python >=3.9
is required to install somadata
. The following package dependencies are installed on a pip install
:
pandas >= 1.1.2
numpy >= 1.19.1
Upon installation, load somadata
as normal:
help(somadata) # help(somadata.adat) ... etc
Help on package somadata:
NAME
somadata
PACKAGE CONTENTS
adat
annotations
base (package)
data (package)
errors
io (package)
tools (package)
FILE
/Users/tjohnson/code/repos/SomaData/somadata/__init__.py
The somadata
package comes with one internal object available to users to run canned examples (or analyses). It can be accessed by performing the import:
from somadata.data.example_data import example_data
*.adat
format into a Python
session as an adat
object.adat
object.adat
object as a *.adat
text file.Loading the sample file from within the somadata library via its path
adat = somadata.read_adat('./somadata/data/example_data.adat') type(adat)
MultiIndex([( '10000-28', '3', 'SL019233', ...),
( '10001-7', '3', 'SL002564', ...),
( '10003-15', '3', 'SL019245', ...),
( '10006-25', '3', 'SL019228', ...),
( '10008-43', '3', 'SL019234', ...),
( '10011-65', '3', 'SL019246', ...),
( '10012-5', '3', 'SL014669', ...),
( '10013-34', '3', 'SL025418', ...),
( '10014-31', '3', 'SL007803', ...),
('10015-119', '3', 'SL014924', ...),
...
( '9981-18', '3', 'SL018293', ...),
( '9983-97', '3', 'SL019202', ...),
( '9984-12', '3', 'SL019205', ...),
( '9986-14', '3', 'SL005356', ...),
( '9989-12', '3', 'SL019194', ...),
( '9993-11', '3', 'SL019212', ...),
( '9994-217', '3', 'SL019217', ...),
( '9995-6', '3', 'SL013164', ...),
( '9997-12', '3', 'SL019215', ...),
( '9999-1', '3', 'SL019231', ...)],
names=['SeqId', 'SeqIdVersion', 'SomaId', 'TargetFullName', 'Target', 'UniProt', 'EntrezGeneID', 'EntrezGeneSymbol', 'Organism', 'Units', 'Type', 'Dilution', 'PlateScale_Reference', 'CalReference', 'Cal_Example_Adat_Set001', 'ColCheck', 'CalQcRatio_Example_Adat_Set001_170255', 'QcReference_170255', 'Cal_Example_Adat_Set002', 'CalQcRatio_Example_Adat_Set002_170255'], length=5284)
from IPython.display import HTML #Display the first five rows and columns of the adat HTML(adat.iloc[:5,:5].to_html()) # Need to use HTML & to_html() here to display nicely for this README # Output is left-right scrollable in both this readme and Jupyter notebooksSeqId 10000-28 10001-7 10003-15 10006-25 10008-43 SeqIdVersion 3 3 3 3 3 SomaId SL019233 SL002564 SL019245 SL019228 SL019234 TargetFullName Beta-crystallin B2 RAF proto-oncogene serine/threonine-protein kinase Zinc finger protein 41 ETS domain-containing protein Elk-1 Guanylyl cyclase-activating protein 1 Target CRBB2 c-Raf ZNF41 ELK1 GUC1A UniProt P43320 P04049 P51814 P19419 P43080 EntrezGeneID 1415 5894 7592 2002 2978 EntrezGeneSymbol CRYBB2 RAF1 ZNF41 ELK1 GUCA1A Organism Human Human Human Human Human Units RFU RFU RFU RFU RFU Type Protein Protein Protein Protein Protein Dilution 20 20 0.5 20 20 PlateScale_Reference 687.4 227.8 126.9 634.2 585.0 CalReference 687.4 227.8 126.9 634.2 585.0 Cal_Example_Adat_Set001 1.01252025 1.01605709 0.95056180 0.99607350 0.94051447 ColCheck PASS PASS PASS PASS PASS CalQcRatio_Example_Adat_Set001_170255 1.008 0.970 1.046 1.042 1.036 QcReference_170255 505.4 223.9 119.6 667.2 587.5 Cal_Example_Adat_Set002 1.01476233 1.03686846 1.15258856 0.93581231 0.96201283 CalQcRatio_Example_Adat_Set002_170255 1.067 1.007 0.981 1.026 0.998 PlateId PlateRunDate ScannerID PlatePosition SlideId Subarray SampleId SampleType PercentDilution SampleMatrix Barcode Barcode2d SampleName SampleNotes AliquotingNotes SampleDescription AssayNotes TimePoint ExtIdentifier SsfExtId SampleGroup SiteId TubeUniqueID CLI HybControlNormScale RowCheck NormScale_20 NormScale_0_005 NormScale_0_5 ANMLFractionUsed_20 ANMLFractionUsed_0_005 ANMLFractionUsed_0_5 Age Sex Example Adat Set001 2020-06-18 SG15214400 H9 258495800012 3 1 Sample 20 Plasma-PPT 0.98185998 PASS 1.03693580 0.85701624 0.77717491 0.914 0.869 0.903 76 F 476.5 310.1 100.3 602.8 561.8 H8 258495800004 7 2 Sample 20 Plasma-PPT 0.96671829 PASS 0.96022505 0.84858420 0.85201953 0.937 0.956 0.973 55 F 474.4 293.5 101.8 561.9 541.9 H7 258495800010 8 3 Sample 20 Plasma-PPT 1.00193072 PASS 0.98411617 1.03270156 0.91519153 0.907 0.919 0.915 47 M 415.6 299.6 3030.1 563.9 423.9 H6 258495800003 4 4 Sample 20 Plasma-PPT 0.94017961 PASS 1.07839878 0.94626841 0.91246731 0.934 0.919 0.912 37 M 442.6 247.9 112.9 563.7 469.8 H5 258495800009 4 5 Sample 20 Plasma-PPT 0.94621098 PASS 0.84679446 0.92904553 0.77413056 0.707 0.894 0.708 71 F 465.7 710.7 95.9 791.0 443.5
You may also access the dict header metadata via:
{'!AdatId': 'GID-1234-56-789-abcdef',
'!Version': '1.2',
'!AssayType': 'PharmaServices',
'!AssayVersion': 'V4',
'!AssayRobot': 'Fluent 1 L-307',
'!Legal': 'Experiment details and data have been processed to protect Personally Identifiable Information (PII) and comply with existing privacy laws.',
'!CreatedBy': 'PharmaServices',
'!CreatedDate': '2020-07-24',
'!EnteredBy': 'Technician1',
'!ExpDate': '2020-06-18, 2020-07-20',
'!GeneratedBy': 'Px (Build: : ), Canopy_0.1.1',
'!RunNotes': "2 columns ('Age' and 'Sex') have been added to this ADAT. Age has been randomly increased or decreased by 1-2 years to protect patient information",
'!ProcessSteps': 'Raw RFU, Hyb Normalization, medNormInt (SampleId), plateScale, Calibration, anmlQC, qcCheck, anmlSMP',
'!ProteinEffectiveDate': '2019-08-06',
'!StudyMatrix': 'EDTA Plasma',
'!PlateType': '',
'!LabLocation': 'SLUS',
'!StudyOrganism': '',
'!Title': 'Example Adat Set001, Example Adat Set002',
'!AssaySite': 'SW',
'!CalibratorId': '170261',
'!ReportConfig': {'analysisSteps': [{'stepType': 'hybNorm',
'referenceSource': 'intraplate',
'includeSampleTypes': ['QC', 'Calibrator', 'Buffer']},
{'stepName': 'medNormInt',
'stepType': 'medNorm',
'includeSampleTypes': ['Calibrator', 'Buffer'],
'referenceSource': 'intraplate',
'referenceFields': ['SampleId']},
{'stepType': 'plateScale',
'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
{'stepType': 'calibrate',
'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
{'stepName': 'anmlQC',
'stepType': 'ANML',
'effectSizeCutoff': 2.0,
'minFractionUsed': 0.3,
'includeSampleTypes': ['QC'],
'referenceSource': 'Reference_v4_Plasma_ANML'},
{'stepType': 'qcCheck',
'QCReferenceSource': 'Reference_v4_Plasma_QC_ANML_170255',
'tailsCriteriaLower': 0.8,
'tailsCriteriaUpper': 1.2,
'tailThreshold': 15.0,
'QCAdditionalReferenceSources': ['Reference_v4_Plasma_QC_ANML_170259',
'Reference_v4_Plasma_QC_ANML_170260'],
'prenormalized': True},
{'stepName': 'anmlSMP',
'stepType': 'ANML',
'effectSizeCutoff': 2.0,
'minFractionUsed': 0.3,
'includeSampleTypes': ['Sample'],
'referenceSource': 'Reference_v4_Plasma_ANML'}],
'qualityReports': ['SQS Report'],
'filter': {'proteinEffectiveDate': '2019-08-06'}},
'HybNormReference': 'intraplate',
'MedNormReference': 'intraplate',
'NormalizationAlgorithm': 'ANML',
'PlateScale_ReferenceSource': 'Reference_v4_Plasma_Calibrator_170261',
'PlateScale_Scalar_Example_Adat_Set001': '1.08091554',
'PlateScale_PassFlag_Example_Adat_Set001': 'PASS',
'CalibrationReference': 'Reference_v4_Plasma_Calibrator_170261',
'CalPlateTailPercent_Example_Adat_Set001': '0.1',
'PlateTailPercent_Example_Adat_Set001': '1.2',
'PlateTailTest_Example_Adat_Set001': 'PASS',
'PlateScale_Scalar_Example_Adat_Set002': '1.09915270',
'PlateScale_PassFlag_Example_Adat_Set002': 'PASS',
'CalPlateTailPercent_Example_Adat_Set002': '2.6',
'PlateTailPercent_Example_Adat_Set002': '4.2',
'PlateTailTest_Example_Adat_Set002': 'PASS'}
SomaData's Adat object inherits the pandas printing methods which displays nicely in Jupyter Notebooks when using IPython.display.display()
.
Dataframe columns
Contain Feature Information
# Using the `adat` loaded from above aptamer_df = adat.columns.to_frame(index=False) type(aptamer_df)
pandas.core.frame.DataFrame
HTML(aptamer_df.head(5).to_html())SeqId SeqIdVersion SomaId TargetFullName Target UniProt EntrezGeneID EntrezGeneSymbol Organism Units Type Dilution PlateScale_Reference CalReference Cal_Example_Adat_Set001 ColCheck CalQcRatio_Example_Adat_Set001_170255 QcReference_170255 Cal_Example_Adat_Set002 CalQcRatio_Example_Adat_Set002_170255 0 10000-28 3 SL019233 Beta-crystallin B2 CRBB2 P43320 1415 CRYBB2 Human RFU Protein 20 687.4 687.4 1.01252025 PASS 1.008 505.4 1.01476233 1.067 1 10001-7 3 SL002564 RAF proto-oncogene serine/threonine-protein kinase c-Raf P04049 5894 RAF1 Human RFU Protein 20 227.8 227.8 1.01605709 PASS 0.970 223.9 1.03686846 1.007 2 10003-15 3 SL019245 Zinc finger protein 41 ZNF41 P51814 7592 ZNF41 Human RFU Protein 0.5 126.9 126.9 0.95056180 PASS 1.046 119.6 1.15258856 0.981 3 10006-25 3 SL019228 ETS domain-containing protein Elk-1 ELK1 P19419 2002 ELK1 Human RFU Protein 20 634.2 634.2 0.99607350 PASS 1.042 667.2 0.93581231 1.026 4 10008-43 3 SL019234 Guanylyl cyclase-activating protein 1 GUC1A P43080 2978 GUCA1A Human RFU Protein 20 585.0 585.0 0.94051447 PASS 1.036 587.5 0.96201283 0.998
The .to_frame()
method creates a lookup table that links the feature names in the adat
object to the annotation data in columns
:
col_df = adat.columns.to_frame(index=False) type(col_df)
pandas.core.frame.DataFrame
HTML(col_df.head(5).to_html())SeqId SeqIdVersion SomaId TargetFullName Target UniProt EntrezGeneID EntrezGeneSymbol Organism Units Type Dilution PlateScale_Reference CalReference Cal_Example_Adat_Set001 ColCheck CalQcRatio_Example_Adat_Set001_170255 QcReference_170255 Cal_Example_Adat_Set002 CalQcRatio_Example_Adat_Set002_170255 0 10000-28 3 SL019233 Beta-crystallin B2 CRBB2 P43320 1415 CRYBB2 Human RFU Protein 20 687.4 687.4 1.01252025 PASS 1.008 505.4 1.01476233 1.067 1 10001-7 3 SL002564 RAF proto-oncogene serine/threonine-protein kinase c-Raf P04049 5894 RAF1 Human RFU Protein 20 227.8 227.8 1.01605709 PASS 0.970 223.9 1.03686846 1.007 2 10003-15 3 SL019245 Zinc finger protein 41 ZNF41 P51814 7592 ZNF41 Human RFU Protein 0.5 126.9 126.9 0.95056180 PASS 1.046 119.6 1.15258856 0.981 3 10006-25 3 SL019228 ETS domain-containing protein Elk-1 ELK1 P19419 2002 ELK1 Human RFU Protein 20 634.2 634.2 0.99607350 PASS 1.042 667.2 0.93581231 1.026 4 10008-43 3 SL019234 Guanylyl cyclase-activating protein 1 GUC1A P43080 2978 GUCA1A Human RFU Protein 20 585.0 585.0 0.94051447 PASS 1.036 587.5 0.96201283 0.998
adat.columns.get_level_values('SeqId')[:20] # first 20 features
Index(['10000-28', '10001-7', '10003-15', '10006-25', '10008-43', '10011-65',
'10012-5', '10013-34', '10014-31', '10015-119', '10021-1', '10022-207',
'10023-32', '10024-44', '10030-8', '10034-16', '10035-6', '10036-201',
'10037-98', '10040-63'],
dtype='object', name='SeqId')
Dataframe index
Contains Sample Information
# Using the `adat` loaded from above sample_df = adat.index.to_frame(index=False) type(sample_df)
pandas.core.frame.DataFrame
HTML(sample_df.head(5).to_html())PlateId PlateRunDate ScannerID PlatePosition SlideId Subarray SampleId SampleType PercentDilution SampleMatrix Barcode Barcode2d SampleName SampleNotes AliquotingNotes SampleDescription AssayNotes TimePoint ExtIdentifier SsfExtId SampleGroup SiteId TubeUniqueID CLI HybControlNormScale RowCheck NormScale_20 NormScale_0_005 NormScale_0_5 ANMLFractionUsed_20 ANMLFractionUsed_0_005 ANMLFractionUsed_0_5 Age Sex 0 Example Adat Set001 2020-06-18 SG15214400 H9 258495800012 3 1 Sample 20 Plasma-PPT 0.98185998 PASS 1.03693580 0.85701624 0.77717491 0.914 0.869 0.903 76 F 1 Example Adat Set001 2020-06-18 SG15214400 H8 258495800004 7 2 Sample 20 Plasma-PPT 0.96671829 PASS 0.96022505 0.84858420 0.85201953 0.937 0.956 0.973 55 F 2 Example Adat Set001 2020-06-18 SG15214400 H7 258495800010 8 3 Sample 20 Plasma-PPT 1.00193072 PASS 0.98411617 1.03270156 0.91519153 0.907 0.919 0.915 47 M 3 Example Adat Set001 2020-06-18 SG15214400 H6 258495800003 4 4 Sample 20 Plasma-PPT 0.94017961 PASS 1.07839878 0.94626841 0.91246731 0.934 0.919 0.912 37 M 4 Example Adat Set001 2020-06-18 SG15214400 H5 258495800009 4 5 Sample 20 Plasma-PPT 0.94621098 PASS 0.84679446 0.92904553 0.77413056 0.707 0.894 0.708 71 F
The Adat
index and columns are pandas.MultiIndex
objects. Several convenience functions exist to help you modify these objects. Typically, the row metadata (axis=0) represents data describing the sample or the individual from whom the sample was collected. The column metadata (axis=1) contains data regarding the SOMAmer reagent, the reagent's target and scalars applied to columns during normalization, these columns are not usually edited by the end user but can be using the same methods demonstrated on row metadata below.
Row metadata is sample level information which could include added clinical data like age, sex or clinical measurements. The Adat class facilitates managing this data.
# using ittertools to simulate some metadata: from itertools import cycle, islice import pandas as pd # for demonstration we will simulate two types of metadata. Metadata stored in a list and metadata stored with key-value pairs. metadata_list = list(islice(cycle(['A', 'B', 'X', 'Y']), adat.shape[0])) metadata_dictionary = {k:v for k, v in zip(adat.index.get_level_values('SampleId').to_list(), metadata_list)}
You might do this if you know your metadata and Adat
are ordered the same way but you are not using a shared key.
new_meta_adat = adat.insert_meta(0,'GroupData', metadata_list) # this will produce a new Adat file with group data in the right most column of the index new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
PlateId SampleId GroupData 0 Example Adat Set001 1 A 1 Example Adat Set001 2 B 2 Example Adat Set001 3 X 3 Example Adat Set001 4 Y 4 Example Adat Set001 5 A 5 Example Adat Set001 6 B 6 Example Adat Set001 7 XYou might have data coming in as key value pairs from another data source. In that case it is easier to insert metadata using those keys:
# The arguments are `axis` 0 for row metadata, 1 for column metadata, `name` the name of the new index, #`key_meta_name` the nameo of the axis used to match the keys. `values_dict` the dictionary containing the new data new_meta_adat = adat.insert_keyed_meta(0,'GroupData', 'SampleId', metadata_dictionary) # this will produce a new Adat file with group data in the right most column of the index new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
PlateId SampleId GroupData 0 Example Adat Set001 1 A 1 Example Adat Set001 2 B 2 Example Adat Set001 3 X 3 Example Adat Set001 4 Y 4 Example Adat Set001 5 A 5 Example Adat Set001 6 B 6 Example Adat Set001 7 X Replace Metadata with Unlabeled MetadataYou might do this if you know your metadata and Adat
are ordered the same way but you are not using a shared key.
new_meta_adat = adat.replace_meta(0,'SampleName', metadata_list) # this will produce a new Adat file with group data in the right most column of the index new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
PlateId SampleId SampleName 0 Example Adat Set001 1 A 1 Example Adat Set001 2 B 2 Example Adat Set001 3 X 3 Example Adat Set001 4 Y 4 Example Adat Set001 5 A 5 Example Adat Set001 6 B 6 Example Adat Set001 7 X Replace Metadata with Keyed MetadataYou might need to replace metadata based on another document using key-value pairs.
#Here we replace the values in `SampleName` with the values from `metadata_dictionary` new_meta_adat = adat.replace_keyed_meta(0,'SampleName', metadata_dictionary, 'SampleId') # this will produce a new Adat file with group data in the right most column of the index new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
PlateId SampleId SampleName 0 Example Adat Set001 1 A 1 Example Adat Set001 2 B 2 Example Adat Set001 3 X 3 Example Adat Set001 4 Y 4 Example Adat Set001 5 A 5 Example Adat Set001 6 B 6 Example Adat Set001 7 XYou may perform mathematical transformations on the feature data via apply or calling those functions and passing the entire dataframe.
import numpy as np # Using the `adat` loaded from above log10_adat = adat.apply(np.log10) # equivalent to `np.log10(adat)` rounded_adat = adat.apply(round, args=[5,]) # equivalent to `round(adat, 5)` sqrt_adat = adat.apply(np.sqrt) # equivlane to `np.sqrt(adat)`Subsetting/Slicing the Dataframe
You may extract certain subgroups of samples and/or features. SomaData augments the pandas dataframe with a number of helper functions to aid the user.
# Extract rows of only calibrator-type samples calibrator_adat = adat.pick_on_meta(axis=0, name='SampleType', values=['Calibrator']) # Exclude calibrator-type samples non_calibrator_adat = adat.exclude_on_meta(axis=0, name='SampleType', values=['Calibrator']) # Extract columns containing features that start with 'MMP' target_names = adat.columns.get_level_values('Target') mmp_names = [target for target in target_names if target.startswith('MMP')] mmp_adat = adat.pick_on_meta(axis=1, name='Target', values=mmp_names) mmp_adat
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
SeqId 15419-15 2579-17 2788-55 2789-26 2838-53 4160-49 4496-60 4924-32 4925-54 5002-76 5268-49 6425-87 8479-4 9172-69 9719-145 SeqIdVersion 3 5 1 2 1 1 2 1 2 1 3 3 3 3 3 SomaId SL012374 SL000527 SL000524 SL000525 SL003332 SL000124 SL000522 SL000521 SL000523 SL002646 SL003331 SL007616 SL000645 SL000526 SL003331 TargetFullName Matrix metalloproteinase-20 Matrix metalloproteinase-9 Stromelysin-1 Matrilysin Matrix metalloproteinase-17 72 kDa type IV collagenase Macrophage metalloelastase Interstitial collagenase Collagenase 3 Matrix metalloproteinase-14 Matrix metalloproteinase-16 Matrix metalloproteinase-19 Stromelysin-2 Neutrophil collagenase Matrix metalloproteinase-16 Target MMP20 MMP-9 MMP-3 MMP-7 MMP-17 MMP-2 MMP-12 MMP-1 MMP-13 MMP-14 MMP-16 MMP19 MMP-10 MMP-8 MMP-16 UniProt O60882 P14780 P08254 P09237 Q9ULZ9 P08253 P39900 P03956 P45452 P50281 P51512 Q99542 P09238 P22894 P51512 EntrezGeneID 9313 4318 4314 4316 4326 4313 4321 4312 4322 4323 4325 4327 4319 4317 4325 EntrezGeneSymbol MMP20 MMP9 MMP3 MMP7 MMP17 MMP2 MMP12 MMP1 MMP13 MMP14 MMP16 MMP19 MMP10 MMP8 MMP16 Organism Human Human Human Human Human Human Human Human Human Human Human Human Human Human Human Units RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU Type Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Dilution 20 0.5 0.5 20 20 0.5 20 20 20 20 20 0.5 20 20 20 PlateScale_Reference 937.5 19472.3 117.2 2392.9 1520.6 14888.5 1014.9 7611.5 376.6 632.1 565.8 5063.4 1534.0 1088.5 1166.8 CalReference 937.5 19472.3 117.2 2392.9 1520.6 14888.5 1014.9 7611.5 376.6 632.1 565.8 5063.4 1534.0 1088.5 1166.8 Cal_Example_Adat_Set001 1.06947296 1.01957222 0.98404702 0.90131455 1.13783298 0.98961103 0.96180819 0.91162239 0.98689727 1.02497162 0.97906212 0.97843478 0.95061040 0.94709823 1.02243253 ColCheck PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS CalQcRatio_Example_Adat_Set001_170255 1.132 1.002 0.893 1.130 0.955 0.987 1.014 1.054 1.023 0.987 1.024 1.005 1.023 1.029 1.003 QcReference_170255 793.4 10420.6 124.2 9482.5 1252.0 16044.7 1115.3 13192.8 394.4 492.1 559.2 6162.7 1946.6 845.5 1171.0 Cal_Example_Adat_Set002 1.02380692 1.00117741 1.31243001 0.67812509 1.09317038 0.98499534 0.95953484 0.95154455 0.90594178 1.08143713 0.98143972 0.98821187 0.92700024 1.03983569 1.02055454 CalQcRatio_Example_Adat_Set002_170255 1.192 1.009 0.821 1.123 1.083 1.028 1.006 1.025 0.991 0.949 1.002 0.990 1.010 1.039 0.951 PlateId PlateRunDate ScannerID PlatePosition SlideId Subarray SampleId SampleType PercentDilution SampleMatrix Barcode Barcode2d SampleName SampleNotes AliquotingNotes SampleDescription AssayNotes TimePoint ExtIdentifier SsfExtId SampleGroup SiteId TubeUniqueID CLI HybControlNormScale RowCheck NormScale_20 NormScale_0_005 NormScale_0_5 ANMLFractionUsed_20 ANMLFractionUsed_0_005 ANMLFractionUsed_0_5 Age Sex Example Adat Set001 2020-06-18 SG15214400 H9 258495800012 3 1 Sample 20 Plasma-PPT 0.98185998 PASS 1.03693580 0.85701624 0.77717491 0.914 0.869 0.903 76 F 729.9 16230.2 177.9 11903.3 1378.1 11997.2 2455.9 20988.1 442.5 414.2 712.5 8965.5 2030.6 2181.5 1096.4 H8 258495800004 7 2 Sample 20 Plasma-PPT 0.96671829 PASS 0.96022505 0.84858420 0.85201953 0.937 0.956 0.973 55 F 873.3 17253.4 152.8 16508.8 1652.0 14574.6 1595.0 11498.5 501.1 505.9 629.9 8669.7 1301.6 1571.2 1149.0 H7 258495800010 8 3 Sample 20 Plasma-PPT 1.00193072 PASS 0.98411617 1.03270156 0.91519153 0.907 0.919 0.915 47 M 993.0 13094.1 127.5 7577.0 1711.7 14468.7 503.3 14697.7 2883.2 445.8 510.6 6728.9 1067.3 1528.9 1027.8 H6 258495800003 4 4 Sample 20 Plasma-PPT 0.94017961 PASS 1.07839878 0.94626841 0.91246731 0.934 0.919 0.912 37 M 906.5 20991.4 155.0 8673.7 1667.5 9913.1 438.4 20819.0 375.2 644.8 547.8 8629.0 1162.4 1173.7 1091.6 H5 258495800009 4 5 Sample 20 Plasma-PPT 0.94621098 PASS 0.84679446 0.92904553 0.77413056 0.707 0.894 0.708 71 F 747.0 8070.4 124.0 20423.7 1426.6 11345.0 1954.5 16168.9 356.9 446.4 541.7 8125.2 1667.0 1048.6 1132.6 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Example Adat Set002 2020-07-20 SG15214400 A2 258495800108 3 188 Sample 20 Plasma-PPT 0.96699908 PASS 0.95993275 1.08910138 0.99491979 0.566 0.912 0.719 38 F 714.7 19877.3 114.5 15151.9 894.7 17266.2 682.8 27419.4 480.3 705.1 527.9 6638.2 3919.7 1787.5 1191.1 A12 258495800104 2 189 Sample 20 Plasma-PPT 0.91482584 PASS 1.21880129 1.01022697 0.99244374 0.918 0.919 0.926 40 F 865.8 25801.4 123.6 15711.6 1308.4 16598.3 1369.2 15153.5 474.2 655.0 592.2 8953.9 2494.9 2156.2 1391.5 A11 258495800108 5 190 Sample 20 Plasma-PPT 0.88282283 PASS 1.36699142 1.16271427 1.19673587 0.927 0.981 0.964 43 M 869.7 13728.5 140.9 11437.3 1353.5 17996.2 1344.2 10575.3 433.7 439.8 530.7 6193.9 2067.6 2466.6 1114.8 A10 258495800105 5 191 Sample 20 Plasma-PPT 0.95792282 PASS 1.30590374 0.98395166 0.97460119 0.835 0.963 0.944 55 M 529.2 13298.2 161.7 14210.8 1026.0 14549.0 1466.1 9683.8 291.6 435.8 676.1 9584.8 1799.5 1347.8 1194.2 A1 258495800110 5 192 Sample 20 Plasma-PPT 0.97384118 PASS 1.30710646 0.93230123 1.00804341 0.793 0.963 0.933 56 F 1934.4 7567.8 133.0 13614.3 1220.2 16223.8 1110.9 7737.4 364.2 3407.0 645.2 7867.1 1720.9 1888.2 1293.8192 rows × 15 columns
Lifting ADAT data between assay versions.Adat data can be lifted from one SomaScan Assay version's RFU space to another SomaScan Assay version's RFU space. This is achieved by scaling SOMAmer reagent measurement columns using scale factors available at menu.somalogic.com and built in to this tool in the Adat.lift()
method.
The example Adat exists in SomaScan Version V4.0 assay space (also called 5K in some literature). In this example we will lift to SomaScan V5.0 (11K) assay space. It is important to know that only SOMAmer reagent measurements in both assay versions can be lifted. When lifting from a smaller to a larger plex (as demonstrated) the resulting Adat
will remain in the smaller plex size. When lifting from a larger to smaller plex size reagents that don't exist in the small plex size will be scaled by 1.0. The end user might choose to redact the lifted Adat
to the smaller plex to better merge data.
The tool will raise in error if the end user attempts to lift an Adat
object to its current version or an unsupported assay version.
SomaScan data can be referred to by the assay version i.e. 'V5.0' or by the plex size i.e. '11K'. The tool can use either 'V5.0' or '11K' interchangeably in it's input. The mapping between these terms is shown in the table below:
SomaScan Assay Version SomaScan Plex SomaScan Menu Size V4 5K 5284 V4.1 7K 7596 V5.0 11K 11083lifted_adat = adat.lift('V5.0')
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
SeqId 10000-28 10001-7 10003-15 10006-25 10008-43 10011-65 10012-5 10013-34 10014-31 10015-119 ... 9981-18 9983-97 9984-12 9986-14 9989-12 9993-11 9994-217 9995-6 9997-12 9999-1 SeqIdVersion 3 3 3 3 3 3 3 3 3 3 ... 3 3 3 3 3 3 3 3 3 3 SomaId SL019233 SL002564 SL019245 SL019228 SL019234 SL019246 SL014669 SL025418 SL007803 SL014924 ... SL018293 SL019202 SL019205 SL005356 SL019194 SL019212 SL019217 SL013164 SL019215 SL019231 TargetFullName Beta-crystallin B2 RAF proto-oncogene serine/threonine-protein kinase Zinc finger protein 41 ETS domain-containing protein Elk-1 Guanylyl cyclase-activating protein 1 Inositol polyphosphate 5-phosphatase OCRL-1 SAM pointed domain-containing Ets transcription factor Fc_MOUSE Zinc finger protein SNAI2 Voltage-gated potassium channel subunit beta-2 ... Protein FAM234B Inactive serine protease 35 Protein YIPF6 Neuropeptide W Leucine-rich repeat-containing protein 24 Zinc finger protein 264 Potassium-transporting ATPase subunit beta Deoxyuridine 5'-triphosphate nucleotidohydrolase, mitochondrial UBX domain-containing protein 4 Interferon regulatory factor 6 Target CRBB2 c-Raf ZNF41 ELK1 GUC1A OCRL SPDEF Fc_MOUSE SLUG KCAB2 ... K1467 PRS35 YIPF6 Neuropeptide W LRC24 ZN264 ATP4B DUT UBXN4 IRF6 UniProt P43320 P04049 P51814 P19419 P43080 Q01968 O95238 Q99LC4 O43623 Q13303 ... A2RU67 Q8N3Z0 Q96EC8 Q8N729 Q50LG9 O43296 P51164 P33316 Q92575 O14896 EntrezGeneID 1415 5894 7592 2002 2978 4952 25803 6591 8514 ... 57613 167681 286451 283869 441381 9422 496 1854 23190 3664 EntrezGeneSymbol CRYBB2 RAF1 ZNF41 ELK1 GUCA1A OCRL SPDEF SNAI2 KCNAB2 ... KIAA1467 PRSS35 YIPF6 NPW LRRC24 ZNF264 ATP4B DUT UBXN4 IRF6 Organism Human Human Human Human Human Human Human Mouse Human Human ... Human Human Human Human Human Human Human Human Human Human Units RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU ... RFU RFU RFU RFU RFU RFU RFU RFU RFU RFU Type Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein ... Protein Protein Protein Protein Protein Protein Protein Protein Protein Protein Dilution 20 20 0.5 20 20 20 20 20 20 20 ... 20 20 20 20 20 20 20 20 20 20 PlateScale_Reference 687.4 227.8 126.9 634.2 585.0 2807.1 1623.3 499.6 857.2 443.3 ... 643.9 430.0 627.5 3644.5 449.4 953.3 1971.1 1275.6 4426.9 851.9 CalReference 687.4 227.8 126.9 634.2 585.0 2807.1 1623.3 499.6 857.2 443.3 ... 643.9 430.0 627.5 3644.5 449.4 953.3 1971.1 1275.6 4426.9 851.9 Cal_Example_Adat_Set001 1.01252025 1.01605709 0.95056180 0.99607350 0.94051447 1.05383489 1.17290462 1.07095391 1.03464092 1.07466667 ... 0.98035932 1.04878049 1.03513692 0.96341431 1.01444695 1.04551437 0.98299422 0.97426106 0.96896272 0.96042841 ColCheck PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS ... PASS PASS PASS PASS PASS PASS PASS PASS PASS PASS CalQcRatio_Example_Adat_Set001_170255 1.008 0.970 1.046 1.042 1.036 0.975 1.010 0.953 0.978 0.975 ... 0.982 0.949 1.003 0.938 1.017 0.998 1.071 0.985 0.960 0.974 QcReference_170255 505.4 223.9 119.6 667.2 587.5 2617.6 1340.6 443.0 1289.4 441.5 ... 700.7 393.2 612.6 3089.2 455.1 885.6 1389.7 950.9 5560.7 1033.6 Cal_Example_Adat_Set002 1.01476233 1.03686846 1.15258856 0.93581231 0.96201283 1.03133955 1.21250373 1.18192572 0.98926717 1.13173347 ... 0.96075798 1.15250603 1.12013567 1.08296437 0.99314917 1.08268030 1.02784586 0.97351752 0.94828953 0.92900763 CalQcRatio_Example_Adat_Set002_170255 1.067 1.007 0.981 1.026 0.998 1.013 1.078 0.996 0.971 0.941 ... 0.982 0.993 0.990 0.929 0.978 0.961 1.022 0.970 1.027 0.997 PlateId PlateRunDate ScannerID PlatePosition SlideId Subarray SampleId SampleType PercentDilution SampleMatrix Barcode Barcode2d SampleName SampleNotes AliquotingNotes SampleDescription AssayNotes TimePoint ExtIdentifier SsfExtId SampleGroup SiteId TubeUniqueID CLI HybControlNormScale RowCheck NormScale_20 NormScale_0_005 NormScale_0_5 ANMLFractionUsed_20 ANMLFractionUsed_0_005 ANMLFractionUsed_0_5 Age Sex Example Adat Set001 2020-06-18 SG15214400 H9 258495800012 3 1 Sample 20 Plasma-PPT 0.98185998 PASS 1.03693580 0.85701624 0.77717491 0.914 0.869 0.903 76 F 386.0 309.5 97.6 449.1 396.6 4965.9 1106.7 274.9 786.3 567.2 ... 551.9 352.5 408.4 3027.5 538.8 686.3 5202.4 2188.4 12697.7 966.5 H8 258495800004 7 2 Sample 20 Plasma-PPT 0.96671829 PASS 0.96022505 0.84858420 0.85201953 0.937 0.956 0.973 55 F 384.3 292.9 99.1 418.6 382.6 2149.6 1307.8 324.1 779.4 371.8 ... 689.3 358.2 456.0 5724.5 470.3 663.0 1195.9 2302.7 13247.8 824.2 H7 258495800010 8 3 Sample 20 Plasma-PPT 1.00193072 PASS 0.98411617 1.03270156 0.91519153 0.907 0.919 0.915 47 M 336.6 299.0 2948.3 420.1 299.3 2306.6 1290.9 348.6 845.0 416.8 ... 547.0 382.0 503.9 3380.7 405.3 647.6 1552.4 1183.2 11072.1 1145.8 H6 258495800003 4 4 Sample 20 Plasma-PPT 0.94017961 PASS 1.07839878 0.94626841 0.91246731 0.934 0.919 0.912 37 M 358.5 247.4 109.9 420.0 331.7 2261.4 1184.1 362.0 4348.0 374.6 ... 561.9 384.4 477.7 1361.8 493.0 643.5 1005.3 1399.4 9082.4 804.6 H5 258495800009 4 5 Sample 20 Plasma-PPT 0.94621098 PASS 0.84679446 0.92904553 0.77413056 0.707 0.894 0.708 71 F 377.2 709.3 93.3 589.3 313.1 1949.4 990.0 272.8 787.7 723.8 ... 480.0 343.0 742.3 4846.0 487.7 701.4 1132.4 9852.9 38461.1 2865.9 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Example Adat Set002 2020-07-20 SG15214400 A2 258495800108 3 188 Sample 20 Plasma-PPT 0.96699908 PASS 0.95993275 1.08910138 0.99491979 0.566 0.912 0.719 38 F 393.3 954.0 124.2 719.8 1284.4 1312.0 750.7 312.1 727.7 829.2 ... 454.3 301.8 531.0 1658.5 541.8 861.5 1352.3 11604.6 45580.6 3687.1 A12 258495800104 2 189 Sample 20 Plasma-PPT 0.91482584 PASS 1.21880129 1.01022697 0.99244374 0.918 0.919 0.926 40 F 337.4 281.6 82.5 625.5 26153.5 1856.4 1004.8 297.4 734.0 388.3 ... 636.7 292.7 410.7 9236.8 487.6 629.3 1910.9 1855.0 9778.6 1004.4 A11 258495800108 5 190 Sample 20 Plasma-PPT 0.88282283 PASS 1.36699142 1.16271427 1.19673587 0.927 0.981 0.964 43 M 372.9 270.8 204.2 472.6 446.1 1733.8 1067.7 354.3 745.7 410.7 ... 566.0 364.5 448.2 2597.7 515.6 621.4 1113.3 1302.7 8766.2 770.8 A10 258495800105 5 191 Sample 20 Plasma-PPT 0.95792282 PASS 1.30590374 0.98395166 0.97460119 0.835 0.963 0.944 55 M 320.4 319.1 105.9 527.1 370.8 1701.9 756.9 266.4 618.4 453.9 ... 536.3 309.1 434.0 5167.2 522.8 588.2 891.1 2466.6 15455.9 1190.4 A1 258495800110 5 192 Sample 20 Plasma-PPT 0.97384118 PASS 1.30710646 0.93230123 1.00804341 0.793 0.963 0.933 56 F 370.3 288.1 187.2 469.9 438.5 1777.9 787.3 249.4 930.8 423.7 ... 601.9 285.9 467.0 2564.2 590.0 673.2 1036.7 2142.1 7950.7 722.8192 rows × 5284 columns
# because some common documentation refers to plex size instead of assay version the tool also supports lifing by naming plex size. lifted_adat = adat.lift('11K')
# Observing the Lin's CCC of the lifting scale factors. from somadata.data import getSomaScanLiftCCC # the method returns a pandas dataframe containing the available lift Lins's CCC values: ccc = getSomaScanLiftCCC() ccc
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
Serum Lin's CCC v5.0 11K to v4.1 7K Plasma Lin's CCC v5.0 11K to v4.1 7K Serum Lin's CCC v5.0 11K to v4.0 5K Plasma Lin's CCC v5.0 11K to v4.0 5K Serum Lin's CCC v4.1 7K to v4.0 5K Plasma Lin's CCC v4.1 7K to v4.0 5K 10000-28 0.977 0.982 0.970 0.966 0.967 0.963 10001-7 0.857 0.961 0.819 0.860 0.875 0.875 10003-15 0.759 0.787 0.761 0.674 0.774 0.668 10006-25 0.937 0.927 0.903 0.864 0.937 0.877 10008-43 0.951 0.939 0.915 0.879 0.925 0.908 ... ... ... ... ... ... ... 9993-11 0.823 0.855 0.704 0.753 0.714 0.711 9994-217 0.492 0.964 0.502 0.767 0.809 0.778 9995-6 0.975 0.976 0.965 0.916 0.983 0.922 9997-12 0.877 0.955 0.857 0.892 0.926 0.885 9999-1 0.909 0.962 0.870 0.883 0.944 0.89811083 rows × 6 columns
Lin's CCC Between Lifted and Assay Space Native DataThe tool allows you to display Lin's concordance correlation coefficient (Lin 1989) derived during the calculation of the lifting scale factors. This metric allows you to see how well lifted data is expected to correlate with date collected originally in the target assay data signal space. A Lin's CCC close to 1.0 indicates strong correlation indicating the signal would be highly concordant with the lifted value if the sample data were collected in the target assay version space.
# your exact transformation's Lin's CCC can be selected by filtering the column that contains your matrix and versions # Lin's CCC are symmetrical v5.0 -> v4.0 == v4.0 -> v5.0. ccc["Plasma Lin's CCC v5.0 11K to v4.0 5K"]
10000-28 0.966
10001-7 0.860
10003-15 0.674
10006-25 0.864
10008-43 0.879
...
9993-11 0.753
9994-217 0.767
9995-6 0.916
9997-12 0.892
9999-1 0.883
Name: Plasma Lin's CCC v5.0 11K to v4.0 5K, Length: 11083, dtype: float64
In order to store or share analysis the user may need to write out an ADAT file. This utility supports writing to the file system.
adat.to_adat('/tmp/out_file.adat')
Although it is beyond the scope of the SomaData
package, below are 3 sample analyses that typical users/clients would perform on SomaLogic data. They are not intended to be a definitive guide in statistical analysis and existing packages do exist in the Python
universe that perform parts or extensions of these techniques. Many variations of the workflows below exist, however the framework highlights how one could perform standard preliminary analyses on SomaLogic data for:
from somadata.data.example_data import example_data # Example ADAT included with SomaData from scipy.stats import ttest_ind from collections import Counter import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from io import StringIODisplay the shape of the adat (rows, columns) Describe the sample types within the adat and display their counts
Counter(example_data.index.get_level_values('SampleType'))
Counter({'Sample': 170, 'Calibrator': 10, 'Buffer': 6, 'QC': 6})
Prepare the adat for analysis
filtered_transformed_data = ( example_data .exclude_on_meta(axis=0, name='Sex', values=['']) # rm NAs if present .pick_on_meta(axis=0, name='SampleType', values=['Sample']) # rm control samples .apply(np.log10) # log10-transform ) clean_data = ( filtered_transformed_data .insert_keyed_meta( # map Sex -> 0/1 axis=0, key_meta_name='Sex', inserted_meta_name='Group', values_dict={'M': 1, 'F': 0} ) .apply(lambda x: x - x.mean(), axis=0) # center features .apply(lambda x: x / x.std(), axis=0) # scale features )Display the grouping counts
print(clean_data.index.to_frame()['Sex'].value_counts()) print(clean_data.index.to_frame()['Group'].value_counts())
Sex
F 85
M 85
Name: count, dtype: int64
Group
0 85
1 85
Name: count, dtype: int64
Split the adat based on Group
and perform t-test across all aptamers
tt_g0 = clean_data.pick_on_meta(axis=0, name='Group', values=[0]) tt_g1 = clean_data.pick_on_meta(axis=0, name='Group', values=[1]) tt_res = ttest_ind(tt_g0, tt_g1) t_tests = list(zip(clean_data.columns.get_level_values('TargetFullName'), tt_res.pvalue))Sort the results and display the 12 aptamers with the most significant p-values
t_tests_sorted = sorted(t_tests, key=lambda x: x[1]) tt_top_12_analytes = [name for name, p_value in t_tests_sorted[:12]] tt_top_12_analytes
['Prostate-specific antigen',
'Pregnancy zone protein',
'Kunitz-type protease inhibitor 3',
'Follicle stimulating hormone',
'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2',
'Beta-defensin 104',
'Luteinizing hormone',
'Cysteine-rich secretory protein 2',
'Human Chorionic Gonadotropin',
'Serum amyloid P-component',
'SLIT and NTRK-like protein 4',
'Neurotrimin']
Plot the Group
log(RFU) for each aptamer
tt_df= ( filtered_transformed_data .pick_meta(axis=1, names=['TargetFullName']) .pick_on_meta(axis=1, name='TargetFullName', values=tt_top_12_analytes)[tt_top_12_analytes] .reset_index() ) tt_melted_df = pd.melt(tt_df, value_vars=tt_top_12_analytes, id_vars='Sex', value_name='log10(RFU)') tt_p = sns.catplot( x='Sex', y='log10(RFU)', col='TargetFullName', data=tt_melted_df, kind='box', col_wrap=3, sharey=False ) tt_p.set_titles(row_template='{row_name}', col_template='{col_name}') plt.show()Logistic Regression (Predict Sex)
# Import the libraries that we need for this analysis from sklearn.model_selection import train_test_split from sklearn import metrics from scipy.stats import pearsonr import statsmodels.api as sm from IPython.display import HTMLPrepare the data for LogisticRegression
# Wrangle `clean_data` into a simpler form logr_x_df = ( clean_data .pick_meta(axis=1, names=['SeqId', 'TargetFullName']) .reset_index(drop=True) ) logr_y_df = ( clean_data.index.get_level_values('Group') ) # Split the dataset into train and test, holding back 25 samples for testing logr_x_train, logr_x_test, logr_y_train, logr_y_test = train_test_split(logr_x_df, logr_y_df, test_size=25, random_state=0)Perform univariate logistic regression for each aptamer
logr_apt_perf = [] for seq_info in logr_x_train: x = sm.add_constant(logr_x_train[seq_info]) # Need to add the intercept term since sm.GLM does not automatically do it mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial()) res = mod.fit() logr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value
logr_df = pd.concat(logr_apt_perf).reset_index() logr_df['SeqId'] = [x[0] for x in logr_df['index']] logr_df['TargetFullName'] = [x[1] for x in logr_df['index']] logr_df = logr_df.drop('index', axis=1) logr_df = logr_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 'z', 'P>|z|', '[0.025', '0.975]']].set_index('SeqId') logr_df_sorted = logr_df.sort_values('P>|z|') HTML(logr_df_sorted.head(20).to_html()) # Need to use HTML here to display nicely for this READMETargetFullName Coef. Std.Err. z P>|z| [0.025 0.975] SeqId 6580-29 Pregnancy zone protein -3.079818 0.489558 -6.291020 3.153866e-10 -4.039334 -2.120302 5763-67 Beta-defensin 104 2.974778 0.478400 6.218181 5.029509e-10 2.037131 3.912425 3032-11 Follicle stimulating hormone -1.505718 0.250398 -6.013292 1.817935e-09 -1.996490 -1.014946 7926-13 Kunitz-type protease inhibitor 3 2.887475 0.482526 5.984087 2.176067e-09 1.941742 3.833208 16892-23 Ectonucleotide pyrophosphatase/phosphodiesterase family member 2 -2.335113 0.396641 -5.887216 3.927542e-09 -3.112516 -1.557710 9282-12 Cysteine-rich secretory protein 2 1.768026 0.309050 5.720837 1.060006e-08 1.162299 2.373754 2953-31 Luteinizing hormone -1.319728 0.240323 -5.491466 3.986115e-08 -1.790753 -0.848702 4914-10 Human Chorionic Gonadotropin -1.244551 0.229781 -5.416240 6.086534e-08 -1.694914 -0.794188 8468-19 Prostate-specific antigen 5.841131 1.113030 5.247953 1.537986e-07 3.659632 8.022630 2474-54 Serum amyloid P-component 1.434929 0.279218 5.139108 2.760458e-07 0.887673 1.982185 8428-102 Neurotrimin -1.264317 0.246142 -5.136543 2.798380e-07 -1.746745 -0.781888 9002-36 Serpin A11 -1.087385 0.219035 -4.964434 6.890169e-07 -1.516685 -0.658084 3066-12 Galectin-3 -1.005615 0.206735 -4.864276 1.148764e-06 -1.410808 -0.600423 5116-62 Roundabout homolog 2 -1.291594 0.270447 -4.775767 1.790237e-06 -1.821661 -0.761527 7139-14 SLIT and NTRK-like protein 4 1.018520 0.218625 4.658761 3.181183e-06 0.590023 1.447016 8484-24 Leptin -0.991585 0.219260 -4.522415 6.113800e-06 -1.421326 -0.561843 5934-1 Ferritin 1.012300 0.227525 4.449188 8.619536e-06 0.566360 1.458240 15324-58 Ferritin light chain 1.002813 0.226429 4.428822 9.474919e-06 0.559021 1.446606 4234-8 Interleukin-1 receptor-like 1 1.134058 0.258632 4.384831 1.160761e-05 0.627149 1.640968 2696-87 Persephin 1.412833 0.323348 4.369389 1.245947e-05 0.779083 2.046584
logr_top_analytes = [(index, row['TargetFullName']) for index, row in logr_df_sorted.head(5).iterrows()] # Select the top 5 aptamers based on p-value x = sm.add_constant(logr_x_train[logr_top_analytes]) logr_mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial()) logr_res = logr_mod.fit() logr_res.summary()Generalized Linear Model Regression Results Dep. Variable: y No. Observations: 145 Model: GLM Df Residuals: 139 Model Family: Binomial Df Model: 5 Link Function: Logit Scale: 1.0000 Method: IRLS Log-Likelihood: -8.4167 Date: Fri, 01 Mar 2024 Deviance: 16.833 Time: 13:19:34 Pearson chi2: 17.8 No. Iterations: 10 Pseudo R-squ. (CS): 0.7186 Covariance Type: nonrobust coef std err z P>|z| [0.025 0.975] const 1.6106 1.178 1.367 0.172 -0.699 3.920 ('6580-29', 'Pregnancy zone protein') -7.0008 3.112 -2.250 0.024 -13.099 -0.902 ('5763-67', 'Beta-defensin 104') 2.7821 1.255 2.217 0.027 0.322 5.242 ('3032-11', 'Follicle stimulating hormone') -1.1722 0.818 -1.432 0.152 -2.776 0.432 ('7926-13', 'Kunitz-type protease inhibitor 3') 2.2901 1.053 2.174 0.030 0.226 4.354 ('16892-23', 'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2') -3.6045 1.498 -2.407 0.016 -6.540 -0.669
# Create confusion matrix x = sm.add_constant(logr_x_test[logr_top_analytes]) logr_predictions = [1 if val > 0.5 else 0 for val in logr_res.predict(x)] cm = metrics.confusion_matrix(logr_y_test.values, logr_predictions)
# Print out performance metrics via Pandas tp = cm[1, 1] tn = cm[0, 0] fp = cm[0, 1] fn = cm[1, 0] logr_perf_df = pd.DataFrame.from_records({ 'Sensitivity': tp / (tp + fn), 'Specificity': tn / (tn + fp), 'Accuracy': (tp + tn) / sum(sum(cm)), 'PPV': tp / (tp + fp), 'NPV': tn / (tn + fn) }, index=['Value']) HTML(logr_perf_df.to_html())Accuracy NPV PPV Sensitivity Specificity Value 1.0 1.0 1.0 1.0 1.0
# Display the confusion matrix plt.figure(figsize=(3,3)) sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues') plt.ylabel('Actual label') plt.xlabel('Predicted label') all_sample_title = 'Accuracy Score: {0}'.format(100 * logr_perf_df['Accuracy'].values[0]) plt.title(all_sample_title) plt.show()Linear Regression (Predict Age)
We use the same clean_data
as the logistic regression analysis above.
# Wrangle `clean_data` into a simpler form linr_x_df = ( clean_data .pick_meta(axis=1, names=['SeqId', 'TargetFullName']) .reset_index(drop=True) ) linr_y = ( [float(age) for age in clean_data.index.get_level_values('Age')] ) # Split the dataset into train and test, holding back 25 samples for testing linr_x_train, linr_x_test, linr_y_train, linr_y_test = train_test_split(linr_x_df, linr_y, test_size=25, random_state=5)Perform univariate linear regression for each aptamer
linr_apt_perf = [] for seq_info in linr_x_df: x = sm.add_constant(linr_x_train[seq_info]) mod = sm.OLS(linr_y_train, x) res = mod.fit() linr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value
linr_res_df = pd.concat(linr_apt_perf).reset_index() linr_res_df['SeqId'] = [x[0] for x in linr_res_df['index']] linr_res_df['TargetFullName'] = [x[1] for x in linr_res_df['index']] linr_res_df = linr_res_df.drop('index', axis=1) linr_res_df = linr_res_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 't', 'P>|t|', '[0.025', '0.975]']].set_index('SeqId') linr_sorted_res_df = linr_res_df.sort_values('P>|t|') HTML(linr_sorted_res_df.head(20).to_html())TargetFullName Coef. Std.Err. t P>|t| [0.025 0.975] SeqId 3045-72 Pleiotrophin 6.713339 0.865578 7.755906 1.506400e-12 5.002359 8.424320 4374-45 Growth/differentiation factor 15 6.766537 0.902926 7.494011 6.377086e-12 4.981730 8.551343 3024-18 Alpha-2-antiplasmin -6.258739 0.895850 -6.986373 9.854830e-11 -8.029558 -4.487920 6392-7 WNT1-inducible-signaling pathway protein 2 6.206203 0.895426 6.931007 1.321588e-10 4.436222 7.976185 8480-29 EGF-containing fibulin-like extracellular matrix protein 1 6.179473 0.900370 6.863260 1.889770e-10 4.399719 7.959227 15640-54 Transgelin 6.159769 0.905043 6.806048 2.552783e-10 4.370777 7.948761 15533-97 Macrophage scavenger receptor types I and II 5.986741 0.907615 6.596127 7.616175e-10 4.192666 7.780815 15386-7 Fatty acid-binding protein, adipocyte 6.130562 0.931954 6.578182 8.355679e-10 4.288376 7.972748 16818-200 CUB domain-containing protein 1 5.919909 0.902842 6.556970 9.321408e-10 4.135268 7.704550 4496-60 Macrophage metalloelastase 6.149946 0.940133 6.541570 1.009072e-09 4.291592 8.008299 3362-61 Chordin-like protein 1 5.765444 0.913703 6.309975 3.287540e-09 3.959334 7.571554 4541-49 Cell adhesion molecule-related/down-regulated by oncogenes -5.703166 0.906248 -6.293164 3.578780e-09 -7.494540 -3.911793 3600-2 Chitotriosidase-1 5.831590 0.951184 6.130871 8.071212e-09 3.951391 7.711789 2609-59 Cystatin-C 5.577894 0.934072 5.971588 1.773159e-08 3.731521 7.424267 3234-23 Coiled-coil domain-containing protein 80 5.647487 0.948795 5.952270 1.949244e-08 3.772010 7.522963 14133-93 Interleukin-1 receptor type 2 -5.489368 0.926319 -5.926004 2.216438e-08 -7.320415 -3.658321 19601-15 Ankyrin repeat and SOCS box protein 9 5.412074 0.930313 5.817474 3.755863e-08 3.573131 7.251018 9793-145 Immunoglobulin superfamily DCC subclass member 4 -5.292703 0.911239 -5.808247 3.927116e-08 -7.093942 -3.491463 2677-1 Epidermal growth factor receptor -5.341396 0.919656 -5.808039 3.931061e-08 -7.159272 -3.523520 4968-50 Macrophage-capping protein 5.345710 0.926458 5.770050 4.721157e-08 3.514387 7.177033 Feed top 8 SOMAmers into statsmodels OLS regression
linr_top_analytes = [(index, row['TargetFullName']) for index, row in linr_sorted_res_df.head(8).iterrows()] x = sm.add_constant(linr_x_train[linr_top_analytes]) mod = sm.OLS(linr_y_train, x).fit() mod.summary()OLS Regression Results Dep. Variable: y R-squared: 0.501 Model: OLS Adj. R-squared: 0.471 Method: Least Squares F-statistic: 17.05 Date: Fri, 01 Mar 2024 Prob (F-statistic): 2.29e-17 Time: 13:20:02 Log-Likelihood: -522.29 No. Observations: 145 AIC: 1063. Df Residuals: 136 BIC: 1089. Df Model: 8 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] const 55.5436 0.765 72.602 0.000 54.031 57.057 ('3045-72', 'Pleiotrophin') 1.6913 1.197 1.413 0.160 -0.676 4.059 ('4374-45', 'Growth/differentiation factor 15') 1.2404 1.258 0.986 0.326 -1.247 3.728 ('3024-18', 'Alpha-2-antiplasmin') -2.5113 0.910 -2.758 0.007 -4.312 -0.711 ('6392-7', 'WNT1-inducible-signaling pathway protein 2') 1.5143 0.997 1.519 0.131 -0.457 3.486 ('8480-29', 'EGF-containing fibulin-like extracellular matrix protein 1') 2.1363 0.972 2.197 0.030 0.214 4.059 ('15640-54', 'Transgelin') 1.2006 1.010 1.189 0.237 -0.796 3.198 ('15533-97', 'Macrophage scavenger receptor types I and II') 0.8792 1.223 0.719 0.474 -1.540 3.298 ('15386-7', 'Fatty acid-binding protein, adipocyte') 1.1453 1.180 0.971 0.333 -1.188 3.479 Omnibus: 2.712 Durbin-Watson: 2.042 Prob(Omnibus): 0.258 Jarque-Bera (JB): 2.501 Skew: -0.322 Prob(JB): 0.286 Kurtosis: 3.008 Cond. No. 4.53
Notes:
x = sm.add_constant(linr_x_test[linr_top_analytes]) linr_predictions = mod.predict(x) linr_pred_df = pd.DataFrame({ 'Actual Age': linr_y_test, 'Predicted Age': linr_predictions }) linr_pred_df['Pred Error'] = linr_pred_df['Predicted Age'] - linr_pred_df['Actual Age']Compute model performance
# Lin's Concordance Correl. Coef. # Accounts for location + scale shifts def linCCC(x, y): if len(x) != len(y): raise Exception('Arrays are not the same length!') a = 2 * pearsonr(x, y)[0] * np.std(x, ddof=1) * np.std(y, ddof=1) b = np.var(x, ddof=1) + np.var(y, ddof=1) + (np.mean(x) - np.mean(y))**2 return a / b n = linr_x_test.shape[0] p = len(linr_top_analytes) # Regression metrics linr_metrics_df = pd.DataFrame({ 'rss': linr_pred_df['Pred Error'].apply(lambda x: x**2).sum(), # residual sum of squares 'tss': sum((linr_pred_df['Actual Age'] - linr_pred_df['Actual Age'].mean()) ** 2), # total sum of squares 'R2': pearsonr(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'])[0] ** 2, # R-squared Pearson approx. 'MAE': np.mean(np.abs(linr_pred_df['Pred Error'])), # Mean absolute error 'RMSE': np.sqrt(np.mean(linr_pred_df['Pred Error'] ** 2)), # Root mean squared error 'CCC': linCCC(linr_predictions, linr_y_test) # Lins concordance correlation coefficient }, index=['Value']) linr_metrics_df['rsq'] = 1 - (linr_metrics_df['rss'] / linr_metrics_df['tss']) # R-squared linr_metrics_df['rsqadj'] = max(0, 1 - (1 - linr_metrics_df['rsq'][0]) * (n - 1) / (n - p - 1)), # Adjusted R-squared HTML(linr_metrics_df.to_html())rss tss R2 MAE RMSE CCC rsq rsqadj Value 989.768231 2771.84 0.674484 5.214434 6.292116 0.752326 0.64292 0.46438 Visualize performance via concordance plot of predicted and actual values
f, ax = plt.subplots(1, figsize=(5, 5), dpi=150) plot_range = [linr_pred_df[['Actual Age', 'Predicted Age']].min().min() * 0.95, linr_pred_df[['Actual Age', 'Predicted Age']].max().max() * 1.05] ax.plot(plot_range, plot_range, c='g') ax.scatter(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'], alpha=0.5) ax.set( xlim=plot_range, xlabel='Actual Age', ylim=plot_range, ylabel='Predicted Age', title='Concordance in Predicted vs. Actual Age' ) plt.show()
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4