Version 3.3.1
Documentation updated February 14, 2000
A Macintosh-based expert system for LC/MS and
MS/MS analysis of protein digests
Sherpa Homepage: http://www.hairyfatguy.com/Sherpa/
Copyright © 1994-2000 J. Alex Taylor and the University of Washington
All Rights Reserved Worldwide.
1.1 Statement of Program Philosophy
Sherpa is designed to be a robust, easy to use aide-de-camp in LC/MS and MS/MS interpretation. By automating simple but tediously repetitive calculations it allows the user to quickly determine the obvious and spend time exploring interpretations for the not so obvious. It is not, however, a black box that takes in raw data at one end and returns a full interpretation at the other end. At least a basic understanding of the principles involved in mass spectral interpretation as well as data acquisition is necessary to obtain useful information from its searches.
In an ideal world users would not need to use manuals in order to understand how to run programs. To that end I have tried to keep simplicity and user-friendliness foremost in the design of Sherpa. Interpretation with Sherpa is set up to be a dynamic process in which the user can play with a search's settings, run the search, change the setting, run the search again, etc. as opposed to a linear, one shot interpretation. This means that the user should feel free to experiment with optimizing the various setting without fear of not being able to easily get back to where they had started from.
This user guide primarily contains information about what the various user interface items of the program do. See Taylor et al. (1996), for a more applied discussion of the program.
1.2 System Requirements
To use Sherpa you must have:
MS/MS files in any of the above formats can be read natively while MS/MS files in any format not currently supported natively can be imported by saving the data in a text only format where each line of the file is an m/z value followed by a tab and then its corresponding intensity.
1.3 Quick Start
Sequences can be opened in one of three ways:
The only limit to the number of sequences that can be open at one time is the available memory.
Once a sequence is loaded, it will appear in a sequence window with the sequence name as its title. Each sequence loaded will be displayed in its own sequence window. The amino acid sequence is shown in a text editor and can be edited directly by the user.
The N-terminal and C-terminal groups of the protein can be set using the N-terminal Group... and C-terminal Group... buttons in the Sequence Window or by using the same commands in the Options menu. The amino acid set can be defined by using the Define Residues... button or its corresponding command in the Options menu. To quickly replace one residue with another in the sequence use the Search For... command in the Edit menu. Sherpa continuously updates the residue position and mass of the region selected in the sequence.
The Digest button performs a theoretical digest on the sequence and displays the fragments in a digest window. See the Digest Window topic below for more information.
To the right of the sequence display is a box with the sequence's (or selection's) amino acid composition. There are also additional buttons along the right side of the window. The CID button displays the predicted MS/MS fragments for the highlighted section of sequence in a new window. The Copy to Report button prints the sequence and information about the sequence in the Search Results window. The Set Seq Name button allows the user to change the name of the sequence. The Seq Notes button will display any notes which have been associated with the sequence.
When a sequence in a sequence window is digested, the theoretical fragments are displayed in a digest window. The type of enzymatic cleavage can be set by selecting the Digest Method... button. The cleavage agent(s) currently in use are displayed to the right of this button. The display format of the fragments can be either "MW", which gives the monoisotopic and average masses for each fragment, or "Charge-state" which displays the m/z for 1 to 6 charges for each peptide. The number of partial sites allowed, which is the maximum number of consensus cleavage sites tolerated inside a peptide, is set using a popup menu. A popup menu is also used to select whether the fragments are sorted by sequence position, mass, or HPLC index.
An LC/MS file in either Sciex APIIII , Sciex API 100/300, or Finnigan SSQ/TSQ format can be opened by selecting Load LC/MS File 1... from the File menu. If an LC/MS file is already loaded, an additional LC/MS file can be loaded by selecting Load LC/MS File 2... from the File menu or the new LC/MS file can replace the first LC/MS file by selecting Load LC/MS File 1... again. Reloading an LC/MS file into either of the file 1 or file 2 slots will simply replace the LC/MS file currently loaded in that slot.
2.1 Opening a Protein Sequence
Sequences can be opened in one of three ways:
Selecting Save As... from the File menu will display a standard save dialog with the sequence name as the default file name. Selecting Save (command-S) from the File menu will save the sequence and its information back to the sequence file if one already exists and can be located. If Save is selected and a sequence file does not already exist or cannot be located then Sherpa will display a standard save dialog as though Save As... had been selected.
When Sherpa saves a sequence file it places the sequence in the data fork of the file and information about it's termini and any custom amino acids used in the resource fork of the file. Hence, to any other program the sequence file will appear to be a normal text file containing just the sequence. But when a Sherpa-created sequence file is opened again in Sherpa, it will retrieve and reset the custom information saved with the sequence.
When saving a sequence that has been crosslinked to another sequence Sherpa gives the user the option to save the sequence as a regular sequence or as part of a linked sequence file which contains all the involved sequences and crosslinks. The "Insulin [human]" file in the sample sequences folder is an example of a linked sequence file.
Once a sequence is loaded, it appears in a sequence window with the sequence file's name as its title. Each sequence loaded is displayed in its own sequence window.
The N-terminal and C-terminal groups of the protein can be set using the N-terminal Group... and C-terminal Group... buttons in the sequence window or by using the same commands in the Options menu. The amino acid set can be defined by using the Define Residues... button or its corresponding command in the Options menu. (See the Defining Custom Amino Acids and Termini topic for more information.) Any non-standard amino acid assignments are displayed to the right of the Define Residues... button. To quickly replace one residue with another in the sequence use the Search For... command in the Edit menu.
An estimated pI value and molar extinction coefficient at 280nm can be calculated and displayed in the sequence window by selecting More Options... from the Options menu. The isoelectric point is estimated by using pKa's from Bull (1971), in the Henderson-Hasselbach equation to find the pH where the net charge on the protein is less than 0.01. The extinction coefficient is estimated using the coefficients derived by Mach et al. (1992).
The Digest button performs a theoretical digest on the sequence and displays the fragments in a digest window. See the Digest Window topic below for more information.
To the right of the sequence display is a box with the sequence's (or selection's) amino acid composition. The display of the amino acid composition can be turned on and off by by selecting More Options... from the Options menu. There are also additional buttons along the right side of the window. The CID button displays the predicted MS/MS fragments for the highlighted section of sequence in a new window. The Copy to Report button prints the sequence and information about the sequence in the Search Results window.
The Set Seq Name button allows the user to change the name of the sequence. The Seq Notes button will display any notes which have been associated with the sequence.
The Text to Speech button allows the user to play back the displayed sequence (or selection). The pause length after every third residue can be adjusted via More Options... from the Options menu. This can be useful in proofreading sequences entered by hand. Note that this button is only visible on systems which have the speech manager installed.
The elemental composition of the sequence (or selection) can be displayed below the sequence by selecting More Options... from the Options menu. If any crosslinks have been created, a toggle switch also appears below the sequence which allows the user to include crosslinks in the calculation of mass and elemental composition. See the Defining Crosslinks topic below for more information.
To dispose of a sequence window (and any associated digest or notes windows) use the Close Sequence submenu in the File menu. Clicking on the Go Away Box makes the sequence window disappear but does NOT dispose of it. (This is convenient for keeping down window clutter.) See the Window Behavior topic for more information.
To define a custom amino acid for a sequence, bring its sequence or digest window to the front and select Define Residues... from the Options menu. Alternatively, use the Define Residues... button in the sequence window. A scrolling list will appear with the 20 common amino acids and 6 undefined amino acids (B, J, O, U, X, and Z).
The libraries of custom amino acids and termini are stored in the Sherpa Preferences file located in the Preferences folder in the System folder. Each time that Sherpa is started, it opens the preferences file and loads in this custom information along with the user settings. If the preferences file cannot be found, the program creates a new file with default settings and modification libraries.
Termini and custom residue information for a sequence are stored with the sequence files. If the program opens a Sherpa-created sequnce file and finds a custom residue or termini that is not already in its custom library, it will automaticly add the new modifications to the library in the preference file.
When a sequence in a sequence window is digested, the theoretical fragments are displayed in a digest window. Any changes made in a sequence window are automatically reflected in the corresponding digest window.
At the far right is a Display Options button which dislays the digest display options dialog. In the digest display options dialog you can toggle between the first column displaying the fragment number or the HPLC index for the fragment. There is an option that allows for the display of the full sequence of the predicted fragments instead of abbreviating fragments that are too long. The user can also choose the number of charge-states and decimal places to display and whether they are calculated as positive or negative ions.
The HPLC index uses the amino acid retention coefficients derived by Browne et al. (1982), to predict the percent acetonitrile at which the peptide should elute from a C18 reversed-phase column with 0.1% TFA as the ion-pairing agent. Some peptides, especially longer ones, may have values outside of the range of 0 - 100%; but this may still be informative as a relative measure of a peptide's hydrophobic character.
Crosslinks are created within the crosslink dialog which is accessed by selecting Crosslinks... from the Options menu. Crosslink types can be created or edited to suit the users needs. Note that editing of sequences into which crosslinks have been introduced can cause crosslinks to become invalid, which will delete them. Crosslinks are not considered by the secondary searches or by the openended search.
3.1 Opening an LC/MS File
An LC/MS file in either Sciex APIIII , Sciex API 100/300, or Finnigan SSQ/TSQ format can be opened by selecting Load LC/MS File 1... from the File menu. Finnigan (".dat") files must be FTP'd to the Macintosh in binary mode and given a file type of '????' or 'BINA'. (The default binary file type can be set in the preferences of either Telnet or Fetch.) If an LC/MS file is already loaded, an additional LC/MS file can be loaded by selecting Load LC/MS File 2... from the File menu or the new LC/MS file can replace the first LC/MS file by selecting Load LC/MS File 1... again. Reloading an LC/MS file into either of the file 1 or file 2 slots will simply replace the LC/MS file currently loaded in that slot.
Once an LC/MS file has been selected, a dialog will appear in which the import parameters are set.
The bottom portion of the dialog contains the import settings which are initially set to the default values. (For information on changing the default import settings see the Customizing Settings topic.) The first settings box determines which portion of the LC/MS file will be imported for analysis. It will always default to importing the entire scan range and mass range of the file.
The second settings box contains the settings for processing each scan of the LC/MS file to find its peaks - the intensity threshold, peaktop minimum and minimum m/z peak width. These are the most critical import parameters and also the parameters that vary the most from LC/MS file to LC/MS file. The relationship of these parameters is illustrated in figure below. If the data is centroided the minimum m/z peak width parameter is ignored. Also, each scan can now, optionally, be smoothed before processing.
Once the OK button in the LC/MS Import Dialog is pressed, a small progress dialog will appear while the LC/MS data is being imported and processed. The peaks found in each scan are compared to a list of ongoing chromatographic peaks. If an m/z matches that of an ongoing peak it is added to it. If an m/z does not match that of an ongoing peak it becomes a new peak. If a chromatographic peak does not extend for the minimum number of scans, as set in the import dialog, it is removed from the list. The importing process can be aborted by pressing command-period; any data processed up to that point will be saved.
When Sherpa is finished processing the LC/MS data it will display the list of chromatographic peaks, sorted by mass, in an LC/MS Peaks window and give the total number of peaks at the bottom. This window is for display purposes only; the data cannot be edited and fed back into the program. Reloading an LC/MS file into either of the file 1 or file 2 slots will simply replace the LC/MS file currently loaded in that slot and replace the data in its LC/MS Peaks window.
It cannot be stressed enough that using proper thresholds and import ranges when importing LC/MS data is critical to all subsequent analyses. If the progress of the processing appears to be going very slowly or you get an exorbatent number of peaks (greater that a thousand or so), try reimporting the data with higher thresholds or with cropping of the import range to exclude as much salt, junk, noise, etc. as possible. It is often useful to examine the data imported under several different conditions.
Once an LC/MS file has been loaded, the Primary Search option in the Searches menu becomes active. The Primary Search consists of four search options: a MW grouping search which locates potential charge state families, a peptide matching search which compares one or more theoretically digested sequences with the LC/MS data, an unidentified ions option to report prominent peaks that were not MW grouped or matched to a peptide, and a data file comparison option which correlates search results by whether they appear in one or both of the loaded LC/MS files (if two LC/MS files are loaded). The results of a Primary Search are displayed in the Search Results Window.
The Primary Search Settings Dialog can be accessed by the Primary Search Settings... option in the Searches menu. The search settings can also be accessed individually in the Options menu.
The MW grouper combs the peak list data from the LC/MS file to find potential ion charge-state families derived from the same molecular species (assuming a proton as the charge agent). To be considered a group, the ions must 1) be consecutive charge-states of the same MW and 2) elute at the same time (have the same apex scan). In the MW Grouper Settings dialog the following parameters can be set: the minimum and maximum charge states to consider, the error tolerance method used, the apex scan tolerance, and the minimum number of items for a group. The MW Grouper Settings dialog can be accessed either through the Options menu or through the Primary Search Settings dialog. The results of the MW grouper are often used by other searches.
The default Sherpa method for the MW error tolerances is as follows: the error for each peak is equal to the charge-state times the step size. The minimum MW error tolerance for each peak is 0.5 Da.
The peptide match search compares the ions from the LC/MS peak list and any MW groups to the predicted digest fragments for the sequence or sequences selected. MW groups that match a peptide are termed "probable peptide matches". Single ions that match a peptide are termed "possible peptide matches".
Parameters for peptide matching are set in the Peptide Match Settings dialog which can be accessed either through Seq Match Settings... in the Options menu or through the Primary Search Settings dialog. The peptides can be matched either using their average masses or by monoisotopic masses below, and average masses above, a specified switch mass. [This is the only location to set this parameter, which is used by any other search that matches to peptides.] The charge state search range for ungrouped ions and the error tolerance used in matching are other user-definable parameters. The error tolerance also has an option to boost the matching error by 0.25 Da for every 1000 Da of peptide MW.
Unidentified ions are those ions which were not considered part of a group by the MW grouper or matched to peptides. Parameters for the reporting of unidentified ions are set in the Unidentified Ion Reporting dialog which can be accessed either through the Options menu or through the Primary Search Settings dialog. All unidentified ions above a specified intensity threshold can be reported; or, alternatively, the top 10, 25, 50, or 100 most intense unidentified ions can be reported. The list of unidentified ions that is displayed can be sorted be m/z, intensity, or scan.
The secondary searches are searches to help identify modified peptides in the digest. At present the secondary searches available are glycosylation and phosphorylation. Once an LC/MS file has been loaded, the Secondary Search option in the Searches menu becomes active. The results of a Secondary Search are displayed in the Search Results Window.
The glycosylation search works by looking for staircases due to carbohydrate micro heterogeneity on a given peptide. It is based on the observation that nearly every glycosylation site is heterogeneous with the differences being small increments of several carbohydrates. The isoform with more glycosylation is always more hydrophilic so the staircases appear to step down and to the left in conventional, increasing organic gradient, reverse phase setups. And, at least in positive-ion data, the glycopeptide ions are usually quite high in m/z due to addition of substantial carbohydrate mass without a corresponding increase in possible charge sites. Glycosylation indicator ions of HexNAc+, HexHexNAc+, and NeuNAc+ are also indicative of glycopeptides when scanned in a stepped orifice voltage mode to produce collision-induced fragmentation while scanning at low m/z (Bean et al. 1995, Carr et al 1993).
The glycosylation search algorithm scans through the MW groups and ungrouped LC/MS peak data looking for mass differences that correspond to the selected carbohydrates to search for. The apex scan difference of a match must be within the user specified range; and if indicator ion verification is turned on an indicator peak must be within a specified number of scans of each match component. The indicator ions used are m/z 204 (HexNAc+), m/z 366 (HexHexNAc+), and, if NeuNAc is included in the staircase search, m/z 292 (NeuNAc+), and m/z 274 (NeuNAc+ - H20).
Once staircases are found in the data, if a sequence or sequences have been loaded and selected in the Secondary Search Settings dialog Sherpa attempts to jump from the lowest step of each staircase to the predicted peptides by carbohydrate increments. There is an N-link search which only tries to match to peptides containing an N-link consensus site, Asn-X-Ser/Thr-X where X can be any amino acid but proline. There is also an N&O-link search which tries to match the steps to peptides which contain Ser or Thr (as there is no real consensus O-link attachment site).
The settings for the glycosylation search are accessed via the Secondary Search Settings submenu in the Options menu or from the Secondary Search Settings dialog from the Searches menu. The origin of the protein can be specified as either yeast or mammalian. Selecting yeast as the origin will exclude NeuNAc and Fucose from being searched for and will also slightly alter the carbohydrate ranges used when attempting to match the staircases to the peptides. The apex scan difference range for each carbohydrate can be set by a pair of sliders. The upper (yellow) slider sets the minimum apex scan difference for ions differing by that sugar and the lower (red) slider sets the maximum apex scan difference.
The purpose of the open-ended search is to search a sequence or sequences for a given MW, or m/z regardless of consensus proteolytic sites. Hence it is useful for detecting spurious proteolytic cleavages and identifying MW groups that were not matched to the theoretical digest fragments by peptide matching. The open-ended search window can be accessed from the Searches menu. FASTA formatted protein databases (such as the OWL non-redundant protein database - downloadable from the NCBI at ftp://ncbi.nlm.nih.gov/repository/OWL/) can be used to perform a mass fingerprint search on a list of molecular weights input into the small text box on the right of the window. The database search algorithm is still being optimized and hence is rather slow unless the two termini option is utilized.
4.1 Opening an MS/MS File
MS/MS files in Sciex APIIII , Sciex API 100/300, or Finnigan SSQ/TSQ format formats can be read natively. Finnigan (".dat") files must be FTP'd to the Macintosh in binary mode and given a file type of '????' or 'BINA'. (The default binary file type can be set in the preferences of either Telnet or Fetch.) MS/MS files in any format not currently supported natively can be imported by saving the raw data in a text only format where each line of the file is an m/z value followed by a tab and then its corresponding intensity. An MS/MS file is imported by selecting Load MS/MS File... from the File menu.
Once an MS/MS file has been selected, a dialog will appear in which the import parameters are set. The MS/MS Import Dialog is very similar to the LC/MS Import Dialog. The top portion of the MS/MS import dialog contains information about how the MS/MS file was acquired: the m/z of the parent ion, whether it was positive or negative ion data, whether the data is centroided or profile, the exact step size used in acquisition (including any mass defect), and the m/z range that was scanned.
The bottom portion of the dialog contains the import settings which are initially set to the default values. The first settings box determines which portion of the MS/MS file will be imported for analysis. It will always default to importing the entire mass range of the file.
The second settings box contains the settings for processing the scan to find its peaks - the intensity threshold, peaktop minimum and minimum m/z peak width. (See LC/MS Import Dialog for a discussion of these parameters.)
The third settings box contains a popup menu to set the number of times the scan should be smoothed before processing. The smoothing algorithm is identical to that used in MacSpec. This box also contains an option to limit the final number of imported ions to the X most intense ions, where X can be defined by the user.
Once the OK button in the MS/MS Import Dialog is pressed, the scan is processed and the MS/MS PepID Window will appear with the processed peaks listed in the small text editor on the left side of the window. If an MS/MS file is loaded after a previous MS/MS file had already been loaded its peak data will replace the previous peak data but the results of any PepID searches displayed in the main text editor at the bottom of the MS/MS PepID Window will remain.
PepID, an algorithm developed by Richard Johnson at the University of Washington can be used to correlate an MS/MS spectrum with a protein sequence. The selected sequence(s) or a selected FASTA formatted protein database (such as the OWL non-redundant protein database - downloadable from the NCBI at ftp://ncbi.nlm.nih.gov/repository/OWL/) are searched to find all possible peptides with a mass within tolerance of the MW of the parent ion. If a database is chosen to be searched, the database's amino acid set and the digest method can be set via the Database Settings... button. Selected posttranslational modificatons or single amino acid substitutions can also be considered when identifying peptides with the correct mass. (This initial portion is identical to an open-ended search, which was described in the preceeding section.) The database search algorithm is still being optimized and hence is rather slow unless the two termini option is utilized.
The candidate sequences are then scored in two ways: The first score is the percentage of the total ion current from the MS/MS ion list the candidate sequence can account for by conventional low energy fragmentation (listed in the results in the "%" column). The second score, ISCOR, is a more sophisticated intensity-based scoring system which sums a fraction of each matched ions intensity based on the likelihood of the ion type and charge-state. Bonuses and penalties are then added to this based on the continuity of b and y ion series, the percentage of unmatched b and y ions, the presence or absence of particular immonium ions which should be present if their corresponding amino acids are present in the candidate peptide, etc. The ISCOR values are normalized to 1.0 and the candidate peptides of each charge-state are sorted based on this value.
5.1 Customizing Settings
To globally save the current settings as Sherpa's default settings for the next time you run the program simply select Save Preferences... from the Options menu and the current settings (program wide) will be saved as the default values. Preferences can also be set or reset to the program defaults on a more local basis. Windows and dialogs that have settings that can be saved to the preference file have two buttons: a Standard Settings button which will reset the values associated with that window (and ONLY the values associated with that window) back to the program's default values; and a Save Settings button which saves the current settings for the window as the default settings. Note that selecting only Standard Settings does not save the standard settings as the default settings.
The custom amino acids, termini, and modifications that you create in Sherpa are stored in its preferences file. If you install Sherpa on a new computer and wish to retain the custom AA's / termini that you have entered using Sherpa on another computer you can simply copy the Sherpa preferences from the Preferences folder in the System folder and place it in the Preferences folder of the new computer.
Clicking on the Go Away Box in the upper left hand corner of a window will hide the window but will not destroy the window or its contents. Any existing window can be brought to the front by selecting its name from the list of windows at the bottom of the Window menu. The active/ frontmost window is denoted with a check in this list; and any hidden windows are denoted with a dash.
Some windows contain a shredder button above their text editor. These are temporary windows. They can be hidden like other windows by using the Go Away Box or they can be permanently disposed of by clicking the shredder button. Predicted CID fragment windows and sequence notes windows are examples of temporary windows.
This guide and the software described in it are copyrighted with all rights reserved. Neither the guide nor the software may be copied in whole or part without the written consent of the author, except as described in the license agreement provided with this product.
Internet: jataylor@nwlink.com
Macintosh, and Mac are trademarks or registered trademarks of Apple Computer, Inc. MacSpec and MacBioSpec are trademarks of Perkin-Elmer Sciex Instruments. All other trade names referenced are the service mark, trademark, or registered trademark of the respective manufacturer.
LC/MS files must be in either Sciex APIIII , Sciex API 100/300, or Finnigan SSQ/TSQ format.
1.3.1 Opening a Protein Sequence
1.3.2 The Sequence Window
1.3.3 The Digest Window
1.3.4 Opening an LC/MS File
2.0 Handling Protein Sequences
Any typed or opened sequence is stripped of any non-alphabetic characters and then capitalized before being displayed in the sequence window. The only limit to the number of sequences that can be open at one time is the available memory.
2.2 Saving a Sequence
2.3 The Sequence Window
2.4 Defining Custom Amino Acids and Termini
2.5 The Digest Window
2.6 Defining Crosslinks
3.0 LC/MS Interpretation
3.2 The LC/MS Import Dialog
The third settings box contains chromatographic parameters. The minimum number of scans for a "chromatographic peak" determines how many consecutive scans an ion must appear in to be considered a chromatographic peak. The chromatographic hole tolerance is the number of consecutive scans without signal that are tolerated within a chromatographic peak.
3.3 Running a Primary Search
3.3.1 MW Grouper
3.3.2 Peptide Match
3.3.3 Unidentified Ions
3.4 Running a Secondary Search
3.4.1 Glycosylation
3.5 Open-ended Search
Each modification must have: 1) a name of up to 29 characters, 2) a modification type - AA modification, N-terminal protein modification, N-terminal peptide modification, C-terminal protein modification, or C-terminal peptide modification, 3) a maximum number of modifications allowed per peptide, 4) a motif - which has a specified length with each position having a specified list of possible residues, and one position of the motif is specified as the modified residue, 5) A net compositional change associated with the modification which in turn is used to calculate the modification's delta mass, and 6) if the modification is a N-terminal or C-terminal modification then a unique short label of up to 6 characters must be specified to identify the modification in the output.
4.0 MS/MS Interpretation
4.2 The MS/MS Import Dialog
4.3 Using PepID to Interpret an MS/MS Spectrum
5.0 Miscellaneous
5.2 Window Behavior
5.3 References
5.4 Legalities
5.4.1 Contact Information:
Alex Taylor
10244 SE 16th
Bellevue, WA 98004 USA
WWW: http://www.hairyfatguy.com/Sherpa/