How to Use the ADDTREE/P Program for Fitting Additive Trees copyright 1981-1995 by James E. Corter All rights reserved. ADDTREE/P is a program in the PASCAL language for fitting additive trees to proximity data, using a variant (Corter, 1982) of the algorithm described by Sattath and Tversky (1977). Differences between the original Sattath & Tversky algorithm and the variant used here are described more fully in Corter (1992). The input data to ADDTREE/P may be either similarity or distance data. A listing of parameters and their estimates, drawing of the tree graph, and summary statistics are reported. Various other output options are available. A help file accompanies this source file. See it for more information on using the program. ADDTREE/P was written by James E. Corter. Questions or comments regarding the program may be addressed to him at: INTERNET: jec34@columbia.edu Box 41, Teachers College, Columbia University, New York, NY 10027 NOTICE REGARDING RETRIEVAL AND USE OF ADDTREE/P AND ALL ASSOCIATED MATERIAL: By retrieving or using any of the files in this directory pertaining to ADDTREE/P, you are agreeing to the following terms and conditions: 1. All the material you retrieved is and will remain the property of the author, James E. Corter. It is provided only for not-for- profit research and teaching purposes. The material to be supplied is not to be published or redistributed, with or without modifi- cation, or resold, or used for any business or commercial purpose without the express written permission of the author. 2. THIS MATERIAL IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED WARRANTY. IN PARTICULAR, BUT WITHOUT LIMITATION, NEITHER THE AUTHOR NOR AT&T MAKE ANY REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY OF THIS MATERIAL, ITS FITNESS FOR ANY PARTICULAR PURPOSE, ITS FREEDOM FROM INFRINGEMENT OF ANY PATENT OR COPYRIGHT, ITS ACCURACY, OR PROVISION OF TECHNICAL SUPPORT. (revised 4/95) How to Use the ADDTREE/P Program for Fitting Additive Trees James E. Corter Teachers College, Columbia Uinversity Box 41, Teachers College Columbia University New York, NY 10027 (INTERNET: jec34@columbia.edu) ADDTREE/P copyright 1981-1995 by James E. Corter All rights reserved. ================================================================== CONTENTS: I. Description of input file format II. Example input and output files III. Differences from the ADDTREE program IV. Installing ADDTREE/P on your system V. References ================================================================== I. Description of Input File Format Input files should contain the following lines: ***************************************************************** (1) first line: a list of analysis option commands, zero or more of the following keywords. Note that lowercase letters may be used, and that all keywords may be abbreviated to 3 characters. DATA Requests printing of the raw data matrix. TRANSFORMED Requests printing of the data, after transformation into distance-like numbers. MODELDISTANCES Requests printing of the model (tree) distances. RESIDUALS Requests printing of the residuals matrix. HEIGHTS Requests printing of the distance of each node from the root. NOTREE Suppresses printing of the tree graph. NONUMBER Suppresses numbering of objects in the tree graph. SPECIFYORDER If this is included, line 1 of the input file must be followed by a line containing a list of object numbers. The leaves of the tree graph are then ordered to conform as closely as possible to this list. (NOTE: Care must be taken not to omit or duplicate numbers in this list. Also, note that not all orderings of the leaves are possible, since the tree structure imposes constraints on the possible orderings.) NOSUBTRACTCONSTANT The default assumption is interval-scale data; this implies complete freedom in choosing an additive constant. Therefore the primary approach is to either add OR subtract an additive constant to exactly satisfy the triangle inequality: d(i,j) + d(i,k) <= d(j,k) for all i,j,k, AND d(i,j) + d(i,k) = d(j,k) for some i,j,k. But if "nosubtractconstant" is specified, strict inequality is allowed, i.e. if d(i,j) + d(i,k) > d(j,k) holds for all i,j,k in the data, no constant is subtracted. MINVARROOT The default rooting is to combine the last few remaining clusters into the root node; if "minvarroot" is specified, the program will search for the root that minimizes the variance of the distances from the root to the leaves. ROOTAT Specify where the root is to be placed. If this option is requested, this keyword must be followed by two node numbers, e.g. "rootat 12 13 ". Note that the two nodes specified must be contiguous in the tree structure. A previous run will generally be necessary to determine the correct node numbers. LINESIZE Sets the length of lines in the output file. The current default value is 80 characters (so that output may be conveniently read on a video terminal). Example: "linesize 66 ". Note that any of the keywords mentioned above may be abbreviated to three (or more) characters; also, capital or lower-case letters are ok. Keywords may be separated by either blanks or commas. However, when a keyword above specifies that a parameter is to be read in following the keyword (e.g. ("rootat 12 13 "), it is best to follow the numerical parameter with a (" ") instead of a (",") since some PASCALs will object to finding a comma while attempting to read a number. If the first line is blank, the standard analysis is done and the default output is obtained (estimates, tree graph, and statistics). ***************************************************************** (1.5) (optional line: see SPECIFYORDER above) ***************************************************************** (2) second line: a comment line for labeling of the output (up to 80 characters in length). ***************************************************************** (3) third line: the following parameters (separated by commas or blanks): (integer) SIMILARITIES or DISSIMILARITIES (specifies similarity or distance (dissimilarities) data) FULL or LOWERHALF (specifies shape of matrix) (integer; width of OUTPUT data fields) (integer; number of digits after the decimal place in the OUTPUT matrices) NOTE: the last two parameters here refer only to the OUTPUT. Also note that you should try to leave at least two spaces between numbers: e.g. if your data is numbers between 0 and 99, and you want to see one digit after the decimal place, you should specify 6 and 1 for these parameters. This is to leave room for possible minus signs and the decimal point itself. ***************************************************************** (4) fourth, etc. lines: the data matrix itself NOTE: some Pascals require real format, that is, the data must have a decimal point, and digits both before and after (e.g. 45.0, not 45 or 45. or .45) NOTE: the data matrix can be in a variety of formats, as long as the order of entries corresponds to the order of entries in a (full or lowerhalf) matrix. So for example, the input data file might contain only one number to a line, as long as the order of the lines corresponds to the order of entries in a lowerhalf (or full) matrix. However, one thing that must NOT occur is trailing blanks on a line, since the program then expects another number on that line. ***************************************************************** (5) (optional) lines immediately following the data: a list of object names, one to a line, maximum length 20 characters. II. Example Input and Output Files EXAMPLE INPUT FILE: nonumber,residuals,minvarroot numbers: abstract concept (Shepard, Kilpatric, & Cunningham; 1975) 10 dis low 7 1 421 584 284 709 346 354 684 646 059 413 804 588 671 429 409 788 758 421 300 388 396 909 630 796 592 742 400 417 821 791 367 804 246 671 350 400 850 625 808 263 683 592 296 459 392 zero one two three four five six seven eight nine EXAMPLE OUTPUT FILE: addtree analysis (ADDTREE/P version 1.5): numbers: abstract concept (Shepard, Kilpatric, & Cunningham; 1975) ( 0.0 needed for positivity of distances ) 303.0 added to exactly satisfy triangle inequality var. of dis. from root to leaves =61715.5041 node 11 score=43795.738 node 15 score=61831.331 node 16 score=116529.784 node 14 score=116529.784 changeroot( 17 => 11) node 1 score=147886.744 node 2 score=147886.744 node 15 score=61831.331 node 18 score=61715.504 var. of dis. from root to leaves =43795.7375 node length children 1 454.6 zero 2 269.4 one 3 185.6 two 4 235.6 three 5 176.4 four 6 327.4 five 7 264.1 six 8 375.6 seven 9 325.3 eight 10 330.4 nine 11 127.5 1 2 12 103.2 3 5 13 53.9 4 10 14 81.8 6 8 15 97.5 12 9 16 37.9 13 7 17 53.1 15 18 18 77.4 16 14 19 0.0 11 17 -------------------------------------------------- zero ------------| | ------------------------------ one | | -------------------- two | -----------| | ----------| ------------------- four | | | | | ---------------------------------- eight | | | --------------------------- three ------| ------| | -----| ---------------------------------- nine | | | -------| ---------------------------- six | | ---------------------------------- five ----------| ------------------------------------- seven stress formula 1 = 0.0926 stress formula 2 = 0.4958 r(monotonic) squared=0.7542 r-squared (p.v.a.f.)=0.6249 residual distances: 0.0 -134.5 -249.4 -28.1 -206.0 -134.2 -25.2 121.9 0.0 -66.0 -14.8 -45.7 101.0 -4.7 -151.7 76.4 231.5 -41.7 49.4 -65.5 -12.2 42.0 -51.8 177.9 110.2 133.1 0.0 -39.3 66.1 221.2 55.9 279.3 -55.9 64.6 -149.2 -254.6 18.2 -21.7 225.0 0.0 109.3 63.6 -49.4 -117.6 -227.4 III. Differences from the ADDTREE Program ADDTREE/P is based on the algorithm used by the ADDTREE program written by S. Sattath, Hebrew University (Sattath & Tversky, 1977). Differences exist in the method of resolving ties in the nearest- neighbor count matrix and in the choosing of a root for the tree. The tie-breaking rule used in ADDTREE/P results in slightly better solutions in a small percentage of cases. Whenever a tie occurs in the choosing of pairs of objects to be joined (that is, if objects y and u are both nearest- neighbors of x), or when the nearest-neighbor relationship is not reciprocal (e.g. x is y's nearest neighbor, but u is x's), the preferable grouping is chosen by what amounts to a check of the additive inequality for the quadruple x,y,u,V (where V refers to all the objects in the tree not = x,y,u). One earlier version of the current program attempted to find the "deepest" rooting of the tree; in recent versions the root is created out of last few nodes remaining from the clustering process. This tends to produce balanced trees, requires no extra processing time, and is a relatively non-metric criterion. Also available as options are (2) the root that minimizes the variance of the distances from the root to the leaves of the tree, and (3) any arbitrary user-specified root. Since no single criterion will choose the most satisfactory rooting for every application, this variety of options seems desirable. For additional description of differences from the earlier ADDTREE program, see Corter (1982). IV. Installing ADDTREE/P on your System The ADDTREE/P and EXTREE programs included have been compiled and tested on a number of systems using various PASCAL compilers. On IBM PC's, Turbo Pascal is recommended. Previous versions of the programs have been successfully compiled on UNIX systems and on IBM mainframes (using the IBM interactive PASCAL compiler). Because the PASCAL language as originally defined does not contain specifications for certain I/O and file definition operations, statements pertaining to these operations may vary from compiler to compiler. V. REFERENCES Corter, J.E. (1982). ADDTREE/P: A PASCAL program for fitting additive trees based on Sattath & Tversky's ADDTREE program. Behavior Research Methods and Instrumentation, 14, 353-354. Corter, J.E. (1992). An order-N**3 combinatoric algorithm for fitting additive trees. Paper presented at annual meeting of the Classification Society of North America, East Lansing MI. Sattath, S., & Tversky, A. (1977). Additive similarity trees. Psychometrika, 42, 319-345.