|
XSM User Guide
Note: This documentation applies to current stable version. For older versions, start XSM without parameter to display list of available options.
Summary
- Introduction to XSM
- Installing XSM
- Running XSM
- Parameter Statements
- Command line Syntax
- Sample Sort Jobs: best way to get into!
- Destructive and non destructive sort
- Performance Issues
- Running XSM as a User Exit
- Messages & Return Codes
- CHANGELOG
- FAQ
1. Introduction to XSM
XSM is a fast sort program that reads one or more input files, sorts all the lines or records according to user's defined 'sort keys', and writes final result onto one or more files, according to user's 'Include/Exclude' filters.
XSM can sort 2 kinds of files:
- Fixed length records files:
All records or 'lines' have the same length, including a trailing CR/LF (DOS, OS2, Windows, ...), or a trailing LF (UNIX), if any. They may or may not have: CR and LF are treated as others chars.
-
Text files (Variable length records files), also called "flat files":
All lines are terminated by a CR/LF (DOS, OS2, NT ...), or by a single LF (UNIX text files).
The sorting information (sort keys) in that lines may be at fixed columns (FIELDS)
or in "variable length fields" (VFIELDS) separated by a given character (FIELDSEP).
XSM reads its parameters from an IBM-like 'sort parameters' file (equivalent to the SYSIN DD card in MVS JCL):
a simple ASCII text file that you can create and modify with your favorite text editor.
We call it "Parmfile". It is described below at Parameter Syntax section.
Parmfile is mainly used to specify static parameters, such as sort fields, sortworks and options.
You can also specify parameters on command line instead of using a parameter file. It is described below at Command line Syntax section.
This is mainly used to specify variable parameters, such as input and output file names. Then you can make the most of Environment Variables in batch scripts (UNIX shells, Windows .bat) to pass XSM input/output file names.
It is recommended to make use of Parmfile and command line parameters together:
- Parmfile contains static statements, such as sort fields
- Command line contains variables such as input/output file names
When starting, XSM will automatically compute the best memory and temporary disk files ("sortworks") usage for processing specified input files. So, for most usages, there is no need for XSM tuning.
Default XSM processing mode is "destructive sort" for optimal performance, as opposed to "stable". It can be parametered to non-destructive or "stable" sort mode.
See below "destructive/non-destructive sort" discussion.
2. Installing XSM
Installing XSM is easy and takes 5 minutes.
XSM consists of a single binary program 'hxsm', along with a "Use once and forget" license serial activation program 'hhnsinst'
No hardware/software requirement is needed to run XSM on your Operating System.
Just read Installation Guide in UNIX/Linux or Windows flavor.
Once XSM is activated, you can rename it to whatever, move it to wherever you wish on the system.
Note that XSM binary is named 'hxsm' not to confuse with UNIX X11 Session Manager 'xsm'.
Tips:
-
It is recommended to use XSM program with a generic name, to ease XSM future releases updates without having to modify any batch scripts. In other words, get rid of release in XSM program name.
On UNIX, use symbolic links:
$ cd /usr/local/bin
$ # Let's use XSM release 6.71:
$ ln -s hxsm671_Linux2.6.9-1.667_i686_32 hxsm
$ ls -l hxsm*
lrwxrwxrwx 1 root root 7 Apr 28 11:28 hxsm -> hxsm671_Linux2.6.9-1.667_i686_32
-rwxrwxr-x 1 root root 109160 Feb 19 2005 hxsm662_Linux2.6.9-1.667_i686_32
-rwxr-xr-x 1 root root 107744 Oct 31 2007 hxsm668_Linux2.6.9-1.667_i686_32
-rwxr-xr-x 1 root root 107744 Oct 28 2008 bin/hxsm669_Linux2.6.9-1.667_i686_32
-rwxr-xr-x 1 root root 100480 Feb 8 2010 hxsm671_Linux2.6.9-1.667_i686_32
- It is recommended to set PATH so that XSM calls do not need to specify the program full pathname:
An activated copy of XSM is bound to the Operating System is has been activated on. That is, if you copy the XSM binary program to another similar Operating System, you must activate it on the target OS.
3. Running XSM
Running XSM is quite simple:
1) Create a parameter file that describes the job to do:
- which files to sort (if not standard input)
- what kind of files: fixed length (binary), or text file
- what are the sort keys
- where to put the final result (if not standard output)
- what resources are to be used: memory, directories for temporary sortworks, ...
- optional features (deduplicate, filtering, ...)
2) At the OS command prompt --- a UNIX terminal or Windows cmd.exe ---, just type:
hxsm your-parmfile
3) Verbose switch:
You may follow your job running, with each step's details and duration by
using the -v switch:
hxsm -v your-parmfile
4) Check switch:
Once the sort job is ended, you may want to check that the output
file is correctly sorted according to the parameter file.
Just use the -c switch:
hxsm -v -c your-parmfile
Note that there is a command line with traditional switches described
below, which may replace the parameter file in most cases. It is useful to specify input and output file names that come from environment variables.
Next section is the Parameter Statements.
For an easy start, see some Sample Sort Jobs below.
4. Parameter Statements
XSM reads its parameters from a simple text file. We call it 'parameter file' or 'parmfile'.
If you prefer not to use a parmfile, most statements can be given on the command line.
The parameter file is a set of following statements:
OPTION statement - optional
SORT/MERGE statement - mandatory if not OPTION COPY
RECORD statement - mandatory
INPFIL statement - optional
OUTFIL statement - optional
SORTWORKS statement - optional
IOERROR statement - optional
STORAGE statement - optional
INCLUDE statement - optional
EXCLUDE|OMIT statement - optional
Statements can start at any column, though is it recommended to begin on columns 1 to 8 for readability.
Comments marks are:
- a semicolon
';' anywhere in a line,
- or a star
'*' at the beginning of a line,
- or a number/hash sign
'#' symbol at the beginning of a line.
# this is a comment
* this is a comment
; this is a comment
INPFIL /tmp/data/myinput.file ; this is a comment for my input file
Empty lines are permitted wherever you feel.
The OPTION statement
OPTION option,option,...
or
OPTIONS option,option,...
where options are:
Y2KSTART=19nn
(nn = separates Years 1900 from Years 2000, default 1970)
This option has meaning only for fields of type 'Y' or 'Y2K' (see below)
COLLATING=EBCDIC
or
COLLATE=EBCDIC
This option is primarily used for VM/MVS to UNIX/Windows migrations
It will keep ASCII characters keys in the following order:
- first, all lower-case letters,
- then all upper-case letters,
- and finally the digits 0..9
RECORD_SEPARATOR=byte
or
RSEP=byte
This option is primarily used for OPEN MVS and OS/400 for variable length files.
It defines a specific character (1 byte) at the end of each record to be considered as the Record Separator.
Defaults:
RSEP=0A (UNIX, all flavors)
RSEP=0D (Windows)
RSEP=15 (Open MVS)
RSEP=25 (OS/400)
Example:
OPTION RSEP=|
or
OPTION RSEP=7C
MULTI=n
or
PROCESSORS=n
or
PROCS=n
This option forces XSM to run in multi-threading (parallel) mode with a minimum of n threads.
For example, if you have a 8-processor on a modern hardware, feel free to have such statement in your parmfile, that will boost sort time:
OPTION PROCS=8
KEEP_ORDER
This option turns to non-destructive sort, also called "stable sort".
See explanation below.
COPY
This option is primarily used for compatibility with some mainframe sort programs, such as IBM DFSORT.
It tells XSM to selectively copy the input file(s) onto one or more output files.
It must be followed by one or more following statements:
OUTFIL FILE=n,INCLUDE=(condition1[,AND/OR,condition2...])
or
OUTFIL FILE=n,INCLUDE=ALL
(see the 'OUTFIL' statement below)
Example of OPTION statement usage:
OPTION PROCS=4,KEEP_ORDER ; force multithreading on 4 procs, force stable sort
The SORT/MERGE statement
SORT or MERGE FIELDS=(start,len,type,direction[,start,len,type,direction,..])
or
FIELDS=(start,len,direction[,start,len,direction,..]),FORMAT=type
or
FIELDS=ALL
or
VFIELDS=(start,len,type,dir.[,start,len,type,dir.,..]),FIELDSEP=car
start : start of sort field (1,2, ... n):
position if fixed field (FIELD=)
field number if variable fields (VFIELDS=)
len : sort field length (FIELDS)
sort field max length (VFIELDS)
type : B or BI (Binary) - binary field
C or CH (Char) - normal ASCII characters (default)
I (Ignore case)
N or NU (Numeric) - only '0' .. '9' characters (VFIELDS only)
P or PD (Packed) - Packed decimal field
(Binary files only, last half byte = sign)
Y or Y2K (yy) - 2 bytes numeric field containing year of a date
(would contain '84' for year 1984)
direction : A (Ascending order), D (Descending order)
FORMAT : short form to use only when all sort keys have the same type
FIELDSEP : The FIELDSEP sub-parameter value can be a symbolic name, or a single
character, or an hexadecimal value.
Using symbolic name is recommended for general punctuation chars to
avoid syntax parsing errors, specially with comments markers ';', '#'.
Symbolic names accepted:
BAR = the '|' char
TAB = the Tabulation (X'09') char
COMMA = the ',' char
COLUMN = the ':' char
DIARESIS = the '#' char
SLASH = the '/' char
BACKSLASH = the '\' char
SEMICOLUMN or SEMI-COLUMN = the ';' char
SINGLEQUOTE or SINGLE-QUOTE = the "'" char
DOUBLEQUOTE or DOUBLE-QUOTE = the '"' char
Hexadecimal value: X'hh'
(Hex syntax : UPPERCASE X, Singlequote, 2 Hex digits, Singlequote)
Default value is TAB
Examples:
VFIELDS=(.....),FIELDSEP=SEMI-COLUMN
VFIELDS=(.....),FIELDSEP=@
VFIELDS=(.....),FIELDSEP= ; Field separator is space char
VFIELDS=(.....),FIELDSEP=X'7C'
VFIELDS=(.....) ; Default Field separator is TAB
SORT verb is used for one or more input files.
MERGE verb is used to merge at least two files already sorted files.
To understand difference between SORT and MERGE, see SORT and MERGE discussion
One of SORT or MERGE statement is mandatory, unless using OPTION COPY
The FIELDS parameter describes sort keys at fixed column in the line (RECFM=V) or the record (RECFM=F).
FIELDS=ALL : the sort key is the whole record.
The VFIELDS parameter describes variable length sort keys separated by a given char.
This parameter cannot be used for Binary Files (RECFM=F).
When using VFIELDS, specify field separator with FIELDSEP=
Examples:
# these 3 SORT statements are equivalent:
SORT FIELDS=(17,3,B,D,1,15,B,A)
or
SORT FIELDS=(17,3,BI,D,1,15,BI,A)
or
SORT FIELDS=(17,3,D,1,15,A),FORMAT=BI
This means:
- the 1st sort field starts at column 17 of each record, ends at col. 17 + 3 -1 = 19, type Binary, descending order,
- the 2nd sort field starts at col. 1 of each record, ends at col. 1 + 15 -1 = 15, type Binary, ascending order.
Tip: on a LITTLE-ENDIAN ("x86") machine, if you want to sort binaries integer (shorts, longs), just invert the direction:
SORT FIELDS=(2,4,B,A) ; will sort a long (32 bits) in Descending order
SORT FIELDS=(6,2,B,D) ; will sort a short (16 bits) in Ascending order
Sorting a text file using a name at pos.12 for 20 bytes, ignore case:
SORT FIELDS=(12,20,I,A)
Sorting a Binary file using a packed decimal number at pos. 7 for 4 bytes, reverse order:
SORT FIELDS=(7,4,P,D) ; 7 BCD digits, with sign
Sorting a text file with variable fields, separated by ':', using:
- a name in field #12, max. length 20 bytes, ignore case:
- a number in field #7, at most 9 digits (chars '0' .. '9'), reverse order
SORT VFIELDS=(12,20,I,A,7,9,N,D),FIELDSEP=:
Same with a blank as field separator:
SORT VFIELDS=(12,20,I,A,7,9,N,D),FIELDSEP=
Same with the TAB char as field separator:
SORT VFIELDS=(12,20,I,A,7,9,N,D),FIELDSEP=TAB
or
SORT VFIELDS=(12,20,I,A,7,9,N,D) ; FIELDSEP=TAB implied (default)
Dates like "mmddyy', pos. 21-26, descending order
SORT FIELDS=(25,2,Y,D,21,2,B,D,23,2,B,D)
Sorting whole lines in a text file:
SORT FIELDS=ALL
The RECORD statement
RECORD RECFM=record_format,LRECL=record_length
record_format : F (Fixed), V (Variable) | T (Text), M ("MFCOBOL" or "MFVariable")
V and T are synonyms (equivalent)
record_length : exact record length for RECFM=F, including any separator CR/LF
max record length for RECFM=V, excluding line separator CR/LF
The RECORD statement tells XSM if record type is fixed length or variable length.
The RECORD statement is mandatory.
With RECFM=V, LRECL is internally added by 2 to include CR/LF (Windows/UNIX flat text files)
RECFM=M is used for MicroFocus COBOL special variable format known as "MFCOBOL" or "MFVariable"
Examples:
RECORD RECFM=F,LRECL=400
means that all the records have the same (fixed) length of 400 bytes
RECORD RECFM=V,LRECL=133
means that this is a text file, and that the maximum line length is 133 (excluding CR/LF);
The INPFIL statement
form 1:
INPFIL filename
filename : input file name
form 2:
INPFIL DD:varname
varname : Environment variable name holding input filename.
The INPFIL statement describes input file names.
The INPFIL statement is optional: if omitted, the standard input (stdin) will be used as input file (UNIX style redirection and pipe are allowed).
one INPFIL line per input file
As of Versions 450/510, files may be specified 'a la MVS' by a ddname: DD:SORTIN
In that case, XSM will get the 'dsname' (file name) via the corresponding environment variable.
Examples:
INPFIL C:\Myjob\BIGF.INP # Windows
INPFIL D:\TMP\Wrk.Dat # Windows
INPFIL /home/hh/bigf.inp # UNIX
INPFIL /home/hh/littlef.inp # UNIX
INPFIL DD:SORTIN1 # any systems
INPFIL DD:SORTIN2 # any systems
INPFIL DD:JOHNNY # any systems
# = actual file names via the environment variables SORTIN1, SORTIN2, and JOHNNY
The OUTFIL statement
The OUTFIL statement describes output file names.
The OUTFIL statement is optional: if omitted, the standard output
will be taken as output file (redirection and pipe allowed).
Note: the output file may overwrite one of the input files,
but in SORT operations only.
form 1 :
OUTFIL filename
filename : output file name
form 2 :
OUTFIL DD:varname
varname : Environment variable name holding output filename.
form 3 :
OUTFIL FILE=n,INCLUDE/EXCLUDE=(condition1,[AND/OR,condition2...)
n : number of the SORTOFn environment variable SORTOFn which holds
the full path of output file
condition : start,length,datatype,operator,value
start : start pos of zone to compare
length : length of zone to compare
type : 'CH' (Char) or 'BI' (Binary)
operator : EQ or NE or LT or LE or GT or GE
value : C'xxxx' where xxxx = string to compare
Length of string must equal length of zone to compare
Examples:
OUTFIL D:\TMP\BIGF.OUT ; OS2, WIN32 full pathname style
OUTFIL /home/hh/bigf.out ; UNIX full path name style
OUTFIL DD:FOO
; all systems, actual file name via the environment variable FOO
OUTFIL FILE=1,INCLUDE=(11,3,CH,EQ,C'MAR',OR,11,3,CH,EQ,'GAS')
; output filename via environment variable SORTOF1
OUTFIL FILE=2,OMIT=(11,3,CH,EQ,C'POP')
; output filename via environment variable SORTOF2
The SORTWORKS statement
SORTWORKS directory[,directory,...]
directory : one or more directory full path for temporary work files.
The SORTWORKS statement is used to control where to put work files on different drives, or UNIX file-systems (recommended).
Examples:
SORTWORKS C:\TMP,D:\TEMP (OS2, WIN32)
SORTWORKS /var/tmp,/tmp (UNIX)
The SORTWORKS statement is optional: if omitted, all work files
are created (then deleted after use) in the current directory.
*** Using SORTWORKS properly is essential for performance ***
When more than one sortworks directory is given, XSM will force multi-threading: one thread per directory.
When OPTION PROCS= is specified together with SORTWORKS=dir1,dir2,...,dir_n, XSM will use max of both values for threads number.
Using temporary workfiles on another physical disk than the one having input/output files will dramatically improve XSM performance as it will reduce concurrent I/O on a single disk.
Indeed, to simplify, consider XSM I/O phases are:
Phase 1 : reading input file + writing sortworks
Phase 2 : reading sortworks + writing output file
Using input + ouput on one disk, sortworks on another disk will produce best results:
(Disk1) (Disk2)
Phase 1 : reading input file Read
+ writing sortworks Write
Phase 2 : reading sortworks Read
+ writing output file Write
Using input on same disk than sortworks, output on another disk will produce poor results:
(Disk1) (Disk2)
Phase 1 : reading input file Read
+ writing sortworks Write <== Concurrent I/O on same disk: BAD!
Phase 2 : reading sortworks Read
+ writing output file Write
This illustrates the same sort operation, with one disk / with two disks:
Please note than results highly depend on factors such as filesize, free memory,
disk speed, CPU speed. Above chart just illustrates that using different physical
hard disks will improve sort time with no doubt.
Temporary files are named after the current XSM Process Id, srtw04d2.xxx for instance
for Pid 1234 (hexa 04d2). So running several XSM simultaneously using same SORTWORKS
is not a problem.
Temporary files are deleted upon job completion.
In case of a severe error they may remain on disks so it is wise to check XSM return code.
The IOERROR statement
IOERROR IGNORE
The IOERROR statement is valid for variable length text (RECFM=V) files only.
It tells XSM what to do when a record is shorter than a defined key:
- if omitted, the first line shorter than the sort key will cause
XSM to 'ABEND' (Abnormal end) immediately.
- if present, lines shorter than the sort key will be ignored.
The STORAGE statement
STORAGE size{K|M|G}
size : amount of main storage to use, in Kilobytes, Megabytes, or Gigabytes.
Examples:
STORAGE 450K
STORAGE 2150K
STORAGE 8M
The STORAGE statement is optional: if omitted, storage will be allocated 'au mieux' by XSM, depending on your Operating System and input files.
Nevertheless, it can be used to limit paging/swapping leading to System Stress in case of huge files:
T = total input size in Megabytes
T ≤ 64517 MB : STORAGE = square root( T ) / 4
T > 64517 MB : STORAGE = T / 1020
For instance, 3 input files of 25000 MB
T = 3 x 25000 = 75000 (MB)
STORAGE = 75000 / 1020 = 74MB
in this case, statement STORAGE 74MB can improve performances
The INCLUDE/EXCLUDE/OMIT statements
form 1 : Deduplicating
NONE ; do not suppress anything (default)
OMIT DUPKEYS (or DUPLICATE KEYS) ; suppress records with duplicate keys
DUPRECORDS (or DUPLICATE RECORDS) ; suppress duplicate records
form 2 : Filtering
INCLUDE
EXCLUDE COND=(col,len,pattern[,startcol,len,pattern ... ])
OMIT COND=(col,len,type,op,pattern[,AND|OR,col,len,type,op,pattern ...])
col : start position of the field to be examined in each record (1..n)
len : length of the field
type : always 'CH' (useless but for IBM SORT compatibility)
operator : one of 'EQ' 'NE' 'GT' GE' 'LT' 'LE'
pattern : C'...' string constant to be matched against the field
You cannot use EXCLUDE/INCLUDE operations with VFIELDS.
EXCLUDE and OMIT are synonyms.
Wild chars '*' and '+' MUST be escaped by a backslash '\' :
'\*' stands for 'Any String',
'\+' stands for 'Any Character',
thus '*' and '+' (not escaped) are normal characters.
If the pattern does not contain the wild char '\*', its length should match the field length; otherwise, it will be truncated or padded with blanks to match the field length.
Successive triplets <start,len,pattern> specified in the same statement are processed with an 'OR' boolean operator
Successive statements (one statement per line) are processed with a 'AND' boolean operator.
Example:
INCLUDE COND=(15,5,CH,EQ,C'JONES',OR,15,5,CH,EQ,C'SMITH')
OMIT COND=(11,3,CH,EQ,C'000')
These 2 filter statements mean:
1) process records only if the field col.15-19 contains the names 'SMITH' or 'JONES'
2) in the remaining set of records, throw away records where col.11-13 are equal to '000'
The SKIP_HEADER statement
SKIP_HEADER nnn
The SKIP_HEADER statement will skip nnn first records (lines) from input file.
If more than one input file is specified, record are skipped from first file.
Example:
SKIP_HEADER 4 # throw away report header
5. Command Line Syntax
The XSM Sort/Merge program is invoked from the system command line.
You may call XSM with or without parameter file.
Once you have manually setup XSM parameters and options, just plug the command line in your batch program: UNIX Shell, Windows .bat or whatever is your favorite batch language.
The full syntax is:
hxsm [options]
Options:
-c, --check check result of a previous sort/merge operation and issues message
"XSM064I The file xxx correctly sorted" on success
-q, --quiet run in quiet mode with no information displayed on stderr, unless an
error occurs
Note that stdout is reserved for default output if no output specified
-v, --verbose display more details while processing
-v/--verbose and -q/--quiet are mutually exclusive
-h, --help display help and exit
--sort sort operation (default)
equivalent to SORT statement in a parameter file
-m, --merge merge operation
equivalent to MERGE statement in a parameter file
--copy copy operation
equivalent to OPTION COPY statement in a parameter file
--sort --merge --copy are mutually exclusive
-k, --key=start,len[,direction,[type] ]
sort key definition:
start : starting position (relative to 1) for that key.
length : key length, in bytes
direction : 'D' = descending, 'A' = ascending
type : 'C' = character
'B' = binary byte ("low value" in Cobol)
'I' = ignore Upper/lower case
'N' = numeric
'P' = Packed decimal field (RECFM=F or M files only)
'Y' = Year in a date
equivalent to FIELDS= part of SORT statement in a parameter file
-k all, --key=all sort using key = whole record
equivalent to FIELDS=ALL part of SORT statement in a parameter file
-r, --recfm=F|V|M specify Record Format, can be F (Fixed), V (variable), M (MFCobol)
default is V (variable)
equivalent to RECFM= part of RECORD statement in a parameter file
-l nnn, --lrecl=nnn specify Logical Record Length (LRECL):
nnn = exact record length for RECFM=F, including any separator CR/LF
nnn = max record length for RECFM=V, excluding line separator CR/LF
equivalent to LRECL= part of RECORD statement in a parameter file
-z nnn same as -l nnn (for UNIX compatibility)
--infile=file[,recfm=x[,lrecl=nnn] specify input file(s)
RECFM and LRECL can be specified together with an input file
when reformatting.
See Reformatting
multiple input files can be specified, as follow:
--infile=/tmp/f1 --infile=/tmp/f2 ...
equivalent to INPFIL statement in a parameter file
-o, --outfile=fileout[,recfm=F|V|M[,lrecl=nnn]] specify output file
RECFM and LRECL can be specified together with an output file
when reformatting.
See Reformatting
--outrec=(in_pos1,len1,out_pos1,type1[,in_pos2,...]) specify output record
when reformatting.
See Reformatting
-uk, --unique-key drop next records with duplicate keys, 1rst one is kept
equivalent to OMIT DUPLICATE KEYS statement in a parameter file
-ur, --unique-record drop duplicate records, 1rst one is kept
equivalent to OMIT DUPLICATE RECORDS statement
--include=start,len,op,val[AND|OR,start,len,op,val...] specify include filter
equivalent to INCLUDE statement in a parameter file
--exclude=start,len,op,val[AND|OR,start,len,op,val...] specify exclude filter
equivalent to EXCLUDE statement in a parameter file
-t dir1,dir2 --sortwork[s]=dir1,dir2,... sortworks directory list
you can issue a list of directories, separated by comma (,) or
issue several --sortworks= or -t :
--sortworks=/tmp1/dir1 --sortworks=/tmp/dir2
equivalent to:
--sortworks=/tmp1/dir1,/tmp/dir2
equivalent to:
-t /tmp1/dir1 -t /tmp/dir2
equivalent to:
-t /tmp1/dir1,/tmp/dir2
equivalent to SORTWORKS statement in a parameter file
-y nnnK|M|G, --storage=nnnK|M/G force storage allocation to nnn Kilo/Mega/Gigabytes
in most cases, better to let XSM calculate by itself
to use only when slow processing on very large files
equivalent to STORAGE statement in a parameter file
--keep-order force non-destructive sort, also called "stable sort"
see explanation below.
equivalent to OPTION KEEP_ORDER in a parameter file
--record-separator=C|0xhh defines the characters or pair of characters at the end
of each record. This option is primarily used for OPEN MVS and
OS/400 for variable length files.
equivalent to OPTION RECORD_SEPARATOR in a parameter file
--collating-sequence=ebcdic force alphanumeric sequence to IBM's EBCDIC
equivalent to OPTION COLLATING in a parameter file
--skip-head=nnn ignore nnn first records of input file
equivalent to SKIP_HEADER statement in a parameter file
--throw-empty-records ignore empty records. Without this option, XSM will stop
with an error if it encounters empty (length=0) record on input
stream.
-i ignore lines shorter that sort keys (text file RECFM=V only)
equivalent to IOERROR IGNORE statement in a parameter file
-n n, --procs=n force multi-threading to n Threads
equivalent to OPTION PROCS statement in a parameter file
--norun test syntax of parmfile only. does not process
Short options and long options can be mixed together, as follow:
hxsm --input-file=/tmp/f.in -o/tmp/f.out myparms.xms
For an easy start, see Command Line examples in next chapter "Sample jobs":
6. Sample sort jobs: best way to get into!
Job 1. Single text file
- Method 1: input/output files are specified using UNIX style redirections
Parameter file job1.xsm:
SORT FIELDS=(14,7,B,A)
RECORD RECFM=V,LRECL=200
Command line:
hxsm job1.xsm < SAMPLE.INP > SAMPLE.OUT
-
Method 2: input/output files are "hard coded" in parmfile
Parameter file job1.xsm:
SORT FIELDS=(14,7,B,A)
RECORD RECFM=V,LRECL=200
INPFIL SAMPLE.INP
OUTFIL SAMPLE.OUT
Command line:
hxsm job1.xsm
-
Method 3: Full command line without parameter file
hxsm -k 14,7 -l 200 < SAMPLE.INP > SAMPLE.OUT
or
hxsm -k 14,7 -l 200 -oSAMPLE.OUT SAMPLE.INP
or
hxsm --key=14,7 --lrecl=200 --outfile=SAMPLE.OUT --infile=SAMPLE.INP
Job 2. Single text file, redirecting stdout stream
- Same as above, but the output will be redirected to a report program named 'MYREPORT.EXE'
Parameter file job2.xsm:
# Job 2 : OUTFIL is not specified, so XSM will output on stdout
# This is used to redirect output to another program
SORT FIELDS=(14,7,B,A)
RECORD RECFM=V,LRECL=200
INPFIL SAMPLE.INP
Command line:
hxsm job2.xsm | myreport
Full command line without parameter file:
hxsm -k 14,7 SAMPLE.INP | myreport
If you want all identical lines except first to be dropped, add -ur (or --unique-record) option:
hxsm -ur -k 14,7 SAMPLE.INP | myreport
Job 3. Single text file, redirecting stdin,stdout streams
- Same as above, but the input comes from an account program named 'ACCOUNT.EXE'
Parameter file job3.xsm:
# Job 3 : neither INPFIL nor OUTFIL are specified:
# UNIX style redirections are used instead
SORT FIELDS=(14,7,B,A)
RECORD RECFM=V,LRECL=200
Command line:
account | hxsm job3.xsm | myreport
Full command line without parameter file:
account | hxsm -k 14,7 | myreport
Job 4. Two Binary files, UNIX, two physical drives available
- All records are 180 char. long
- The sort key is an account number col 14 to 20, plus a date (yymmdd) col 2 to 7
- For each account, records are to be sorted on decreasing date
- The input files are '../s1' and '../s2', each about 4 Megabytes large
- The output file is '/usr/acct/sample.out'
- The second drive is mounted as /var
Parameter file job4.xsm:
SORT FIELDS=(14,20,B,A,2,6,B,D)
RECORD RECFM=F,LRECL=180
INPFIL ../s1 ; input file #1
INPFIL ../s2 ; input file #2
OUTFIL /usr/acct/sample.out ; output file
SORTWORKS /var/tmp,/tmp ; We use 2 disks for sortworks
STORAGE 4M ; Storage 4M to minimize I/O on sortworks
Command line:
hxsm job4.xsm
Full command line without parameter file:
hxsm -y 4200K -k 14,20 -k 2,6,d -r F -z 180 \
-t /var/tmp,/tmp \
-o /usr/acct/sample.out ../s1 ../s2
Job 5. Two Binary files, UNIX, two physical drives available
- Same as above, but only records which do not start with a '*' are to be processed
Parameter file job5.xsm:
SORT FIELDS=(14,20,B,A,2,6,B,D)
RECORD RECFM=F,LRECL=180
INPFIL ../s1
INPFIL ../s2
OUTFIL /usr/acct/sample.out
SORTWORKS /var/tmp,/tmp
OMIT 1,1,*
; Note:
; 1,1,* means pos 1, length 1, value: single star '*' char
; ==> gives expected results :)
; 1,1,\* would mean pos 1, length 1, value: any chars
; ==> would give strange results :(
Command line:
hxsm job5.xsm
No Full Command line available, because of the OMIT statement
Job 6. CSV Text File with variable fields separated by ':', UNIX, one disk drive
- Line size is maximum 100 chars, excluding UNIX LF or Windows CR/LF
- 1st key: a name, 3rd field, max length 40 bytes, ignore case, ascending
- 2nd key: a number, 2nd field, max length 10 digits (char '0' .. '9'), descending
- Drop lines with duplicate keys, except the 1st one
- OS UNIX/Linux
Parameter file job6.xsm:
SORT VFIELDS=(3,40,I,A,2,10,N,D),FIELDSEP=SEMICOLUMN
RECORD RECFM=T,LRECL=100
INPFIL ../s1
OUTFIL /usr/acct/sample.out
OMIT DUPKEYS ; short for OMIT DUPLICATE KEYS
Command line:
hxsm job6.xsm
No full command line, because of the variable fields (VFIELD)
Variant: a CSV file (field separator is semi-column), Windows
- Max Line size is 600 chars
- 1st key: a name, 3rd field, max length 40 bytes, Respect case, ascending
- 2nd key: a number, 2nd field, max length 10 digits (char '0' .. '9'), descending
- OS Windows
Parameter file job6b.xsm:
SORT VFIELDS=(3,40,C,A,2,10,N,D),FIELDSEP=SEMICOLUMN
RECORD RECFM=T,LRECL=600
INPFIL C:\data\myfile.csv
OUTFIL D:\data\myfile_sorted.csv
Note: using special FIELDSEP keywords is recommended for general punctuation chars to avoid syntax parsing errors, specially with comments markers ';', '#'.
Command line:
hxsm job6b.xsm
No full command line, because of the variable fields (VFIELD)
Job 7. Binary file with Packed Decimal zones, WIN32, one disk drive
- Record length is 110 bytes,
- 1st key from pos. 1 to 4, "Packed Decimal", descending order,
- 2nd key from pos. 21 to 40, Alphanumerical, ascending order.
Parameter file job7.xsm:
SORT FIELDS=(1,4,P,D,21,20,B,A)
RECORD RECFM=F,LRECL=110
INPFIL E:\tmp\s1.bin
OUTFIL E:\tmp\s2.bin
Command line:
hxsm job7.xsm | myreport
Full command line without parameter file:
hxsm -r F -l 110 -k 1,4,P,D -k 21,20 -o E:\tmp\s2.bin E:\tmp\s1.bin
Job 8. Text file with dates like 'mmddyy', UNIX, one disk drive
- text lines are at most 110 bytes long, excluding the trailing CR/LF,
- 1st key pos. 1-4, Alphanumeric, ascending order,
- 2nd key pos. 21-40 Alphanumeric, ascending order,
- 3rd key pos. 61-66 date like 'mmddyy', descending order
- the oldest date is past the year 1970 (Y2K pivot).
Parameter File job8.xsm:
SORT FIELDS=(1,4,BI,A,21,20,BI,A,65,2,Y2K,D,63,2,BI,D,61,2,BI,D)
RECORD RECFM=V,LRECL=110
INPFIL /tmp/s1.txt
OUTFIL /tmp/s2.txt
Command line:
hxsm job8.xsm | myreport
No full command line, because this 'Y2K' feature is supported only in parameter files.
For processing dates before 1970 belonging to the 1900 years, just add the statement:
OPTION Y2KSTART=19nn
Example:
OPTION Y2KSTART=1963 # changing Y2K pivot from 1970 to 1963
64, 65, .... 99 will be processed as 19NN
00, 01, ... 63 will be processed as 20NN
See FAQ, Y2K for explanation on Y2K pivot.
Job 9. Stripping duplicates lines in a text file
- text lines are at most 52 bytes long, excluding the trailing CR/LF,
- duplicate records are eliminated
Parameter File job9.xsm:
SORT FIELDS=all
RECORD RECFM=V,LRECL=52 ; excluding CR/LF
INPFIL /tmp/s1.txt
OUTFIL /tmp/s2.txt
OMIT DUPRECS ; short for OMIT DUPLICATE RECORDS
Command line:
hxsm job9.xsm | myreport
Full command line without parameter file:
hxsm -r V -l 52 -k all -ur -o /tmp/s2.txt /tmp/s1.txt
Job 10. Copying selectively a text file onto 3 others files
- input and output file names are set using environment variables
- text lines are at most 152 bytes long, excluding the trailing CR/LF
- all lines are copied onto file # 1
- lines beginning with name 'SCOTT' and containing 'TIGER' in cols. 16-20 are written onto output file # 2
- other lines are written onto output file # 3
- ddname for input file is SORTIN (could be any other variable name)
- ddnames for output files are SORTOF1, SORTOF2, SORTOF3. (syntax:
SORTOF + a number, starting at 1)
Note the use of DD:varname which needs initialization of environment variable varname
DD:varname form can be used both for INPFIL and OUTFIL statements
Note the use of OUTFIL FILE=n which needs initialization of environment variables SORTOFn=
Parameter File job10.xsm:
OPTION COPY
RECORD RECFM=V,LRECL=152 ; excluding CR/LF
INPFIL DD:SORTIN
OUTFIL FILE=1,INCLUDE=ALL
OUTFIL FILE=2,INCLUDE=(1,5,CH,EQ,'SCOTT',AND,16,5,CH,EQ,'TIGER')
OUTFIL FILE=3,INCLUDE=(1,5,CH,NE,'SCOTT',OR,16,5,CH,NE,'TIGER')
OMIT DUPRECS
Command line (UNIX):
SORTIN=/tmp/myappl/data_to_copy.text \
SORTOF1=/tmp/s1 \
SORTOF2=/tmp/s2 \
SORTOF3=/tmp/s3 \
hxsm job10.xsm
Command line (Windows)
setlocal
set SORTIN=/tmp/myappl/data_to_copy.text
set SORTOF1=/tmp/s1 (or \tmp\s1 ...)
set SORTOF2=/tmp/s2
set SORTOF3=/tmp/s3
hxsm job10.xsm
No full command line for this job, due to DD:varname and FILE=n forms.
Job 11. Splitting a file selectively with filters
- Same as job 10 above but using a sort operation (
SORT FIELDS= statement) instead of a copy operation (OPTION COPY statement):
Parameter File job11.xsm:
SORT FIELDS=(1,5,C,A,16,5,C,A)
RECORD RECFM=V,LRECL=152 ; excluding CR/LF
INPFIL DD:SORTIN
OUTFIL FILE=1,INCLUDE=ALL
OUTFIL FILE=2,INCLUDE=(1,5,CH,EQ,'SCOTT',AND,16,5,CH,EQ,'TIGER')
OUTFIL FILE=3,INCLUDE=(1,5,CH,NE,'SCOTT',OR,16,5,CH,NE,'TIGER')
OMIT DUPRECS
Command line (UNIX):
SORTIN=/tmp/myappl/data_to_split.text \
SORTOF1=/tmp/s1 \
SORTOF2=/tmp/s2 \
SORTOF3=/tmp/s3 \
hxsm job11.xsm
No full command line for this job, due to DD:varname and FILE=n forms.
Job 12. Merging files and splitting selectively with filters
- Same as job 11 above but using 3 input files instead of one:
Parameter File job12.xsm:
SORT FIELDS=(1,5,C,A,16,5,C,A)
RECORD RECFM=V,LRECL=152 ; excluding CR/LF
INPFIL DD:MYINPUT1 ; variable name is free with DD: form
INPFIL DD:MYINPUT2 ; variable name is free with DD: form
INPFIL DD:MYINPUT3 ; variable name is free with DD: form
OUTFIL FILE=1,INCLUDE=ALL
OUTFIL FILE=2,INCLUDE=(1,5,CH,EQ,'SCOTT',AND,16,5,CH,EQ,'TIGER')
OUTFIL FILE=3,INCLUDE=(1,5,CH,NE,'SCOTT',OR,16,5,CH,NE,'TIGER')
OMIT DUPRECS
Command line (UNIX):
MYINPUT1=/tmp/myappl/data1 \
MYINPUT2=/tmp/myappl/data1 \
MYINPUT3=/tmp/myappl/data1 \
SORTOF1=/tmp/s1 \
SORTOF2=/tmp/s2 \
SORTOF3=/tmp/s3 \
hxsm job12.xsm
No full command line for this job, due to DD:varname and FILE=n forms.
7. Destructive and non destructive sort
XSM's default behavior is to run as a destructive sort: it will not make sure that original order of records having the same sort key(s) is kept.
Sort programs can produce different and unpredictable results, based on their proper algorithms.
XSM becomes a non-destructive sort, also known as "stable sort" using following option:
OPTION KEEP_ORDER=YES (in a parameter file)
or
--keep-order (command line option)
Example: input file:
01234 SMITH John
01738 SMITH Robin
02284 ALLEN David
04264 ALLEN Bob
Destructive sort (default) result, using sort key = Name :
We get the ALLEN family then the SMITH family, we don't care if firstnames are kept ordered as they where in input file.
04264 ALLEN Bob
02284 ALLEN David
01234 SMITH John
01738 SMITH Robin
Non-destructive (or "stable") sort result (sort key = Name) :
We make sure records, once sorted on sort key, are kept ordered as the where in input file.
02284 ALLEN David
04264 ALLEN Bob
01234 SMITH John
01738 SMITH Robin
Note: forcing stable sort may decrease performance as the whole record as to be processed, not only the sort keys.
For best performance, it is recommended to add whatever needed supplementary sort keys and avoid forcing stable sort.
8. Performance Issues
See our Performance Table.
The XSM program is a multi-phases sort program: if it cannot sort
the whole set of input files internally in main storage, it writes
portions of sorted items in temporary files, then merge them onto
the final output file.
Performances are tied to:
- the size of the files to be sorted
- the kind of records to be sorted ( Fixed or Variable records)
- the amount of internal memory available
- the quality if Operating System itself
- the number and speed of the physical disk drives available.
File size:
If the whole set of input file can fit in internal memory, it will
be read, then sorted, then written in one shot.
Otherwise, portions of sorted items will be written to temporary files,
then merged onto other temporary files, and so on until the program
can merge all the temporary files onto the final output file.
In that case, performances will be a matter of disk I/O speed.
Fixed or Variable records:
Variable records (lines of text) are read in memory and written to
disk, one line at a time, thus decreasing slightly performances.
Note that for such files, the VFIELDS option gives rather poor results.
Fixed records are read in memory and written to disk in big 'chunks',
the time spent to read an write being dramatically decreased.
But as soon as 'INCLUDE/EXCLUDE/OMIT' statements are to be processed,
fixed records must be read one by one, just like variable records.
Memory available:
Depending on your Operating System, the XSM program will allocate
real or virtual memory to sort internally the largest number of
input records.
Although the amount of memory can be up to 512 Megabytes, care must be taken to avoid Operating System paging and swapping between disk and memory (Virtual storage management).
As of Version 5, XSM computes the amount of storage needed, generally:
1/4 * square root of total input size (in Megabytes).
In case of huge files, when this exceeds the possibility of the Operating System,
you may specify a Storage limit to avoid paging and swapping.
example: sorting a 50 Gigabytes file an a 128 MB machine.
XSM will compute its storage size for best performances: 57 MBytes
You want to keep at least 80MB for Operating System:
STORAGE 48M
XSM will take more time in merge operations, but you will avoid the (in)famous paging/swapping 'System Stress'.
Operating System:
The Operating System overhead is something difficult to manage.
Some hints:
UNIX: you may give your sort jobs a nice priority, still it might have little impact.
Windows: just play with your mouse during a sort job and the duration will increase by a factor of ten !!!
Best Practices:
- First setup and tune your XSM jobs on a system with no activity (no overload).
- Once you are happy with performances, move your XSM jobs on target system (production activity).
- Then observe results. If XSM job performances have strongly decreased, it's about time to audit your system and search for high resource consumers (CPU, Mem, disk I/O).
Keep in mind that on nowadays machines, bottleneck is physical disk I/O : CPU and Memory I/O are way faster than mechanical units such as disks.
Whenever you can, tune XSM jobs using different physical disks that are not busy: it will make the difference.
Physical Disk drives
If the files to be sorted do not fit in internal memory,
work files are written to disks, as specified in SORTWORKS statement,
or in the current directory.
If possible, have the input and output files (they can be the same)
on one physical drive, and the work file on another physical drive, as explained in SORTWORKS section.
Be sure to have room enough on each drive (at least the output size
on one drive, and 1/4 of that size on the other)
On true multitasking systems (UNIX, OS/2, Windows NT), take care of
other programs (and the Operating System itself, for paging) using
those drives at same time.
And, last but not least, please don't be confused:
Two logical partitions 'C:' and 'D:' on the same physical drive
are not equal to
Two physical drives named 'C:' and 'D:'
(same remark regarding the UNIX 'filesystems').
9. Running XSM as a User Exit
There are two ways of running XSM as an internal sort, using a User Exit :
- You provide your own I/O routines in a dynamic library described in the parameter file as indicated below, and XSM will load that library at initialization, then call the I/O routines as needed instead of doing the input/output itself.
- or, you write your own program that will call the XSM main routine with a parameter string and pointers to your own read/write functions, then compile and link-edit it with the library provided in the XSM package
Providing your own dynamic library
On Windows, the libraries are named 'SOMELIB.DLL'.
On UNIX/Linux, libs are named 'libsomething.a' for static libs or 'libsomething.so' for dynamic libs.
Example:
The following module is compiled, then linked to give 'libmysrt.a' (AIX), or 'libmysrt.so' (LINUX), or 'MYSRT.DLL' (Windows):
#include <stdio.h>
int myread(char * xsm_buf, int xsm_maxl) {
/* this routine reads on stdin, and arrange the lines to be sorted */
/* the lines beginning with a '*' are to be dropped */
char mybuf[134];
do if (gets(mybuf) == NULL) return EOF;
while (mybuf[0] == '*');
/* set up the record to be sorted, return its length */
return sprintf(xsm_buf, "%-5.5s%-20.20s%s", mybuf+20, mybuf,mybuf+25);
}
int mywrite(char * xsm_buf, int xsm_len) {
/* this routine arrange the sorted lines and write them onto stdout */
if (xsm_len > 0)
printf("-20.20s%-5.5s%s\n", mybuf+5, mybuf, mybuf+25);
return 0;
}
Then the library is moved somewhere in the 'PATH' (Windows), or in the 'LD_LIBRARY_PATH' (UNIX) directories list;
Finally, the hxsm module is run with the following parameter file:
SORT FIELDS=(25,5,B,D,1,20,B,A)
RECORD RECFM=V,LRECL=132
INPXIT LIBRARY=mysrt,ENTRY=myread
OUTXIT LIBRARY=mysrt,ENTRY=mywrite
Using the XSM routines
The XSM software includes a library with a public routine that may be called from your own programs written in any compiled language supporting link-edit, such as C, C++, C#, VB, ...:
the hxsm_begin routine
Synopsis: int hxsm_begin(parms_string, my_read_function, my_write_function)
where:
parm_string is either a parameter file-name or the traditional
parameters, options, and flag string of the hxsm command line.
my_read_function is a pointer to your own input routine
with the following syntax:
int myread(char * buffer, int max_length)
and which, when called, should either provide the next record to be sorted
end its length, or return max_length = -1 for the end of the input stream (EOF).
my_write_function is a pointer to your own output routine
with the following syntax:
int mywrite(char * buffer, int rec_length)
and which, when called, will receive either the next sorted record and its length,
or a NULL buffer and a length = -1 (EOF).
returns: 0 if OK, a negative number if a parameter is not correct.
Example:
The following program will prompt for a name and an amount, until the user hits the EOF (^Z or ^D) key, will sort and display all the input lines in decreasing amount order, with a total as the last display.
It is compiled and linked against the hxsm library:
#include <stdio.h>
double atof(), total = 0.0;
static int myread(char * buf, int maxlen) {
char work_area_1[133], work_area_2[133];
double amount;
/* prompt for a name and an amount, returns -1 if EOF */
printf("Name :" ); if (gets(work_area_1) == NULL) return -1;
printf("Amount :" ); if (gets(work_area_2) == NULL) return -1;
amount = atof(work_area_2);
return sprintf(buf, "%9.2f %-30.30s", amount, work_area_1);
}
static int mywrite(char * buf, int len) {
if (len > 0)
printf("%-30.30s %-12.12s\n", buf + 13, buf);
return len; /* will be ignored by hxsm */
}
main(int argc, char **argv) {
int rc = hxsm_begin("MYPRM.XSM", myread, mywrite);
if (rc == 0)
printf("%-30.30s %9.2f\n", "*** TOTAL ***", total);
else printf("Failed, rc %d\n", rc);
return rc;
}
The program uses following parameter file 'MYPRM.XSM':
SORT FIELDS=(1,12,B,D)
RECORD RECFM=V,LRECL=132
It can also be written with the following hxsm_begin statement:
int rc = hxsm_begin("-k1,12,d", myread, mywrite);
and then run straight without any parameter file.
For any further detail or help, please read our FAQ or contact our support.
|