





            


                       TTSS ---- AA SSiimmppllee TTookkeenn SSccaannnniinngg LLiibbrraarryy
                                    RReelleeaassee 11..0066


                                    _P_a_u_l _D_u_B_o_i_s
                              _d_u_b_o_i_s_@_p_r_i_m_a_t_e_._w_i_s_c_._e_d_u
                     Wisconsin Regional Primate Research Center
                          Revision date:  18 October 1993


            Applications  often wish to pull strings apart into individ-
            ual tokens.  This document describes TS, a library  consist-
            ing  of  an unsophisticated set of routines providing simple
            token scanning operations.

            String tokenizing can often  be  done  satisfactorily  using
            strtok()  or  equivalent  function from the C library.  When
            such routines are insufficient, the routines described  here
            may  be  useful.   They offer, for example, quote and escape
            character parsing, and configurability of  underlying  scan-
            ning  properties,  within the confines of a fixed interface.
            TS provides a simple built-in scanner, which may be replaced
            by  alternate  routines as desired.  Applications may switch
            back and forth between scanners on the fly.

            11..  IInnssttaallllaattiioonn

            This release of TS is configured using imake and the  WRPRC2
            configuration  files,  so you also need to obtain the WRPRC2
            configuration distribution if you want to build it the usual
            way.   (If  you  want to avoid imake, the Makefile is simple
            enough that you should be able to tweak it by hand.)

            There is one library  to  be  built,  libtokenscan.a.   That
            library  should  be installed in a system library directory.
            The header file tokenscan.h should be installed in a  system
            header file directory.

            22..  EExxaammppllee

            The  canonical  method  of tokenizing a string with TS is as
            follows:












            Revision date:  18 October 1993   Printed:  19 October 19125





            Token Scanning Library      - 2 -


                 char buf[size], *p;

                 /* _._._._i_n_i_t_i_a_l_i_z_e _c_o_n_t_e_n_t_s _o_f _b_u_f _h_e_r_e_._._. */
                 TSScanInit (buf);   /* initialize scanner */
                 while ((p = TSScan ()) != (char *) NULL)
                 {
                      /* _._._._d_o _s_o_m_e_t_h_i_n_g _h_e_r_e _w_i_t_h _t_o_k_e_n _p_o_i_n_t_e_d _t_o _b_y _p _h_e_r_e_._._. */
                 }


            The scanner is initialized  by  passing  the  string  to  be
            scanned to TSScanInit() and TSScan() is called to get point-
            ers to successive tokens.  TSScan() returns NULL when  there
            are no more.

            33..  BBeehhaavviioorr ooff tthhee DDeeffaauulltt SSccaannnneerr

            The  default  scanner is destructive in that it modifies the
            string scanned (it writes nulls at the  end  of  each  token
            found),  so make a copy of the scanned string if you need to
            maintain an intact version.

            The scanner is controlled by delimiter, quote,  escape,  and
            end-of-string  (EOS)  characters.   The defaults for each of
            these are given below.


            center tab(:); l l .  delimiter:space tab quote:" ' escape:\
            EOS:null linefeed carriage-return


            In  the  simplest  case,  tokens are sequences of characters
            between delimiters.  Since the default  delimiters  are  the
            whitespace characters space and tab, tokens are sequences of
            non-whitespace characters.


                 This is a line ->   <This> <is> <a> <line>


            Quotes may be used to include  whitespace  within  a  token.
            Quotes  must match; hence one quote character may be used to
            quote another kind of quote character, if there is more than
            one.


                 "This is" a line    -><This is> <a> <line>
                 This" "is a line    -><This is> <a> <line>
                 "'" '"'        ->   <'> <">
                 "'"'"'         ->   <'">


            The  escape  character  turns off any special meaning of the
            next character, including another escape character.



            Revision date:  18 October 1993   Printed:  19 October 19125





                                        - 3 -     Token Scanning Library


                 What\'s up     ->   <What's> <up>
                 \\ is the escape    -><\> <is> <the> <escape>


            The EOS characters tell the scanner when to  quit  scanning.
            A null character always terminates the scan.  In the default
            case, linefeed and carriage return do as well.

            You can replace the delimiter, quote, escape, or EOS charac-
            ter sets.  This changes the particular characters that trig-
            ger the  above  behaviors,  without  changing  the  way  the
            default  scan  algorithm works.  Or you can replace the scan
            routine to make the scanner  behave  in  entirely  different
            ways.

            By  default,  multiple  consecutive delimiter characters are
            treated as a single delimiter.  A flag may  be  set  in  the
            scanner  structure  to  suppress delimiter concatenation, so
            that every delimiter character is significant.  This is use-
            ful  for  tokenizing  strings  in  which  empty  fields  are
            allowed: two consecutive delimiters are considered  to  have
            an empty token between them, and delimiters appearing at the
            beginning or end of a string signify an empty token  at  the
            beginning end of the string.

            The  difference  in treatment of strings when delimiters are
            concatenated versus when they are not is illustrated  below.
            Suppose the delimiter is colon (:) and the string to be tok-
            enized is:

                 :a:b::c:

            When delimiters are concatenated, the string contains  three
            tokens:


                 :a:b::c:       -> <a> <b> <c>


            When  all  delimiters are significant, string contains three
            empty tokens in addition:


                 :a:b::c:       -> <> <a> <b> <> <c> <>


            44..  PPrrooggrraammmmiinngg IInntteerrffaaccee

            Source files using TS routines  should  include  tokenscan.h
            and executables should be linked with -ltokenscan.

            A scanner is described by a data structure:





            Revision date:  18 October 1993   Printed:  19 October 19125





            Token Scanning Library      - 4 -


                 typedef struct TSScanner TSScanner;
                 struct TSScanner
                 {
                      void (*scanInit) ();
                      char *(*scanScan) ();
                      char *scanDelim;
                      char *scanQuote;
                      char *scanEscape;
                      char *scanEos;
                      int  scanFlags;
                 }


            Scanner   structures  may  be  obtained  or  installed  with
            TSGetScanner() and TSSetScanner().

            For each string to be  scanned,  the  application  passes  a
            pointer to it to TSScanInit(), which takes care of scan ini-
            tialization.  If the application requires initialization  to
            be  performed  in  addition to that done internally by TS, a
            pointer to a routine that does so should be installed in the
            scanInit  field of the scanner data structure.  It takes one
            argument, a pointer  to  the  string  to  be  scanned.   The
            default scanInit is NULL, which does nothing.

            scanDelim,  scanQuote,  scanEscape, and scanEos are pointers
            to null-terminated strings consisting of the set of  charac-
            ters  to  be  considered  delimiter,  quote, escape, and EOS
            characters, respectively.  The default values were described
            previously.

            scanScan  points  to  the routine that does the actual scan-
            ning.  It is called by TSScan() and should  be  declared  to
            take no arguments and return a character pointer to the next
            token in the current scan buffer.   Normally,  this  routine
            does  the  following: call TSGetScanPos() to get the current
            scan position, scan the token, call TSSetScanPos() to update
            the scan position, then return a pointer to the beginning of
            the token.  If there are no more tokens in the scan  buffer,
            the routine should return NULL, and should continue to do so
            until TSScanInit() is called again.

            scanFlags contains flags that modify the scanner's behavior.
            For  the default scanner, the default is zero.  If the tsNo-
            ConcatDelims flag is set, the scanner stops on every  delim-
            iter rather than treating sequences of contiguous delimiters
            as a single delimiter.

            The public routines in the TS library are described below.

            vvooiidd TTSSSSccaannIInniitt ((pp))
            cchhaarr    **pp;;

            Initializes the scanning  routines  to  make  the  character



            Revision date:  18 October 1993   Printed:  19 October 19125





                                        - 5 -     Token Scanning Library


            string pointed to by p the current scan buffer.

            cchhaarr **TTSSSSccaann (())

            Returns  a  pointer  to  the  next token in the current scan
            buffer, NULL if there are no more.  The token is  terminated
            by a null byte.  Scan behavior may be modified by substitut-
            ing alternate scan routines.

            Once TSScan() returns NULL, it continues to do so until  the
            scanner  is reinitialized with another call to TSScanInit().

            vvooiidd TTSSGGeettSSccaannnneerr ((pp))
            TTSSSSccaannnneerr**pp;;

            Gets the current  scanner  information  (initialization  and
            scan procedures; delimiter, quote, escape, and EOS character
            sets; and scanner flags) into the structure pointed to by p.

            vvooiidd TTSSSSeettSSccaannnneerr ((pp))
            TTSSSSccaannnneerr**pp;;

            Installs  a scanner.  If p itself if NULL, all default scan-
            ner values are reinstalled.  Otherwise, any pointer field in
            p  with a NULL value causes the corresponding value from the
            default scanner to be reinstalled, and  if  p->scanFlags  is
            zero,  the scanner flags are set to the default (also zero).

            vvooiidd TTSSGGeettSSccaannPPooss ((pp))
            cchhaarr    ****pp;;

            Puts the current position within  the  current  scan  buffer
            into  the argument, which should be passed as the address of
            a character pointer.  This is useful when you want  to  scan
            only enough of the buffer to partially classify it, then use
            the rest in some other way.

            vvooiidd TTSSSSeettSSccaannPPooss ((pp))
            cchhaarr    **pp;;

            Set the current scan position to p.

            iinntt TTSSIIssSSccaannDDeelliimm ((cc))
            cchhaarr    cc;;

            Returns non-zero if c is a member of the  current  delimiter
            character set, zero otherwise.

            iinntt TTSSIIssSSccaannQQuuoottee ((cc))
            cchhaarr    cc;;

            Returns non-zero if c is a member of the current quote char-
            acter set, zero otherwise.




            Revision date:  18 October 1993   Printed:  19 October 19125





            Token Scanning Library      - 6 -


            iinntt TTSSIIssSSccaannEEssccaappee ((cc))
            cchhaarr    cc;;

            Returns non-zero if c is a  member  of  the  current  escape
            character set, zero otherwise.

            iinntt TTSSIIssSSccaannEEooss ((cc))
            cchhaarr    cc;;

            Returns  non-zero  if  c is an end-of-string character, zero
            otherwise.

            iinntt TTSSTTeessttSSccaannFFllaaggss ((ffllaaggss))
            iinntt     ffllaaggss

            Returns non-zero if all bits in flags are set for  the  cur-
            rent scanner, zero otherwise.

            44..11..  OOvveerrrriiddiinngg SSccaannnniinngg RRoouuttiinneess

            It  is possible to switch back and forth between scan proce-
            dures on the fly, even in the middle of scanning  a  string.
            The  general  procedure  is to use TSGetScanner() to get the
            current scanner information, and TSSetScanner() to install a
            new  one  and  reinstall the old one when done with the new.
            If you switch between more than two scanners, another method
            may be necessary.

            It is possible to modify the default scanner without replac-
            ing it.  For instance, you could change the  default  delim-
            iter set but leave everything else the same, as follows:


                 TSScanner scanStruct;

                 TSGetScanner (&scanStruct);
                 scanStruct.scanDelim = " \t:;?,!";
                 TSSetScanner (&scanStruct);


            55..  MMiisscceellllaanneeoouuss

            A  scanner  can  be  nondestructive with respect to the line
            being scanned by using a scan routine that copies characters
            out of the scanned line into a second buffer and returning a
            pointer to the second buffer.  The  second  buffer  must  be
            large  enough to hold the largest possible token, of course.
            If the second buffer is a fixed area, the  host  application
            must  be careful not to call TSScan() again until it is done
            with the current token, or else make a copy of it first.  If
            the  second buffer is dynamically allocated, the application
            must be  ready  to  do  storage  management  of  the  tokens
            returned.




            Revision date:  18 October 1993   Printed:  19 October 19125





                                        - 7 -     Token Scanning Library


            Some  scanners  might  not need delimiter, quote, escape, or
            EOS characters at all, particularly if token boundaries  are
            context sensitive.

            66..  DDiissttrriibbuuttiioonn aanndd UUppddaattee AAvvaaiillaabbiilliittyy

            The  TS  distribution may be freely circulated and is avail-
            able for anonymous FTP access in the  /pub/TS  directory  on
            host  ftp.primate.wisc.edu.   Updates  appear  there as they
            become available.

            The WRPRC2 imake configuration file distribution  is  avail-
            able on ftp.primate.wisc.edu as well, in /pub/imake-stuff.

            If  you  do  not  have  FTP  access,  send requests to soft-
            ware@primate.wisc.edu.  Bug reports, questions,  suggestions
            and comments may be sent to this address as well.








































            Revision date:  18 October 1993   Printed:  19 October 19125


