Posted tagged ‘string’

Classifying Strings

April 23, 2012

Purpose

To determine ways to classify strings into two sets, as either contained by a data set or not contained by a data set.

Living on the fast track? Click for conclusion.

Obvious Solution (Ifs)

Well the most obvious solution is a chain of if‘s and else if‘s. It’s also the simplest… well, at first.

if (strcmp(input, "Hello") == 0) {
    puts("I like that word.");
} else if (strcmp(input, "World") == 0) {
    puts("I like that word.");
} else {
    puts("I don't know what you said.");
}

However, an understanding of how this works shows why it is a bad idea for a large number of strings. Let’s use the above program, for simplicity’s sake.

If the input is “Hello”, then the code is something like this:

Is the first letter ‘H’? Yes.

Is the next letter ‘e’? Yes.

Is the next letter ‘l’? Yes.

Is the next letter ‘l’? Yes.

Is the next letter ‘o’? Yes.

Is the last character a ‘0’? Yes. Return 0.

That is six comparisons to compare the string. If the input is “World”, then the code is:

Is the first letter ‘H’? No. Return non-zero -> if fails.

Is the first letter ‘W’? Yes.

Is the next letter ‘o’? Yes.

Is the next letter ‘r’? Yes.

Is the next letter ‘l’? Yes.

Is the next letter ‘d’? Yes.

Is the last character a ‘0’? Yes. Return 0.

So now, we have 7 comparisons. But what if the input is “foo” (neither “Hello” nor “World”)…

Is the first letter ‘H’? No. Return non-zero -> if fails.

Is the first letter ‘W’? No. Return non-zero -> else if fails.

We’re out of options. Execute the else.

The result is very slow for large amounts of strings with large lengths. Below is some personally crafted code that takes an input file and returns an if construct using strcmp. It is obviously very rudimentary, but it was created to illustrate a point.

/* Copyright (C) 2012 <https://dashmorethan.wordpress.com>
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included
 * in all copies or substantial portions of the Software.
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

int read_string(FILE *fp, unsigned char **string);

int main(int argc, char **argv) {
    FILE *input;
    unsigned char *in;

    if (argc != 2) {
        printf("usage: %s [input file]\n", argv[0]);
        return 1;
    }

    input = fopen(argv[1], "r");
    if (input == NULL) {
        printf("File %s does not exist (input file).\n", argv[1]);
        return 2;
    }

    if (read_string(input, &in) != 0) {
        fclose(input);
        return 3;
    }
    printf("if (strcmp(input, \"%s\") == 0) {\n", in);

    while (read_string(input, &in) == 0) {
        printf("\telse if (strcmp(input, \"%s\") == 0) {\n\t\t\n}",
            in);
        free(in);
    }

    puts("\telse {\n\t\t\n}");

    fclose(input);

    return 0;
}

int read_string(FILE *fp, unsigned char **string) {
    int count = 0;
    int current = 0;
    *string = NULL;

    while ((unsigned char)current != '\n') {
        current = fgetc(fp);
        if (current == EOF)
            return -2;
        count++;
    }

    fseek(fp, -count, SEEK_CUR);
    *string = malloc(count);
    if (*string == NULL) {
        return -1;
    }

    for (current = 0; current < count; current++) {
        (*string)[current] = fgetc(fp);
    }
    (*string)[count - 1] = '\0';

    return 0;
}

Rolling our own Hash Function

It would be nice if we could do:

switch (input) {
    case "Hello":
        puts("I like that word.");
        break;
    case "World":
        puts("I like that word.");
        break;
    default:
        puts("I don't know what you said.");
        break;
}

But no such thing exists, because the switch construct only accepts integers for cases. The next best thing is a hash function, which would look something like this:

switch (hash(input)) {
    case hash("Hello"):
        puts("I like that word.");
        break;
    case hash("World"):
        puts("I like that word.");
        break;
    default:
        puts("I don't know what you said.");
        break;
}

Unfortunately, this is still unallowed. Even if hash is a function, C will refuse to call it and output the results. However, we could precompute the hashes at run-time, using a pre-determined hash function. Below is a program that uses the sdbm (actually, gawk’s implementation of the sdbm hash) and generates code that will hash a string and then create a switch construct for an input input.

/* Copyright (C) 2012 <https://dashmorethan.wordpress.com>
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
 * copy of this software and associated documentation files (the "Software"),
 * to deal in the Software without restriction, including without limitation
 * the rights to use, copy, modify, merge, publish, distribute, sublicense,
 * and/or sell copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included
 * in all copies or substantial portions of the Software.
 */

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

uint32_t hash(unsigned char *str);
int read_string(FILE *fp, unsigned char **string);

int main(int argc, char **argv) {
    FILE *input;
    unsigned char *in;

    if (argc != 2) {
        printf("usage: %s [input file]\n", argv[0]);
        return 1;
    }

    input = fopen(argv[1], "r");
    if (input == NULL) {
        printf("File %s does not exist (input file).\n", argv[1]);
        return 2;
    }

    puts("#include <stdint.h>\n\
/* sdbm / gawk hash */\n\
uint32_t hash(unsigned char *str) {\n\
    uint32_t hash = 0;\n\
    int c;\n\
\n\
    while ((c = *str++))\n\
        hash = c + (hash << 6) + (hash << 16) - hash;\n\
    return hash;\n\
}\n\
switch (hash(input)) {\n");

    while (read_string(input, &in) == 0) {
        printf("\tcase %lu: /*%s*/\n\t\t\n\t\tbreak;\n",
            (unsigned long)hash(in), in);
        free(in);
    }

    puts("\tdefault:\n\n\t\tbreak;\n}\n");

    fclose(input);

    return 0;
}

/* sdbm / gawk hash */
uint32_t hash(unsigned char *str) {
    uint32_t hash = 0;
    int c;

    while ((c = *str++))
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

int read_string(FILE *fp, unsigned char **string) {
    int count = 0;
    int current = 0;
    *string = NULL;

    while ((unsigned char)current != '\n') {
        current = fgetc(fp);
        if (current == EOF)
            return -2;
        count++;
    }

    fseek(fp, -count, SEEK_CUR);
    *string = malloc(count);
    if (*string == NULL) {
        return -1;
    }

    for (current = 0; current < count; current++) {
        (*string)[current] = fgetc(fp);
    }
    (*string)[count - 1] = '\0';

    return 0;
}

And the output (when our own code is added to it):

#include <stdint.h>
/* sdbm / gawk hash */
uint32_t hash(unsigned char *str) {
    uint32_t hash = 0;
    int c;

    while ((c = *str++))
        hash = c + (hash << 6) + (hash << 16) - hash;
    return hash;
}
switch (hash(input)) {
    /* fall through */
    case 2873473298: /*Hello*/
    case 2052574866: /*World*/
        puts("I like that word.");
        break;
    default:
        puts("I don't know what you said.");
        break;
}

However there is a major problem with this method, hash collisions. If a string could be generated such that it hashes to 2052574866 (for World), then it would appear to be right despite possibly being wrong! To decrease the chances of this happening, we could use a stronger hash function. Alternatively, we may check that the string is actually the one expected. But programs such as our hash creator have already been made, and they are far more professional than anything that you can roll at home.

gperf – the perfect hash generator

gperf is a program that generates a perfect hash function in C code. By perfect, we mean it has no collisions, and it only needs one string comparison. This article will only cover the bare minimum, but I highly recommend you read IBM’s excellent guide on the subject. You can produce a gperf for C by just running gperf -L C -C #input file#, where “#input file#” is a list of strings.

Here is our Hello World example in gperf:

/* C code produced by gperf version 3.0.3 */
/* Command-line: gperf -L C -C hello_world  */
/* Computed positions: -k'1' */

#if !((' ' == 32) && ('!' == 33) && ('"' == 34) && ('#' == 35) \
      && ('%' == 37) && ('&' == 38) && ('\'' == 39) && ('(' == 40) \
      && (')' == 41) && ('*' == 42) && ('+' == 43) && (',' == 44) \
      && ('-' == 45) && ('.' == 46) && ('/' == 47) && ('0' == 48) \
      && ('1' == 49) && ('2' == 50) && ('3' == 51) && ('4' == 52) \
      && ('5' == 53) && ('6' == 54) && ('7' == 55) && ('8' == 56) \
      && ('9' == 57) && (':' == 58) && (';' == 59) && ('<' == 60) \
      && ('=' == 61) && ('>' == 62) && ('?' == 63) && ('A' == 65) \
      && ('B' == 66) && ('C' == 67) && ('D' == 68) && ('E' == 69) \
      && ('F' == 70) && ('G' == 71) && ('H' == 72) && ('I' == 73) \
      && ('J' == 74) && ('K' == 75) && ('L' == 76) && ('M' == 77) \
      && ('N' == 78) && ('O' == 79) && ('P' == 80) && ('Q' == 81) \
      && ('R' == 82) && ('S' == 83) && ('T' == 84) && ('U' == 85) \
      && ('V' == 86) && ('W' == 87) && ('X' == 88) && ('Y' == 89) \
      && ('Z' == 90) && ('[' == 91) && ('\\' == 92) && (']' == 93) \
      && ('^' == 94) && ('_' == 95) && ('a' == 97) && ('b' == 98) \
      && ('c' == 99) && ('d' == 100) && ('e' == 101) && ('f' == 102) \
      && ('g' == 103) && ('h' == 104) && ('i' == 105) && ('j' == 106) \
      && ('k' == 107) && ('l' == 108) && ('m' == 109) && ('n' == 110) \
      && ('o' == 111) && ('p' == 112) && ('q' == 113) && ('r' == 114) \
      && ('s' == 115) && ('t' == 116) && ('u' == 117) && ('v' == 118) \
      && ('w' == 119) && ('x' == 120) && ('y' == 121) && ('z' == 122) \
      && ('{' == 123) && ('|' == 124) && ('}' == 125) && ('~' == 126))
/* The character set is not based on ISO-646.  */
error "gperf generated tables don't work with this execution character set. Please report a bug to <bug-gnu-gperf@gnu.org>."
#endif

#define TOTAL_KEYWORDS 2
#define MIN_WORD_LENGTH 5
#define MAX_WORD_LENGTH 5
#define MIN_HASH_VALUE 5
#define MAX_HASH_VALUE 6
/* maximum key range = 2, duplicates = 0 */

#ifdef __GNUC__
__inline
#else
#ifdef __cplusplus
inline
#endif
#endif
static unsigned int
hash (str, len)
     register const char *str;
     register unsigned int len;
{
  static const unsigned char asso_values[] =
    {
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 1, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 0, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
      7, 7, 7, 7, 7, 7
    };
  return len + asso_values[(unsigned char)str[0]];
}

#ifdef __GNUC__
__inline
#ifdef __GNUC_STDC_INLINE__
__attribute__ ((__gnu_inline__))
#endif
#endif
const char *
in_word_set (str, len)
     register const char *str;
     register unsigned int len;
{
  static const char * const wordlist[] =
    {
      "", "", "", "", "",
      "World",
      "Hello"
    };

  if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH)
    {
      register int key = hash (str, len);

      if (key <= MAX_HASH_VALUE && key >= 0)
        {
          register const char *s = wordlist[key];

          if (*str == *s && !strcmp (str + 1, s + 1))
            return s;
        }
    }
  return 0;
}

And what do we need to do?

if (in_word_set(input, strlen(input)) {
    puts("I like that word.");
} else {
    puts("I don't know what you said.");
}

Wonderful (we could also change in_word_set to another function with a better name).

Benchmarks

Objective

Given a list of states and territories in the United States of America determine if a certain input is a state or territory or is not a state or territory. (Total Count: 56).

States and Territories

Alabama
Alaska
American Samoa
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Guam
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Northern Marianas Islands
Ohio
Oklahoma
Oregon
Pennsylvania
Puerto Rico
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont
Virginia
Virgin Islands
Washington
West Virginia
Wisconsin
Wyoming

Results

A file states, that contains 98496 test cases (including some random gibberish for default cases), was piped into three programs, one using if, the other using sdbm, and the last using gperf. They were compiled with the highest optimization settings (gcc -O3). The average for 10 tests are below in seconds:

IF: 0.046
SDBM: 0.017
GPERF: 0.015

The sizes of the programs in bytes (stripped) follows:

10320 state_if
 6216 state_hash
10320 state_gperf
26856 total

Downloads

You can download this as a neat .zip or .tar.gz file below:

Contents:

  • unique_states – a list of all states and territories.
  • states – a large file, that is a bunch of randomly ordered states. Use it as input for benchmarks.
  • if_maker.c – a program that creates an if structure based on a file (string1\nstring2\nstring3 …)
  • state_if.c – a program that uses if_maker’s generated structure as a benchmark.
  • hash_maker.c – a program that creates a switch structure based on a file (string1\nstring2\nstring3 …)
  • state_hash.c – a program that uses hash_maker’s generated structure as a benchmark.
  • gperf_hash.h – a gperf’s generated hash function that determines if a string is a state.
  • state_gperf.c – a program that uses “gperf_hash.h” as a benchmark.

Conclusion

Speed (seconds) Size (bytes)
If 0.046s 10320
SDBM 0.017s 6216
GPERF 0.015s 10320

There is no reason to use if, it is not only three times as slow as the other implementations, but it is also the largest resulting size. The home-made hash brings the smallest file, and is second place in speed, but users should beware of hash collisions. Ultimately, gperf is the best, removing collisions and also clocking in at the highest time, its only downfall being size.