Implementing IsASCII method. Benchmark.

These days I needed a realisation of string ASCII validator. After few google attempts, I found a single constant unicode.MaxASCII in Go SDK, which could help to resolve this problem. To be sure, I asked on stackoverflow.

Since we were discussing about performance, user peterSO did a benchmark, which had a really interesting numbers. I decided to go forward and create more implementations. Source code can be found here.

Implementations

Using Go RANGE

This method uses a range iteration over a string. It’s one of the cleaner solutions in this topic.

func IsASCIIRange(s string) bool {
	for _, c := range s {
		if c > unicode.MaxASCII {
			return false
		}
	}

	return true
}

Using classic for-loop

This is a trivial iteration over string. Which will make a really big difference.

func IsASCIIClassic(s string) bool {
	for i := 0; i < len(s); i++ {
		if s[i] > unicode.MaxASCII {
			return false
		}
	}

	return true
}

Using regex

Regex. Nothing more to say. Cool, but not always.

var asciRegex = regexp.MustCompile(`^[\x00-\x7F]*$`)

func IsASCIIRegex(s string) bool {
	return asciRegex.MatchString(s)
}

Using Rune

Using strings.IndexFunc we can find first character which satisfy condition of unicode.

func IsASCIIRune(s string) bool {
	nonASCIIChar := func(r rune) bool {
		return r > unicode.MaxASCII
	}

	return strings.IndexFunc(s, nonASCIIChar) == -1
}

Benchmark

goos: darwin
goarch: amd64
pkg: github.com/nicumaxian-ellation/benchmarks/isascii
BenchmarkIsASCIIRange-8     	50000000	        35.8 ns/op
BenchmarkIsASCIIClassic-8   	100000000	        19.2 ns/op
BenchmarkIsASCIIRegex-8     	 2000000	       987 ns/op
BenchmarkIsASCIIRune-8      	20000000	       104 ns/op
PASS
ok  	github.com/nicumaxian-ellation/benchmarks/isascii	8.965s

Conclusion

We didn’t usually thing on such small things. Actually, small things matters. There is a big performance difference between Range and classic for solutions. The diff between these two is ~1.86%. Which as you see, is quite big.

If you’re working with a big amount of data, than you obviously should consider it. For a small amount of data, I would choose cleanest solution. For me it was solution #1 (using range).