393. UTF-8 Validation #

Problem #

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

For 1-byte character, the first bit is a 0, followed by its unicode code.
For n-bytes character, the first n-bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing the data, return whether it is a valid utf-8 encoding.

Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

Example 1:

data = [197, 130, 1], which represents the octet sequence: 11000101 10000010 00000001.

Return true.
It is a valid utf-8 encoding for a 2-bytes character followed by a 1-byte character.

Example 2:

data = [235, 140, 4], which represented the octet sequence: 11101011 10001100 00000100.

Return false.
The first 3 bits are all one's and the 4th bit is 0 means it is a 3-bytes character.
The next byte is a continuation byte which starts with 10 and that's correct.
But the second continuation byte does not start with 10, so it is invalid.

Problem Summary #

A character in UTF-8 may have a length of 1 to 4 bytes, following these rules:

For a 1-byte character, the first bit of the byte is set to 0, and the following 7 bits are the unicode code of this symbol. For an n-byte character (n > 1), the first n bits of the first byte are all set to 1, the n+1 bit is set to 0, and the first two bits of the following bytes are all set to 10. All remaining binary bits not mentioned are the unicode code of this symbol. This is how UTF-8 encoding works:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Given an array of integers representing data, return whether it is a valid utf-8 encoding.

Note:

The input is an integer array. Only the least significant 8 bits of each integer are used to store data. This means each integer represents only 1 byte of data.

Solution Approach #

This problem seems complicated, but in fact, we can simply simulate strictly according to the UTF8 definition.

Code #


package leetcode

func validUtf8(data []int) bool {
	count := 0
	for _, d := range data {
		if count == 0 {
			if d >= 248 { // 11111000 = 248
				return false
			} else if d >= 240 { // 11110000 = 240
				count = 3
			} else if d >= 224 { // 11100000 = 224
				count = 2
			} else if d >= 192 { // 11000000 = 192
				count = 1
			} else if d > 127 { // 01111111 = 127
				return false
			}
		} else {
			if d <= 127 || d >= 192 {
				return false
			}
			count--
		}
	}
	return count == 0
}