187. Repeated DNA Sequences #
Problem #
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for Example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Problem Summary #
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, identifying repeated sequences in DNA can sometimes be very helpful for research. Write a function to find all 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Solution Ideas #
- This problem is easier to solve without bit manipulation: maintain a string of length 10, and output it if its occurrence count in the map is > 1.
- To solve this problem with bit manipulation, you need to dynamically maintain a hashkey of length 10. First calculate the hash of the initial length-9 prefix. During the subsequent scan, if the length exceeds 10, remove one character from the beginning of the hash and add one character at the end. The specific approach is to first encode ATCG as 00, 01, 10, 11. Then for a length of 10, the hashkey needs to be maintained within 20 bits. mask = 0xFFFFF is exactly 20 bits. After maintaining the hashkey, use it to deduplicate and count frequencies.
Code #
package leetcode
// Solution One
func findRepeatedDnaSequences(s string) []string {
if len(s) < 10 {
return nil
}
charMap, mp, result := map[uint8]uint32{'A': 0, 'C': 1, 'G': 2, 'T': 3}, make(map[uint32]int, 0), []string{}
var cur uint32
for i := 0; i < 9; i++ { // First 9 bits, ignore
cur = cur<<2 | charMap[s[i]]
}
for i := 9; i < len(s); i++ {
cur = ((cur << 2) & 0xFFFFF) | charMap[s[i]]
if mp[cur] == 0 {
mp[cur] = 1
} else if mp[cur] == 1 { // >2, repeated
mp[cur] = 2
result = append(result, s[i-9:i+1])
}
}
return result
}
// Solution Two
func findRepeatedDnaSequences1(s string) []string {
if len(s) < 10 {
return []string{}
}
ans, cache := make([]string, 0), make(map[string]int)
for i := 0; i <= len(s)-10; i++ {
curr := string(s[i : i+10])
if cache[curr] == 1 {
ans = append(ans, curr)
}
cache[curr]++
}
return ans
}