Chapter 1: Collections

知识解析

本文将 array, hash table, string 统称为collections。 array可以看作 "key是从0开始的连续整数的hash table"，string可以看作是character组成的array。注意这里说的都是抽象意义上的array和string概念，与具体语言中的class Array和 class String相区别。

1.1 Hash Table as ADT

Hash table的功能是基于"key"快速查找value ( $O(1)$ 时间) ，存储的基本元素是key-value pair。

Hash table 是一种“归类”的数据结构，由hash_func(key) 得到bucket number，把key, value pair 放入对应的bucket。整个hash table就是一组bucket。hash_func就是hash function，负责把 key 转换成尽可能少的整数，作为key对应的bucket index。【Cracking the Coding Interview】介绍了一种简单而普遍的实现方式。

如果不同的key出现在同一个bucket，就叫做hash table的collision。hash function的设计原则，就是collision越少越好。

Hash 函数的具体实现方法，可以参考open hashing 和 closed hashing的相关资料：Meaning of open hashing and closed hashing

如果hash table的key的范围与连续的整数对应，比如keys是从a 到z , 或者数字i 到j，那么可以把array当做hashtable，index与真正的key一一对应。比如用arr[0] 表示 key:a 或key:i对应的value。这种情况具体选择专门的hash数据结构还是array，取决于key的分布是密集 (array) 还是稀疏 (hash)。

特别地，如果value是只有两种状态，比如true 或 false ，那么可以把二进制数或bit-vector作为hash table来记录。

特别地，Hash table还可以用Balanced BST来实现, lookup的时间由 $O(1)$ 增长为 $O(logn)$ ，但可能节省space cost , 且可以按照key的顺序输出。

正因为Hash Table是一种“归类”的数据类型，往往也可以用排序来解决，取决于问题对时间和空间复杂度的trade-off。排序（sorting）即归类，只不过是有序的归类。

1.2 Reverse array / string

用头尾两个pointer，相向遍历，交换元素直到指针相遇 while l < r ，但不适用于数据流
顺序遍历元素，push入stack，再依次pop，也可以用递归

1.3 String Matching 的常见算法（在string中查找substring）

假定源string长度为m，需要查找的substring长度为n

Brute-Force算法：遍历string，将每个idx为起点, idx+n-1为终点的匹配子串与substring比较 $O(mn)$

Rabin-Karp算法：遍历string，把每个idx 为起点，idx+n-1 为终点的匹配子串当做一个sliding window，为他们设计一种线性的hash function，使得这些匹配子串的hash值在window移动时只需要做 $O(1)$ 的操作：比如把子串的每一位看作一个x进制数，每个字符对应每一位上的数字，这样每次window slide时，就可以快速增量计算hash的值（扣除最高位，增加最低位）。而hash值与匹配对象substring不同时，无需每个字符进行比较，因此在average情况下能做到 $O(m+n)$ 。最差情况需要 $O(mn)$ ，在hash function的collision非常严重时发生。

模式识别

1.4 在collection内检索和统计的问题

1.4.1 统计一个元素集中元素出现的次数

用Hash Table。只统计元素出现与否的问题，可以用bitvector即二进制数来记录。

e.g. 1 Implement an algorithm to determine if a string has all unique characters. （Hint: 也可以用排序） e.g. 2 Write a method to decide if one string is a permutation of the other. （Hint: 也可以用排序） e.g. 3 Given a newspaper and message as two strings, check if the message can be composed using letters in the newspaper.

1.4.2 检索某一类元素的问题

检索(归类)某一类元素的问题，如果允许修改原来的array并且追求更低的空间复杂度，那么总是可以用排序来归类进而达到待检索元素group到一起的效果。

e.g.1 Given an array of integers, find two numbers such that they add up to a specific number. (Leetcode: 2Sum)
e.g.2 Find all unique triplets in the array which gives the sum of zero.(Leetcode: 3Sum)

1.5 处理array或string问题的几点细节

1.5.1 数组或string内，问题的解与前驱子问题有递推关系

总是以[0, i]为范围，检索前驱节点的信息求解当前子问题的解。如果只是检索特定的value，则可以用将当前value作为key建立hash table（value取决于你需要什么前驱子问题的信息），使得后驱节点每次检索的复杂度降低到 $O(1)$ 。实际上类似于dynamic programming的思维方式，记录前驱节点的信息，用空间换取回头在范围内重新扫描需要的时间。

e.g. Given an unsorted array of integers, find the length of the longest consecutive elements sequence. (Leetcode: Longest Consecutive Sequence)

# 1. 前驱子问题的最大长度 + 当前节点的影响 => 到当前节点为止的最大长度
# 2. 要求当前节点对最大长度的影响，就要知道当前节点所在序列的长度，
#    就必须知道num-1和num+1所在序列的上界和下界

class Solution:
    def longestConsecutive(self, nums: List[int]) -> int:
        bounds = {}
        max_length = 0
        for num in nums:
            if num in bounds:
                continue

            # extend the bound via neighbors
            low, high = num, num

            if num - 1 in bounds:
                low = bounds[num-1][0]
            
            if num + 1 in bounds:
                high = bounds[num+1][1]

            # future extension can only be from low and high so only they need correct bounds
            bounds[high] = bounds[low] = bounds[num] = (low, high)

            # maintain max_length
            if high - low + 1 > max_length:
                max_length = high - low + 1
        
        return max_length

注意在处理每个节点时，首先检索得到当前子问题的解，其次再把用于当前节点检索的信息存入hash table。

1.5.2 求最值或总和的问题，其实就是1.5.1的最简单特例，只不过我们只需要了解之前紧邻的一个节点的解就能推导出当前解 :`sum(n) = sum(n-1) + curr` ，在update的过程中用 `sum` 一个变量记录即可

1.5.3 处理两个不等长的array或string，总是可以以while aIdx < aLength and bIdx < bLength 作为结束条件，再处理可能剩下的元素。

e.g Given two sorted arrays, merge B into A in sorted order. (Leetcode: Merge Sorted Array)

1.5.4 array或string修改后长度变化的问题

用两个pointer分别记录old_index 和 new_index ，如果改动后element增多，就先resize array，再倒序赋值；element减少，就先顺序赋值，再resize。这样就使得新数据不会覆盖旧数据。

e.g. 1 Given input "Mr John Smith", output "Mr%20John%20Smith" e.g. 2 Given input "Mr%20John%20Smith" , output "Mr John Smith"

1.6 Sliding Window 问题**

1.7 处理2D-Array的问题

通常可以把2D-Array当做array of arrays，访问每一行，在每一行内再按列访问。对于外圈向内圈遍历，可以记录row和col的边界，循环时将上下左右的进程分开描述。

e.g.1 Given an NxN matrix, rotate the matrix by 90 degrees. （从test case出发理解交换规律） e.g.2 If mat[i][j] is 0, then set the entire row and column to 0.

e.g.3 Given a matrix of m x n elements (m rows, n columns), return all elements of the matrix in spiral order. (LeetCode: Spiral Matrix)

def spiralOrder(self, matrix: List[List[int]]) -> List[int]:
    if len(matrix) == 0 or len(matrix[0]) == 0: return []

    seq = []
    row, col = 0, 0
    row_l, col_l = 0,0
    row_h, col_h = len(matrix), len(matrix[0])
    
    while True:
        # right
        for col in range(col_l, col_h):
            seq.append(matrix[row_l][col])
        row_l += 1
        if row_l == row_h: return seq
        
        # down
        for row in range(row_l, row_h):
            seq.append(matrix[row][col_h-1])
        col_h -= 1
        if col_l == col_h: return seq

        # left
        for col in reversed(range(col_l, col_h)):
            seq.append(matrix[row_h-1][col])
        row_h -= 1
        if row_l == row_h: return seq
        
        # up
        for row in reversed(range(row_l, row_h)):
            seq.append(matrix[row][col_l])
        col_l += 1
        if col_l == col_h: return seq

1.8 设计数字来映射/hash的情境

可以利用进制数的正交性，即：数字的高位和低位是互相正交、独立的（或者反过来说，一个数用m进制来表示，只有唯一一种可能），把每一个单元用一位数字来表示

e.g. 1 Design a hash function that is suitable for words in a dictionary.(EPI: 12.1) (问题的特征: word不会很长，每一位的可能性有限且密集，可以对应0-25的整数） e.g. 2 A hash function for the state of a chess game.(EPI: 12.2) (问题的特征: 对每一次transition需要很容易的变换hash）
e.g. 3 Implement a method rand7() given rand5().(CtCI:17.11)

Previous2. Array - Intervals NextChapter 2: Linked List

Last updated 4 years ago

Was this helpful?

知识解析

1.1 Hash Table as ADT

1.2 Reverse array / string

1.3 String Matching 的常见算法 （在string中查找substring）

模式识别

1.4 在collection内检索和统计的问题

1.4.1 统计一个元素集中元素出现的次数

1.4.2 检索某一类元素的问题

1.5 处理array或string问题的几点细节

1.5.1 数组或string内，问题的解与前驱子问题有递推关系

1.5.2 求最值或总和的问题，其实就是1.5.1的最简单特例，只不过我们只需要了解之前紧邻的一个节点的解就能推导出当前解 :sum(n) = sum(n-1) + curr ，在update的过程中用 sum 一个变量记录即可

1.6** Sliding Window 问题

1.7 处理2D-Array的问题

1.8 设计数字来映射/hash的情境

1.3 String Matching 的常见算法（在string中查找substring）

1.5.2 求最值或总和的问题，其实就是1.5.1的最简单特例，只不过我们只需要了解之前紧邻的一个节点的解就能推导出当前解 :`sum(n) = sum(n-1) + curr` ，在update的过程中用 `sum` 一个变量记录即可

1.6 Sliding Window 问题**