- Data structures can be categorized from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data elements, while physical structure describes how data is stored in computer memory.
- Common logical structures include linear, tree-like, and network structures. We generally classify data structures into linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program runs, data is stored in computer memory. Each memory space has a corresponding memory address, and the program accesses data through these addresses.
- Physical structures are primarily divided into contiguous space storage (arrays) and dispersed space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- Basic data types in computers include integers (`byte`, `short`, `int`, `long`), floating-point numbers (`float`, `double`), characters (`char`), and booleans (`boolean`). Their range depends on the size of the space occupied and the representation method.
- Original code, complement code, and two's complement code are three methods of encoding numbers in computers, and they can be converted into each other. The highest bit of the original code of an integer is the sign bit, and the remaining bits represent the value of the number.
- Integers are stored in computers in the form of two's complement. In this representation, the computer can treat the addition of positive and negative numbers uniformly, without the need for special hardware circuits for subtraction, and there is no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the presence of the exponent bit, the range of floating-point numbers is much greater than that of integers, but at the cost of sacrificing precision.
- ASCII is the earliest English character set, 1 byte in length, and includes 127 characters. The GBK character set is a commonly used Chinese character set, including more than 20,000 Chinese characters. Unicode strives to provide a complete character set standard, including characters from various languages worldwide, thus solving the problem of garbled characters caused by inconsistent character encoding methods.
- UTF-8 is the most popular Unicode encoding method, with excellent universality. It is a variable-length encoding method with good scalability and effectively improves the efficiency of space usage. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 occupies less space than UTF-8. Programming languages like Java and C# use UTF-16 encoding by default.
- Data structures can be categorized from two perspectives: logical structure and physical structure. Logical structure describes the logical relationships between data, while physical structure describes how data is stored in memory.
- Frequently used logical structures include linear structures, trees, and networks. We usually divide data structures into linear (arrays, linked lists, stacks, queues) and non-linear (trees, graphs, heaps) based on their logical structure. The implementation of hash tables may involve both linear and non-linear data structures.
- When a program is running, data is stored in memory. Each memory space has a corresponding address, and the program accesses data through these addresses.
- Physical structures can be divided into continuous space storage (arrays) and discrete space storage (linked lists). All data structures are implemented using arrays, linked lists, or a combination of both.
- The basic data types in computers include integers (`byte`, `short`, `int`, `long`), floating-point numbers (`float`, `double`), characters (`char`), and booleans (`bool`). The value range of a data type depends on its size and representation.
- Sign-magnitude, 1's complement, 2's complement are three methods of encoding integers in computers, and they can be converted into each other. The most significant bit of the sign-magnitude is the sign bit, and the remaining bits represent the value of the number.
- Integers are encoded by 2's complement in computers. The benefits of this representation include (i) the computer can unify the addition of positive and negative integers, (ii) no need to design special hardware circuits for subtraction, and (iii) no ambiguity of positive and negative zero.
- The encoding of floating-point numbers consists of 1 sign bit, 8 exponent bits, and 23 fraction bits. Due to the exponent bit, the range of floating-point numbers is much greater than that of integers, but at the cost of precision.
- ASCII is the earliest English character set, with 1 byte in length and a total of 127 characters. GBK is a popular Chinese character set, which includes more than 20,000 Chinese characters. Unicode aims to provide a complete character set standard that includes characters from various languages in the world, thus solving the garbled character problem caused by inconsistent character encoding methods.
- UTF-8 is the most popular and general Unicode encoding method. It is a variable-length encoding method with good scalability and space efficiency. UTF-16 and UTF-32 are fixed-length encoding methods. When encoding Chinese characters, UTF-16 takes up less space than UTF-8. Programming languages like Java and C# use UTF-16 encoding by default.
### 2. Q & A
**Q**: Why does a hash table contain both linear and non-linear data structures?
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining": each bucket in the array points to a linked list, which, when exceeding a certain threshold, might be transformed into a tree (usually a red-black tree).
From a storage perspective, the foundation of a hash table is an array, where each bucket slot might contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
The underlying structure of a hash table is an array. To resolve hash collisions, we may use "chaining" (discussed in a later section, "Hash collision"): each bucket in the array points to a linked list, which may transform into a tree (usually a red-black tree) when its length is larger than a certain threshold.
From a storage perspective, the underlying structure of a hash table is an array, where each bucket might contain a value, a linked list, or a tree. Therefore, hash tables may contain both linear data structures (arrays, linked lists) and non-linear data structures (trees).
**Q**: Is the length of the `char` type 1 byte?
The length of the `char` type is determined by the encoding method used by the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to save Unicode code points), so the length of the char type is 2 bytes.
The length of the `char` type is determined by the encoding method of the programming language. For example, Java, JavaScript, TypeScript, and C# all use UTF-16 encoding (to save Unicode code points), so the length of the `char` type is 2 bytes.
**Q**: Is there ambiguity in calling data structures based on arrays "static data structures"? Because operations like push and pop on stacks are "dynamic".
**Q**: Is there any ambiguity when we refer to array-based data structures as "static data structures"? The stack can also perform "dynamic" operations such as popping and pushing.
While stacks indeed allow for dynamic data operations, the data structure itself remains "static" (with unchangeable length). Even though data structures based on arrays can dynamically add or remove elements, their capacity is fixed. If the data volume exceeds the pre-allocated size, a new, larger array needs to be created, and the contents of the old array copied into it.
The stack can implement dynamic data operations, but the data structure is still "static" (the length is fixed). Although array-based data structures can dynamically add or remove elements, their capacity is fixed. If the stack size exceeds the pre-allocated size, then the old array will be copied into a newly created and larger array.
**Q**: When building stacks (queues) without specifying their size, why are they considered "static data structures"?
**Q**: When building a stack (queue), its size is not specified, so why are they "static data structures"?
In high-level programming languages, we don't need to manually specify the initial capacity of stacks (queues); this task is automatically handled internally by the class. For example, the initial capacity of Java's ArrayList is usually 10. Furthermore, the expansion operation is also implemented automatically. See the subsequent "List" chapter for details.
In high-level programming languages, we do not need to manually specify the initial capacity of stacks (queues); this task is automatically completed within the class. For example, the initial capacity of Java's `ArrayList` is usually 10. Furthermore, the expansion operation is also completed automatically. See the subsequent "List" chapter for details.
**Q**:The method of converting the sign-magnitude to the 2's complement is "first negate and then add 1", so converting the 2's complement to the sign-magnitude should be its inverse operation "first subtract 1 and then negate".
However, the 2's complement can also be converted to the sign-magnitude through "first negate and then add 1", why is this?
**A**:This is because the mutual conversion between the sign-magnitude and the 2's complement is equivalent to computing the "complement". We first define the complement: assuming $a + b = c$, then we say that $a$ is the complement of $b$ to $c$, and vice versa, $b$ is the complement of $a$ to $c$.
Given a binary number $0010$ with length $n = 4$, if this number is the sign-magnitude (ignoring the sign bit), then its 2's complement can be obtained by "first negating and then adding 1":
$$
0010 \rightarrow 1101 \rightarrow 1110
$$
Observe that the sum of the sign-magnitude and the 2's complement is $0010 + 1110 = 10000$, i.e., the 2's complement $1110$ is the "complement" of the sign-magnitude $0010$ to $10000$. **This means that the above "first negate and then add 1" is equivalent to computing the complement to $10000$**.
So, what is the "complement" of $1110$ to $10000$? We can still compute it by "negating first and then adding 1":
$$
1110 \rightarrow 0001 \rightarrow 0010
$$
In other words, the sign-magnitude and the 2's complement are each other's "complement" to $10000$, so "sign-magnitude to 2's complement" and "2's complement to sign-magnitude" can be implemented with the same operation (first negate and then add 1).
Of course, we can also use the inverse operation of "first negate and then add 1" to find the sign-magnitude of the 2's complement $1110$, that is, "first subtract 1 and then negate":
$$
1110 \rightarrow 1101 \rightarrow 0010
$$
To sum up, "first negate and then add 1" and "first subtract 1 and then negate" are both computing the complement to $10000$, and they are equivalent.
Essentially, the "negate" operation is actually to find the complement to $1111$ (because `sign-magnitude + 1's complement = 1111` always holds); and the 1's complement plus 1 is equal to the 2's complement to $10000$.
We take $n = 4$ as an example in the above, and it can be generalized to any binary number with any number of digits.
A <u>hash table</u> achieves efficient element querying by establishing a mapping between keys and values. Specifically, when we input a `key` into the hash table, we can retrieve the corresponding `value` in $O(1)$ time.
A <u>hash table</u>, also known as a <u>hash map</u>, is a data structure that establishes a mapping between keys and values, enabling efficient element retrieval. Specifically, when we input a `key` into the hash table, we can retrive the corresponding `value` in $O(1)$ time complexity.
As shown in Figure 6-1, given $n$ students, each with two pieces of data: "name" and "student number". If we want to implement a query feature that returns the corresponding name when given a student number, we can use the hash table shown in Figure 6-1.
As shown in Figure 6-1, given $n$ students, each student has two data fields: "Name" and "Student ID". If we want to implement a query function that takes a student ID as input and returns the corresponding name, we can use the hash table shown in Figure 6-1.
![Abstract representation of a hash table](hash_map.assets/hash_table_lookup.png){ class="animation-figure" }
<palign="center"> Figure 6-1 Abstract representation of a hash table </p>
Apart from hash tables, arrays and linked lists can also be used to implement querying functions. Their efficiency is compared in Table 6-1.
In addition to hash tables, arrays and linked lists can also be used to implement query functionality, but the time complexity is different. Their efficiency is compared in Table 6-1:
- **Adding elements**: Simply add the element to the end of the array (or linked list), using $O(1)$ time.
- **Querying elements**: Since the array (or linked list) is unordered, it requires traversing all the elements, using $O(n)$ time.
- **Deleting elements**: First, locate the element, then delete it from the array (or linked list), using $O(n)$ time.
- **Inserting elements**: Simply append the element to the tail of the array (or linked list). The time complexity of this operation is $O(1)$.
- **Searching for elements**: As the array (or linked list) is unsorted, searching for an element requires traversing through all of the elements. The time complexity of this operation is $O(n)$.
- **Deleting elements**: To remove an element, we first need to locate it. Then, we delete it from the array (or linked list). The time complexity of this operation is $O(n)$.
<palign="center"> Table 6-1 Comparison of element query efficiency</p>
<palign="center"> Table 6-1 Comparison of time efficiency for common operations</p>
Observations reveal that **the time complexity for adding, deleting, and querying in a hash table is $O(1)$**, which is highly efficient.
It can be seen that **the time complexity for operations (insertion, deletion, searching, and modification) in a hash table is $O(1)$**, which is highly efficient.
## 6.1.1 Common operations of hash table
Common operations of a hash table include initialization, querying, adding key-value pairs, and deleting key-value pairs, etc. Example code is as follows:
Common operations of a hash table include: initialization, querying, adding key-value pairs, and deleting key-value pairs. Here is an example code:
=== "Python"
@ -294,7 +294,7 @@ Common operations of a hash table include initialization, querying, adding key-v
## 6.1.2 Simple implementation of hash table
## 6.1.2 Simple implementation of a hash table
First, let's consider the simplest case: **implementing a hash table using just an array**. In the hash table, each empty slot in the array is called a <u>bucket</u>, and each bucket can store one key-value pair. Therefore, the query operation involves finding the bucket corresponding to the `key` and retrieving the `value` from it.
First, let's consider the simplest case: **implementing a hash table using only one array**. In the hash table, each empty slot in the array is called a <u>bucket</u>, and each bucket can store a key-value pair. Therefore, the query operation involves finding the bucket corresponding to the `key` and retrieving the `value` from it.
So, how do we locate the appropriate bucket based on the `key`? This is achieved through a <u>hash function</u>. The role of the hash function is to map a larger input space to a smaller output space. In a hash table, the input space is all possible keys, and the output space is all buckets (array indices). In other words, input a `key`, **and we can use the hash function to determine the storage location of the corresponding key-value pair in the array**.
So, how do we locate the corresponding bucket based on the `key`? This is achieved through a <u>hash function</u>. The role of the hash function is to map a larger input space to a smaller output space. In a hash table, the input space consists of all the keys, and the output space consists of all the buckets (array indices). In other words, given a `key`, **we can use the hash function to determine the storage location of the corresponding key-value pair in the array**.
The calculation process of the hash function for a given `key` is divided into the following two steps:
When given a `key`, the calculation process of the hash function consists of the following two steps:
1. Calculate the hash value using a certain hash algorithm `hash()`.
2. Take the modulus of the hash value with the number of buckets (array length) `capacity` to obtain the array index `index`.
1. Calculate the hash value by using a certain hash algorithm `hash()`.
2. Take the modulus of the hash value with the bucket count (array length) `capacity` to obtain the array `index` corresponding to that key.
```shell
index = hash(key) % capacity
```
Afterward, we can use `index` to access the corresponding bucket in the hash table and thereby retrieve the `value`.
Afterward, we can use the `index` to access the corresponding bucket in the hash table and thereby retrieve the `value`.
Assuming array length `capacity = 100` and hash algorithm `hash(key) = key`, the hash function is `key % 100`. Figure 6-2 uses `key` as the student number and `value` as the name to demonstrate the working principle of the hash function.
Let's assume that the array length is `capacity = 100`, and the hash algorithm is defined as `hash(key) = key`. Therefore, the hash function can be expressed as `key % 100`. The following figure illustrates the working principle of the hash function using `key` as student ID and `value` as name.
![Working principle of hash function](hash_map.assets/hash_function.png){ class="animation-figure" }
@ -884,29 +884,29 @@ The following code implements a simple hash table. Here, we encapsulate `key` an
## 6.1.3 Hash collision and resizing
Fundamentally, the role of the hash function is to map the entire input space of all keys to the output space of all array indices. However, the input space is often much larger than the output space. Therefore, **theoretically, there must be situations where "multiple inputs correspond to the same output"**.
Essentially, the role of the hash function is to map the entire input space of all keys to the output space of all array indices. However, the input space is often much larger than the output space. Therefore, **theoretically, there will always be cases where "multiple inputs correspond to the same output"**.
For the hash function in the above example, if the last two digits of the input `key` are the same, the output of the hash function will also be the same. For example, when querying for students with student numbers 12836 and 20336, we find:
In the example above, with the given hash function, when the last two digits of the input `key` are the same, the hash function produces the same output. For instance, when querying two students with student IDs 12836 and 20336, we find:
```shell
12836 % 100 = 36
20336 % 100 = 36
```
As shown in Figure 6-3, both student numbers point to the same name, which is obviously incorrect. This situation where multiple inputs correspond to the same output is known as<u>hash collision</u>.
As shown in Figure 6-3, both student IDs point to the same name, which is obviously incorrect. This situation where multiple inputs correspond to the same output is called<u>hash collision</u>.
![Example of hash collision](hash_map.assets/hash_collision.png){ class="animation-figure" }
<palign="center"> Figure 6-3 Example of hash collision </p>
It is easy to understand that the larger the capacity $n$ of the hash table, the lower the probability of multiple keys being allocated to the same bucket, and the fewer the collisions. Therefore, **expanding the capacity of the hash table can reduce hash collisions**.
It is easy to understand that as the capacity $n$ of the hash table increases, the probability of multiple keys being assigned to the same bucket decreases, resulting in fewer collisions. Therefore, **we can reduce hash collisions by resizing the hash table**.
As shown in Figure 6-4, before expansion, key-value pairs `(136, A)` and `(236, D)` collided; after expansion, the collision is resolved.
As shown in Figure 6-4, before resizing, the key-value pairs `(136, A)` and `(236, D)` collide. However, after resizing, the collision is resolved.
Similar to array expansion, resizing a hash table requires migrating all key-value pairs from the original hash table to the new one, which is time-consuming. Furthermore, since the capacity `capacity` of the hash table changes, we need to recalculate the storage positions of all key-value pairs using the hash function, which adds to the computational overhead of the resizing process. Therefore, programming languages often reserve a sufficiently large capacity for the hash table to prevent frequent resizing.
Similar to array expansion, resizing a hash table requires migrating all key-value pairs from the original hash table to the new one, which is time-consuming. Furthermore, since the `capacity` of the hash table changes, we need to recalculate the storage positions of all key-value pairs using the hash function, further increasing the computational overhead of the resizing process. Therefore, programming languages often allocate a sufficiently large capacity for the hash table to prevent frequent resizing.
The <u>load factor</u> is an important concept for hash tables. It is defined as the ratio of the number of elements in the hash table to the number of buckets. It is used to measure the severity of hash collisions and **is often used as a trigger for resizing the hash table**. For example, in Java, when the load factor exceeds $0.75$, the system will resize the hash table to twice its original size.
The <u>load factor</u> is an important concept in hash tables. It is defined as the ratio of the number of elements in the hash table to the number of buckets. It is used to measure the severity of hash collisions and **often serves as a trigger for hash table resizing**. For example, in Java, when the load factor exceeds $0.75$, the system will resize the hash table to twice its original size.