查看Ruby Under a Microscope的源代码
←
Ruby Under a Microscope
跳转到:
导航
、
搜索
因为以下原因,你没有权限编辑本页:
您刚才请求的操作只有这个用户组中的用户才能使用:
用户
您可以查看并复制此页面的源代码:
== 散列表 == * ruby 的散列表 hash 解决哈希冲突还是经典的链表法,密度阈值设定为 5,超过就做 rehash。也就是 java hash 的所谓负载因子。 * rehash 扩容的大小不是翻倍之类的算法,而是基于素数,总是将容器数目设置为一个素数表里的素数大小。这个考虑也是基于对哈希函数不够分布的担忧。 * 比较元素通过 eq? 方法。 * 默认哈希算法采用 murmur hash,这跟 clojure 是一样的。自定义对象作为 key,同样也可以选择自定义实现 hash 函数。一般都推荐使用默认。 * Ruby 2.0 做了个优化,跟 clojure 一样,小于等于 6 个元素的 hash 直接组织成一个数组, clojure 里是少于等于8个就是 PersistentArrayMap,节省内存和提升效率。 * RHash 转移到 internal.h <pre> struct RHash { struct RBasic basic; struct st_table *ntbl; /* possibly 0 */ int iter_lev; const VALUE ifnone; }; </pre> * ruby 2.4 又有一个大的变化,使用开放地址法替换了链表法,参考 https://bugs.ruby-lang.org/issues/12142,这个讨论非常有价值。 <pre> /* The original package implemented classic bucket-based hash tables with entries doubly linked for an access by their insertion order. To decrease pointer chasing and as a consequence to improve a data locality the current implementation is based on storing entries in an array and using hash tables with open addressing. The current entries are more compact in comparison with the original ones and this also improves the data locality. The hash table has two arrays called *bins* and *entries*. bins: ------- | | entries array: |-------| -------------------------------- | index | | | entry: | | | |-------| | | | | | | ... | | ... | hash | ... | ... | |-------| | | key | | | | empty | | | record | | | |-------| -------------------------------- | ... | ^ ^ |-------| |_ entries start |_ entries bound |deleted| ------- o The entry array contains table entries in the same order as they were inserted. When the first entry is deleted, a variable containing index of the current first entry (*entries start*) is changed. In all other cases of the deletion, we just mark the entry as deleted by using a reserved hash value. Such organization of the entry storage makes operations of the table shift and the entries traversal very fast. o The bins provide access to the entries by their keys. The key hash is mapped to a bin containing *index* of the corresponding entry in the entry array. The bin array size is always power of two, it makes mapping very fast by using the corresponding lower bits of the hash. Generally it is not a good idea to ignore some part of the hash. But alternative approach is worse. For example, we could use a modulo operation for mapping and a prime number for the size of the bin array. Unfortunately, the modulo operation for big 64-bit numbers are extremely slow (it takes more than 100 cycles on modern Intel CPUs). Still other bits of the hash value are used when the mapping results in a collision. In this case we use a secondary hash value which is a result of a function of the collision bin index and the original hash value. The function choice guarantees that we can traverse all bins and finally find the corresponding bin as after several iterations the function becomes a full cycle linear congruential generator because it satisfies requirements of the Hull-Dobell theorem. When an entry is removed from the table besides marking the hash in the corresponding entry described above, we also mark the bin by a special value in order to find entries which had a collision with the removed entries. There are two reserved values for the bins. One denotes an empty bin, another one denotes a bin for a deleted entry. o The length of the bin array is at least two times more than the entry array length. This keeps the table load factor healthy. The trigger of rebuilding the table is always a case when we can not insert an entry anymore at the entries bound. We could change the entries bound too in case of deletion but than we need a special code to count bins with corresponding deleted entries and reset the bin values when there are too many bins corresponding deleted entries Table rebuilding is done by creation of a new entry array and bins of an appropriate size. We also try to reuse the arrays in some cases by compacting the array and removing deleted entries. o To save memory very small tables have no allocated arrays bins. We use a linear search for an access by a key. o To save more memory we use 8-, 16-, 32- and 64- bit indexes in bins depending on the current hash table size. This implementation speeds up the Ruby hash table benchmarks in average by more 40% on Intel Haswell CPU. */ </pre> === 2.4 st_table 阅读心得 === * st_features 定义了 table 的属性: ** entry_power entries数组的大小,2的幂指数。 ** bin_power bins数组大小,同样是2的幂指数 ** size_ind 根据 table 大小,选择 bins 对应的元素的大小,可能是 8-bit, 16-bit etc。 ** bins_words bins按照word计算的大小。 ** 根据 SIZEOF_ST_INDEX_T 枚举了一堆 table 属性。 * 因为 bins 的大小都是 2 的次幂,因此计算哈希值对应的 bin 可以直接用位运算: <pre> /* Return mask for a bin index in table TAB. */ static inline st_index_t bins_mask(const st_table *tab) { return get_bins_num(tab) - 1; } /* Return the index of table TAB bin corresponding to HASH_VALUE. */ static inline st_index_t hash_bin(st_hash_t hash_value, st_table *tab) { return hash_value & bins_mask(tab); } </pre> * 最小的 table 大小是 4, 由 MINIMAL_POWER2 决定。最大的 table 大小是 2 的 30 次方(非 8 位平台),8 位平台上是 2 的 62 次方。 * 对于小于等于 16 个元素的 table,不创建 bins 数组,直接存储在 entries 数组,线性探测,无需进行 hash 计算和查找。 * rebuild_table 可能有两种: compact 现有的,或者创建新的。当 entries_bound 到达上限的时候,开始 rebuild。 ** 当已有 entries 数组的大小在现有元素大小的 2 倍到 4 倍(REBUILD_THRESHOLD)之间,或者元素数量小于 16 个,进入压缩流程,直接使用原来 table 作为新 new_tab ** 否则,进入新建 table 作为 new_tab ** 小技巧 , prefetch 指令,预加载下个元素,在遍历 entries 的时候用到。 <pre> #define PREFETCH(addr, write_p) __builtin_prefetch(addr, write_p) PREFETCH(entries + i + 1, 0); </pre> ** rehash 其实很简单了,遍历已有的 entries 数组,跳过已经删除的,设置到新的 table 里,同时设置 bins: <pre> bins = new_tab->bins; size_ind = get_size_ind(new_tab); for (i = tab->entries_start; i < bound; i++) { curr_entry_ptr = &entries[i]; PREFETCH(entries + i + 1, 0); if (EXPECT(DELETED_ENTRY_P(curr_entry_ptr), 0)) continue; if (&new_entries[ni] != curr_entry_ptr) new_entries[ni] = *curr_entry_ptr; if (EXPECT(bins != NULL, 1)) { bin_ind = find_table_bin_ind_direct(new_tab, curr_entry_ptr->hash, curr_entry_ptr->key); st_assert(bin_ind != UNDEFINED_BIN_IND && (tab == new_tab || new_tab->rebuilds_num == 0) && IND_EMPTY_BIN_P(new_tab, bin_ind)); set_bin(bins, size_ind, bin_ind, ni + ENTRY_BASE); } new_tab->num_entries++; ni++; } </pre> * 开放地址法,遇到哈希冲突,采用二次哈希,次级哈希函数如下: <pre> /* Return the next secondary hash index for table TAB using previous index IND and PERTERB. Finally modulo of the function becomes a full *cycle linear congruential generator*, in other words it guarantees traversing all table bins in extreme case. According the Hull-Dobell theorem a generator "Xnext = (a*Xprev + c) mod m" is a full cycle generator iff o m and c are relatively prime o a-1 is divisible by all prime factors of m o a-1 is divisible by 4 if m is divisible by 4. For our case a is 5, c is 1, and m is a power of two. */ static inline st_index_t secondary_hash(st_index_t ind, st_table *tab, st_index_t *perterb) { *perterb >>= 11; ind = (ind << 2) + ind + *perterb + 1; return hash_bin(ind, tab); } </pre> * find_entry 线性探测,find_table_entry_ind 是二次哈希查找,但是可以选择是否启用二次探测: <pre> /* Use the quadratic probing. The method has a better data locality but more collisions than the current approach. In average it results in a bit slower search. */ /*#define QUADRATIC_PROBE*/ ind = hash_bin(hash_value, tab); #ifdef QUADRATIC_PROBE d = 1; #else peterb = hash_value; #endif FOUND_BIN; for (;;) { bin = get_bin(tab->bins, get_size_ind(tab), ind); //找到相等的。 if (! EMPTY_OR_DELETED_BIN_P(bin) && PTR_EQUAL(tab, &entries[bin - ENTRY_BASE], hash_value, key)) break; //或者找到空的。 else if (EMPTY_BIN_P(bin)) return UNDEFINED_ENTRY_IND; #ifdef QUADRATIC_PROBE //启用了二次探测,计算下一个探测位置。 ind = hash_bin(ind + d, tab); d++; #else //或者二次哈希 ind = secondary_hash(ind, tab, &peterb); #endif COLLISION; } return bin; </pre> 默认采用二次哈希。 * 查找过程简单明了: <pre> /* Find an entry with KEY in table TAB. Return non-zero if we found it. Set up *RECORD to the found entry record. */ int st_lookup(st_table *tab, st_data_t key, st_data_t *value) { st_index_t bin; //计算哈希 st_hash_t hash = do_hash(key, tab); if (tab->bins == NULL) { //线性查找,对于少于等于 16 个元素的table bin = find_entry(tab, hash, key); if (bin == UNDEFINED_ENTRY_IND) return 0; } else { //进入哈希查找。 bin = find_table_entry_ind(tab, hash, key); if (bin == UNDEFINED_ENTRY_IND) return 0; bin -= ENTRY_BASE; } //赋值,返回 if (value != 0) *value = tab->entries[bin].record; return 1; } </pre> * 删除是标记删除,开放地址法必须这么做,否则冲突的时候,前面的删除,就找不到后面的元素了: <pre> bin = get_bin(tab->bins, get_size_ind(tab), bin_ind) - ENTRY_BASE; MARK_BIN_DELETED(tab, bin_ind); } entry = &tab->entries[bin]; *key = entry->key; if (value != 0) *value = entry->record; MARK_ENTRY_DELETED(entry); </pre>
返回到
Ruby Under a Microscope
。
个人工具
登录
名字空间
页面
讨论
变换
查看
阅读
查看源代码
查看历史
操作
搜索
导航
首页
社区专页
新闻动态
最近更改
随机页面
帮助
工具箱
链入页面
相关更改
特殊页面