关于卡 STL 哈希表

panyf · 2021-11-05 19:12:18 · 个人记录

unordered_map 的实现和用法

首先介绍 unordered_map 的一些函数。

hash_function()：使用的哈希函数。

bucket(x)：x 所在的桶的编号。

bucket_size(i)：编号为 i 的桶的元素个数。

bucket_count()：桶的个数。

max_load_factor()：每个桶平均至多能容纳元素个数，默认为 1。

max_load_factor(f)：将 max_load_factor() 设为 f，f 是 float 类型。

load_factor()：每个桶平均元素个数。

size()：元素个数。

rehash(a)：重构哈希表，使 bucket_count() 变为某个不小于 \max(a,size()/max\_load\_factor()) 的值。

reserve(n)：相当于 rehash(ceil(n/max_load_factor()))。

注意调用 bucket(x) 或 count(x) 或 find(x) 不会视为插入 x，但是只要访问 mp[x] 就会视为插入了 x（对应的 second 为 0，size() 加 1）。

所以尽量用 count(x)（返回 mp 中 x 的个数，若存在则为 1，不存在则为 0），find(x)（返回 first 为 x 的位置的迭代器，若存在多个（multimap）则返回任意一个，若不存在则返回 mp.end()）等函数代替 mp[x]。

示例：

#include<bits/stdc++.h>
using namespace std;
unordered_map<int,int>mp;
int main(){ios::sync_with_stdio(0),cin.tie(0);
    auto f=mp.hash_function();
    cout<<f(123456)<<'\n';//123456
    cout<<mp.max_load_factor()<<'\n';//1
    for(int i=1;i<=100;++i)mp[i]=1;
    cout<<mp.bucket_count()<<'\n';//199
    cout<<mp.bucket(199)<<' '<<mp.bucket(200)<<'\n';//0 1
    cout<<mp.bucket_size(1)<<'\n';//1
    mp[200];
    cout<<mp.size()<<'\n';//101
    cout<<mp.bucket_size(1)<<'\n';//2
    mp.erase(200);
    cout<<mp.bucket_size(1)<<'\n';//1
    for(int i=1;i<=10000;++i)mp[i];
    cout<<mp.bucket_count()<<'\n';//15173
    for(int i=1;i<=9900;++i)mp.erase(i);
    cout<<mp.bucket_count()<<'\n';//15173
    mp.rehash(1234),cout<<mp.bucket_count()<<'\n';//1289
    cout<<mp.load_factor()<<'\n';//0.0775795
    mp.max_load_factor(0.01);
    mp.rehash(1234),cout<<mp.bucket_count()<<'\n';//10273
    mp.reserve(1234),cout<<mp.bucket_count()<<'\n';//126271
    return 0;
}

unordered_map 的实现方法是，插入元素 x 时，求出 x 所在桶的编号 i=hash\_function(x)\bmod bucket\_count()。

遍历桶 i，若 i 中已经有 x，则什么都不做（若是 unordered_multimap 则插入一个新的 x）。否则在桶 i 中插入 x。

若某一时刻 load\_factor() 大于 max\_load\_factor()，则进行一次重构，将新的 bucket\_count() 变为一个不小于当前 bucket\_count() 的两倍的数（相当于 rehash(bucket_count()*2)）。若删除若干元素，不会进行重构，只有手动 rehash(0) 才能让 bucket\_count() 变小。（和 vector/basic_string 的动态内存申请实现类似）

hack unordered_map

容易发现，hash_function() 的实现很蠢，对于一个整数 x，调用 hash_function()(x) 得到的结果就是 x 本身。

要 hack unordered_map，只需要多次向同一个桶中插入数并访问即可。也就是多次插入模 bucket\_count() 相同的数。

如何找到这些 bucket\_count()？

运行以下程序：

#include<bits/stdc++.h>
using namespace std;
unordered_map<int,int>mp;
set<int>st;
int main(){ios::sync_with_stdio(0),cin.tie(0);
    for(int i=1;i<=1000000;++i){
        mp[i];
        st.insert(mp.bucket_count());
    }
    for(auto o:st)cout<<o<<' ';
    return 0;
}

得到的结果即为 10^6 以内的 bucket\_count()，这样的数只有大约 20 个，从中随意取一个较大的数即可。注意不同编译器运行结果可能有差异。

比如取 15173：

#include<bits/stdc++.h>
using namespace std;
unordered_map<int,int>mp;

int main(){ios::sync_with_stdio(0),cin.tie(0);
    for(int i=0;i<10000;++i)mp[i*15173];
    for(int i=0;i<100000;i++)mp[(i%10000)*15173];
    cout<<mp.size()<<' '<<mp.bucket_count();
    return 0;
}

只进行了 10^4 次插入和 10^5 次查询，但本地开 O2 跑了接近两秒。

hack pbds hash_table

pbds 的哈希函数实现有所不同，之前的 hack 方法失效了。

不过 pbds 对 64 位整数的哈希函数是先强转成 32 位再哈希（unordered_map 是强转成 size_t 类型）。

所以只需要插入很多模 2^{32} 相同的数即可。

#include<bits/extc++.h>
using namespace std;
__gnu_pbds::gp_hash_table<long long,long long>mp;
int main(){ios::sync_with_stdio(0),cin.tie(0);
    for(int i=0;i<100000;++i)mp[4294967296ll*i];
    cout<<mp.size();
    return 0;
}

本地大约需要 10 秒。

如何防止 hack

1.手动 rehash

假设需要插入 n 个数，就先 rehash(n)。

这样可以大幅减小常数（不需要重构），但是最坏情况下复杂度仍然是 O(n^2)，不能避免被 hack。

所有 10^6 以内可能作为 bucket\_count() 的数：

#include<bits/stdc++.h>
using namespace std;
unordered_map<int,int>mp;
set<int>s;
int main(){ios::sync_with_stdio(0),cin.tie(0);
    for(int i=1;i<=1000000;++i)mp.rehash(i),s.insert(mp.bucket_count());
    for(auto o:s)cout<<o<<' ';
    return 0;
}

容易发现这样的数大约只有 100 个。

2.手写 hash 函数

容易发现默认的 hash_function() 实现很蠢，重新写一个就行了。

#include<bits/stdc++.h>
using namespace std;
mt19937 rd(chrono::steady_clock::now().time_since_epoch().count());
int RD=rd();
struct H{size_t operator()(int x)const{return x^RD;}};
unordered_map<int,int,H>mp;
int main(){
    for(int i=0;i<10000;++i)mp[i*15173];
    for(int i=0;i<100000;i++)mp[(i%10000)*15173];
    cout<<mp.size()<<' '<<mp.bucket_count();
    return 0;
}

修改后重新跑之前例子中的程序，只需要不到 0.1 秒。

注意在 32 位系统上如果类型为 long long，这样还是会被 hack，写一个更好的哈希函数即可：

#include<bits/stdc++.h>
using namespace std;
mt19937 rd(chrono::steady_clock::now().time_since_epoch().count());
int RD=rd();
struct H{size_t operator()(long long x)const{return(x>>32)^x^RD;}};
unordered_map<long long,long long,H>mp;
int main(){
    for(int i=0;i<100000;++i)mp[i*4294967296ll];
    return 0;
}

pbds 的 hash_table 也可以用同样的方式。

3.手写哈希表

unordered_map 本身还是太慢，3\times 10^6 以上就要超过 1 秒。

考虑手写哈希表，10^7 都能跑。

例题：CF1333C Eugene and an array（数据很强）