Count-min sketch

Count-min sketch 是一种概率型数据结构，用于估计数据流中元素的频率。

Count-Min Sketch 是 Redis 开源中的一种概率型数据结构，可用于估计数据流中事件/元素的频率。

它使用亚线性空间，但代价是由于冲突而导致某些事件的计数偏高。它消耗事件/元素的流，并保留其频率的估计计数。

了解 Count-Min sketch 得出的低于某个阈值（由 error_rate 确定）的结果应被忽略，甚至通常近似为零，这一点非常重要。因此，Count-Min sketch 确实是一种用于计算数据流中元素频率的数据结构，但仅对较高计数有用。非常低的计数应作为噪声忽略。

用例

产品（零售，在线商店）

这个应用场景回答了这个问题：某个产品（在特定日期）的销售量是多少？

每天（或每个时间段）创建一个 Count-Min sketch。每一次产品销售都计入 CMS。CMS 对销售贡献最大的产品提供相当准确的结果。销售占比很低的产品会被忽略。

示例

假设您选择 0.1% (0.001) 的错误率和 99.8% (0.998) 的确定性。这意味着您的错误概率为 0.02% (0.002)。您的 sketch 力求将误差控制在您添加的所有元素总数的 0.1% 以内。有 0.02% 的几率误差可能会超出此范围——例如当低于阈值的元素与高于阈值的元素发生重叠时。当您向 CMS 添加少量项目并评估其频率时，请记住，在如此小的样本中，冲突是罕见的，与其他概率数据结构一样。

>_ Redis CLI

> CMS.INITBYPROB bikes:profit 0.001 0.002
OK
> CMS.INCRBY bikes:profit "Smokey Mountain Striker" 100
(integer) 100
> CMS.INCRBY bikes:profit "Rocky Mountain Racer" 200 "Cloudy City Cruiser" 150
1) (integer) 200
2) (integer) 150
> CMS.QUERY bikes:profit "Smokey Mountain Striker" "Rocky Mountain Racer" "Cloudy City Cruiser" "Terrible Bike Name"
1) (integer) 100
2) (integer) 200
3) (integer) 150
4) (integer) 0
> CMS.INFO bikes:profit
1) width
2) (integer) 2000
3) depth
4) (integer) 9
5) count
6) (integer) 450

Python

"""
Code samples for Count-min sketch doc pages:
    https://redis.ac.cn/docs/latest/develop/data-types/probabilistic/count-min-sketch/
"""
import redis

r = redis.Redis(decode_responses=True)

res1 = r.cms().initbyprob("bikes:profit", 0.001, 0.002)
print(res1)  # >>> True

res2 = r.cms().incrby("bikes:profit", ["Smoky Mountain Striker"], [100])
print(res2)  # >>> [100]

res3 = r.cms().incrby(
    "bikes:profit", ["Rocky Mountain Racer", "Cloudy City Cruiser"], [200, 150]
)
print(res3)  # >>> [200, 150]

res4 = r.cms().query("bikes:profit", "Smoky Mountain Striker")
print(res4)  # >>> [100]

res5 = r.cms().info("bikes:profit")
print(res5.width, res5.depth, res5.count)  # >>> 2000 9 450

Node.js

import assert from 'assert';
import { createClient } from 'redis';

const client = createClient();
await client.connect();


const res1 = await client.cms.initByProb('bikes:profit', 0.001, 0.002);
console.log(res1);  // >>> OK

const res2 = await client.cms.incrBy('bikes:profit', {
    item: 'Smoky Mountain Striker',
    incrementBy: 100
});
console.log(res2);  // >>> [100]

const res3 = await client.cms.incrBy('bikes:profit', [
  {
    item: 'Rocky Mountain Racer',
    incrementBy: 200
  },
  {
    item: 'Cloudy City Cruiser',
    incrementBy: 150
  }
]);
console.log(res3);  // >>> [200, 150]

const res4 = await client.cms.query('bikes:profit', 'Smoky Mountain Striker');
console.log(res4);  // >>> [100]

const res5 = await client.cms.info('bikes:profit');
console.log(res5.width, res5.depth, res5.count);  // >>> 2000 9 450

Java-Sync

package io.redis.examples;


public class CMSExample {

  public void run() {
    UnifiedJedis jedis = new UnifiedJedis("redis://:6379");


    String res1 = jedis.cmsInitByProb("bikes:profit", 0.001d, 0.002d);
    System.out.println(res1); // >>> OK

    long res2 = jedis.cmsIncrBy("bikes:profit", "Smoky Mountain Striker", 100L);
    System.out.println(res2); // >>> 100

    List<Long> res3 = jedis.cmsIncrBy("bikes:profit", new HashMap<String, Long>() {{
      put("Rocky Mountain Racer", 200L);
      put("Cloudy City Cruiser", 150L);
    }});
    System.out.println(res3); // >>> [200, 150]

    List<Long> res4 = jedis.cmsQuery("bikes:profit", "Smoky Mountain Striker");
    System.out.println(res4); // >>> [100]

    Map<String, Object> res5 = jedis.cmsInfo("bikes:profit");
    System.out.println(res5.get("width") + " " + res5.get("depth") + " " + res5.get("count")); // >>> 2000 9 450

    jedis.close();
  }

}

package example_commands_test

import (
	"context"
	"fmt"

	"github.com/redis/go-redis/v9"
)


func ExampleClient_cms() {
	ctx := context.Background()

	rdb := redis.NewClient(&redis.Options{
		Addr:     "localhost:6379",
		Password: "", // no password docs
		DB:       0,  // use default DB
	})


	res1, err := rdb.CMSInitByProb(ctx, "bikes:profit", 0.001, 0.002).Result()

	if err != nil {
		panic(err)
	}

	fmt.Println(res1) // >>> OK

	res2, err := rdb.CMSIncrBy(ctx, "bikes:profit",
		"Smoky Mountain Striker", 100,
	).Result()

	if err != nil {
		panic(err)
	}

	fmt.Println(res2) // >>> [100]

	res3, err := rdb.CMSIncrBy(ctx, "bikes:profit",
		"Rocky Mountain Racer", 200,
		"Cloudy City Cruiser", 150,
	).Result()

	if err != nil {
		panic(err)
	}

	fmt.Println(res3) // >>> [200 150]

	res4, err := rdb.CMSQuery(ctx, "bikes:profit",
		"Smoky Mountain Striker",
	).Result()

	if err != nil {
		panic(err)
	}

	fmt.Println(res4) // >>> [100]

	res5, err := rdb.CMSInfo(ctx, "bikes:profit").Result()

	if err != nil {
		panic(err)
	}

	fmt.Printf("Width: %v, Depth: %v, Count: %v",
		res5.Width, res5.Depth, res5.Count)
	// >>> Width: 2000, Depth: 9, Count: 450

}


using NRedisStack.CountMinSketch.DataTypes;
using NRedisStack.RedisStackCommands;
using NRedisStack.Tests;
using StackExchange.Redis;



public class Cms_tutorial
{
    public void run()
    {
        var muxer = ConnectionMultiplexer.Connect("localhost:6379");
        var db = muxer.GetDatabase();


        bool res1 = db.CMS().InitByProb("bikes:profit", 0.001, 0.002);
        Console.WriteLine(res1);    // >>> True

        long res2 = db.CMS().IncrBy("bikes:profit", "Smoky Mountain Striker", 100);
        Console.WriteLine(res2);    // >>> 100

        long[] res3 = db.CMS().IncrBy("bikes:profit",
            new Tuple<RedisValue, long>[]{
                new Tuple<RedisValue, long>("Rocky Mountain Racer", 200),
                new Tuple<RedisValue, long>("Cloudy City Cruiser", 150)
            }
        );
        Console.WriteLine(string.Join(", ", res3)); // >>> 200, 150

        long[] res4 = db.CMS().Query("bikes:profit", new RedisValue[] { "Smoky Mountain Striker" });
        Console.WriteLine(string.Join(", ", res4)); // >>> 100

        CmsInformation res5 = db.CMS().Info("bikes:profit");
        Console.WriteLine($"Width: {res5.Width}, Depth: {res5.Depth}, Count: {res5.Count}");
        // >>> Width: 2000, Depth: 9, Count: 450

        // Tests for 'cms' step.


    }
}

示例 1

如果我们有 1000 个元素的均匀分布，每个元素的计数约为 500，那么阈值将是 500。

threshold = error * total_count  = 0.001 * (1000*500) = 500

这表明 CMS 可能不是统计均匀分布流频率的最佳数据结构。让我们尝试将误差减小到 0.01%。

threshold = error * total_count  = 0.0001 * (1000*500) = 100

这个阈值看起来已经更容易接受了，但这意味我们需要更大的 sketch 宽度 w = 2/error = 20 000，因此也会占用更多内存。

示例 2

在另一个示例中，假设存在正态（高斯）分布，其中有 1000 个元素，其中 800 个元素的总计数为 40 万（平均计数为 500），200 个元素的总计数更高，为 160 万（平均计数为 8000），这使得它们成为“重击者”（或称“大象流”）。使用所有 1000 个元素“填充”sketch 后的阈值为

threshold = error * total_count = 0.001 * 2M = 2000

这个阈值似乎恰好落在两个平均计数 500 和 8000 之间，因此最初选择的错误率应该适用于这种情况。

尺寸确定

尽管 Count-Min sketch 在许多方面与布隆过滤器相似，但其尺寸确定要复杂得多。初始化命令只接收两个尺寸参数，但如果您想获得一个可用的 sketch，就必须彻底理解它们。

CMS.INITBYPROB key error probability

1. 误差

error 参数将决定 sketch 的宽度 w，而 probability 将决定哈希函数的数量（深度 d）。我们选择的错误率将决定我们可以信任 sketch 结果的阈值。相关性如下

threshold = error * total_count

或

error = threshold/total_count

其中 total_count 是所有元素计数的总和，可以通过 CMS.INFO 命令结果的 count 键获取，并且显然是动态的——它会随着 sketch 中每一次新的增量而变化。在创建时，您可以将 total_count 比率近似为 sketch 中预期的平均计数和平均元素数量的乘积。

由于阈值是过滤器中总计数的函数，因此非常重要的一点是，它会随着计数的增长而增长，但知道总计数，我们可以随时动态计算阈值。如果结果低于阈值，则可以丢弃。

2. 概率

在此数据结构中，probability 表示计数低于阈值的元素在所有 sketches/ depths 上与计数高于阈值的元素发生冲突的可能性，从而返回高频元素的最小计数而不是其自身的计数。

性能

在 CMS 中添加、更新和查询元素的时间复杂度为 O(1)。

学术资源

改进的数据流摘要：Count-Min Sketch 及其应用

参考

Count-Min Sketch：估计事物的艺术与科学

产品

工具

获取 Redis

连接

学习

最新

查看工作原理