Elasticsearch 聚合分析

什么是聚合分析
搜索引擎用来回答如下问题:

  • 请告诉我地址为北京的所有订单?
  • 请告诉我最近1天内创建但没有付款的所有订单?

聚合分析可以回答如下问题:

  • 请告诉我最近1周每天的订单成交量有多少?
  • 请告诉我最近1个月每天的平均订单金额是多少?
  • 请告诉我最近半年卖的最火的前5个商品是哪些?
  • 聚合分析,英文为Aggregation ,是es除搜索功能外提供的针对es数据做统计分析的功能

功能丰富,提供Bucket, Metric, Pipeline等多种分析方式,可以满足大部分的分析需求
实时性高,所有的计算结果都是即时返回的,而hadoop等大数据系统一般都是T+1级别的

aggregation属于_search的一部分,一般情况下,将size设定为0
aggregation语法

示例:获取所有年份的电影数

request
1
2
3
4
5
6
7
8
9
10
11
GET movies/_search
{
"size": 0,
"aggs": {
"aggre_year": {
"terms": {
"field":"year"
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 9743,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"aggre_year" : {
"doc_count_error_upper_bound" : 172,
"sum_other_doc_count" : 6333,
"buckets" : [
{
"key" : 0,
"doc_count" : 1078
},
{
"key" : 2015,
"doc_count" : 274
},
{
"key" : 2014,
"doc_count" : 271
},
{
"key" : 2002,
"doc_count" : 268
},
{
"key" : 2006,
"doc_count" : 265
},
{
"key" : 2007,
"doc_count" : 259
},
{
"key" : 1996,
"doc_count" : 252
},
{
"key" : 2000,
"doc_count" : 249
},
{
"key" : 2009,
"doc_count" : 248
},
{
"key" : 2005,
"doc_count" : 246
}
]
}
}
}

aggs下面同级的aggregation_name可以存在多个,但是aggs的直属下级不能含有aggs
错误示例:

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET test_search_index/_search
{
"size": 0,
"aggs": {
"age_filter": {
"filter": {
"range": {
"age": {
"gte": 10,
"lte": 20
}
}
}
},
"aggs": {
"max_age": {
"max": {
"field": "age"
}
}
}
}
}

为了便于理解, es将聚合分析主要分为如下4类

  • Bucket,分桶类型,类似SOL中的GROUP BY语法
  • Metric,指标分析类型,如计算最大值、最小值、平均值等等
  • Pipeline ,管道分析类型,基于上一级的聚合分析结果进行再分析
  • Matrix ,矩阵分析类型

Metric聚合分析

主要分如下两类:

单值分析,只输出一个分析结果

  • min,max,avg,sum
  • cardinality
    多值分析,输出多个分析结果
  • stats,extended stats
  • percentile, percentile rank
  • top hits

min,max,avg,sum

多个聚合分析

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
GET movies/_search
{
"size": 0,
"aggs": {
"max_year": {
"max": {
"field":"year"
}
},
"min_year": {
"min": {
"field":"year"
}
},
"sum_year": {
"sum": {
"field":"year"
}
},
"avg_year": {
"avg": {
"field":"year"
}
}
}
}

Cardinality

cardinality ,意为集合的势,或者基数,是指不同数值的个数,类似SQL中的distinct count概念

request
1
2
3
4
5
6
7
8
9
10
11
GET movies/_search
{
"size": 0,
"aggs": {
"count_of_year": {
"cardinality": {
"field": "year"
}
}
}
}

Stats

stats返回一系列数值类型的统计值,包含min, max, avg, sum和count

request
1
2
3
4
5
6
7
8
9
10
11
GET movies/_search
{
"size": 0,
"aggs": {
"stats_of_year": {
"stats": {
"field": "year"
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 9743,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"stats_of_year" : {
"count" : 9743,
"min" : 0.0,
"max" : 2018.0,
"avg" : 1772.895925279688,
"sum" : 1.7273325E7
}
}
}

extended_stats返回更多的统计指标,用法和stats一致

Percentile

percentiles 百分位数统计

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GET movies/_search
{
"size": 0,
"aggs": {
"per_of_year": {
"percentiles": {
"field": "year",
"percents": [
1,
5,
25,
50,
75,
95,
99
]
}
}
}
}
request
1
2
3
4
5
6
7
8
9
10
11
12
GET movies/_search
{
"size": 0,
"aggs": {
"per_of_year": {
"percentile_ranks": {
"field": "year",
"values": [2017,2018]
}
}
}
}

Top Hits

top_hits一般用于分桶后获取该桶内最匹配的顶部文档列表,即详情数据

返回匹配度最接近的10个文档

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
GET movies/_search
{
"size": 0,
"aggs": {
"top_year": {
"top_hits": {
"size": 10,
"sort": {
"year": "asc"
}
}
}
}
}

Bucket聚合分析

Bucket ,意为桶,即按照一定的规则将文档分配到不同的桶中,达到分类分析的目的
按照Bucket的分桶策略,常见的Bucket聚合分析如下:

  • Terms
  • Range
  • Date Range
  • Histogram
  • Date Histogram

Terms

该分桶策略最简单,直接按照term来分桶,如果是text类型,则按照分词后的结果分桶

request
1
2
3
4
5
6
7
8
9
10
11
GET movies/_search
{
"size": 0,
"aggs": {
"aggre_title": {
"terms": {
"field":"title.keyword"
}
}
}
}

fielddata需要指定为true才能使用

Range

通过指定数值的范围来设定分桶规则

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET test_search_index/_search
{
"size": 0,
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"to": 18
},
{
"from": 18,
"to": 20
},
{
"from": 20
}
]
}
}
}
}

Date Range

一通过指定日期的范围来设定分桶规则

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
GET test_search_index/_search
{
"size": 0,
"aggs": {
"birth_range": {
"range": {
"field": "birth",
"format": "yyyy",
"ranges": [
{
"to": 1980
},
{
"from": 1980,
"to": 1990
},
{
"from": 1990
}
]
}
}
}
}

ranges里面设置key可自定义返回的key显示

Histogram

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET movies/_search
{
"size": 0,
"aggs": {
"year_hist": {
"histogram": {
"field": "year",
"interval": 100,
"extended_bounds": {
"min": 0,
"max": 2019
}
}
}
}
}

interval设定间隔,extended_bounds设定范围

Date Histogram

request
1
2
3
4
5
6
7
8
9
10
11
12
GET test_search_index/_search
{
"size": 0,
"aggs": {
"year_hist": {
"date_histogram": {
"field": "birth",
"calendar_interval": "1y"
}
}
}
}

interval可分为calendar_interval和fixed_interval
详情参考Date Histogram

bucket和metric聚合分析

Bucket聚合分析允许通过添加子分析来进一步进行分析,该子分析可以是Bucket也可以是Metric,这也使得es的聚合分析能力变得异常强大

分桶之后再分桶:

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
GET test_search_index/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{
"to": 18
},
{
"from": 18,
"to": 25
},
{
"from": 25
}
]
}
}
}
}
}
}

返回结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
"aggregations" : {
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "java and ruby engineer",
"doc_count" : 1,
"age_range" : {
"buckets" : [
{
"key" : "*-18.0",
"to" : 18.0,
"doc_count" : 0
},
{
"key" : "18.0-25.0",
"from" : 18.0,
"to" : 25.0,
"doc_count" : 1
},
{
"key" : "25.0-*",
"from" : 25.0,
"doc_count" : 0
}
]
}
},
{
"key" : "java engineer",
"doc_count" : 1,
"age_range" : {
"buckets" : [
{
"key" : "*-18.0",
"to" : 18.0,
"doc_count" : 0
},
{
"key" : "18.0-25.0",
"from" : 18.0,
"to" : 25.0,
"doc_count" : 1
},
{
"key" : "25.0-*",
"from" : 25.0,
"doc_count" : 0
}
]
}
},
{
"key" : "java senior engineer and java specialist",
"doc_count" : 1,
"age_range" : {
"buckets" : [
{
"key" : "*-18.0",
"to" : 18.0,
"doc_count" : 0
},
{
"key" : "18.0-25.0",
"from" : 18.0,
"to" : 25.0,
"doc_count" : 0
},
{
"key" : "25.0-*",
"from" : 25.0,
"doc_count" : 1
}
]
}
},
{
"key" : "php engineer",
"doc_count" : 1,
"age_range" : {
"buckets" : [
{
"key" : "*-18.0",
"to" : 18.0,
"doc_count" : 0
},
{
"key" : "18.0-25.0",
"from" : 18.0,
"to" : 25.0,
"doc_count" : 0
},
{
"key" : "25.0-*",
"from" : 25.0,
"doc_count" : 1
}
]
}
},
{
"key" : "ruby engineer",
"doc_count" : 1,
"age_range" : {
"buckets" : [
{
"key" : "*-18.0",
"to" : 18.0,
"doc_count" : 0
},
{
"key" : "18.0-25.0",
"from" : 18.0,
"to" : 25.0,
"doc_count" : 1
},
{
"key" : "25.0-*",
"from" : 25.0,
"doc_count" : 0
}
]
}
}
]
}
}

Pipeline聚合分析

概念:支持聚合分析的结果再进行聚合分析

pipeline分析的结果会输出到原结果,具体分为两类

  • Sibline: 结果和现有分析结果同级
    • max,min,avg,sum bucket
    • stats,extended_stats bucket
    • Percentile
  • Parent:结果内嵌到现有的聚合分析结果中
    • Derivative 求导导数
    • Moving fn 滑动窗口
    • Cumulative Sum 累计求和

Sibline Demo

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
GET test_search_index/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
}
}
},
"min_age": {
"min_bucket": {
"buckets_path": "jobs>age_avg"
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"aggregations" : {
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "php engineer",
"doc_count" : 2,
"age_avg" : {
"value" : 27.5
}
},
{
"key" : "java and ruby engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 22.0
}
},
{
"key" : "java engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 18.0
}
},
{
"key" : "java senior engineer and java specialist",
"doc_count" : 1,
"age_avg" : {
"value" : 28.0
}
},
{
"key" : "ruby engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 23.0
}
}
]
},
"min_age" : {
"value" : 18.0,
"keys" : [
"java engineer"
]
}
}

Parent Demo

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
GET test_search_index/_search
{
"size": 0,
"aggs": {
"births": {
"date_histogram": {
"field": "birth",
"calendar_interval": "1y"
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
},
"mavg_age": {
"moving_fn": {
"buckets_path": "age_avg",
"window": 10,
"script": "MovingFunctions.min(values)"
}
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"aggregations" : {
"births" : {
"buckets" : [
{
"key_as_string" : "1980-01-01T00:00:00.000Z",
"key" : 315532800000,
"doc_count" : 1,
"age_avg" : {
"value" : 28.0
},
"mavg_age" : {
"value" : null
}
},
{
"key_as_string" : "1981-01-01T00:00:00.000Z",
"key" : 347155200000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1982-01-01T00:00:00.000Z",
"key" : 378691200000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1983-01-01T00:00:00.000Z",
"key" : 410227200000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1984-01-01T00:00:00.000Z",
"key" : 441763200000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1985-01-01T00:00:00.000Z",
"key" : 473385600000,
"doc_count" : 1,
"age_avg" : {
"value" : 22.0
},
"mavg_age" : {
"value" : 28.0
}
},
{
"key_as_string" : "1986-01-01T00:00:00.000Z",
"key" : 504921600000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1987-01-01T00:00:00.000Z",
"key" : 536457600000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1988-01-01T00:00:00.000Z",
"key" : 567993600000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1989-01-01T00:00:00.000Z",
"key" : 599616000000,
"doc_count" : 1,
"age_avg" : {
"value" : 23.0
},
"mavg_age" : {
"value" : 22.0
}
},
{
"key_as_string" : "1990-01-01T00:00:00.000Z",
"key" : 631152000000,
"doc_count" : 1,
"age_avg" : {
"value" : 18.0
},
"mavg_age" : {
"value" : 22.0
}
},
{
"key_as_string" : "1991-01-01T00:00:00.000Z",
"key" : 662688000000,
"doc_count" : 0,
"age_avg" : {
"value" : null
}
},
{
"key_as_string" : "1992-01-01T00:00:00.000Z",
"key" : 694224000000,
"doc_count" : 1,
"age_avg" : {
"value" : 28.0
},
"mavg_age" : {
"value" : 18.0
}
},
{
"key_as_string" : "1993-01-01T00:00:00.000Z",
"key" : 725846400000,
"doc_count" : 1,
"age_avg" : {
"value" : 27.0
},
"mavg_age" : {
"value" : 18.0
}
}
]
}
}

作用范围

es聚合分析默认作用范围是query的结果集,可以通过如下的方式改变其作用范围:

  • filter
  • post_filter
  • global

aggs只作用与query的结果集

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
GET test_search_index/_search
{
"query": {
"match": {
"username": {
"query": "alfred way",
"operator": "and"
}
}
},
"aggs": {
"max_year": {
"max": {
"field": "age"
}
}
}
}

filter

为某个聚合分析设定过滤条件,从而在不更改整体query语句的情况下修改了作用范围

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
GET test_search_index/_search
{
"size": 0,
"aggs": {
"filter_age": {
"filter": {
"range": {
"age": {
"gte": 10,
"lte": 20
}
}
},
"aggs": {
"max_age": {
"max": {
"field": "age"
}
}
}
}
}
}
1
2
3
4
5
6
7
8
"aggregations" : {
"filter_age" : {
"doc_count" : 1,
"max_age" : {
"value" : 18.0
}
}
}

post-filter

作用于文档过滤,但在聚合分析后生效

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
GET test_search_index/_search
{
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword"
}
}
},
"post_filter": {
"match": {
"job.keyword": "php engineer"
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_search_index",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"username" : "way",
"job" : "php engineer",
"age" : 27,
"birth" : "1993-08-07",
"isMarried" : false
}
},
{
"_index" : "test_search_index",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.0,
"_source" : {
"username" : "jun way",
"job" : "php engineer",
"age" : 28,
"birth" : "1992-08-07",
"isMarried" : false
}
}
]
},
"aggregations" : {
"jobs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "php engineer",
"doc_count" : 2
},
{
"key" : "java and ruby engineer",
"doc_count" : 1
},
{
"key" : "java engineer",
"doc_count" : 1
},
{
"key" : "java senior engineer and java specialist",
"doc_count" : 1
},
{
"key" : "ruby engineer",
"doc_count" : 1
}
]
}
}
}

只过滤hits文档

global

无视query过滤条件,基于全部文档进行分析

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
GET test_search_index/_search
{
"query": {
"match": {
"username": {
"query": "alfred way",
"operator": "and"
}
}
},
"aggs": {
"max_year": {
"max": {
"field": "age"
}
},
"all": {
"global": {},
"aggs": {
"all_max_year": {
"max": {
"field": "age"
}
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
"aggregations" : {
"all" : {
"doc_count" : 6,
"all_max_year" : {
"value" : 28.0
}
},
"max_year" : {
"value" : 23.0
}
}

适用场景,整体和部分做对比

排序

可以使用自带的关键数据进行排序,比如:

  • _count文档数
  • _key按照key值排序

count文档数 排序

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
GET test_search_index/_search
{
"size": 0,
"aggs": {
"job_count": {
"terms": {
"size": 10,
"field": "job.keyword",
"order": [
{"_count": "asc"},
{"_key": "desc"}
]
}
}
}
}

Json Object 排序

类似stats这种包含多个值的可以用.指定哪个值排序
Demo1:

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET test_search_index/_search
{
"size": 0,
"aggs": {
"job_count": {
"terms": {
"size": 10,
"field": "job.keyword",
"order": [
{"age_stats.sum": "asc"}
]
},
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
"aggregations" : {
"job_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "java engineer",
"doc_count" : 1,
"age_stats" : {
"count" : 1,
"min" : 18.0,
"max" : 18.0,
"avg" : 18.0,
"sum" : 18.0
}
},
{
"key" : "java and ruby engineer",
"doc_count" : 1,
"age_stats" : {
"count" : 1,
"min" : 22.0,
"max" : 22.0,
"avg" : 22.0,
"sum" : 22.0
}
},
{
"key" : "ruby engineer",
"doc_count" : 1,
"age_stats" : {
"count" : 1,
"min" : 23.0,
"max" : 23.0,
"avg" : 23.0,
"sum" : 23.0
}
},
{
"key" : "java senior engineer and java specialist",
"doc_count" : 1,
"age_stats" : {
"count" : 1,
"min" : 28.0,
"max" : 28.0,
"avg" : 28.0,
"sum" : 28.0
}
},
{
"key" : "php engineer",
"doc_count" : 2,
"age_stats" : {
"count" : 2,
"min" : 27.0,
"max" : 28.0,
"avg" : 27.5,
"sum" : 55.0
}
}
]
}
}

Demo2:

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
GET test_search_index/_search
{
"size": 0,
"aggs": {
"job_count": {
"terms": {
"size": 10,
"field": "job.keyword",
"order": [
{"age_avg.value": "asc"}
]
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
}
}
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
"aggregations" : {
"job_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "java engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 18.0
}
},
{
"key" : "java and ruby engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 22.0
}
},
{
"key" : "ruby engineer",
"doc_count" : 1,
"age_avg" : {
"value" : 23.0
}
},
{
"key" : "php engineer",
"doc_count" : 2,
"age_avg" : {
"value" : 27.5
}
},
{
"key" : "java senior engineer and java specialist",
"doc_count" : 1,
"age_avg" : {
"value" : 28.0
}
}
]
}
}

聚合嵌套:

request
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
GET test_search_index/_search
{
"size": 0,
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10,
"order": {
"age_filter>age_avg": "desc"
}
},
"aggs": {
"age_filter": {
"filter": {
"range": {
"age": {
"gte": 10,
"lte": 20
}
}
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
}
}
}
}
}
}
}

聚合的精准度问题

Min聚合得执行流程

min执行过程中每次从shard中取出的都是最小值,所以不存在精准度问题

Terms聚合的执行流程

数据分散在多Shard上, Coordinating Node无法得悉数据全貌;所以不准确。

Terms不准确的解决办法

  • 设置Shard数为1,消除数据分散的问题,但无法承载大数据量
  • 合理设置Shard Size大小,即每次从Shard上额外多获取数据,以提升准确度
request
1
2
3
4
5
6
7
8
9
10
11
12
GET test_search_index/_search
{
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10,
"shard_size": 10
}
}
}
}

Shard_Size大小的设定方法
terms聚合返回结果中有如下两个统计值:

  • doc_count_error_upper_bound被遗漏的term可能的最大值
  • sum_other_other_doc_count返回结果bucket的term外其他term的文档总数

设定show_term_doc_count_error可以查看每个bucket误算的最大值

request
1
2
3
4
5
6
7
8
9
10
11
12
GET test_search_index/_search
{
"aggs": {
"jobs": {
"terms": {
"field": "job.keyword",
"size": 10,
"show_term_doc_count_error": true
}
}
}
}

Shard Size默认大小如下:

  • shard size = (size x 1.5) +10
  • 通过调整Shard Size的大小降低doc_count_error_upper_bound来提升准确度
    • 增大了整体的计算量,从而降低了响应时间

在ES的聚合分析中, CardinalityPercentile分析使用的是近似统计算法

  • 结果是近似准确的,但不一定精准
  • 可以通过参数的调整使其结果精准,但同时也意味着更多的计算时间和更大的性能消耗

官方文档参考https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-aggregations.html

-------------本文结束感谢您的阅读-------------