Why sampling is important and what you need to be aware of?
When dealing with very large amount of data, you probably want to run your
queries only for a smaller dataset in your current tables. Specially if your dataset
is not fitting in RAM.
MergeTree is the first and more advanced engine on Clickhouse that you want to try.
It supports indexing by Primary Key and it is mandatory to have a column of Date
type (used for automatic partitioning).
Is the only engine that supports sampling, and only if the sampling expression was defined
at table creation. So, the rul of the thumb is that if the dataset does not fit in RAM you will prefer to
create the table with sampling support. Otherwise, there is no performance gain by using sampling
on relatively small tables that fit in RAM.
Sampling expression uses a hash function over a chosen column in order to generate pseudo randomly
data on each of the selected columns defined in the primary key. Then you can enable this feature by accesing
the data using the SAMPLE clause in the query.
Values of aggregate functions are not corrected automatically, so to get an approximate result,
the value ‘count()’ is manually multiplied by the factor of the sample. For instance, a sample
of 0.1 (10%) will need to be multiplied by 10, 0.2 will need to be multiplied by 5.
Suppose we have the 96MM rows in a distributed table, split in 2 shards:
If you use SAMPLE > 100, you’ll probably get some dirty results, specially if you execute over
a distributed umbrella. In the bellow example is possible to see that the SAMPLE is over each
local table and aggregated later locally (there are 2 shards):
Instead, by using the relative coefficient format, the aggregations are more accurate/consistent in terms of total rows
gathered, although you’ll need to fix the estimation depending on the coefficient:
1
2
3
4
5
6
7
8
9
10
11
SELECTcount(*)AScount_over_sample,-- Without fixing, we have x10 less rows
count(*)*10AScount_estimated-- By 10 as we are sampling 10% of the table
FROMdatabase_report.stats_table_distributedSAMPLE1/10┌─count_over_sample─┬─count_estimated─┐│9641965│96419650│└───────────────────┴─────────────────┘1rowsinset.Elapsed:1.442sec.Processed96.41millionrows,4.15GB(66.84millionrows/s.,2.87GB/s.)
The path of the execution on sampling can be seen in the following animation:
Hasing functions for sampling Int and Strings
You have several hashing functions (intHash32 for integers and cityHash64 for strings) although
you may stick with those non-cryptographic in order to don’t affect the performance.
Example without sampling support: MergeTree(EventDate, (CounterID, EventDate), 8192)
Example with sampling support: MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)
The examples on this article use cityHash64, as the id is a String. Also the distribution
is random, in order to warrante the parallelization of the queries:
If the dataset is smaller than the amount of RAM, sampling won’t help in terms of performance.
The bellow is an example of a bigger result set using no-sampling and sampling.