Skip to main content
Skip to main content

JSON Data Type

Beta feature. Learn more.

The JSON type stores JavaScript Object Notation (JSON) documents in a single column.

Note

This feature is in beta and is not yet production-ready. If you need to work with JSON documents, consider using this guide instead.

If you want to use the JSON type, and for the examples on this page, please use:

SET enable_json_type = 1

To declare a column of JSON type, you can use the following syntax:

<column_name> JSON
(
max_dynamic_paths=N,
max_dynamic_types=M,
some.path TypeName,
SKIP path.to.skip,
SKIP REGEXP 'paths_regexp'
)

Where the parameters in the syntax above are defined as:

ParameterDescriptionDefault Value
max_dynamic_pathsAn optional parameter indicating how many paths can be stored separately as sub-columns across single block of data that is stored separately (for example across single data part for MergeTree table).

If this limit is exceeded, all other paths will be stored together in a single structure.
1024
max_dynamic_typesAn optional parameter between 1 and 255 indicating how many different data types can be stored inside a single path column with type Dynamic across single block of data that is stored separately (for example across single data part for MergeTree table).

If this limit is exceeded, all new types will be converted to type String.
32
some.path TypeNameAn optional type hint for particular path in the JSON. Such paths will be always stored as sub-columns with specified type.
SKIP path.to.skipAn optional hint for particular path that should be skipped during JSON parsing. Such paths will never be stored in the JSON column. If specified path is a nested JSON object, the whole nested object will be skipped.
SKIP REGEXP 'path_regexp'An optional hint with a regular expression that is used to skip paths during JSON parsing. All paths that match this regular expression will never be stored in the JSON column.

Creating JSON

In this section we'll take a look at the various ways that you can create JSON.

Using JSON in a table column definition

Query (Example 1)
CREATE TABLE test (json JSON) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : {"b" : 42}, "c" : [1, 2, 3]}'), ('{"f" : "Hello, World!"}'), ('{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}');
SELECT json FROM test;
Response (Example 1)
┌─json────────────────────────────────────────┐
│ {"a":{"b":"42"},"c":["1","2","3"]} │
│ {"f":"Hello, World!"} │
│ {"a":{"b":"43","e":"10"},"c":["4","5","6"]} │
└─────────────────────────────────────────────┘
Query (Example 2)
CREATE TABLE test (json JSON(a.b UInt32, SKIP a.e)) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : {"b" : 42}, "c" : [1, 2, 3]}'), ('{"f" : "Hello, World!"}'), ('{"a" : {"b" : 43, "e" : 10}, "c" : [4, 5, 6]}');
SELECT json FROM test;
Response (Example 2)
┌─json──────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3]} │
│ {"a":{"b":0},"f":"Hello, World!"} │
│ {"a":{"b":43},"c":[4,5,6]} │
└───────────────────────────────────┘

Using CAST with ::JSON

It is possible to cast various types using the special syntax ::JSON.

CAST from String to JSON

Query
SELECT '{"a" : {"b" : 42},"c" : [1, 2, 3], "d" : "Hello, World!"}'::JSON AS json;
Response
┌─json───────────────────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │
└────────────────────────────────────────────────┘

CAST from Tuple to JSON

Query
SET enable_named_columns_in_function_tuple = 1;
SELECT (tuple(42 AS b) AS a, [1, 2, 3] AS c, 'Hello, World!' AS d)::JSON AS json;
Response
┌─json───────────────────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │
└────────────────────────────────────────────────┘

CAST from Map to JSON

Query
SET enable_variant_type=1, use_variant_as_common_type=1;
SELECT map('a', map('b', 42), 'c', [1,2,3], 'd', 'Hello, World!')::JSON AS json;
Response
┌─json───────────────────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │
└────────────────────────────────────────────────┘

CAST from deprecated Object('json') to JSON

Query
SET allow_experimental_object_type = 1;
SELECT '{"a" : {"b" : 42},"c" : [1, 2, 3], "d" : "Hello, World!"}'::Object('json')::JSON AS json;
Response
┌─json───────────────────────────────────────────┐
│ {"a":{"b":42},"c":[1,2,3],"d":"Hello, World!"} │
└────────────────────────────────────────────────┘
Note

JSON paths are stored flattened. This means that when a JSON object is formatted from a path like a.b.c it is not possible to know whether the object should be constructed as { "a.b.c" : ... } or { "a" " {"b" : {"c" : ... }}}. Our implementation will always assume the latter.

For example:

SELECT CAST('{"a.b.c" : 42}', 'JSON') as json

will return:

   ┌─json───────────────────┐
1. │ {"a":{"b":{"c":"42"}}} │
└────────────────────────┘

and not:

   ┌─json───────────┐
1. │ {"a.b.c":"42"} │
└────────────────┘

Reading JSON paths as sub-columns

The JSON type supports reading every path as a separate sub-column. If the type of the requested path is not specified in the JSON type declaration, then the sub column of the path will always have type Dynamic.

For example:

Query
CREATE TABLE test (json JSON(a.b UInt32, SKIP a.e)) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : {"b" : 42, "g" : 42.42}, "c" : [1, 2, 3], "d" : "2020-01-01"}'), ('{"f" : "Hello, World!", "d" : "2020-01-02"}'), ('{"a" : {"b" : 43, "e" : 10, "g" : 43.43}, "c" : [4, 5, 6]}');
SELECT json FROM test;
Response
┌─json──────────────────────────────────────────────────┐
│ {"a":{"b":42,"g":42.42},"c":[1,2,3],"d":"2020-01-01"} │
│ {"a":{"b":0},"d":"2020-01-02","f":"Hello, World!"} │
│ {"a":{"b":43,"g":43.43},"c":[4,5,6]} │
└───────────────────────────────────────────────────────┘
Query (Reading JSON paths as sub-columns)
SELECT json.a.b, json.a.g, json.c, json.d FROM test;
Response (Reading JSON paths as sub-columns)
┌─json.a.b─┬─json.a.g─┬─json.c──┬─json.d─────┐
│ 42 │ 42.42 │ [1,2,3] │ 2020-01-01 │
│ 0 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 2020-01-02 │
│ 43 │ 43.43 │ [4,5,6] │ ᴺᵁᴸᴸ │
└──────────┴──────────┴─────────┴────────────┘

If the requested path wasn't found in the data, it will be filled with NULL values:

Query
SELECT json.non.existing.path FROM test;
Response
┌─json.non.existing.path─┐
│ ᴺᵁᴸᴸ │
│ ᴺᵁᴸᴸ │
│ ᴺᵁᴸᴸ │
└────────────────────────┘

Let's check the data types of the returned sub-columns:

Query
SELECT toTypeName(json.a.b), toTypeName(json.a.g), toTypeName(json.c), toTypeName(json.d) FROM test;
Response
┌─toTypeName(json.a.b)─┬─toTypeName(json.a.g)─┬─toTypeName(json.c)─┬─toTypeName(json.d)─┐
│ UInt32 │ Dynamic │ Dynamic │ Dynamic │
│ UInt32 │ Dynamic │ Dynamic │ Dynamic │
│ UInt32 │ Dynamic │ Dynamic │ Dynamic │
└──────────────────────┴──────────────────────┴────────────────────┴────────────────────┘

As we can see, for a.b, the type is UInt32 as we specified it to be in the JSON type declaration, and for all other sub-columns the type is Dynamic.

It is also possible to read sub-columns of a Dynamic type using the special syntax json.some.path.:TypeName:

Query
SELECT
json.a.g.:Float64,
dynamicType(json.a.g),
json.d.:Date,
dynamicType(json.d)
FROM test
Response
┌─json.a.g.:`Float64`─┬─dynamicType(json.a.g)─┬─json.d.:`Date`─┬─dynamicType(json.d)─┐
│ 42.42 │ Float64 │ 2020-01-01 │ Date │
│ ᴺᵁᴸᴸ │ None │ 2020-01-02 │ Date │
│ 43.43 │ Float64 │ ᴺᵁᴸᴸ │ None │
└─────────────────────┴───────────────────────┴────────────────┴─────────────────────┘

Dynamic sub-columns can be cast to any data type. In this case an exception will be thrown if the internal type inside Dynamic cannot be cast to the requested type:

Query
SELECT json.a.g::UInt64 AS uint 
FROM test;
Response
┌─uint─┐
│ 42 │
│ 0 │
│ 43 │
└──────┘
Query
SELECT json.a.g::UUID AS float 
FROM test;
Response
Received exception from server:
Code: 48. DB::Exception: Received from localhost:9000. DB::Exception:
Conversion between numeric types and UUID is not supported.
Probably the passed UUID is unquoted:
while executing 'FUNCTION CAST(__table1.json.a.g :: 2, 'UUID'_String :: 1) -> CAST(__table1.json.a.g, 'UUID'_String) UUID : 0'.
(NOT_IMPLEMENTED)

Reading JSON sub-objects as sub-columns

The JSON type supports reading nested objects as sub-columns with type JSON using the special syntax json.^some.path:

Query
CREATE TABLE test (json JSON) ENGINE = Memory;
INSERT INTO test VALUES ('{"a" : {"b" : {"c" : 42, "g" : 42.42}}, "c" : [1, 2, 3], "d" : {"e" : {"f" : {"g" : "Hello, World", "h" : [1, 2, 3]}}}}'), ('{"f" : "Hello, World!", "d" : {"e" : {"f" : {"h" : [4, 5, 6]}}}}'), ('{"a" : {"b" : {"c" : 43, "e" : 10, "g" : 43.43}}, "c" : [4, 5, 6]}');
SELECT json FROM test;
Response
┌─json────────────────────────────────────────────────────────────────────────────────────────┐
│ {"a":{"b":{"c":42,"g":42.42}},"c":[1,2,3],"d":{"e":{"f":{"g":"Hello, World","h":[1,2,3]}}}} │
│ {"d":{"e":{"f":{"h":[4,5,6]}}},"f":"Hello, World!"} │
│ {"a":{"b":{"c":43,"e":10,"g":43.43}},"c":[4,5,6]} │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Query
SELECT json.^a.b, json.^d.e.f FROM test;
Response
┌─json.^`a`.b───────────────┬─json.^`d`.e.f────────────────────┐
│ {"c":42,"g":42.42} │ {"g":"Hello, World","h":[1,2,3]} │
│ {} │ {"h":[4,5,6]} │
│ {"c":43,"e":10,"g":43.43} │ {} │
└───────────────────────────┴──────────────────────────────────┘
Note

Reading sub-objects as sub-columns may be inefficient, as this may require a near full scan of the JSON data.

Type inference for paths

During parsing of JSON, ClickHouse tries to detect the most appropriate data type for each JSON path. It works similarly to automatic schema inference from input data, and is controlled by the same settings:

Let's take a look at some examples:

Query
SELECT JSONAllPathsWithTypes('{"a" : "2020-01-01", "b" : "2020-01-01 10:00:00"}'::JSON) AS paths_with_types settings input_format_try_infer_dates=1, input_format_try_infer_datetimes=1;
Response
┌─paths_with_types─────────────────┐
│ {'a':'Date','b':'DateTime64(9)'} │
└──────────────────────────────────┘
Query
SELECT JSONAllPathsWithTypes('{"a" : "2020-01-01", "b" : "2020-01-01 10:00:00"}'::JSON) AS paths_with_types settings input_format_try_infer_dates=0, input_format_try_infer_datetimes=0;
Response
┌─paths_with_types────────────┐
│ {'a':'String','b':'String'} │
└─────────────────────────────┘
Query
SELECT JSONAllPathsWithTypes('{"a" : [1, 2, 3]}'::JSON) AS paths_with_types settings schema_inference_make_columns_nullable=1;
Response
┌─paths_with_types───────────────┐
│ {'a':'Array(Nullable(Int64))'} │
└────────────────────────────────┘
Query
SELECT JSONAllPathsWithTypes('{"a" : [1, 2, 3]}'::JSON) AS paths_with_types settings schema_inference_make_columns_nullable=0;
Response
┌─paths_with_types─────┐
│ {'a':'Array(Int64)'} │
└──────────────────────┘

Handling arrays of JSON objects

JSON paths that contain an array of objects are parsed as type Array(JSON) and inserted into a Dynamic column for the path. To read an array of objects, you can extract it from the Dynamic column as a sub-column:

Query
CREATE TABLE test (json JSON) ENGINE = Memory;
INSERT INTO test VALUES
('{"a" : {"b" : [{"c" : 42, "d" : "Hello", "f" : [[{"g" : 42.42}]], "k" : {"j" : 1000}}, {"c" : 43}, {"e" : [1, 2, 3], "d" : "My", "f" : [[{"g" : 43.43, "h" : "2020-01-01"}]], "k" : {"j" : 2000}}]}}'),
('{"a" : {"b" : [1, 2, 3]}}'),
('{"a" : {"b" : [{"c" : 44, "f" : [[{"h" : "2020-01-02"}]]}, {"e" : [4, 5, 6], "d" : "World", "f" : [[{"g" : 44.44}]], "k" : {"j" : 3000}}]}}');
SELECT json FROM test;
Response
┌─json────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ {"a":{"b":[{"c":"42","d":"Hello","f":[[{"g":42.42}]],"k":{"j":"1000"}},{"c":"43"},{"d":"My","e":["1","2","3"],"f":[[{"g":43.43,"h":"2020-01-01"}]],"k":{"j":"2000"}}]}} │
│ {"a":{"b":["1","2","3"]}} │
│ {"a":{"b":[{"c":"44","f":[[{"h":"2020-01-02"}]]},{"d":"World","e":["4","5","6"],"f":[[{"g":44.44}]],"k":{"j":"3000"}}]}} │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Query
SELECT json.a.b, dynamicType(json.a.b) FROM test;
Response
┌─json.a.b──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─dynamicType(json.a.b)────────────────────────────────────┐
│ ['{"c":"42","d":"Hello","f":[[{"g":42.42}]],"k":{"j":"1000"}}','{"c":"43"}','{"d":"My","e":["1","2","3"],"f":[[{"g":43.43,"h":"2020-01-01"}]],"k":{"j":"2000"}}'] │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │
│ [1,2,3] │ Array(Nullable(Int64)) │
│ ['{"c":"44","f":[[{"h":"2020-01-02"}]]}','{"d":"World","e":["4","5","6"],"f":[[{"g":44.44}]],"k":{"j":"3000"}}'] │ Array(JSON(max_dynamic_types=16, max_dynamic_paths=256)) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────┘

As you may have noticed, the max_dynamic_types/max_dynamic_paths parameters of the nested JSON type got reduced compared to the default values. This is needed to avoid the number of sub-columns growing uncontrollably on nested arrays of JSON objects.

Let's try to read sub-columns from a nested JSON column:

Query
SELECT json.a.b.:`Array(JSON)`.c, json.a.b.:`Array(JSON)`.f, json.a.b.:`Array(JSON)`.d FROM test; 
Response
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐
│ [42,43,NULL] │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My'] │
│ [] │ [] │ [] │
│ [44,NULL] │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]] │ [NULL,'World'] │
└───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘

We can avoid writing Array(JSON) sub-column names using a special syntax:

Query
SELECT json.a.b[].c, json.a.b[].f, json.a.b[].d FROM test;
Response
┌─json.a.b.:`Array(JSON)`.c─┬─json.a.b.:`Array(JSON)`.f───────────────────────────────────┬─json.a.b.:`Array(JSON)`.d─┐
│ [42,43,NULL] │ [[['{"g":42.42}']],NULL,[['{"g":43.43,"h":"2020-01-01"}']]] │ ['Hello',NULL,'My'] │
│ [] │ [] │ [] │
│ [44,NULL] │ [[['{"h":"2020-01-02"}']],[['{"g":44.44}']]] │ [NULL,'World'] │
└───────────────────────────┴─────────────────────────────────────────────────────────────┴───────────────────────────┘

The number of [] after the path indicates the array level. For example, json.path[][] will be transformed to json.path.:Array(Array(JSON))

Let's check the paths and types inside our Array(JSON):

Query
SELECT DISTINCT arrayJoin(JSONAllPathsWithTypes(arrayJoin(json.a.b[]))) FROM test;
Response
┌─arrayJoin(JSONAllPathsWithTypes(arrayJoin(json.a.b.:`Array(JSON)`)))──┐
│ ('c','Int64') │
│ ('d','String') │
│ ('f','Array(Array(JSON(max_dynamic_types=8, max_dynamic_paths=64)))') │
│ ('k.j','Int64') │
│ ('e','Array(Nullable(Int64))') │
└───────────────────────────────────────────────────────────────────────┘

Let's read sub-columns from an Array(JSON) column:

Query
SELECT json.a.b[].c.:Int64, json.a.b[].f[][].g.:Float64, json.a.b[].f[][].h.:Date FROM test;
Response
┌─json.a.b.:`Array(JSON)`.c.:`Int64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.g.:`Float64`─┬─json.a.b.:`Array(JSON)`.f.:`Array(Array(JSON))`.h.:`Date`─┐
│ [42,43,NULL] │ [[[42.42]],[],[[43.43]]] │ [[[NULL]],[],[['2020-01-01']]] │
│ [] │ [] │ [] │
│ [44,NULL] │ [[[NULL]],[[44.44]]] │ [[['2020-01-02']],[[NULL]]] │
└────────────────────────────────────┴──────────────────────────────────────────────────────────────┴───────────────────────────────────────────────────────────┘

We can also read sub-object sub-columns from a nested JSON column:

Query
SELECT json.a.b[].^k FROM test
Response
┌─json.a.b.:`Array(JSON)`.^`k`─────────┐
│ ['{"j":"1000"}','{}','{"j":"2000"}'] │
│ [] │
│ ['{}','{"j":"3000"}'] │
└──────────────────────────────────────┘

Reading JSON type from data

All text formats (JSONEachRow, TSV, CSV, CustomSeparated, Values, etc.) support reading the JSON type.

Examples:

Query
SELECT json FROM format(JSONEachRow, 'json JSON(a.b.c UInt32, SKIP a.b.d, SKIP d.e, SKIP REGEXP \'b.*\')', '
{"json" : {"a" : {"b" : {"c" : 1, "d" : [0, 1]}}, "b" : "2020-01-01", "c" : 42, "d" : {"e" : {"f" : ["s1", "s2"]}, "i" : [1, 2, 3]}}}
{"json" : {"a" : {"b" : {"c" : 2, "d" : [2, 3]}}, "b" : [1, 2, 3], "c" : null, "d" : {"e" : {"g" : 43}, "i" : [4, 5, 6]}}}
{"json" : {"a" : {"b" : {"c" : 3, "d" : [4, 5]}}, "b" : {"c" : 10}, "e" : "Hello, World!"}}
{"json" : {"a" : {"b" : {"c" : 4, "d" : [6, 7]}}, "c" : 43}}
{"json" : {"a" : {"b" : {"c" : 5, "d" : [8, 9]}}, "b" : {"c" : 11, "j" : [1, 2, 3]}, "d" : {"e" : {"f" : ["s3", "s4"], "g" : 44}, "h" : "2020-02-02 10:00:00"}}}
')
Response
┌─json──────────────────────────────────────────────────────────┐
│ {"a":{"b":{"c":1}},"c":"42","d":{"i":["1","2","3"]}} │
│ {"a":{"b":{"c":2}},"d":{"i":["4","5","6"]}} │
│ {"a":{"b":{"c":3}},"e":"Hello, World!"} │
│ {"a":{"b":{"c":4}},"c":"43"} │
│ {"a":{"b":{"c":5}},"d":{"h":"2020-02-02 10:00:00.000000000"}} │
└───────────────────────────────────────────────────────────────┘

For text formats like CSV/TSV/etc, JSON is parsed from a string containing the JSON object:

Query
SELECT json FROM format(TSV, 'json JSON(a.b.c UInt32, SKIP a.b.d, SKIP REGEXP \'b.*\')',
'{"a" : {"b" : {"c" : 1, "d" : [0, 1]}}, "b" : "2020-01-01", "c" : 42, "d" : {"e" : {"f" : ["s1", "s2"]}, "i" : [1, 2, 3]}}
{"a" : {"b" : {"c" : 2, "d" : [2, 3]}}, "b" : [1, 2, 3], "c" : null, "d" : {"e" : {"g" : 43}, "i" : [4, 5, 6]}}
{"a" : {"b" : {"c" : 3, "d" : [4, 5]}}, "b" : {"c" : 10}, "e" : "Hello, World!"}
{"a" : {"b" : {"c" : 4, "d" : [6, 7]}}, "c" : 43}
{"a" : {"b" : {"c" : 5, "d" : [8, 9]}}, "b" : {"c" : 11, "j" : [1, 2, 3]}, "d" : {"e" : {"f" : ["s3", "s4"], "g" : 44}, "h" : "2020-02-02 10:00:00"}}')
Response
┌─json──────────────────────────────────────────────────────────┐
│ {"a":{"b":{"c":1}},"c":"42","d":{"i":["1","2","3"]}} │
│ {"a":{"b":{"c":2}},"d":{"i":["4","5","6"]}} │
│ {"a":{"b":{"c":3}},"e":"Hello, World!"} │
│ {"a":{"b":{"c":4}},"c":"43"} │
│ {"a":{"b":{"c":5}},"d":{"h":"2020-02-02 10:00:00.000000000"}} │
└───────────────────────────────────────────────────────────────┘

Reaching the limit of dynamic paths inside JSON

The JSON data type can store only a limited number of paths as separate sub-columns internally. By default, this limit is 1024, but you can change it in the type declaration using parameter max_dynamic_paths.

When the limit is reached, all new paths inserted to a JSON column will be stored in a single shared data structure. It's still possible to read such paths as sub-columns, but it will require reading the entire shared data structure to extract the values of this path. This limit is needed to avoid having an enormous number of different sub-columns that can make the table unusable.

Let's see what happens when the limit is reached in a few different scenarios.

Reaching the limit during data parsing

During parsing of JSON objects from data, when the limit is reached for the current block of data, all new paths will be stored in a shared data structure. We can use the following two introspection functions JSONDynamicPaths, JSONSharedDataPaths:

Query
SELECT json, JSONDynamicPaths(json), JSONSharedDataPaths(json) FROM format(JSONEachRow, 'json JSON(max_dynamic_paths=3)', '
{"json" : {"a" : {"b" : 42}, "c" : [1, 2, 3]}}
{"json" : {"a" : {"b" : 43}, "d" : "2020-01-01"}}
{"json" : {"a" : {"b" : 44}, "c" : [4, 5, 6]}}
{"json" : {"a" : {"b" : 43}, "d" : "2020-01-02", "e" : "Hello", "f" : {"g" : 42.42}}}
{"json" : {"a" : {"b" : 43}, "c" : [7, 8, 9], "f" : {"g" : 43.43}, "h" : "World"}}
')
Response
┌─json───────────────────────────────────────────────────────────┬─JSONDynamicPaths(json)─┬─JSONSharedDataPaths(json)─┐
│ {"a":{"b":"42"},"c":["1","2","3"]} │ ['a.b','c','d'] │ [] │
│ {"a":{"b":"43"},"d":"2020-01-01"} │ ['a.b','c','d'] │ [] │
│ {"a":{"b":"44"},"c":["4","5","6"]} │ ['a.b','c','d'] │ [] │
│ {"a":{"b":"43"},"d":"2020-01-02","e":"Hello","f":{"g":42.42}} │ ['a.b','c','d'] │ ['e','f.g'] │
│ {"a":{"b":"43"},"c":["7","8","9"],"f":{"g":43.43},"h":"World"} │ ['a.b','c','d'] │ ['f.g','h'] │
└────────────────────────────────────────────────────────────────┴────────────────────────┴───────────────────────────┘

As we can see, after inserting paths e and f.g the limit was reached, and they got inserted into a shared data structure.

During merges of data parts in MergeTree table engines

During a merge of several data parts in a MergeTree table the JSON column in the resulting data part can reach the limit of dynamic paths and won't be able to store all paths from source parts as sub-columns. In this case, ClickHouse chooses what paths will remain as sub-columns after merge and what paths will be stored in the shared data structure. In most cases, ClickHouse tries to keep paths that contain the largest number of non-null values and move the rarest paths to the shared data structure. This does, however, depend on the implementation.

Let's see an example of such a merge. First, let's create a table with a JSON column, set the limit of dynamic paths to 3 and then insert values with 5 different paths:

Query
CREATE TABLE test (id UInt64, json JSON(max_dynamic_paths=3)) engine=MergeTree ORDER BY id;
SYSTEM STOP MERGES test;
INSERT INTO test SELECT number, formatRow('JSONEachRow', number as a) FROM numbers(5);
INSERT INTO test SELECT number, formatRow('JSONEachRow', number as b) FROM numbers(4);
INSERT INTO test SELECT number, formatRow('JSONEachRow', number as c) FROM numbers(3);
INSERT INTO test SELECT number, formatRow('JSONEachRow', number as d) FROM numbers(2);
INSERT INTO test SELECT number, formatRow('JSONEachRow', number as e) FROM numbers(1);

Each insert will create a separate data part with the JSON column containing a single path:

Query
SELECT
count(),
groupArrayArrayDistinct(JSONDynamicPaths(json)) AS dynamic_paths,
groupArrayArrayDistinct(JSONSharedDataPaths(json)) AS shared_data_paths,
_part
FROM test
GROUP BY _part
ORDER BY _part ASC
Response
┌─count()─┬─dynamic_paths─┬─shared_data_paths─┬─_part─────┐
│ 5 │ ['a'] │ [] │ all_1_1_0 │
│ 4 │ ['b'] │ [] │ all_2_2_0 │
│ 3 │ ['c'] │ [] │ all_3_3_0 │
│ 2 │ ['d'] │ [] │ all_4_4_0 │
│ 1 │ ['e'] │ [] │ all_5_5_0 │
└─────────┴───────────────┴───────────────────┴───────────┘

Now, let's merge all parts into one and see what will happen:

Query
SELECT
count(),
groupArrayArrayDistinct(JSONDynamicPaths(json)) AS dynamic_paths,
groupArrayArrayDistinct(JSONSharedDataPaths(json)) AS shared_data_paths,
_part
FROM test
GROUP BY _part
ORDER BY _part ASC
Response
┌─count()─┬─dynamic_paths─┬─shared_data_paths─┬─_part─────┐
│ 15 │ ['a','b','c'] │ ['d','e'] │ all_1_5_2 │
└─────────┴───────────────┴───────────────────┴───────────┘

As we can see, ClickHouse kept the most frequent paths a, b and c and moved paths d and e to a shared data structure.

Introspection functions

There are several functions that can help to inspect the content of the JSON column:

Examples

Let's investigate the content of the GH Archive dataset for the date 2020-01-01:

Query
SELECT arrayJoin(distinctJSONPaths(json))
FROM s3('s3://clickhouse-public-datasets/gharchive/original/2020-01-01-*.json.gz', JSONAsObject)
Response
┌─arrayJoin(distinctJSONPaths(json))─────────────────────────┐
│ actor.avatar_url │
│ actor.display_login │
│ actor.gravatar_id │
│ actor.id │
│ actor.login │
│ actor.url │
│ created_at │
│ id │
│ org.avatar_url │
│ org.gravatar_id │
│ org.id │
│ org.login │
│ org.url │
│ payload.action │
│ payload.before │
│ payload.comment._links.html.href │
│ payload.comment._links.pull_request.href │
│ payload.comment._links.self.href │
│ payload.comment.author_association │
│ payload.comment.body │
│ payload.comment.commit_id │
│ payload.comment.created_at │
│ payload.comment.diff_hunk │
│ payload.comment.html_url │
│ payload.comment.id │
│ payload.comment.in_reply_to_id │
│ payload.comment.issue_url │
│ payload.comment.line │
│ payload.comment.node_id │
│ payload.comment.original_commit_id │
│ payload.comment.original_position │
│ payload.comment.path │
│ payload.comment.position │
│ payload.comment.pull_request_review_id │
...
│ payload.release.node_id │
│ payload.release.prerelease │
│ payload.release.published_at │
│ payload.release.tag_name │
│ payload.release.tarball_url │
│ payload.release.target_commitish │
│ payload.release.upload_url │
│ payload.release.url │
│ payload.release.zipball_url │
│ payload.size │
│ public │
│ repo.id │
│ repo.name │
│ repo.url │
│ type │
└─arrayJoin(distinctJSONPaths(json))─────────────────────────┘
SELECT arrayJoin(distinctJSONPathsAndTypes(json))
FROM s3('s3://clickhouse-public-datasets/gharchive/original/2020-01-01-*.json.gz', JSONAsObject)
SETTINGS date_time_input_format = 'best_effort'
┌─arrayJoin(distinctJSONPathsAndTypes(json))──────────────────┐
│ ('actor.avatar_url',['String']) │
│ ('actor.display_login',['String']) │
│ ('actor.gravatar_id',['String']) │
│ ('actor.id',['Int64']) │
│ ('actor.login',['String']) │
│ ('actor.url',['String']) │
│ ('created_at',['DateTime']) │
│ ('id',['String']) │
│ ('org.avatar_url',['String']) │
│ ('org.gravatar_id',['String']) │
│ ('org.id',['Int64']) │
│ ('org.login',['String']) │
│ ('org.url',['String']) │
│ ('payload.action',['String']) │
│ ('payload.before',['String']) │
│ ('payload.comment._links.html.href',['String']) │
│ ('payload.comment._links.pull_request.href',['String']) │
│ ('payload.comment._links.self.href',['String']) │
│ ('payload.comment.author_association',['String']) │
│ ('payload.comment.body',['String']) │
│ ('payload.comment.commit_id',['String']) │
│ ('payload.comment.created_at',['DateTime']) │
│ ('payload.comment.diff_hunk',['String']) │
│ ('payload.comment.html_url',['String']) │
│ ('payload.comment.id',['Int64']) │
│ ('payload.comment.in_reply_to_id',['Int64']) │
│ ('payload.comment.issue_url',['String']) │
│ ('payload.comment.line',['Int64']) │
│ ('payload.comment.node_id',['String']) │
│ ('payload.comment.original_commit_id',['String']) │
│ ('payload.comment.original_position',['Int64']) │
│ ('payload.comment.path',['String']) │
│ ('payload.comment.position',['Int64']) │
│ ('payload.comment.pull_request_review_id',['Int64']) │
...
│ ('payload.release.node_id',['String']) │
│ ('payload.release.prerelease',['Bool']) │
│ ('payload.release.published_at',['DateTime']) │
│ ('payload.release.tag_name',['String']) │
│ ('payload.release.tarball_url',['String']) │
│ ('payload.release.target_commitish',['String']) │
│ ('payload.release.upload_url',['String']) │
│ ('payload.release.url',['String']) │
│ ('payload.release.zipball_url',['String']) │
│ ('payload.size',['Int64']) │
│ ('public',['Bool']) │
│ ('repo.id',['Int64']) │
│ ('repo.name',['String']) │
│ ('repo.url',['String']) │
│ ('type',['String']) │
└─arrayJoin(distinctJSONPathsAndTypes(json))──────────────────┘

ALTER MODIFY COLUMN to JSON type

It's possible to alter an existing table and change the type of the column to the new JSON type. Right now only ALTER from a String type is supported.

Example

Query
CREATE TABLE test (json String) ENGINE=MergeTree ORDeR BY tuple();
INSERT INTO test VALUES ('{"a" : 42}'), ('{"a" : 43, "b" : "Hello"}'), ('{"a" : 44, "b" : [1, 2, 3]}'), ('{"c" : "2020-01-01"}');
ALTER TABLE test MODIFY COLUMN json JSON;
SELECT json, json.a, json.b, json.c FROM test;
Response
┌─json─────────────────────────┬─json.a─┬─json.b──┬─json.c─────┐
│ {"a":"42"} │ 42 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
│ {"a":"43","b":"Hello"} │ 43 │ Hello │ ᴺᵁᴸᴸ │
│ {"a":"44","b":["1","2","3"]} │ 44 │ [1,2,3] │ ᴺᵁᴸᴸ │
│ {"c":"2020-01-01"} │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 2020-01-01 │
└──────────────────────────────┴────────┴─────────┴────────────┘

Comparison between values of the JSON type

Values of a JSON column cannot be compared with the less/greater functions, but can be compared using the equal function.

Two JSON objects are considered equal when they have the same set of paths and each of these paths has the same type and value in both objects.

For example:

Query
CREATE TABLE test (json1 JSON(a UInt32), json2 JSON(a UInt32)) ENGINE=Memory;
INSERT INTO test FORMAT JSONEachRow
{"json1" : {"a" : 42, "b" : 42, "c" : "Hello"}, "json2" : {"a" : 42, "b" : 42, "c" : "Hello"}}
{"json1" : {"a" : 42, "b" : 42, "c" : "Hello"}, "json2" : {"a" : 43, "b" : 42, "c" : "Hello"}}
{"json1" : {"a" : 42, "b" : 42, "c" : "Hello"}, "json2" : {"a" : 43, "b" : 42, "c" : "Hello"}}
{"json1" : {"a" : 42, "b" : 42, "c" : "Hello"}, "json2" : {"a" : 42, "b" : 42, "c" : "World"}}
{"json1" : {"a" : 42, "b" : [1, 2, 3], "c" : "Hello"}, "json2" : {"a" : 42, "b" : 42, "c" : "Hello"}}
{"json1" : {"a" : 42, "b" : 42.0, "c" : "Hello"}, "json2" : {"a" : 42, "b" : 42, "c" : "Hello"}}
{"json1" : {"a" : 42, "b" : "42", "c" : "Hello"}, "json2" : {"a" : 42, "b" : 42, "c" : "Hello"}};

SELECT json1, json2, json1 == json2 FROM test;
Response
┌─json1──────────────────────────────────┬─json2─────────────────────────┬─equals(json1, json2)─┐
│ {"a":42,"b":"42","c":"Hello"} │ {"a":42,"b":"42","c":"Hello"} │ 1 │
│ {"a":42,"b":"42","c":"Hello"} │ {"a":43,"b":"42","c":"Hello"} │ 0 │
│ {"a":42,"b":"42","c":"Hello"} │ {"a":43,"b":"42","c":"Hello"} │ 0 │
│ {"a":42,"b":"42","c":"Hello"} │ {"a":42,"b":"42","c":"World"} │ 0 │
│ {"a":42,"b":["1","2","3"],"c":"Hello"} │ {"a":42,"b":"42","c":"Hello"} │ 0 │
│ {"a":42,"b":42,"c":"Hello"} │ {"a":42,"b":"42","c":"Hello"} │ 0 │
│ {"a":42,"b":"42","c":"Hello"} │ {"a":42,"b":"42","c":"Hello"} │ 0 │
└────────────────────────────────────────┴───────────────────────────────┴──────────────────────┘

Tips for better usage of the JSON type

Before creating JSON column and loading data into it, consider the following tips:

  • Investigate your data and specify as many path hints with types as you can. It will make storage and reading much more efficient.
  • Think about what paths you will need and what paths you will never need. Specify paths that you won't need in the SKIP section, and SKIP REGEXP section if needed. This will improve the storage.
  • Don't set the max_dynamic_paths parameter to very high values, as it can make storage and reading less efficient. While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set max_dynamic_paths > 10 000.

Further Reading