Explained: Linter stage

Behaviour

Each time the Linter is executed, all files in the input directory are examined against a set of rules. The rule is either a default rule (e.g. is the syntax correct?) or a rule associated to a template (e.g. are table names capitalized). The Linter keeps track of all exceptions it finds during the traversal. If no exception is found, the Linter reports a succesful traversal. Otherwise, the Linter returns the exceptions per file.

General validation rules

Given the following example of a template configuration, that is validated by the Linter:

Listing 1. view 'input/example.yml'
{ (1)
  "template": "my-template.sql.j2", (2)
  "vars": { (3)
    "var1": "val1",
    "var2": ["val1", "val2"]
  },
  "<optional_key>": "", (4)
} (5)

Validate the following:

1 The start of the root element is present
2 The template-key is referencing an existing file located at 'templates/my-template.sql.j2'
3 The vars-key holds a list of variables. An example of such variable is "var1" with the value "val1"
4 Additional keys may exist, but they are ignored
5 The whole file is valid json or yaml syntax

These rules ensure that each template-configuration has valid syntax and that the referenced template exists.

Template specific validation rules

Given the code below that specifies a set of rules:

Listing 2. view 'templates/create_schema.rules.yml'
[
  {
    "identifier": "var_db", (1)
    "rules": [
      "^[A-Z_0-9]+$", (2)
      "^DB_" (3)
    ],
    "message": "A database should start with the DB_prefix. Its name has to be capitalized." (4)
  }
]

With the above rule, the Linter will apply it as follows:

1 Search for the identifier "var_db"
2 Check if its value only contains capitals
3 Check if its value has the prefix "DB_"
4 Return the error message to the user if the above checks fail

The above rules apply to all configurations of the template 'create_schema.sql.j2'. One example of such configuration is below:

Listing 3. view 'input/META_SCHEMA.yml'
{
  "template": "create_schema.sql.j2",
  "vars": {
    "var_env_list": ["T"],
    "var_context": "DWH",
    "var_rl_adm":  "RL_ADMIN",
    "var_db":      "DB_CDWH_T", (1)
    "var_sch":     "META_SCHEMA",
    "var_sch_comment": "Meta-schema contains meta information to guide the ELT process",
    "var_enable_managed_access": "TRUE",
    "var_data_retention_days": "14"
  }
}
1 The value of field "var_db" is validated against the specific template rules we saw earlier. Value "DB_CDWH_T" is capitalized (1st validation rule) and starts with the prefix "DB_" (2nd validation rule).

In more general terms, when the Linter examines a file-reference to a template, it will also look for another file with the same stem and the extension 'rules.yml'. If such file exists, it will validate the variables in the template configuration against the regex-rules.

Why do you want automatic rule validation?

Many data engineering teams agree on a set of naming conventions. These naming conventions allow them to build robust and generic ELT processes on top of them.

In my early Snowflake days, we agreed to capitalize basically all names in Snowflake (e.g. databases, schemas, tables and columns). Because of this simple rule, we were able to make generic ELT-flows where you could find here-and-there an upper()-function written around an identifier of a Snowflake object, to make sure it is always capitalized. New engineers don’t always know about all conventions, but they are required to deliver. In the early days of the data platform, ELT-flows were developed on AWS Redshift. Half a year later, the preference was shifted from AWS Redshift towards Snowflake. As a consequence, the already created ELT flows needed to be migrated. One such new engineer was quick to migrate the Redshift ELT-flows to a similar set of Snowflake ELT-flows. However, his approach demanded him to recreate the same Redshift-tables in Snowflake. Many tables were using a mixture of capitals and lowercase letters. The migration took place on an isolated environment with its own ELT-setup and separate database, where the new developer could work on his own. He was not enginaging with other developers, causing him to not get familiar with the way of working of other developers. After the migration was completed, Redshift could be decommissioned. However, the migrated ELT-flows could not be integrated into the existing Snowflake landscape. The main reason was the not-capitalized names being used, causing default Snowflake ELT-flows to fail. A big refactoring was needed to integrate the isolated ex-Redshift tables in the Snowflake landscape. The whole process took 1,5 years.

In the above example, the big refactoring could be avoided if the new engineer would have worked more with his fellow engineers more frequently, instead of on his own island. But still, it would take precious review and fixing-time to make all the ELT-flows conform the naming conventions.

— A story from a data engineer
by Paul Stapersma