Variables coding

When preparing data for a multidimensional analysis there is the problem of appropriate coding of nominal and ordinal variables. That is an important element of preparing data for analysis as it is a key factor in the interpretation of the coefficients of a model. The nominal or ordinal variables divide the analyzed objects into two or more categories. The dichotomous variables (in two categories, $k=2$) must only be appropriately coded, whereas the variables with many categories ($k>2$) ought to be divided into dummy variables with two categories and coded.

Dummy coding is employed in order to answer, with the use of multidimensional models, the question: How do the ($Y$) results in any analyzed category differ from the results of the reference category. The coding consists in ascribing value 0 or 1 to each category of the given variable. The category coded as 0 is, then, the reference category.

  • [$k=2$] If the coded variable is dichotomous, then by placing it in a regression model we will obtain the coefficient calculated for it, ($b_i$). The coefficient is the reference of the value of the dependent variable $Y$ for category 1 to the reference category (corrected with the remaining variables in the model).
  • [$k>2$] If the analyzed variable has more than two categories, then $k$ categories are represented by $k-1$ dummy variables with dummy coding. When creating variables with dummy coding one selects a category for which no dummy category is created. That category is treated as a reference category (as the value of each variable coded in the dummy coding is equal to 0.

When the $X_1, X_2, ..., X_{k-1}$ variables obtained in that way, with dummy coding, are placed in a regression model, then their $b_1, b_2, ..., b_{k-1}$ coefficients will be calculated.

  • [$b_1$] is the reference of the $Y$ results (for codes 1 in $X_1$) to the reference category (corrected with the remaining variables in the model);
  • [$b_2$] is the reference of the $Y$ results (for codes 1 in $X_2$) to the reference category (corrected with the remaining variables in the model);
  • […]
  • [$b_{k-1}$] is the reference of the $Y$ results (for codes 1 in $X_{k-1}$) to the reference category (corrected with the remaining variables in the model);


We code, in accordance with dummy coding, the sex variable with two categories (the male sex will be selected as the reference category), and the education variable with 4 categories (elementary education will be selected as the reference category).


& \multicolumn{3}{c|}{\textbf{Coded education}}\\

Building on the basis of dummy variables, in a multiple regression model, we might want to check what impact the variables have on a dependent variable, e.g. $Y$ = the amount of earnings (in thousands of PLN). As a result of such an analysis we will obtain sample coefficients for each dummy variable:

- for sex the statistically significant coefficient $b_{i}=-0.5$, which means that average women's wages are a half of a thousand PLN lower than men's wages, assuming that all other variables in the model remain unchanged;

- for vocational education the statistically significant coefficient $b_{i}=0.6$, which means that the average wages of people with elementary education are 0.6 of a thousand PLN higher than those of people with elementary education, assuming that all other variables in the model remain unchanged;

- for secondary education the statistically significant coefficient $b_{i}=1$, which means that the average wages of people with secondary education are a thousand PLN higher than those of people with elementary education, assuming that all other variables in the model remain unchanged;

- for tertiary-level education the statistically significant coefficient $b_{i}=1.5$, which means that the average wages of people with tertiary-level education are 1.5 PLN higher than those of people with elementary education, assuming that all other variables in the model remain unchanged;

Effect coding is used to answer, with the use of multidimensional models, the question: How do ($Y$) results in each analyzed category differ from the results of the (unweighted) mean obtained from the sample. The coding consists in ascribing value -1 or 1 to each category of the given variable. The category coded as -1 is then the base category

  • [$k=2$] If the coded variable is dichotomous, then by placing it in a regression model we will obtain the coefficient calculated for it, ($b_i$). The coefficient is the reference of $Y$ for category 1 to the unweighted general mean (corrected with the remaining variables in the model).
  • If the analyzed variable has more than two categories, then $k$ categories are represented by $k-1$ dummy variables with effect coding. When creating variables with effect coding a category is selected for which no separate variable is made. The category is treated in the models as a base category (as in each variable made by effect coding it has values -1).

When the $X_1, X_2, ..., X_{k-1}$ variables obtained in that way, with effect coding, are placed in a regression model, then their $b_1, b_2, ..., b_{k-1}$ coefficients will be calculated.

  • [$b_1$] is the reference of the $Y$ results (for codes 1 in $X_1$) to the unweighted general mean (corrected by the remaining variables in the model);
  • [$b_2$] is the reference of the $Y$ results (for codes 1 in $X_2$) to the unweighted general mean (corrected by the remaining variables in the model);
  • […]
  • [$b_{k-1}$] is the reference of the $Y$ results (for codes 1 in $X_{k-1}$) to the unweighted general mean (corrected by the remaining variables in the model);


With the use of effect coding we will code the sex variable with two categories (the male category will be the base category) and a variable informing about the region of residence in the analyzed country. 5 regions were selected: northern, southern, eastern, western, and central. The central region will be the base one.


\textbf{Regions}& \multicolumn{4}{c|}{\textbf{Coded regions}}\\
\textbf{of residence}&\textcolor[rgb]{0,0,1}{\textbf{western}}&\textcolor[rgb]{1,0,0}{\textbf{eastern}}&\textcolor[rgb]{0,0.58,0}{\textbf{northern}}&\textcolor[rgb]{0.55,0,0}{\textbf{southern}}\\\hline

Building on the basis of dummy variables, in a multiple regression model, we might want to check what impact the variables have on a dependent variable, e.g. $Y$ = the amount of earnings (expressed in thousands of PLN). As a result of such an analysis we will obtain sample coefficients for each dummy variable:

- for sex the statistically significant coefficient $b_{i}=-0.5$, which means that the average women's wages are a half of a thousand PLN lower than the average wages in the country, assuming that the other variables in the model remain unchanged;

- for the western region the statistically significant coefficient $b_{i}=0.6$, which means that the average wages of people living in the western region of the country are 0.6 thousand PLN higher than the average wages in the country, assuming that the other variables in the model remain unchanged;

- for the eastern region the statistically significant coefficient $b_{i}=-1$, which means that the average wages of people living in the eastern region of the country are a thousand PLN lower than the average wages in the country, assuming that the other variables in the model remain unchanged;

- for the northern region the statistically significant coefficient $b_{i}=0.4$, which means that the average wages of people living in the western region of the country are 0.4 thousand PLN higher than the average wages in the country, assuming that the other variables in the model remain unchanged;

- for the southern region the statistically significant coefficient $b_{i}=0.1$, which means that the average wages of people living in the southern region of the country do not differ in a statistically significant manner from the average wages in the country, assuming that the other variables in the model remain unchanged;