Principal component analysis of financial statements . A compositional approach

Financial ratios are often used in principal component analysis and related techniques for the purposes of data reduction and visualization. Besides the dependence of results on ratio choice, ratios themselves pose a number of problems when subjected to a principal component analysis, such as skewed distributions. In this work, we put forward an alternative method drawn from compositional data analysis (CoDa), a standard statistical toolbox for use when data convey information about relative magnitudes, as financial ratios do. The method, referred to as the CoDa biplot, does not rely on any particular choice of financial ratio but allows researchers to visually order firms along the pairwise financial ratios for any two accounts. Non-financial magnitudes and time evolution can be added to the visualization as desired. We show an example of its application to the top chains in the Spanish grocery retail sector and show how the technique can be used to depict strategic management differences in financial structure or performance, and their evolution over time.


Introduction.
Financial ratios have been used for a variety of purposes, ranging from bankruptcy prediction and option pricing to management performance assessment and strategic assessment (e.g., Barnes, 1987;Blanco-Oliver et al., 2016;Caro et al., 2017;Chen & Shimerda, 1981;Horrigan, 1968;McGee & Thomas, 1986). As regards the latter purposes, since the last half of the nineteenth century, financial ratio analysis has been a common methodology for assessing firm management (Horrigan, 1968). The original DuPont model based on financial ratios developed in 1918 has become a universally used approach to analyse the relationship between management and firm performance. Since then, scholars and educators alike have acknowledged the role that financial ratios play in the formulation and implementation of realistic and ultimately effective strategies (Norman, 2018). Ross et al. (2003) considered that one the main advantages of financial ratios is their ability to evaluate the company's position compared to its main competitors. This is an essential aspect to address in the strategic management of a company. Strategic management is a process that establishes the courses of action a company will follow to achieve its objectives (Grant, 2008;Smith, 2005). Techniques such as strengths, weaknesses, opportunities and threats (SWOT) analysis, the balanced scorecard, and conceptual approaches such as competitive advantage, resource-based analysis or strategic groups, are widely acknowledged and used in strategic management and business analysis. Their application requires having quantitative information that makes it possible to assess the situation of a company and its competitors, as well as project future strategic scenarios. Financial ratios applied to general-purpose financial statements and related data are the core of useful estimates and inferences in business analysis. In fact, financial ratios make it possible to assess business strategy and relate it to performance measures (Allen & Helms, 2006;Banker et al., 2014;Hoque, 2004). Despite the widespread and frequent use of financial ratios as a business analysis tool, it has been hampered by several long-known shortcomings related to financialratio asymmetry and redundancy. Most ratios are distributed between zero and infinity and thus make fully symmetric distributions impossible to achieve (Deakin, 1976). Besides its statistical implications, asymmetry arises from an uneven treatment of differences in the numerator and the denominator of the ratio (Frecka & Hopwood, 1983), meaning that permuting them changes the results of statistical analyses (Linares-Mustarós et al., 2018).
Redundancy necessarily arises from the fact that there are many more financial ratios in common use than there are accounts from which these ratios are computed. Often, redundancy occurs to such an extent that "there is no absolute test for the importance of variables" (Barnes, 1987, p. 455) and "to identify those ratios which contain complete information about a firm while minimising duplication cannot be achieved purely by logic" (Barnes, 1987, p. 456). In extreme cases, there is an exact dependency between ratios. Chen & Shimerda (1981) put an example of four exactly redundant ratios: net worth to total debt, total debt to net worth, net worth to total assets and total debt to total assets. Just one of them carries the same information as the whole set of four.
Financial ratio analysis constitutes a case of genuine interest to researchers and professionals with regard to relative rather than absolute magnitudes of accounts in financial statements. In other scientific fields, there is a well-developed toolbox for analysing the relative importance of magnitudes, known as Compositional Data Analysis (CoDa- Aitchison's, 1982Aitchison's, , 1986van den Boogaart & Tolosana-Delgado, 2013;Pawlowsky-Glahn et al., 2015). Among other features, CoDa treats magnitudes (i.e., account values) in a symmetric fashion in such a way that results depend only on the selected accounts of interest but not on any particular set of ratios, let alone on numerator and denominator permutation. CoDa also tends to reduce redundancy and hence increase parsimony by acknowledging the fact that no analysis will require more variables than there are account magnitudes to be compared.
CoDa has already been successfully applied with the purpose of clustering firms with similar financial statement structures (Linares-Mustarós et al., 2018). To the best of our knowledge, that is the only CoDa application to financial statement analysis to date. The purpose of the present article is to introduce a data visualization and data reduction tool for financial statement analysis based on CoDa. With this tool firms can be mapped in a two-dimensional space, which allows ready appraisal of their ordering according to the ratios between any two accounts, thus providing a powerful visual aid for strategic assessment.
In this article, we first provide an overview of the basics in CoDa. Then, we present a common data visualization and reduction tool called CoDa biplot, which can be used to map the main accounts in order to visually appraise the strategic or managerial differences among firms. Some interpretational cues specific to financial ratio analysis are highlighted. Finally, we provide an illustration characterizing the business model evolution of the largest chains in the Spanish grocery retailing sector, from the onset of the financial crisis in 2008 up to 2015.

Definition and purpose.
Compositional Data analysis (CoDa) is the standard statistical method used when data carry only information about the relative importance of non-negative parts of a whole. The CoDa tradition started with Aitchison's seminal work (1982,1986) on chemical and geological compositions, in which only the proportion of each part or component is of interest, since absolute amounts are irrelevant and only telling with regard to the size of the chemical or soil sample (e.g., Buccianti et al., 2006). Nowadays, CoDa spans almost all of the hard sciences and has started to be used in several management fields. Besides Linares et al. (2018), examples include crowdfunding (Davis et al., 2017), financial markets (Ortells et al., 2016;Wang et al., 2019), investment portfolios (Belles-Sampera et al., 2016;Boonen et al., 2019;Glassman & Riddick, 1996) In the last three decades, CoDa has provided a standardized toolbox for statistical analyses whose research questions concern the relative importance of magnitudes. Dedicated user-friendly software is available to this end (Van den Boogaart & Tolosana-Delgado, 2013; Greenacre, 2018;Palarea-Albaladejo & Martín-Fernández, 2015;Templ et al., 2011;Thió-Henestrosa & Martín-Fernández, 2005), as well as accessible handbooks (Van den Boogaart & Tolosana-Delgado, 2013;Filzmoser et al., 2018;Greenacre, 2018;Pawlowsky-Glahn & Buccianti, 2011;Pawlowsky-Glahn et al., 2015).
Compositional analysis (Barceló-Vidal & Martín-Fernández, 2016) has recently been coined as a term to emphasize the fact that what is ultimately compositional are not the data but the analysis and research objectives centred on the relative importance of non-negative magnitudes. Interesting applications of CoDa to non-negative data that do not represent parts of any whole can be found in Ortells et al. (2016) and Azevedo-Rodrigues et al., (2011). This is the case with financial statement analysis, in which, for instance, sales and assets are not parts of any whole and the asset turnover ratio compares the magnitudes of both in relative terms. Even non-negative non-financial magnitudes can be and indeed are included, such as number of employees. Compositional data are represented by a positive vector in a D-dimensional real space, which conveys information about the relative size of its components: where D is the number of components, also referred to as parts. In our case, parts are non-negative accounts in financial statements or even other non-negative management magnitudes. This means, for instance, that one should use the non-negative constituents of working capital (current assets and current liabilities) rather than working capital itself, that one should use revenues and costs rather than profit, that one should use assets and liabilities rather than net worth, and so on.

Transformations, variance and association.
The most common CoDa approach is to express an original compositional vector of D parts into logarithms of ratios (Aitchison, 1986;Egozcue et al., 2003). Log ratios are unbounded and tend to meet the distributional assumptions of classical statistical models, such as normality (Aitchison, 1982;Pawlowsky-Glahn et al., 2015). This notwithstanding, for the purposes of this article the main arguments for log ratios are that they constitute a natural way of distilling information about relative size of parts, they form the basis for defining compositional association and variance in a meaningful way, and, involving ratios, they are coherent with financial statement analysis practice. Logs of financial ratios had already been suggested in the financial ratio literature as a means of reducing asymmetry (e.g. Cowen & Hoffer, 1982;Deakin, 1976;Sudarsanam & Taffler, 1995).
Positive values mean that xj is larger than xk. Negative values show the opposite. A zero log ratio implies equality of both magnitudes, exactly in the same way as a unit standard ratio. If xj contains quick assets and xk contains current liabilities, interpretation parallels that of the quick ratio. A log ratio is symmetric in the sense that its range is from minus infinity to plus infinity. It is also symmetric in the sense that permuting the numerator and denominator only affects the log ratio sign, but not its value. Furthermore, if one of the parts being compared is close to zero, it may lead to an outlying standard financial ratio when placed in the denominator and to a typical ratio when placed in the numerator. For log ratios placement makes no difference.
Log ratios may also be computed between each part and the geometric mean the remaining D-1 parts, in the so-called centred log ratios. They thus indicate the relative dominance of an account in the overall financial statement structure. (D-1)/D is a scaling constant: There are alternative interpretations and expressions of centred log-ratios (Filzmoser et al., 2018;Pawlowsky-Glahn et al., 2015). Other ways of computing log ratios in financial analysis are described in Linares-Mustarós et al. (2018).
Total variance in a compositional data set is expressed by the sum of variances of all centred log ratios: Association is understood as proportionality between pairs of accounts is (Lovell et al., 2015). The same pairwise log ratios (2) and their variances are computed: ln with , 1, 2,..., ; .
These variances can be arranged in a symmetric matrix with components (i.e., accounts) defining both D rows and D columns, with the same layout as a correlation matrix. This is the so-called variation matrix and has zero elements in the diagonal. Var(ln(xj/xk)) is zero when xj and xk behave perfectly proportionally (e.g., firms having one account of double size also have the other of double size), which corresponds to perfect positive association (Lovell et al., 2015). The further Var(ln(xj/xk)) is from zero, the lower the association. There is no clearly defined threshold representing no association, just as there is no upper bound representing perfect negative association, so that values in the matrix are assessed comparatively. It can be shown that the sum of elements in the variation matrix is 2D times the total variance (4).

Zero and other irregular components.
As with other types of data, compositions require some data preprocessing to deal with problems such as missing information and outliers.
To begin with, the accounts of interest may contain no zero values in order for log ratios to be computed (e.g. Martín-Fernández et al., 2011). The same holds for standard financial ratio analysis regarding the account in the denominator. Unlike the case in standard financial ratio analysis, CoDa includes an advanced toolbox for zero imputation prior to log ratio computation, with Martín-Fernández et al. (2011)  CoDa also has implications for outlier detection. Given that components cannot be considered in isolation, multivariate outlier detection methods are called for. Once compositions have been transformed into log ratios, squared Mahalanobis distances between each composition and the overall mean can be computed. Under multivariate normality, these squared distances follow a χ 2 distribution with D-1 degrees of freedom. An appropriate percentile for this distribution (e.g., the 99.9 percentile) can be used as cut-off criterion for outlier detection.
CoDa also has its own multivariate statistical methods, which are in many cases closely related to standard methods used for log ratio transformations. In this article, we deal with compositional principal component analysis. Aitchison (1983) was the first to extend principal component analysis to the compositional case. The extension boils down to submitting centred log ratios (3) to an otherwise standard principal component analysis based on a covariance matrix, and adapting interpretation to take into account the fact that the information carried by the data refers to relative rather than absolute importance of magnitudes. Together with Gabriel's (1971) biplot, which jointly represents cases and variables in a principal component analysis, Aitchison's developments served as a basis for Aitchison & Greenacre (2002) introducing CoDa biplots.
A CoDa biplot can be understood as the most accurate data visualization of a compositional dataset in two dimensions. As in standard principal component analysis, overall biplot accuracy can be assessed from the percentage of variance (4) explained by the two first principal components. There are several types of biplots. The most interesting type for financial statement analysis purposes is the socalled covariance biplot, which optimizes the representation of the variation matrix (5) among the selected accounts and non-financial magnitudes. Accounts and other magnitudes appear as rays emanating from a common origin, which represents a firm with log ratios equal to the sample average. Firms appear as points. The interpretation of the covariance biplot is explained below (see Aitchison & Greenacre, 2002;Van den Boogaart & Tolosana-Delgado, 2013;Pawlowsky-Glahn et al., 2015 for further details). The two key interpretational elements are the rays representing accounts and other magnitudes, and the links between the vertices of a pair of rays: • Lengths of the links between the vertices of the rays of two accounts or magnitudes are approximately proportional to the square root of the variance of their corresponding pairwise logratio (5). Accounts which behave proportionally for all firms appear close together. It must be noted that, unlike in standard principal component analysis, distances between vertices are used rather than angles between rays. • The orthogonal projection of all firms along the direction defined by a ray shows an approximate ordering of the relative size of that account or magnitude for all firms. This coincides with the interpretation in standard principal component analysis, the only difference being that relative rather than absolute size is considered, as implied in (3). • The orthogonal projection of all firms along the direction defined by the link between the vertices of a pair of rays shows an approximate ordering of firms according to the standard financial ratio between the corresponding two accounts. In this way, the CoDa biplot is also a visual representation of any of the D(D-1)/2 possible financial ratios computed from any two accounts or magnitudes, although only long links showing high variance pairwise log ratios tend to lead to informative directions. The ability to visually interpret ratios between any two magnitudes is of great interest in financial statement analysis. This interpretation is specific to CoDa biplots, and differs from that of standard principal component analysis. Unlike principal component analysis of standard financial ratios, the choice of ratios of interest does not need to be made in advance and hence does not influence the analysis outcome. • The cosine of the angles between two links corresponding to the financial ratios between two pairs of accounts shows the approximate correlation between the two corresponding log ratios. For example, parallel links show log ratios with approximate correlations equal to 1 or -1. Orthogonal links show log ratios with approximate 0 correlation.
The use of compositional principal components and the CoDa biplot is not limited to visualizing the composition but is also appropriate for data reduction and summarization purposes. The first few principal component coordinates explaining most of the variance can be used as variables in further statistical analyses, exactly in the same way as in standard principal component analysis.

A case study: the largest retail chains in the Spanish grocery sector (2008-2015).
This example focuses on the largest retail chains in the Spanish grocery sector. Data are taken from the largest chains in 2015 (a total of 28 chains with a net income of over €200 million) using the SABI (Iberian Balance sheet Analysis System) database, developed by INFORMA D&B in collaboration with Bureau Van Dijk. These data correspond to annual accounts presented in official records and other nonfinancial information from 2008 to 2015. Four chains were ruled out due to missing or faulty information for the total period considered. The final sample size was therefore 24 retail chains, i.e., 192 observations over 8 years: Ahorramás, AlCampo, Alimerka, BonPreu, Caprabo, Carrefour, Cecosa Supermercados, Condis, Consum, Dia, Distribuciones Froiz, Eroski, HiperCor, Hiper Usera, JuanFornes, Lidl, Mercadona, Semark, Supermercados Champion, Supercor, Supeco Maxor, Superercados Sabeco, Unión Detallistas Españoles, Vego Supermercados. The database was accessed on 28/7/2017. This sector is especially interesting for analysis due to the pressures and changes it has had to face during the last economic crisis. On the one hand, we must consider a very adverse international context at the beginning of the crisis with an inflationary process in the food and energy markets (OECD, 2013) and on the other, the factors characteristic of the Spanish economy: demographic stagnation, a fall in the purchasing power of domestic economies, decrease in household size, changes in purchasing habits (Pablos et al., 2013) and an increase in VAT. Faced with this complex scenario, large grocery distribution chains have been forced to react and modify or intensify their strategic approaches, and in so doing they have contributed to shaping the sector's recent evolution (OECD, 2013). The observed trends for this sector in Spain are not isolated, coinciding with those observed for the vast majority of EU countries (European Union, 2016), although they have presented greater intensity in the Spanish case.
The non-negative financial and non-financial magnitudes used in this case study include: x 1=Fixed assets (FA) x2=Inventory (I) x3=Quick assets (QA) x4=Long term liability (LTL) x5=Short term debt (STD) x6=Accounts payable (AP) x7=Number of employees (E) x8=Net Sales (NS) x9=Costs of goods sold (CGS) x10=Labour costs (CL) x11=Costs of external services (CES) x12=Asset depreciation and amortization (AAD) x13=Other non-finance costs (CONF) x14=Finance costs (CF) These magnitudes have been selected because they constitute the basis for a wide array of common financial and management ratios frequently used in the retail sector (Evans & Mathur, 2014), which belong to three main categories: solvency and liquidity, operating efficiency and profitability. Table 1 shows examples of relevant financial ratios that might be computed from x1 to x14. The use of CoDa does not favour one ratio over any other, as only the data transformed as centred log ratios (3) are used in the analysis. This ratio list is not closed; it could incorporate any possible ratio computed from x1 to x14. Thus, it is provided merely as an example in order to show which type of information is carried by x1 to x14.   & Martín-Fernández, 2008) as implemented in the lrEM command with default options and the minimum non-zero observed value as detection limit. Eight outliers with squared Mahalanobis distance to the centre above the 99.9 percentile of the χ 2 distribution were removed. The final sample size was thus n=184. The acomp, clr and princomp commands were used in the principal component analysis. However, once centred log ratios (3) are computed, any standard software handling covariance-based principal component analysis could be used with identical results. Subsequently, the biplot command was used, whose default options return the covariance biplot.

Results.
The variation matrix in Table 2 shows some cells with very low values, in other words, ones displaying strong proportionality. Proportionality between number of employees (E) and labour costs (CL) suggests average wages to be very similar across chains and over time. Proportionality among net sales (NS), cost of goods sold (CGS), labour costs (CL), and costs of external services (CES) suggests that chains have very similar operating margins and cost structure.  Source: Own elaboration. Table 3 and Figure 1 show two dimensions to be appropriate for a compositional principal component analysis and thus the two-dimensional biplot to accurately represent centred log ratio variances (4) and the variation matrix (5).   Figure 2 shows the covariance biplot with only part rays (only 2005 rays are shown). As expected, magnitudes related to net sales and direct costs (NS, AAD, CES, CGS, CL) and thus to operating margins appear close together. Heterogeneity among chains mostly lies in solvency, liquidity and turnover. The most leveraged chains will appear to the right of the graph (high financial costs -CF-and long term liabilities -LTL-) relative to their activity volume (sales -NS-) and non-financial operating costs -AAD, CES, CGS, CL-). The upper part of the biplot represents chains with high fixed assets (FA) and overheads (CONF) relative to the quick assets (QA) and liabilities (AP) related to core activity (NS).   Source: Own elaboration. Figure 4 shows the covariance biplot with the directions defined by some informative financial ratios between two magnitudes (only 2015 data shown). Note that we have not represented the ratios of net sales over inventory (NS/I) or net sales per employee (NS/E), which have very low variances given the proximity of their pairs of vertices. The represented ratios are: • Fixed assets turnover (NS/FA).
• Share of overheads in the margin structure (CONF/NS).
• Share of financial costs in the margin structure (CF/NS).
• Asset depreciation and amortization over fixed assets (AAD/FA). The angles between the links show the log fixed assets turnover (NS/FA) to be negatively correlated with the log share of overheads in the margin structure (CONF/NS) but to have almost no correlation with the log share of financial costs in the margin structure (CF/NS). Log fixed asset per employee (FA/E) also has a very low correlation with log accounts payable turnover (AP/CGS). Any other ratio of interest to the researcher and the correlation with any other ratio could be represented and interpreted in the same manner.
In spite of the fact that all observations have been used for its computation, only 2015 values are shown in Figure 4 for the sake of simplicity. The orthogonal projection of all chains along the direction defined by a ratio shows an approximate ordering of chains according to the ratio value. For instance, Hiper Usera (HU) has the highest accounts payable turnover (AP/CGS) and Supermercados Champion (CH) the lowest. In the same vein, Eroski (EK), and HiperCor (HC) have the highest fixed asset per employee (FA/E), and Vego Supermercados (VS) the lowest. Vego Supermercados (VS) has the highest fixed assets turnover (NS/FA) and HiperCor (HC) the lowest. Hipercor (HC) has the largest share of overheads in the margin structure (CONF/NS), Eroski (EK) the largest share of financial costs in the margin structure (CF/NS), and Vego Supermercados (VS) the largest depreciation and amortization over fixed assets (AAD/FA).
Although formal CoDa methods for clustering firms according to financial statement structure have been developed (Linares-Mustarós et al., 2018), the biplot makes for an appealing visual classification of strategic groups. For instance, firms close to the origin of the rays would fit the socalled stuck-in-the-middle strategy (e.g., AlCampo -AC-, Supermercados Sabeco -SS-and Juan Fornes -JF). Alimerka (AK), Mercadona (MD), Vego Supermercados (VS) and Carrefour (CF) seem to form a strategic group characterized by low fixed asset per employee and low share of financial costs in the margin structure. Supercor (SC) and HiperCor (HC) are characterized by low fixed asset turnover and large overhead share in the margin structure, and so on.

Discussion.
This work has presented a data analysis and visualization tool for supporting strategic management decisions. This approach uses the same type of financial statement information used by financial ratios but does not depend on the often arbitrary decision about which financial ratios to compute. Instead of drawing from any particular ratio, the approach maps accounts and other management magnitudes in a two-dimensional plot, on which firms can be ordered with respect to the ratios of any two magnitudes. Thus, the analysis supplies a global perspective on the financial and economic situation of firms, which is more enriching than that offered by a collection of ratios considered in an isolated way. In this way, the global characteristics of firms can be readily visualized. This tool is especially valuable in the field of strategic management. Analytical techniques such as SWOT analysis or strategic groups are applied more rigorously when based on an adequate metric. Moreover, even the use of conceptual paradigms such as the theory of resource-based analysis (Grant, 1991) or Michael Porter's analysis of industry structure and competitive positioning (Porter, 1980) is enhanced when a metric analysis can be performed. Firms with similar financial patterns and performance can be easily identified, as well as the key ratios according to which they behave similarly or differently. Thus, strategic groups can be considered in terms of the relative position of firms. Particular firms are represented and differences between them are derived by taking into account the multiple dimensions represented by each financial magnitude and all pairwise ratios. This tool visualizes the strengths and weaknesses of each firm in a relative way compared to competitors. Moreover, it helps to identify resource-based position by firm. In a global perspective, this tool is useful for describing industrial structure and the competitive positioning of each firm. Therefore, it can be a powerful methodology for assisting in internal strategic analyses. It can also help define the strategic objectives of the company and the subsequent formulation of strategies to be executed, as well as assessing the performance obtained from their implementation. If panel data are available, the individual time evolution of each firm can also be plotted and interpreted in terms of key ratios.
This method draws on the CoDa tradition, and is based on logarithms of ratios, which have proven to lead lo less asymmetry and redundancy than standard financial ratios (Linares-Mustarós et al., 2018) and not to rely on any particular choice of financial ratios. Another advantage of CoDa in financial analysis is the existence of a formal and well proven method for imputing zero values in financial statements. This makes it possible to perform financial analysis of firms with some zero account values, for which standard financial ratio analysis has to date only provided ad-hoc solutions or has merely dropped them from the analyses.
Several extensions are possible. In this article, mostly financial information has been used to develop and show the proposed methodological approach. However, this tool can deal with any type of quantitative variables, be they financial or not. For example, in the analysis of large distribution chains, we might have considered aspects as relevant as the number of establishments or the commercial area of each retail chain, the volume of Internet sales or other relevant data. All this information can be analysed using this methodological tool if it is in a quantitative metric and non-negative. The method can also be used as a data reduction tool and the first few coordinates explaining most of the variance can be used as variables in further statistical analyses, as explanatory variables in models predicting bankruptcy or belonging to previously known strategic groups, for example. The analysis can also be extended by using weights (Greenacre, 2018) or robust methods (Templ et al., 2011).