I will continue the common topic in SQL, string concatenation, and show how it can be handled in SQL Server. The application developer will obviously be tempted to see things in a loop and try to divide the logic row by row. After all these years with so many nice features and enhancements to allow a more suitable approach in the database, many developers are still tempted to go down at row level and open loops, transfer data in variables, take things slowly, slowly.  String concatenation is still a burden for SQL, in the sense that programmers are still tempted to loop through tables and to gather strings one each other according to the required logic. These kinds of solutions are still, unfortunately, the default solutions to be used by most of the developers.

I explained how it works in Oracle and illustrate the two alternates approaches, the first one  where the programmer is opening a cursor (or two) and transfer data in a row by row approach, finally getting the desired results. This was the atomic approach. I also explained the opposite approach, by which the programmer  see things holistically and try to handle the string concatenation directly in a set based manner. In Oracle, there is a function named LISTAGG that allows this manipulation of string concatenation directly, without the need to follow the traditional approach of cursor, loop, transfer, loop again etc.

I want to explain the same exercise as before but in SQL Server.  Despite the fact that Transact SQL is a declarative language while PL SQL is a procedural language, the style of development can be very similar. Looking at the same exercise with countries and  continents we can easily see how the programmer can solve the problem trying to think non SQL and going down to the row or trying to think SQL (and holistically) and using the set based functionalities that the system offer.

Let’s take the same example that will illustrate the problem of string concatenation. For that, we imagine a simple database called geo with two tables: continents and countries. Apart from the geo tables, imagine a result table called List_Of_Countries.

Here is the the design below:

IF OBJECT_ID(N'Geo_Continents', N'U') IS NULL
BEGIN
	CREATE TABLE Geo_Continents
	(
		Id                      INT NOT NULL,
		Continent               NVARCHAR(50) NOT NULL    
	);


	ALTER TABLE Geo_Continents ADD CONSTRAINT Pk_Geo_Continents_Id PRIMARY KEY (Id);
END	

IF OBJECT_ID(N'Geo_Countries', N'U') IS NULL
BEGIN
	CREATE TABLE Geo_Countries
	(
		Id                      INT NOT NULL, 
		Code                    NVARCHAR(2) NOT NULL,
		Name                    NVARCHAR(50) NOT NULL,
		Continent_Id            INT NOT NULL   
	);

	ALTER TABLE Geo_Countries ADD CONSTRAINT Pk_Geo_Countries_Id PRIMARY KEY (Id);

	ALTER TABLE Geo_Countries ADD CONSTRAINT Uq_Geo_Countries_Id UNIQUE (Code);

	ALTER TABLE Geo_Countries ADD CONSTRAINT Fk_Countries_Continents FOREIGN KEY (Continent_Id) REFERENCES Geo_Continents (Id);
END	

IF OBJECT_ID(N'List_Of_Countries', N'U') IS NULL
BEGIN
	CREATE TABLE List_Of_Countries (Continent NVARCHAR(50), List_Of_Countries NVARCHAR(1000));
END	

The tables contain the list of continents as well as some of the countries. Add the continents and countries here (see below).

INSERT INTO Geo_Continents (Id, Continent) VALUES (1, 'Europe');
INSERT INTO Geo_Continents (Id, Continent) VALUES (2, 'North America');
INSERT INTO Geo_Continents (Id, Continent) VALUES (3, 'South America');
INSERT INTO Geo_Continents (Id, Continent) VALUES (4, 'Central America');
INSERT INTO Geo_Continents (Id, Continent) VALUES (5, 'Africa');
INSERT INTO Geo_Continents (Id, Continent) VALUES (6, 'Asia');
INSERT INTO Geo_Continents (Id, Continent) VALUES (7, 'Australia');

INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (1, 'AR', 'Argentina', '3');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (2, 'AU', 'Australia', '7');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (3, 'BE', 'Belgium', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (4, 'BR', 'Brazil', '3');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (5, 'CA', 'Canada', '2');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (6, 'CH', 'Switzerland', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (7, 'CN', 'China', '6');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (8, 'DE', 'Germany', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (9, 'DK', 'Denmark', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (10, 'EG', 'Egypt', '5');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (11, 'FR', 'France', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (12, 'IN', 'India', '6');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (13, 'IT', 'Italy', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (14, 'JP', 'Japan', '6');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (15, 'RO', 'Romania', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (16, 'ML', 'Malaysia', '3');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (17, 'MX', 'Mexico', '4');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (18, 'NG', 'Nigeria', '5');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (19, 'NL', 'Netherlands', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (20, 'SG', 'Singapore', '6');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (21, 'UK', 'United Kingdom', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (22, 'US', 'United States of America', '2');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (23, 'ZM', 'Zambia', '5');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (24, 'LU', 'Luxembourg', '1');
INSERT INTO Geo_Countries (Id, Code, Name, Continent_Id) VALUES (25, 'KR', 'South Korea', '6');

Now we have our data! Let’s take a look at our data, running a simple query. See here the data

SELECT co.Continent, cn.Code AS Country_Code, cn.Name AS Country_Name, cn.Continent_id
FROM Geo_Continents co INNER JOIN Geo_Countries cn
    ON co.Id = cn.Continent_Id
ORDER BY co.Continent,cn.Code;  

See the geo dataset below:

The goal of the exercise is to concatenate the countries codes for a continent and move every continent with the list of countries separated by a comma into the result table List_Of_Countries.

So, for example, for Africa, we should have EG, NG, ZM. For Asia we should have CN, IN, JP, KR, SG etc.

Let’s try the classic manner of work, mainly specific to application developers. Look at Oracle version and check the SQL Server version. See the similarities between them. Please also see the style of development, and see the fact that the developer simply ignores the fact that he/she is in a relational database and tries to go down at the row, because of the lack of tools, or better said, due to the lack of knowledge  of the tools that an engine like Oracle or SQL Server can offer. The programmer follows the classical approach.

  1. The classic (row, atomic, procedural approach)

The same tendency as in Oracle can be found in SQL Server , the tendency to think procedurally and atomically. The programmer is picking every continent in a cursor and transfer the continent, eventually the continent identifier. Being inside the cursor with continents, the programmer opens the second cursor, and chooses every country code for the selected continent. Afterwards, the developer concatenates one by one the country codes in a result string. Finally, the list of concatenated country codes are inserted along with the continent in the result table.

So, to conclude before viewing the traditional, atomic (row based) solution.

  1. Loop through continents, firstly, and take the continent into a variable.
  2. Loop second time through the countries, per selected continents, and concatenate the country codes.   We also add commas and exclude the last one!
  3. Inserting the continent and the list of country codes into the result table (for every continent).

Here is the exercise:

DECLARE 
	@v_Continent_Id					INT,
	@v_Country_Code					NVARCHAR(2),
	@v_Continent					NVARCHAR(50),
	@v_Count						INT,

	@v_Sum_Of_Countries 			NVARCHAR(1000);

	DECLARE C_Continent CURSOR FOR 
	SELECT DISTINCT cn.Continent_Id, co.Continent
	FROM Geo_Continents co INNER JOIN Geo_Countries cn
		ON co.Id = cn.Continent_Id
	ORDER BY 1;
BEGIN
	TRUNCATE TABLE List_Of_Countries;
	
	OPEN C_Continent;

	FETCH NEXT FROM C_Continent INTO @v_Continent_Id, @v_Continent;
	
	WHILE @@FETCH_STATUS = 0
	BEGIN
		SELECT @v_Sum_Of_Countries = '';

		SELECT @v_Count = COUNT(*) FROM Geo_Countries
		WHERE Continent_Id = @v_Continent_Id;
		
		IF @v_Count = 1 
			SELECT @v_Sum_Of_Countries = Code FROM Geo_Countries
			WHERE Continent_Id = @v_Continent_Id;
		ELSE	
		BEGIN
			DECLARE C_Countries CURSOR FOR
			SELECT Code AS Country_Code
			FROM Geo_Countries
			WHERE Continent_Id = @v_Continent_Id
			ORDER BY 1;
		
			OPEN C_Countries;

			FETCH NEXT FROM C_Countries INTO @v_Country_Code;
			
			WHILE @@FETCH_STATUS = 0 
			BEGIN
				SELECT @v_Sum_Of_Countries = @v_Sum_Of_Countries + @v_Country_Code + ', ';
				
				FETCH NEXT FROM C_Countries INTO @v_Country_Code;
			END;

			CLOSE C_Countries;
			
			DEALLOCATE C_Countries;

			SELECT @v_Sum_Of_Countries = SUBSTRING(@v_Sum_Of_Countries, 1, LEN(@v_Sum_Of_Countries) - 1);
		END;	

		INSERT INTO List_Of_Countries (Continent, List_Of_Countries)
		VALUES (@v_Continent, @v_Sum_Of_Countries);
		
		FETCH NEXT FROM C_Continent INTO @v_Continent_Id, @v_Continent;
	END;

	CLOSE C_Continent;
	
	DEALLOCATE C_Continent;

	SELECT * FROM List_Of_Countries ORDER BY Continent;
END;
GO


om the differences in the  syntax and particularities of PL SQL against Transact SQL, it is easy to see the style of development. It is the same. The programmer can be inside a relational database  and totally ignore that! This is what is happening now, for this exercise.

The flow is quite straight forward: take the continents, move the content  in variables, one by one. Lucky us we have just a limited number of continents, so the loop is small! Afterwards, per every continent, take the countries, again one by one, and start the process of concatenation. Dedicate a result string for the result (per continent), populate that result with the countries and add a comma. The same principle of work, very intuitive  and classic, but very expensive in the same time and large in terms of lines of code. However, if is to consider a serious issue, performance is clearly the first one and readability it follows, but at a great distance. I will come back to this later.

Let’s see the result after the first execution, using the atomic and procedural approach.

We obtained the desired result but with what cost? Of course, having such few data like in our example will not impact performance but imagine an average quantity of data and looping twice through it!

2. The set based approach for SQL Server, two possible solutions

As discussed earlier when showing the Oracle way of concatenate strings, by using the  dedicated function LISTAGG for this purpose (of string concatenation), in SQL Server things are more or less similar. The same tendencies, for allowing more and more functionalities that allow the set based way, SQL Server is not an exception.

There is an equivalent of LISTAGG in Oracle, the function STRING_AGG. Even etymologically, the meaning is obvious. But, before showing this facility, I want to explain that  this function is quite new. IT started with SQL Server 2017 (https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql?view=sql-server-2017).

For string concatenation, the function STRING_AGG is perfect. The definition is the same as  LISTAGG function in Oracle. It is used to aggregate strings (eventually from data in columns in a database table). Same as LISTAGG, this function allows direct strings concatenation without the need for any loop or cursor or other procedural facility. Let’s see the second solution, that I call the holistic solution offered by this new SQL Server function.

DELETE List_Of_Countries;

INSERT INTO List_Of_Countries (Continent, List_Of_Countries)
SELECT co.Continent, STRING_AGG(cn.Code, ', ') WITHIN GROUP (ORDER BY cn.Code ASC) AS List_Of_Countries
FROM Geo_Continents co INNER JOIN Geo_Countries cn
	ON co.Id = cn.Continent_Id
GROUP BY co.Continent;

SELECT * FROM List_Of_Countries ORDER BY Continent;

First of all, after executing the script above, you can easily see that the result table is the same for both approaches, so the string_agg function is able to do in one simple SQL statement  what the complex logic above was doing in a quite complex logic (opening two cursors and associated loops, transferring variables, concatenating etc.).

Of course, there are a lot of examples  over the Internet with STRING_AGG. We can easily go and check for others examples. What it really matters is the will to write in a certain way in a relational database and to unknowledge the paradigm. A 7 years old boy knows how to google nowadays, so trying to find out one function or another is the easy part of the problem. Being aware of a certain style of writing in the relational database is something beyond a simple search. Trying to gain a better performance by writing holistically and trying to adapt the paradigm in a relational database is something that can be achieved by any developer, if he/she  realizes the importance of this paradigm.

However, the STRING_AGG is a brand new function. What about systems written in 2014, or 2012 or even earlier? Is there any option  for a set based approach? There is another way of concatenating strings, the use of FOR XML PATH (https://docs.microsoft.com/en-us/sql/relational-databases/xml/for-xml-sql-server?view=sql-server-2017)

The same exercise can be done using this syntax. See the logic below.

DELETE List_Of_Countries;

WITH Countries (Continent, Country_Code)
AS
(
	SELECT co.Continent, cn.code AS Country_Code
	FROM Geo_Continents co INNER JOIN Geo_Countries cn
		ON co.Id = cn.Continent_Id
),
Continents (Continent, List_Of_Countries)
AS
(
	SELECT cc.Continent, (SELECT cn.Country_Code + ', ' 
	FROM Countries cn
	WHERE cn.Continent = cc.Continent ORDER BY cn.Country_Code FOR XML PATH('')) AS List_Of_Countries
	FROM Countries cc
	GROUP BY cc.Continent
)
INSERT INTO List_Of_Countries (Continent, List_Of_Countries)
SELECT Continent, SUBSTRING(List_Of_Countries, 1, LEN(List_Of_Countries) - 1) AS List_Of_Countries
FROM Continents;

SELECT * FROM List_Of_Countries ORDER BY Continent;

DELETE List_Of_Countries;

WITH Countries (Continent, Country_Code)
AS
(
	SELECT co.Continent, cn.code AS Country_Code
	FROM Geo_Continents co INNER JOIN Geo_Countries cn
		ON co.Id = cn.Continent_Id
)
INSERT INTO List_Of_Countries (Continent, List_Of_Countries)
SELECT cc.Continent, STUFF((SELECT ', ' + cn.Country_Code  
FROM Countries cn
WHERE cn.Continent = cc.Continent ORDER BY cn.Country_Code FOR XML PATH('')), 1, 1,'') AS List_Of_Countries
FROM Countries cc
GROUP BY cc.Continent;

SELECT * FROM List_Of_Countries ORDER BY Continent;

Obviously, the use of STRING_AGG seems much simpler as it actually is. However, the use of FOR XML PATH is also holistic and allow the same results. This solution is available in older versions of SQL Server and there are many examples over the internet that illustrate the string concatenation. I showed two ways of filling the result table, one with two with blocks and one with one and with the use of stuff function  just to add a bit of variety, but this is not necessarily very relevant to the topic.

3. Conclusions

Nowadays, the vendors are increasing the number of feature like this. However, there is not enough argumentation in the favor of that style of programming that is using these features and is using the strength of SQL. The features exists and their number is increasing. However,  programmers needs to be aware of their existence and need to know how to use these.

See the simplicity of the code in the set based approach, how simple the SQL  statement is compared with the complex procedural code in the first example! Increase the quantity of data and see the difference in performance, it is huge! As I mentioned earlier, if the simplicity of the code in the set based approach is, let’s say, an esthetic and subjective argument (I saw application developers  that really prefer the first approach as being intelligible for themselves); we cannot say the same about the performance! There is a significant difference in performance between the two approaches.

To conclude, the same problem, of string concatenation, can be solved in many ways. There are now more and more facilities that allow set based and holistic approach, the SQL ways. And this is very good for performance, firstly. However,  readability is also a relevant matter. Apart from being aware of these facilities, we also need to be aware to adapt our style of programming and use the correct manner when developing in a relational database.