Aggregation Precedence in Power BI: How does Power BI choose between candidate aggregation tables with the same Precedence?

Precedence is one of the aggregation properties that you can define on an aggregation table in Power BI. In a model with multiple aggregation tables, you can use Precedence to define the order in which aggregation tables will be considered by Power BI to answer queries. The higher the Precedence number, the sooner the aggregation table will be considered. Very simple and easy to understand. but the question is:

What does Power BI do when there are multiple aggregation tables configured with the same Precedence value and they can all answer the same query? Which one is considered first? Does it choose the smallest one? or is there another rule in place?

Spoiler Alert: It does not seem to choose the smaller aggregation table.

Background

In my previous post on Aggregations in Power BI, I used an example where the model had two aggregation tables:

One table at Product and Year level with 288 rows:

CREATE TABLE [dbo].[Sales_Agg1_By_Product_Year](
	[ProductKey] [int] NOT NULL,
	[Calendar Year] [smallint] NOT NULL,
	[Sales Amount] [money] NULL,
	[Unit Price Sum] [money] NULL,
	[Unit Price Count] [int] NULL,
	[InternetSalesRowCount] [int] NULL
) ON [PRIMARY]


INSERT INTO dbo.Sales_Agg1_By_Product_Year
SELECT	fs.ProductKey,
		dd.[Calendar Year],
		SUM([Sales Amount]) AS [Sales Amount],
		SUM([Unit Price]) AS [Unit Price Sum],
		COUNT([Unit Price]) AS [Unit Price Count],
		COUNT(*) AS InternetSalesRowCount

FROM [dbo].[vFactInternetSales] fs
JOIN [dbo].[vDimDate] dd
ON fs.DateKey = dd.DateKey
GROUP By fs.ProductKey, dd.[Calendar Year]

One table at Year level with 5 rows:

CREATE TABLE [dbo].[Sales_Agg2_By_Year](
	[Calendar Year] [smallint] NOT NULL,
	[Sales Amount] [money] NULL
) ON [PRIMARY]

INSERT INTO dbo.Sales_Agg2_By_Year
SELECT	dd.[Calendar Year],
		SUM([Sales Amount]) AS [Sales Amount]
FROM [dbo].[vFactInternetSales] fs
JOIN [dbo].[vDimDate] dd
ON fs.DateKey = dd.DateKey
GROUP By dd.[Calendar Year]

Obviously, before configuring the second aggregation table, Power BI used the first and only aggregation table available to answer the test query.

Before configuring the second Aggregation table.

After configuring the second aggregation table, Power BI picked the second one.

After configuring the second aggregation table.
Before setting the Precedence property.

Since the second table was smaller, this seemed like a desired behavior but the question is: Did Power BI pick it because it was smaller or was there another factor in place? Unfortunately as you will see shortly, there seems to be something else at play.

Before I show you that, for reference, when I set the Precedence property of the larger aggregation tableto 10, Power BI chose the larger aggregation table to answer the respective queries.

After configuring the second aggregation table.
After setting the Precedence property of the first aggregation table to 10.

This last example was only used to demonstrate that the Precedence property can direct Power BI to consider one aggregation table over another one.

In a real case, if there is more than one aggregation table configured and they can all be used to answer the same query, you do want the smallest aggregation table to be considered first.

What does Power BI do when aggregation tables have the exact same Precedence

To find out why Power BI picked the second aggregation table in the absence of precedence (0), I added a third aggregation table to the mix.

CREATE TABLE [dbo].[Sales_Agg3_By_ProductCategory_Year](
[ProductCategoryKey] [int] NOT NULL,
[Calendar Year] [smallint] NOT NULL,
[Sales Amount] [money] NULL
) ON [PRIMARY]

INSERT INTO dbo.Sales_Agg3_By_ProductCategory_Year --49 rows
SELECT	dp.[ProductSubcategoryKey],
		dd.[Calendar Year],
		SUM([Sales Amount]) AS [Sales Amount]
FROM [dbo].[vFactInternetSales] fs
JOIN [dbo].[vDimDate] dd
ON fs.DateKey = dd.DateKey
JOIN [dbo].[DimProduct] dp
ON fs.ProductKey = dp.ProductKey 
GROUP By dp.[ProductSubcategoryKey],
	 dd.[Calendar Year]

This table has 49 rows. It is smaller than the first aggregation table with 288 rows but larger than the second one with 5 rows.

I configured the third aggregation table similar to others.

Since the query in question, is by Calendar Year, I did not configure ProductCategoryKey and did not connect the ProductCategoryKey column between the Sales_Agg3_By_ProductCategory_Year and the Product Category tables in the model.

EVALUATE 
SUMMARIZECOLUMNS(
	'Date'[Calendar Year],
	"Sales Amount", SUM ('Internet Sales'[Sales Amount])
)

Multiple Aggregation tables with the same Precedence

After adding and configuring the third aggregation table in the model, the test query was answered from the third aggregation table:

It seems like Power BI chose the last configured Aggregation table to answer the query.

(Note: As expected, connecting ProductCategoryKey between the Sales_Agg3_By_ProductCategory_Year and Product Category tables in the model, does not make any difference in this example. This is because that column is not in use for the query in question.)

I repeatetest the same test as before but this time I changed the order of the opeations. I added Sales_Agg3_By_ProductCategory to the model first, configured it, then added Sales_Agg2_By_Year and configured it. As I was suspecting at this point, Power BI chose Sales_Agg2_By_year which was the last table added to the model.

At this point I had yet another question in mind: In the absence of different Precedence properties, is it the order in which aggregation tables are added to the model or the order in which they are configured that affects the order in which they are considered to answer queries?

To find out, I tested the following combinations:

Agg2 configured 1st.
Agg3 configured 2nd.
Agg 2 configured 2nd.
Agg 3 configured 1st.
Agg 2 added first. Close and Apply.
Agg 3 added second. Close and Apply.
Query uses Agg3.Query uses Agg3.
Agg 3 added first. Close and Apply.
Agg 2 added second. Close and Apply. 
Query uses Agg2.Query uses Agg2.

The aggregation table last added to the model wins in all combinations.

A couple of notes:

(1) Agggration table one was not considered in either combinaton.

(2) I repaeted the same test with the Precedence set to 10 instead of 0 and got the same results.

Conclusion

It seems like Power BI is favoring the last aggregation table added to a model when there is a tie in the Precedence property between them. At the moment, there are not many details available about the inner workings of how Power BI decides which aggregation tables to use to answer queries if there are more than one candidate table available in the model. It is important to note, the dataset I worked with in this blog post was very small. It is not clear if the same scenario would happen in an actual large dataset.

SQL Server Analysis Services (SSAS) Multidimensional, (the other system I used agggrations extensively in) has a complex set of rules when considering aggregations to answer queries. For example, one of the rules was the 1/3 rule. If an aggregation was larger than 1/3 of the main table (fact table) it would not be created at all. I wonder if Power BI uses a similar set of rules based on either the number of rows in a table or the table’s compressed size. Hopefully Power BI team will provide more details over time.

Regardless of which rules Power BI uses to choose agggration tabltes, always check the important queries you intend to answer from the agggration tables, to make sure the aggregation table you intended to be used is the one being used. You can use either DAX Studio as I showed in my previous post or Extended Events for this.

Aggregations in Power BI

Aggregations became generally available in Power BI in July 2019. Aggregations are a modeling optimization technique that can improve user experience when dealing with large datasets. Combined with Power BI composite models, they make it possible to analyze big datasets efficiently by allowing high level analytical queries to be answered from memory while sending more detailed queries back to the source database. The trillion-row demo by Christian Wade is one example of this scenario that shows how aggregations can make it possible for big datasets to be analyzed in Power BI. Even though this demo is based on a trillion rows (quarter peta bytes) of data, any dataset that is expensive to cache in memory because of space, CPU, or time, can benefit from aggregations. Additionally, sometimes even Power BI reports built off of smaller datasets can benefit from faster load time if they can use aggregations.

The concept of aggregations

The idea of Aggregations is not a new concept in the world of data. In fact, you may have used them in the past without knowing it. Essentially an aggregation is a summary object created from a detail object and it can answer some summary reports in the place of the detail object.

Summary tables in SQL Server

For example, consider the following example. Let’s assume you have a Sales table containing sales amounts for 5000 products sold at 2000 stores for 10 years and that every product is sold at every store on every day. This results in a table with 3.65 billion rows. If you were to write a SQL query against this data that returned Sales for each Store for each year, the query would take a long time to run. Instead, you could create a smaller summary (aggregation) table that already stores the data at a higher level than the original table and query that. For example, assuming there are 200 Product Groups, a summary table by Product Group, Year, and Store would have 104,000 rows which is a small fraction of the main table. Querying this table is much faster than the main table.

Even though this design improves end query performance, it makes the report development more complex. The report developer should decide which table to use for each report. The high level queries can use the summary table whereas more detailed queries must go back to the detail table.

In this design, the Aggregation table needs to be maintained as part of the overall database maintenance. When the main table is updated, the aggregation table needs to update as well.

Aggregations in SQL Server Analysis Services multidimensional (SSAS MD)

In SSAS MD, the predecessor to SSAS Tabular, aggregations are explicitly defined as actual objects that became part of the database. When designed properly, aggregations could significantly improve end user query performance. For example in larger cubes, you could see a difference between 1 minute vs. 5 seconds in query response times. Even though SSAS’s MD storage engine is disk based (vs. Memory base in Tabular), the size of aggregation files on disk is a fraction of the size of the measure group (fact table) files so they can be loaded into memory faster and searched faster.

Like the previous SQL server example, theses aggregations have a cost in terms of disk storage space and the amount of time it takes to process them to keep them updated. The more aggregations are created, the more queries they cover but they will take more space and need more time to process. There is also a point when adding more aggregations can have a negative impact on performance. This is like having too many indexes on a SQL table. I had a case where removing one aggregation that was automatically created by the wizard, improved performance because the queries could hit another manually designed aggregation.

Ultimately the SSAS developer must come up with the right balance of aggregations and make sure that they are being used. In SSAS, this is an advanced task that many developers did not want to get into. If you used the wizard, you had to have all attribute counts specified correctly, the correct relationships in place, … and even then, you may have not got what you wanted. There were many rules in place, that even if you requested an aggregation using hints in the wizard, the engine could have refused to create it. Usually there was a good reason for this decision as the engine weighted out the cost of using the potential aggregation vs. going directly to the fact objects; however, if this was not the case, you could manually create the aggregation. For complex cubes (lots of dimensions), sometimes it was better to skip the wizard entirely and manually create a custom set of aggregations that covered a certain set of queries.

Once aggregations are created, report developers or end users do not need to know about their existence. The engine automatically uses them to answer queries when possible.

At the time of writing this post, SSAS Tabular does not offer aggregations; however, since SSAS Tabular is the engine behind Power BI, it would make sense to expect aggregations to make their way into SSAS Tabular and Azure Analysis at some time in future.

Aggregations in Power BI

In Power BI, Aggregations start as tables just like any other table in a model. They can be based off a view or table in the source database, or created in Power BI with Power Query. They can be in Import or Direct Query storage mode.

Once in the model, these tables can be configured so that the engine can use them instead of a detail table to answer queries when possible. The process of creating and configuring aggregations in Power BI is significantly easier than the process of creating aggregations in SSAS multidimensional.

Once an aggregation table is configured, it becomes hidden from end users. Report developers and end users don’t know that it exists and don’t need to change anything in how they query the dataset.

At this time, aggregation tables can only be used with detail tables that are in Direct Query storage mode themselves. It is not possible to configure aggregation tables the work with details tables that are in Import storage mode themselves.

Like SQL Server Summary tables and aggregation objects in SSAS multidimensional, aggregations in Power BI have a maintenance cost. Once they are configured, it is important to make sure they are actually being used.

Comparing Aggregations in different products

The following table compares aggregations as described in previous examples.

  Power BI SSAS MD SQL Server
Summary object explicitly called an Aggregation Yes Yes No
Maintenance Cost Yes Yes Yes
Reports must be aware of aggregations to use them. No No Yes
Modeler has full control over their design. Yes No: if designed by using the Wizard.
Yes: if designed manually.
Yes
Easily configured Yes No N/A
Modeler can control the order in which aggregations are considered by the engine to answer queries Yes No N/A
Modeler can create a set of aggregations to approximately cover a set of reports at once No Yes N/A
Modeler can create a set of aggregations based on a set of input queries to be covered (Usage Bases Optimization) No Yes N/A

Setting up aggregations in Power BI

For the rest of this blog post, I use the AdventureWorks sample database to review a couple of examples that show how to configure aggregation tables in Power BI and I will use DAX Studio to see how they are being used.

I created a few views on top of a the fact/dimension tables to simplify the model and take out the clutter. I also created two summary tables in the SQL database to be used as aggregation tables. You can download this version of AdventureWorks, the queries I use, and the pbix file from here.

This is how the model looks like:

The Internet Sales table has 60,398 rows, the Sales_Agg1_By_Product_Year has 288 rows, and the Sales_Agg2_By_Year has 5 rows.

Here is the SQL code I used to create the two aggregation tables:

CREATE TABLE [dbo].[Sales_Agg1_By_Product_Year](
	[ProductKey] [int] NOT NULL,
	[Calendar Year] [smallint] NOT NULL,
	[Sales Amount] [money] NULL,
	[Unit Price Sum] [money] NULL,
	[Unit Price Count] [int] NULL,
	[InternetSalesRowCount] [int] NULL
) ON [PRIMARY]


INSERT INTO dbo.Sales_Agg1_By_Product_Year
SELECT	fs.ProductKey,
		dd.[Calendar Year],
		SUM([Sales Amount]) AS [Sales Amount],
		SUM([Unit Price]) AS [Unit Price Sum],
		COUNT([Unit Price]) AS [Unit Price Count],
		COUNT(*) AS InternetSalesRowCount

FROM [dbo].[vFactInternetSales] fs
JOIN [dbo].[vDimDate] dd
ON fs.DateKey = dd.DateKey
GROUP By fs.ProductKey, dd.[Calendar Year]

CREATE TABLE [dbo].[Sales_Agg2_By_Year](
	[Calendar Year] [smallint] NOT NULL,
	[Sales Amount] [money] NULL
) ON [PRIMARY]


INSERT INTO dbo.Sales_Agg2_By_Year
SELECT	dd.[Calendar Year],
		SUM([Sales Amount]) AS [Sales Amount]
FROM [dbo].[vFactInternetSales] fs
JOIN [dbo].[vDimDate] dd
ON fs.DateKey = dd.DateKey
GROUP By dd.[Calendar Year]

For the rest of this blog post, I will start with all the tables in Direct Query storage mode and configure the two aggregation tables. I will use DAX studio to see if the queries I plan for the aggregations to cover get answered from them. Then I will move the aggregation tables into memory by changing their storage mode to Import. You can read more about storage modes and composite models here. In short, the Direct Query storage mode keeps the actual table data in the data source, the Import storage mode loads the table data into Power BI’s in-memory engine called VertiPaq. Finally there is the Dual storage mode where a table can behave either as a Direct Query table or as an Import table based on a query. In Dual, a copy of the table data is loaded into VertiPaq but it is not always used.

Example 1: No Aggregations defined

To establish a baseline, let’s first take a look at what happens when a query runs against the model that is entirely in Direct Query and no Aggregation tables have been configured yet.

Here is a sample query that returns Sales Amount for each ProductKey.

EVALUATE 
 SUMMARIZECOLUMNS(
     'Product'[ProductKey],
     "Sales Amount", SUM ('Internet Sales'[Sales Amount])
 ) 

If you run this query in DAX Studio (make sure the Server Timings option is selected in the top ribbon), you will see the following:

This query generates a SQL event which sends a SQL query back to SQL Server. This is because this model is in Direct Query. If you look at the details of the query, you can see that it is joining the views that Internet Sales and Product are based off of.

Example 2: Detail and Aggregation tables in Direct Query mode.

Even though the ultimate use case of an aggregation table is when it is in Import mode, as you will see, queries can benefit from an aggregation table even if it is in Direct Query itself.

Configuring Sales_Agg1_By_Product_Year

You can configure an aggregation table by right clicking on any table in the model and choosing Manage Aggregations. In the pop up menu, you will see one row for each column of the aggregation table. Not all rows need to be filled out for an aggregation table to work.

Example of an Aggregation based on relationships

To cover the query mentioned before, the only row that needs to be filled out is the Sales Amount. Here you are telling Power BI desktop that if a query comes in asking for Sales Amount column from the Detail table of Internet Sales, it can use the Sum of Sales Amount from the Aggregation table of Sales_Agg1_By_Product_Year.

The reason this is all that is needed to cover the previous query is that the aggregation table is connected via a relationship to the Product table. In the absence of relationships you have to use a different method that I will explain later in another example.

Now if you run the same query in DAX Studio, you will see that it runs in 27 ms vs. 49 ms. You can see that even in such a small dataset, being able to use a smaller table to answer the query made a difference in total run time. In this example, the detail table had 60,398 rows and the aggregation table had 288 rows. You can imagine what a different in run time it would make if the detail table has millions of rows and the aggregation table has thousand of rows.

This time a RewriteAttempt event is generated before the SQL event. In this case, the Rewrite is successful.

In the details of the Rewrite Attempt you can see how Power BI was able to make the match.

If you look at the SQL event now, you will see that this time it used the Sales_Agg1_By_Product_Year instead of the detail table. You can also see how it is able to join the aggregation table with vDimProduct based on productkey.

At this point with this simple configuration, the aggregation table is able to answer any query asking for Sales Amount by any of the Product, Product Subcatergory and Product Category dimensions. This is possible because of the existing relationship between the aggregation table and Product.

Example of an Aggregation Miss

If you are wondering how an aggregation miss looks like, if you run the following query, it will result in a miss since Order Quantity is not included in the aggregation table. Note that I did this for showing how a miss looks like. In real life, you would typically include all quantity columns from the detail table in the aggregation table.

EVALUATE 
 SUMMARIZECOLUMNS(
     'Product'[ProductKey],
     "Order Quantity", SUM ('Internet Sales'[Order Quantity])
 ) 

If you run this query in DAX Studio, you will see the following:

This time the RewriteAttempt failed expected. If you look in the Details, you will see the following telling you it could not find any information about Order Quantity in the aggregation table:

Obviously, the SQL event reverts back to using the detail table in this case.

Example of covering an Average calculation with an aggregation

If you look at the drop down for the aggregation summarization options, it does not include an option for Average. It only includes Sum,Min,Max,Count, and Count table rows. (Ignore GroupBy for now)

So how would you cover the following query?

SUMMARIZECOLUMNS(
	'Product'[Product Name],
	"Avg Unit Price", AVERAGE('Internet Sales'[Unit Price])
)

Thankfully Power BI is able to use these basic summarization functions to cover a lot more queries than what you may think it can. If it can break a calculation into these basic calculations, then it can use an aggregation to answer the more complex calculation. In the case of Average, it is able to break it down to Sum/Count. As long as you provide a Sum and Count entry in the aggregation table, it is able to use it.

Consider the following two queries:

Query 1:

EVALUATE 
SUMMARIZECOLUMNS(
	'Product'[Product Name],
	"Unit Price Sum", SUM ('Internet Sales'[Unit Price])
)

Query 2:

EVALUATE 
 SUMMARIZECOLUMNS(
     'Product'[Product Name],
     "Unit Price Count", COUNT('Internet Sales'[Unit Price])
 ) 

The aggregation table can be easily configured to cover each one of the above queries. All you have to do is to fill out the rows for Unit Price Sum and Unit Price Count in the Aggregation properties as shown below:

Once this is done, both of the queries are able to benefit from the aggregation as shown in the following pictures:

Now if you run the Average query, you will see how Power BI is able to break it down into Sum/Count.

If you look in the Details, you will see the following:

Finally note that you cannot use the “Count table rows” in place of defining an summarization for Unit Price Count. Even though the two are semantically the same in this example, Power BI looks for a count summarization to calculate Average.

Example of an aggregation not based on relationships

So far the queries we looked at have been using an attribute of the Product table. Now consider this query:

EVALUATE 
SUMMARIZECOLUMNS(
	'Date'[Calendar Year],
	"Sales Amount", SUM ('Internet Sales'[Sales Amount])
)

If you run this query at this point, it will not use the Sales_Agg1_By_Product_Year since this aggregation table does not have a relationship with the Date table. The aggregation table is at Year level whereras the Date table is at Day level so it is not possible to connect the two. This is the design situation you would run into with some of the big data systems.

In order to define aggregations not based off of relationships, you must use the GroupBy summarization. Note that the Detail table here is set to the Date table vs. Internet Sales. The GroupBy summarization has another use case later that will use Internet Sales as the Detail table.

If you run the previous query after you configure the GroupBY for Calendar Year, you will see the following:

Example of covering a Distinct Count calculation with an aggregation

The Distinct Count summarization if not available as a built in summarization; however, you can use the Group By summarization to implement it. Remember that Group By in SQL is the same as Distinct.

Consider the following query:

EVALUATE 
SUMMARIZECOLUMNS(
	'Product'[Color],
	"Distinct Products Sold", DISTINCTCOUNT( 'Internet Sales'[ProductKey])
)

To cover this query, you must add a Group By summarization for ProductKey to Sales_Agg1_By_Product_Year:

Note how the Group By for Calendar Year has the Date table as its Detail Table whereas the Group By for ProductKey has Internet Sales as its Detail Table. This is because the Group By is serving a different purpose in each scenario. In the latter case, its purpose is to provide a distinct count of ProoductKey.

If you run the previous query in DAX Studio, you will get a match. The following shows the data request:

If you look at the SQL query submitted, you will see that it uses the ProductKey column from Sales_Agg1_By_Product_Year

Perfromance Consideration: This method can be useful in scenarios where there are billions of rows in the detail table and two to five million distinct values in the column to be distinctly counted. If there are more than two to five million distinct values, this can affect query performance, meaning the aggregation will not help a query run faster.

Example 3: Adding a second Aggregation table, all tables still in Direct Query mode.

You can have more than one aggregation table in a model based on the same Detail table. For example, consider the second aggregation table of Sales_Agg2_By_Year. This table only has 5 rows compared to 288 rows of Sales_Agg1_By_Product_Year. For a query asking for Sale Amount by Year, it would be much faster if Power BI could use this smaller table. Keep in mind that this just an example and in real life you will not see a tangible difference between using a table with 5 rows versus using table with 288 rows. You would see a difference if the Sales_Agg1_By_Product_Year had 288,000 rows.

Configuring Sales_Agg2_By_Year

Before configuring Sales_Agg2_By_Year, if you run the following query you will see that it will be answered by Sales_Agg1_By_Product_Year,

EVALUATE 
SUMMARIZECOLUMNS(
	'Date'[Calendar Year],
	"Sales Amount", SUM ('Internet Sales'[Sales Amount])
)

Now let’s configure Sales_Agg2_By_Year. This aggrgation table does not have relationship with the Date table in the model so the Calendar Year column must be configured using the GroupBY summarization as shown below:

Now if you run the same query as before, you will see that it will be answered from Sales_Agg2_By_Year.

Aggregation Precedence

You can control the order in which, aggregation tables are considered by Power BI to answer queries with a precedence property on each aggregation table.

For example, let’s say that for some reason you wanted the previous query to be answered from Sales_Agg1_By_Year_Product_Year. Note that this does not makes sense since it is a larger table but I am only using this as an example.

You can set the precedence property on Sales_Agg1_By_Product_Year. The higher the number the earlier it will be considered by Power BI.

Now if you run the previous query, it will be answered from Sales_Agg1_By_Product_Year

Example 4: Detail table in Direct Query mode and Aggregation table in Import mode.

So far all tables have were in Direct Query mode. In most cases, you would get the best benefit from aggregation tables if you can load them into memory.

You can change the storage mode of a table by right clicking on the table in the Model view and getting the properties. You will see Storage Model under Advanced.

You can change the storage mode from Direct Query to Import. Note that this is not a reversible change.

Example of an Aggregation Hit

Let’s change the storage modes for both Aggregation tables to import. Power BI will give a set of warnings when you do this. It will warn you that this is not a reversible change and that it may take some time for the data to load. It will also suggest that you change the storage mode of the tables connected to Sales_Agg1_By_Product_Year to Dual. If they are not set to dual, the relationship between them and the aggregation table will be considered weak and the aggregation table will not be used to answer queries by the set of Product tables.

In order for an aggregation to be considered to answer a query based on relationships, the aggregation table and tables related to it that are involved in the query must have the same storage mode or one side has to be set to dual.

The following shows all the combinations that a relationship is considered Strong and an aggregation table is considered:

Many Side (Aggregation table) One Side (Dimension table)
Dual Dual
Import Dual or Import
Direct Query * Dual or Direct Query *

* Both sides in Direct Query have to be from the same data source.

 Finally note that Many-to-many relationships are always considered weak regardless of the storage mode of tables.

Once you change the storage mode of the Sales_Agg1_By_Product to Import, if you run the following query:

EVALUATE 
SUMMARIZECOLUMNS(
	'Product'[Color],
	"Sales Amount", SUM ('Internet Sales'[Sales Amount])
)

You will see that this time, the RewriteAttempt succeeds similar to before as expected. Also note how this query ran in 5 ms vs.

This event is followed by a Scan event instead of a SQL event. This is beause the Aggregation table is in import mode. If you look at the Details you will see the following code which is the internal language Power BI uses to query its in memory data.

Example of an Aggregation Miss

To see what an Aggregation Miss looks like, run the following query:

EVALUATE 
SUMMARIZECOLUMNS(
	'Product'[Color],
	"Order Qty", SUM ('Internet Sales'[Order Quantity])
)

Order Quantity does not exist in the Aggregation tables so this results in a RewriteAttempt failure.

Note how this query took 101 ms versus 5 ms for the previous query that was answered from the aggregation table in memory.

The RewriteAttempt is followed by a SQL event.

Summary

Aggregations are a great optimization technique that can improve end user experience significantly even for datasets that are way smaller than a trillion rows.

Currently there are some limitations. The main one is that they can only be used on top of detail tables that are in Direct Query. Hopefully this will be lifted in future.

Aggregations are not useful for all calculations. For example, they cannot answer queries with calculations that iterate the data row by row. In these cases, a potential workaround is to see if the calculation that is being done row by row can be done with a calculated column instead.

Aggregations have a cost no matter how/where they are created so their actual usage should be monitored. This can be done with DAX Studio for individual queries or with monitoring the ‘Query Processing\Aggregate Table Rewrite Query’ event with SQL Server Profiler/Extended Events for a larger set of queries over a period of time.

Finally, even though Aggregations are a great technique for improving query performance, following best practices for a good model design should be the first step in optimizing a model.

Video

Here you can watch a video of a presentation I did for PASS BA Virtual Chapter on the same topic covered in this post.

A simple Power BI report to help you choose which PASS Summit 2019 activities to attend.

PASS Summit 2019 is about to start. There are many social ane educational events during this week. You can find a complete list here. It is somewhat difficult to get an overall look of events per day. I put together this simple PBI report to get a list of events by day/time/sponsor.

Here is a screenshot of the report:

See the report in Power BI Service here.
The PBIX file can be downloaded here.

Note: There are a few more community events that are not listed on PASS’s website. The best way to keep up with these events is to follow #sqlfamily and #PASSSummit on Twitter.

Another look at PASS Summit 2019 sessions using Power BI Desktop and “Get Data From Web”

Last week I published a post on how I used Power BI Desktop to import PASS Summit 2019 session schedule and build a few reports for myself to help me sort my sessions of interest by day/time/room and decide which ones to attend.

In that post, I had initially tried to import the schedule with “Get Data From Web” within Power BI Desktop but was not able to do so because I ran into the following error and did not know how to get around it:

Unable to Connect: We encountered an error while trying to connect. Details:”The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF”

It looks like something on this page is not following the expected HTML pattern so “Get Data From Web” is not able to pull its data directly.

Thankfully, Reza Rad showed me an alternative way to get data from a web site at SQL Saturday Atlanta. I was able to use this method that I will explain later in this post to pull the sessions schedule directly from PASS’s website and bypass the manual export to PDF/Excel and import into Power B Desktop steps. This makes it a lot easier to keep the report up to date with changes in the schedule as they can sometimes happen at the last minute.

You can find the new pbix file here: “PASS Summit 2019 Sessions From Web Live.pbix“.

Sessions Summary

In this Power BI Desktop report, in addition to the session categories I had used previously, I added some Word Cloud and Q&A visuals.

The Q&A feature is pretty cool. You can ask questions such as “sessions on Power BI” or any other topic and get a list of the sessions. This works out of the box without any more configurations. You can also ask questions such as which speaker has the most session count? (Hint: The answer is Hamish Watson as seen in the Speaker Word Cloud visual. Check out all the Speaker Idol sessions).

The following shows the “Power BI” report section which is one of the session categories I am interested in.

One of my favorite reports is the “Room Schedule” report. It looks like if you don’t like walking and want to stay at one room/building, any of the TCC rooms will have plenty of Power BI/Azure/AI sessions.

Get Data From Web and PASS 2019 schedule

If you try “Get Data From Web” to pull data from PASS Summit 2019 schedule page, you get an error. This method works for a lot of pages but for some reason it does not work for this page.

You get this error:

Unable to Connect: We encountered an error while trying to connect. Details:”The server committed a protocol violation. Section=ResponseHeader Detail=CR must be followed by LF”

One way to get around this, is to start with “Edit Queries” from scratch and use the M language function Web.Page(Web.Contents()) to pull the data. Instead of doing this, you can connect to a different web page (Rotten Tomatoes Top Movies) that Power BI is able to read from and edit the web address later in the Query Editor. That’s what I did.

Start with Get Data From Web:

Type in the web address of a page that has any kind of a list in it, for example: https://www.rottentomatoes.com/top/

Once the page contents show in the Navigator window, choose any Table from the page and click on Transform Data.

When the Query Editor opens, under Applied Steps, remove “Changed Type” and “Navigation” steps. Click on the “Source” step:

This step shows the M function Web.Page(Web.Contents()) that is used to pull the page data:

= Web.Page(Web.Contents(“https://www.rottentomatoes.com/top/”))

Change the web address to https://www.pass.org/summit/2019/Learn/Schedule.aspx

Once the contents load, you can see that each day of conference has its own Table. Click on each Table to add it to the model.

You can download the pbix file shown above from here.

Once you have this file, it is easy to do some clean up and end up at the pbix file that I used to look at the sessions.

Hint: You can get one of the days cleaned up to the form you want, duplicate it and change the navigation step to a different day by changing the number in “Source{xx}[Data]” as highlighted in the picture above,

The final pbix file is available here:

PASS Summit 2019 Sessions From Web Live.pbix

PASS Summit 2019 sessions review: What sessions to attend?

PASS Summit is only 2 weeks away. It is time to review the sessions and decide which ones to attend. You can find the published schedule here. This is a great at-a-glance look. There is also a searchable version of the schedule published here. You can use this one to search by keyword or track and you can add sessions to your own schedule if you are logged in. There is also an app that you can use. Either way, you will see that there are plenty of great sessions available to choose from.

While I find both links useful, I was looking for a better list for my own use that would make it easier for me to choose sessions. I wanted to come up with a list of sessions I am interested in based on topic and sorted by day/time along with the room number. To do this, I built a Power BI report!

I downloaded the PDF version of the schedule from the bottom of the schedule page. Then I exported that to Excel using Adobe. Once I had the schedule in Excel, I imported it into Power BI Desktop.

Note: I have posted an updated version of this post where I was able to user Get Date From Web to directly pull the data into Power BI Desktop. See the newer post here. The live report is also available from here.

The rest was easy. Power BI is amazing at quickly cleaning data and shaping it into a usable format. I parsed the session names and came up with categories that I was interested in such as Power BI, Power Shell, PowerApps, Power Platform, Azure, Speaker Idol and Others. These categories are Power Platform/Cloud focused.

You can download the Power BI file from here and create other categories by editing the M expression that creates the Type column.

Disclaimer: The sessions schedule is subject to change and if that happens, this list may not be up to date. For the latest information, always check the main published schedule. Rooms/times can change even on the day of a session so make sure you download the conference’s app on your cell phone or check the web page for changes.

Session Categories

Having a BI background, I am interested on Power BI and Azure sessions. I was happy to find a good number of sessions in these categories. The following shows the other categories that I am interested in. These are my own categories that I came up by parsing the session names. These are not official PASS categories so they may not correspond completely with what is on the official schedule.

.   

Session Levels

As expected, most sessions are level 200 followed by 300. I think this is a good distribution.

Power BI Sessions

There are 17 Power BI sessions that I wished I could attend all. Unfortunately, some are at the same time slot. At least there are 11 time slots.

PowerShell sessions

I am also very interested in PowerShell as I use it on a regular basis to manage servers. There are 3 PowerShell sessions.

Azure Sessions

An amazing selection of Azure sessions are available this year. It is going to be hard to choose from this list!

Power Platform/PowerApps Sessions

These 3 sessions all sound very interesting. Again, hard choice here.

Speaker Idol Sessions

Speaker Idol is a competition between community speakers for the coveted prize of a session at the next PASS Summit. Each session will have 4 or 5 speakers with different topics, each speaking for 5 to 10 minutes. You can learn a great deal from each speaker. Aside from learning cool short topics, these sessions are both entertaining and educational if you are interested in public speaking or even making presentations for work. See my blog post from 2016 when I competed.

Room Schedule

Let’s say you like a room so much that you decide to spend your entire day/conference in that room1. You can use the room tab in the PBI file to find out what kind of education you will receive if you stay put. This is only for fun!

Really high-level sessions (Level 400 and 500)

There are 13 high-level sessions available. Great sessions, potentially tough on the brain.

So many choices…so little time.

To get the best possible learning experience, you should have an idea of what sessions to attend before you get to the conference. In addition to the regular sessions, make sure to leave enough time to explore the community zone, exhibit hall, Microsoft clinics, and many more learning/networking opportunities that will be available.

A great way to keep up with all the events is to follow up @sqlpass on Twitter.

Connecting live from Power BI Desktop to SSAS generates this error: The Server you’re trying to connect to doesn’t have any models or you don’t have permissions to access them.

I recently ran into an error while connecting live from Power BI Desktop to a SSAS server. Everything was on-premises.There was no cloud component involved. I had full admin rights to the SSAS server and could see all databases and models from SSMS and other tools but was getting this error from PBI Desktop.

The Server you’re trying to connect to doesn’t have any models or you don’t have permissions to access them.

If I explicitly specified the name of the database I wanted to connect to, then it would connect and show me the database contents but if I did not specify the database name, I would get this error. The question is why? Obviously it wasn’t a permissions problem because I could connect to that database if I specified its name and I was admin on the SSAS server and could see that database and other databases from other front end tools such as Excel and SSMS. You can get this exact same error because of permission issues and in those cases this is a valid error. Here I am only exploring the case, where all correct permissions are in place and I still get this error.

The problem was …

The default database on that server was created with a script as an empty shell database for someone to use later. It has stayed empty and never deployed to. This empty database was causing the problem. Here is an example of a simple JSON script that creates an empty SSAS database:

{
"create": {
"database": {
"name": "test 3",
"compatibilityLevel": 1200
}
}
}

Note: This exact same error can happen with SSAS Multidimensional if the first database created on the server has a cube that is in an unprocessed mode.

Default Database on SSAS

The first database that is created on a SSAS server becomes the default database. There is no server property that you can set to modify this later. At least I could not find one through SSMS. I suspect this is in one of the meta data files but I did not find it readily. On the other hand, as I will explain later, there is a SSAS property that can disable automatically setting the first database to the default database but that does not give a choice to set the default database explicitly. Changing this setting has its own implications that I will get into later in this post.

How I found out the cause of the error

I started a SQL Server profiler trace on the SSAS server (You can use Extended Events as well) and attempted the same live connection from Power BI Desktop with different databases on the server. I noticed something very interesting.  The query that Power BI Desktop submits to SSAS to discover the list of databases, seemed different in the following situations:

  1. The SSAS sever is completely empty. No databases exist on the server.
  2. The state of the of the default database. (Empty or unprocessed for Multidimensional)
  3. The AutoDetectDefaultCatalog SSAS server setting.

Let’s explore these.

1) The server is completely empty.

In this case, there are no database on the server. Power BI Desktop can connect live to this server successfully, even though I got an empty list of databases as expected.

Query Profiler With Empty Server
PBIDesktopCanLiveConnectToEmptyServer
PBIDesktopCanLiveConnectToEmptyServer2

In this case, SQL Server Profiler captured the following query from Power BI Desktop:

Select [CATALOG_NAME], [DESCRIPTION], [DATE_MODIFIED], [COMPATIBILITY_LEVEL]
from $System.DBSCHEMA_CATALOGS 
Query Profiler With Empty Server

Notice how this query was run against the server and that the DatabaseName column is empty in the trace shown above. This is possible because this is a server level query. It uses one of the SSAS DMVs (Dynamic Management Views) to get metadata. It returns a list of all databases on the server. You can run this query in SSMS in a DMX query window. Here is an example of what this query would have returned if there were databases on this server:

DMXQuery1Results

To summarize, Power BI Desktop worked successfully in this case. There were no databases to show and no errors were generated. The reason I included this case is for you to see how the query changes in the next case.

2) The default database is empty.

Continuing with the same empty server, I scripted an empty database using the script shown previously. The first database created on a server becomes the default database on the server. This is what I could see at this point from SSMS on this server:

Empty Scripted Database From SSMS

I went to Power BI Desktop and attempted to live connect without specifying the database name. I got this error:

Main Error

The question is: What did adding an empty scripted database do that changed the Power BI Desktop’s behavior?

Here is the DMX query that Power BI Desktop submitted this time which is different than the one in case 1:

Select [CUBE_NAME] from $System.MDSCHEMA_CUBES where [CUBE_SOURCE] = 1
Query Profiler With One Empty Scripted

Notice how in the Profiler trace, the Database is set to “test 3”.

If I run this query in SSMS, it returns no results because that database is empty. There are no models defined in it yet. If there were models in the database, I would have seen something like the following:

DMX Query 2 Results

Note that this DMV has existed since SSAS  Multidimensional, hence the name Cube but it works against both Tabular and Multidimensional SSAS databases, as long as they have a model/processed cube.

3) An empty default database and a non empty database

To test this further, I added another database from a backup file of AdventureWorks. This is what SSMS was showing:

Select [CUBE_NAME] from $System.MDSCHEMA_CUBES where [CUBE_SOURCE] = 1
Added AdventureWorksFromBackup_SSMS

I went back to Power BI Desktop and attempted to live connect as before without specifying a database name. Power BI desktop still generated the same error even though AdventureWorks was fully browsable from SSMS.

Select [CUBE_NAME] from $System.MDSCHEMA_CUBES where [CUBE_SOURCE] = 1

This is the query that Power BI Desktop submitted to SSAS, still just one query and different than the query in case 1:

Query Profiler With One Empty Scripted database and AdventureWorks added

Notice how this query still ran against the default database which is “test 3”, the first (default) database created on the server. 

This is the exact case I was running into in the beginning. I had several databases with models in the them that were fully browsable from SSMS and Excel. There was one empty database that happened to be the default database and was causing the error because the query ran against it and returned no results!

When the above query returns no values, Power BI Desktop shows the error message. This seems like a bug that will hopefully get fixed at some point.

Obviously, one way to get around this error, is to either specify the database name you are trying to connect to or to remove the empty database. In the latter case, another database becomes the default database. You can also delete the empty database and script it again. It will look exactly the same as before except that it won’t be the default database. I tried this case to make sure this whole problem was really a default database issue and not the existence of an empty database.

4) A non empty database and an empty default database

I deleted “test 3” and created it again with the same script as before. This time it was not the default database anymore. SSMS looked exactly as before:

Added AdventureWorksFromBackup_SSMS

I went to Power BI Desktop and tried to live connect. This time it worked! This proved that it was not the empty database causing the issue because of its emptiness. The problem was that it was empty and the default database at the same time. This is what the traces showed this time:

One Non Empty Database As Default And One Empty Database Trace Limited

Power BI Desktop submitted 3 DMX queries, 2 of which are identical and ran against Adventure Works (The default database):

Select [CUBE_NAME] from $System.MDSCHEMA_CUBES where [CUBE_SOURCE] = 1

This was followed by this query:

Select [CATALOG_NAME], [DESCRIPTION], [DATE_MODIFIED], [COMPATIBILITY_LEVEL]
from $System.DBSCHEMA_CATALOGS 

4) AutoDetectDefatInitialCatalog

This one is exactly the same query that ran in case 1 (empty server). In case 2, this query did not run because the first query had returned no values and Power BI Desktop stopped with an error message displayed.

How does Power BI desktop know which database is the default database? As mentioned before, the DXM queries were all submitted in the context of the default database, but how? It is not that Power BI desktop knows which database is the default database, it is that it creates a connection to SSAS without specifying a database name (since I did not explicitly specify one) and it lands on the default database.

This behavior of creating a connection and landing on the default database, is similar to what happens in SSMS when you connect to a SSAS server. You always get connected to the default database and you can change it by changing the name of the database from a drop down box. This behavior is controlled by a setting in SSAS called AutoDetectDefaultInitialCatalog.

AutoDetectDefaultInitialCatalog is one of the Advanced server properties and it is set to True by default.

SSAS Server Properties AutoSetDefaultInitialCatalog

Warning!

Make sure you understand the implications of changing this setting, especially on a production database, before you change it!

When this setting is set to Ture, when you connect to the server using SSMS, you get connected to the default database right away. You can see the default database in the drop down to the left of the Execute query button.

New Query With AutoDetectDefaultDatabase

On the other hand, when AutoDetectDefaultInitialCatalog is set to False, when you connect to the server using SSMS, you get a query that is no connected to any database. You can see this in the drop down to the left of the Execute query button.

SSAS Server Properties AutoSetDefaultInitialCatalog Set To False
New Query With AutoDetectDefaultDatabase Set To False

With this setting set to Fales, I repeated cases 2 and 3. This time everything worked even with the empty default database.

AutoDetectDefaultInitialCatalog Set To False One Empty Database PBI Desktop

How did this work? Power BI desktop only submitted the following query:

Select [CATALOG_NAME], [DESCRIPTION], [DATE_MODIFIED], [COMPATIBILITY_LEVEL]
from $System.DBSCHEMA_CATALOGS 
Query Profiler With One Empty Scripted database and AdventureWorks added and AutoDetectDefaultCatalogSetToFalse

Note how this query ran against the server and no particular database. In this case, Power BI Desktop was able to show the list of all databases, even though the first database created (“test 3”) was still empty.

Please remember that changing this setting can have other implications so I don’t recommend you change it permanently. I only did this for testing. The problem you may run into if you set this setting to False, is that it an cause problems for any code that was dependent on the connection landing on the default database. I have seen this in the past at one client, where custom .NET code was getting metadata from a SSAS database without realizing that they were getting results because their database happened to be the default database. Of course, this was not best practice; however, the point is you should change a production server’s settings very carefully and be ready to work through unforeseen consequences.

Summary

Power BI Desktop as of Sept 2019, has a small bug where you would get an error if you try to live connect to a SSAS sever and explore the database on it without specifying a particular database.

The Server you’re trying to connect to doesn’t have any models or you don’t have permissions to access them.

The error happens when the default database is empty on SSAS Tabular or if the default database is unprocessed in SSAS Multidimensional. You can get around this by removing and adding the empty/unprocessed database so that it is not the default database (assuming you have other good databases on that server.) The other workaround is to change the AutoDetectDefaultInitialCatalog to False but this can have other unforeseen implications on custom code connecting to the sever as explained previously.

Note: You can get this error under other valid circumstances as well. For example, if you truly don not have access to a server. In that case, the error is valid.

Why attend PASS Summit?

I have attended PASS Summit for the past three years in a row and I am planning to attend this year as well. I often get asked if it is worth it to attend PASS Summit. The schedule for this year has been released. The sessions cover a wide range of topics at different levels (see schedule here) and I have always learned from them in the past, but some may think that the material presented will eventually make it to internet or local conferences in some shape or form, so they question the value of attending PASS Summit.

First, the main value of attending PASS Summit for me, has been the networking opportunities that it brings, and the connections made with members of SQL and data community. Over the years, I have been fortunate to get to know many of the speakers, MVPS, members of Microsoft Product teams, and many data professionals from all over the world. If I am really stuck solving a technical problem or if I just need overall architecture advice, I know who I can reach to get guidance for each area. This by itself is extremely valuable to me.

Second, there is something to be said about being able to focus on learning for several days in a row without having to get interrupted for work or life.

Finally here are some of the things I like about attending PASS Summit:

  • There are so many in-depth technical sessions to choose from. You can learn from fundamentals to the in-depth topics on newest advances in the areas of SQL Server, Business Intelligence, Azure Cloud Services, Power BI and more. One concern that I hear from some people who have not attended in the last couple of years, is that they are worried they won’t be able to get to see the sessions that they are interested to see because rooms might be full. This has not happened to me in the last 3 years that I have attended in a row. I have always been able to attend any session I was interested in. PASS has done a great job of managing room sizes. They sometimes move rooms during the conference to accommodate more interest based on how many people sign up for sessions on their personal schedule.
  • You will have direct access to speakers and Microsoft MVPS. There is a great community of speakers and field specialists that love to share their experiences and provide guidance.
  • Talk to Microsoft Product teams. These are the engineers behind different products. You can take your technical questions or even a mockup of a problem to them and get answers on the spot. I was able to mock up a DAX problem I was working on for a client using the AdventureWorks sample database and get answers from the Power BI/SSAS team on site. Make sure you leave room between sessions to visit the product teams.
  • You can network and learn from people in your field in the Community Zone, at breakfast and lunch tables, and after conference events. Here are some of the great events for networking that happened in previous Summits:

1) Birds of a Feather lunch tables: where you can join speakers, MVPS, community members, and SQL/data professionals in your area of interest. Often you can hear from other data professional about some of the real-life challenges they face as they implement new technologies and the solutions they have come up with.

2) Women in Technology Lunch: Provides a wealth of information about challenges that women in technology face and processes that have made a difference in closing the gender gap at workplace. This is a very educational presentation, for not just women but anyone who is interested to learn about increasing diversity at workplace and how to support women at work.

3) Community events such as Game Night. It is usually held at the conference center at the end of one of the conference days. If you are into board games, this is a fun event. There has also been a SQL Run group, Photography group, hiking group, and many more. These have previously been posted on PASS’s website as it gets closer to the event.

4) Vendor dinners: Many vendors have dinner events. These have been listed on PASS’s web site in previous years. You must register for these to get a ticket. These have been free to attend in the past. Of course, by signing up you are agreeing to share your email with them so they can follow up about their products. Register early as some do fill up.

5) Welcome reception and vendors reception: Held at the conference center where you can network and have some light food.

  • Volunteer and make a ton of connections while having fun. In your PASS profile, you can indicate that you are interested in volunteering. There are plenty of opportunities available from holding a Welcome sign to hosting one of the Birds of a Feather tables. A fun perk of volunteering is that you get to attend a volunteer only reception.
  • Speaker Idol: Community speakers who have not spoken at PASS Summit before, get to compete for the coveted prize of a guaranteed session in the next year’s PASS Summit. This is very fun event to watch these speakers compete by presenting short sessions. Judges can be “judgy” at times (think Simon from American Idol) but you will learn a lot about making great presentations even if you are not a public speaker and intend to only make presentations at work. See this post from my experience competing in 2016.

I hope that you can attend PASS Summit. You can register from here: https://www.pass.org/summit/2019/RegisterNow.asp

Finally follow me on Twitter at @shbWatson for more news about PASS Summit 2019.

Shabnam2