Use Explain Data as an incremental, jumping-off point for further exploration of your data. The possible explanations that it generates help you to see the different values that make up or relate to an analyzed mark in a view. It can tell you about the characteristics of the data points in the data source, and how the data might be related (correlations) using statistical modeling. These explanations give you another tool for inspecting your data and finding interesting clues about what to explore next.
What Explain Data is (and isnât)Explain Data is:
Explain Data is not:
When running Explain Data on marks, keep the following points in mind:
Use granular data that can be aggregated. This feature is designed explicitly for the analysis of aggregated data. This means that your data must be granular, but the marks that you select for Explain Data must be aggregated or summarized at a higher level of detail. Explain Data can't be run on disaggregated marks (row-level data) at the most granular level of detail. For more information about aggregation, see Data Aggregation in Tableau.
Consider the shape, size, and cardinality of your data. While Explain Data can be used with smaller data sets, it requires data that is sufficiently wide and contains enough marks (granularity)Â to be able to create a model.
Don't assume causality. Correlation is not causation. Explanations are based on models of the data, but are not causal explanations.
A correlation means that a relationship exists between some data variables, say A and B. You can't tell just from seeing that relationship in the data that A is causing B, or B is causing A, or if something more complicated is actually going on. The data patterns are exactly the same in each of those cases and an algorithm can't tell the difference between each case. Just because two variables seem to change together doesn't necessarily mean that one causes the other to change. A third factor could be causing them both to change, or it may be a coincidence and there might not be any causal relationship at all.
However, you might have outside knowledge that is not in the data that helps you to identify what's going on. A common type of outside knowledge would be a situation where the data was gathered in an experiment. If you know that B was chosen by flipping a coin, any consistent pattern of difference in A (that isn't just random noise) must be caused by B. For a longer, more in-depth description of these concepts, see the article Causal inference in economics and marketing(Link opens in a new window) by Hal Varian.
Explain Data runs a statistical analysis on a dashboard or sheet to find marks that are outliers, or specifically on a mark you select. The analysis also considers possibly related data points from the data source that aren't represented in the current view.
Explain Data first predicts the value of a mark using only the data that is present in the visualization. Next, data that is in the data source (but not in the current view) is considered and added to the model. The model determines the range of predicted mark values, which is within one standard deviation of the predicted value.
What is an expected range?The expected value for a mark is the median value in the expected range of values in the underlying data in your viz. The expected range is the range of values between the 15th and 85th percentile that the statistical model predicts for the analyzed mark. Tableau determines the expected range each time it runs a statistical analysis on a selected mark.
Possible explanations are evaluated on their explanatory power using statistical modeling. For each explanation, Tableau compares the expected value with the actual value.
value Description Higher than expected / Lower than expected If an expected value summary says the mark is lower than expected or higher than expected, it means the aggregated mark value is outside the range of values that a statistical model is predicting for the mark. If an expected value summary says the mark is slightly lower or slightly higher than expected, or within the range of natural variation, it means the aggregated mark value is within the range of predicted mark values, but is lower or higher than the median. Expected Value If a mark has an expected value, it means its value falls within the expected range of values that a statistical model is predicting for the mark. Random Variation When the analyzed mark has a low number of records, there may not be enough data available for Explain Data to form a statistically significant explanation. If the markâs value is outside the expected range, Explain Data canât determine whether this unexpected value is being caused by random variation or by a meaningful difference in the underlying records. No Explanation When the analyzed mark value is outside of the expected range and it does not fit a statistical model used for Explain Data, no explanations are generated. Models used for analysisExplain Data builds models of the data in a view to predict the value of a mark and then determines whether a mark is higher or lower than expected given the model. Next, it considers additional information, like adding additional columns from the data source to the view, or flagging record-level outliers, as potential explanations. For each potential explanation, Explain Data fits a new model, and evaluates how unexpected the mark is given the new information. Explanations are scored by trading off complexity (how much information is added from the data source) against the amount of variability that needs to be explained. Better explanations are simpler than the variation they explain.
Â
Explanation type EvaluationExtreme values
Extreme values are aggregated marks that are outliers, based on a model of the visualized marks. The selected mark is considered to contain an extreme value if a record value is in the tails of the distribution of the expected values for the data.
An extreme value is determined by comparing the aggregate mark with and without the extreme value. If the mark becomes less surprising by removing a value, then it receives a higher score.
When a mark has extreme values, it doesn't automatically mean it has outliers, or that you should exclude those records from the view. That choice is up to you depending on your analysis. The explanation is simply pointing out an interesting extreme value in the mark. For example, it could reveal a mistyped value in a record where a banana cost 10 dollars instead of 10 cents. Or, it could reveal that a particular sales person had a great quarter.
Number of records
The number of records explanation models the aggregate sum in terms of the aggregate count; average value of records models it in terms of the aggregate average. The better the model explains the sum, the higher the score.
This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low.
Average value of the mark
This type of explanation is used for aggregate marks that are sums. It explains whether the mark is consistent with the other marks because in terms of its aggregate count or average, noting the relation SUM(X) = COUNT(X) * AVG(X).
This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low.
Contributing Dimensions
This explanation models the target measure of the analyzed mark in terms of the breakdown among categories of the unvisualized dimension. The analysis balances the complexity of the model with how well the mark is explained.
An unvisualized dimension is a dimension that exists in the data source, but isn't currently being used in the view. This type of explanation is used for sums, counts and averages.
The model for unvisualized dimensions is created by splitting out marks according to the categorical values of the explaining column, and then building a model with the value that includes all of the data points in the source visualization. For each row, the model attempts to recover each of the individual components that made each mark. The analysis indicates whether the model predicts the mark better when components corresponding to the unvisualized dimension are modeled and then added up, versus using a model where the values of the unvisualized dimension are not known.
Aggregate dimension explanations explore how well mark values can be explained without any conditioning. Then, the model conditions on values for each column that is a potential explanation. Conditioning on the distribution of an explanatory column should result in a better prediction.
Contributing Measures
This explanation models the mark in terms of this unvisualized measure, aggregated to its mean across the visualized dimensions. An unvisualized measure is a measure that exists in the data source, but isn't currently being used in the view.
A Contributing Measures explanation can reveal a linear or quadratic relationship between the unvisualized measure and the target measure.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4