Data Warehouse and Information Asset Experts

Freeform Comments – Text Fact or Dimension?

By on January 20, 2011 in Tips, Tricks and Technology with 0 Comments
Freeform comment fields are challenging.  Should they be modeled in a text fact or in a separate dimension?  The topic has been addressed in several books as well as Kimball Group’s Design Tip #55, yet remains a persistent area of debate.  Rightfully so.  This entry presents a gray area example and some considerations in choosing a design.
The client has designed a transaction grain fact table based on a typical header/detail source.  Each transaction contains a detail level freeform comment field that the business line wishes to include on 50% of their standard reports.  They are not used analytically, but are deemed critical information on the identified reports.  An analysis finds:
  • Average Width of Comment – 70 bytes
  • Distinct Comments – 1 million
  • Stable Fact Rows – 2 million
Do the comments belong in their own dimension or on the fact table?
The confusion is understandable.  A 70 byte field on a fact table gives one pause.  A dimension record that approaches 1:1 with the fact grain isn’t a comfortable situation.  Is the dimension approach a good choice if you must do the join on 50% of all reports?
These are all valid considerations.  Look to the details of your specific situation to find the best design.
In our example, if all comments are converted to uppercase, and we properly cleanse to combine nonsensical and punctuation only comments, the number of distinct comments drop to about 400,000.  Still not order of magnitude relative to the fact table, but closer.
Is disk space a concern?  Placed on the fact table, we have (70 bytes * 2,000,000 rows) = 135 MB of space.  In a dimension with a number(6) 4 byte surrogate key, we are looking at (70 + 4 bytes) * 400,000 rows + (4 bytes * 2,000,000) = 35 MB.  From a space standpoint, the dimension looks better.
Speed of query?  It depends.  If the freeform comment is truly just along for the ride, eliminating the join in half the reports is a good option.  What if users decide to constrain on the freeform text?  There are some indexing techniques for text columns, but in a true freeform text field, these are probably irrelevant.  Users could constrain on any character pattern within the field.  The result – full table scan.
Which brings us around to environment.  Two million rows is a small fact table by today’s standards.  In a large organization with top of the line hardware, a full scan in nested loops may be of little concern.  But if this is the largest fact table in a small company running a single core machine, it’s a valid one.
Take this example and apply it to your specific situation.  Careful analysis of data cleansing, disk space, hardware, reporting requirements and future growth will lead you to a solid design decision.

Tags: ,

About the Author

About the Author:

John Frank is the Pricipal Consultant and Founder of DataFusion Technologies.

John holds a Master of Science in Computer Science from Rensselaer Polytechnic Institute and a Bachelor of Science in Mechanical Engineering from The Pennsylvania State University.

.

Subscribe

If you enjoyed this article, subscribe now to receive more just like it.

Post a Comment

Top