How to plan and build a data dictionary Why do you need a data dictionary? Before you can start analysing your data you need it to be useful and compatible. It is very easy for two people to understand the same field in a form in very different ways. In this example, both people have made sensible interpretations of what to record but it’s going to be very difficult to compare funds raised over the year if the dates are entered differently and people calculate the funds differently. Creating data dictionary is a good starting point to try and avoid these problems. It doesn’t necessarily have to involve any technical knowledge or new tools. Instead, you can focus on digging out the existing knowledge in your team or organisation and inevitably the conversations you have when doing this will bring up both existing problems and all sorts of useful information and ideas. Once you’ve put in the work to create a data dictionary, you can use it to avoid miscommunication and to ensure you have data you can combine, analyse and continue using in the future to deliver value. What is a data dictionary? A data dictionary means different things to different people but at its heart it works very much like a traditional dictionary. A dictionary tells you the meaning of a word, how you should use it, whether it’s still in use extra information about sources, examples and synonyms Take the example of the OED definition of the word “anatiferous”, a wonderfully specific and bizarre word. The dictionary gives us a definition “producing ducks or geese”, which tells us what it means. The OED also tells us that this is an adjective, which lets us know how we can use this word and which grammar rules to apply to it. We also find out, unsurprisingly, that the word is obsolete and no longer in common use, as well as the origins, examples of use and even a guide to pronunciation. Data dictionaries provide similar information, split generally into two categories: Logical data dictionary: this is about the business/organisation meaning of the data. When you say “Funds raised” and someone else says “Donations” do you mean the same things or not? This is mainly about the logic and ideas behind the terms you use. Physical data dictionary: this is all about how you store the data, whether that’s storing dates in a certain way – “21 Feb 18” or “02/21/18” – or whether you store donations down to the penny (£20.01), in pennies (2001p) or just round up to pounds (£20). For technical users, a data dictionary is often synonymous with a physical data dictionary and focuses on the key definitions a database table will need: name, data type, data length, mandatory/optional. But the most important thing is to make your data dictionary work for you. Maybe that means combining a bit of both physical and logical into one dictionary. You might have rapid changes and want to store whether data is still in use or not, or you might just want to focus on the names and definitions and leave the rest for later. How to start your data dictionary To start your data dictionary, you don’t need to worry about exactly what it will look like at the end, whether it’s physical or logical, or how to store it. The main thing is to start having conversations with the people in the know. Pick an area you want to know more about and start asking questions. Try to narrow down your first “data entity”, the thing you are storing data about. In our example it is an “Event” but this could be something like a customer, order, volunteer, or product. You want to get a definition of that data entity and also define its attributes; for an event these are things like date of event, location of event and event organiser. If you are finding it hard to create a definition, try one (even if you think it’s no good) as it’s much easier for people to point out what you need to change once they have something to work with. This does tend to get answers faster as it is hard for people to identify what is right but everyone is quick to point out if you’ve got it wrong! Try to get the people who know about this data entity or area into a room together, as discussion brings out lots of hidden assumptions. If there is existing documentation or standards, bring those along but be ready to ask people if they still apply and if they’re helpful. If you already have a database, you might be able to get a data dictionary generated by that to add to the discussion too. Questions to start asking are: Can you give an example of how this is used? What do we mean by this term? Do we all agree? What are we going to do with this data? How are we going to store it? If you look at the example of the event data from above, it turns out that different people would give different responses when asked to give an example of the event name. This depends on where their focus is: the marketing team might use the name from the posters, while digital use the name they put on Meetup and finance are happy to just use a short, generic name. Developing the data dictionary A mind-map or spider diagram is a good way to start capturing the variety of responses without being tempted to organise them into a final form. At this point you want to collect information and get all the conflicting and incompatible definitions down. Try not to be too tied to existing definitions as they might not be helping you achieve what you need to do with the data. Once you’ve gathered the information, try to group up different uses into helpful units. Key points to bear in mind here are: You can group things as long as they are the same type of data. You shouldn’t put dates, numbers and text all into one field as this limits your analysis. Don’t make lots of definitions if everyone could live with sharing one. We could have 5 different types of event name but that’s hard for people to work with. Consider the restrictions you can’t change. For example, there is an event name character limit on Meetup which all the other uses would fit within, so we could make that our limit too, so our even names will always be ok on Meetup. Don’t force uses into one definition. We decided that event names needed to be longer to suit SEO requirements, but that it’s handy to also have a short name for finance reports and to use on emails, which could also act as a unique ID for the event. Think about the future. If you want to report year after year, you need to make sure your dates include the year. Avoid forcing people to use terms they don’t like. You can define business aliases (synonyms) for terms so that different teams can continue using their own jargon but you know which official term they actually relate to. Once you’ve narrowed down the terms you want to use and got agreement about their definitions, start putting them into a table or list. This could be in a text document or spreadsheet, depending on how complicated your data dictionary is going to be. You may decide to separate the logical and physical parts at this point and you should start to see which bit of information are going to be required in your final data dictionary. The finished data dictionary The data dictionary may be made up of several parts; in this example we’ve got a very high-level part which defines the data entity and then a more detailed part with all the attributes of this data entity. They could be two separate worksheets in a spreadsheet or you might just want to work on the data entities in a text document to begin with. You could even do them on a card catalogue if you want this to be non-digital. Your data dictionary may only involve a few of those columns and might only have two or three attributes. You might want to consider including logical definitions about GDPR information, such as is that attribute is personal or sensitive data. What to do next with your data dictionary Once you have a data dictionary, you can use this to help define your spreadsheets and input forms for collecting data, to ensure they are completed in the right ways. This might be by adding comments onto form to remind people what to do or using data validation settings in spreadsheets. If you make sure data comes in according to the standards laid out in the data dictionary, you’ll find it much easier to work with in the future. You can use your data dictionary to help with GDPR compliance, as it is a good place to identify what data is stored where and how. If you also store in the data dictionary what data is personal or sensitive and the purposes you are collecting it for, this makes it much easier to ensure you continue to remain GDPR compliant and don’t have to delete data because no one remembers if it is compliant. Most importantly – keep it up to date. It’s easier to maintain this than have to start all over again next year. And you can expand it further by developing a data model or, if you are not using a database, using your data dictionary to help set up your database. If you want a hand designing your data model or you want to discuss the options for data warehousing then please get in touch with Helen Woodcock.