Data types, assumptions and how a spacecraft crashed
In maths and physics classes in school I was taught to always keep track of the units when doing calculations. If you do some calculation and the result is just a number - that is not good enough. How fast is a train travelling? 80? 80 what? 80 bananas? If you keep track of the units you can do all kinds of calculations and have an easier time figuring out if it is 80 meters, 80 meters per second, 80 km/h, or 80 miles per hour. Throughout the calculations the units are always there. And at the end your result will include the unit.
However in programming, developers are not always keeping track of units so rigorously. In this article I will go through some advantages of doing so.
Missing and assumed information versus explicit data
If you have “10.00 US Dollars” written down, both the number “10.00” and the “US Dollars” part are data. Data can be very useful to both computers and humans. If some data is deemed to be needed, but not present, humans or computer programs can make some assumptions or use some made up data based on a default or an algorithm. However if this made up data or assumption is incorrect, it can lead to problems.
One way of representing a price of 10.00 US Dollars is simply using an integer or a decimal number:
-
price: #Decimal<10.00>
The information about which currency the decimal represents is not present in the data itself. So you need to look at the context and look outside of the data present here to find out what currency it is.
-
A way to improve this a bit is to add the unit to the name of a database column or variable name. So instead of calling the column simply “price”, we can call it “price_usd”:
price_usd: #Decimal<10.00>
This is better than nothing. But if that value is taken out of context and passed to a function as just a decimal number, the information about the currency is not there anymore.
-
Another way is to represent this with both the raw number value and a separate part representing the currency code:
price_amount: #Decimal<10.00>
price_currency: "USD"
-
To take it further we can have a type representing money:
price: %Money{amount: #Decimal<10.00>, currency_code: USD}
This money-specific custom type can be used in a programming language in combination with the previous structure with two fields read from a database and put into the “Money” type struct. Compared to the first example of just having a decimal, in the code we now have a type that tells us both that it is money - not just any Decimal - and what currency it is.
Computer checking units vs. assumptions about someone having read something
If all of the money amounts in a system is handled with such a money type, this can prevent certain problems. For instance if you try to add two amounts money and they have different currencies, this must be taken into account. Adding 10 USD and 10 EUR is easy to do if they are just simple numbers. The result is 20. But 20 what? It doesn’t make sense. With a type that includes the “unit” (in this case the currency), functions can include checks to prevent mixing currencies incorrectly or perhaps convert currencies before adding them together.
Getting data into the system
Do not make up fake data based on loose assumptions. If you have a parsing function to parse strings and make them into Money structs don’t assume some default currency if a currency is missing. E.g. when parsing “$10.00 USD” it is fine to just make that into a Money struct with 10.00 USD. But if a string only contains “10.00” then the information about the currency is not present in the data. It might be tempting to have “default currency” and just have a general parsing function assume that if you see “10.00” then assume that it is USD. This is dangerous. Instead the parsing function should return an error and say that the currency code is missing. It is better to raise errors in the software than to get bad data into your system. An example of what not to do is how in Ruby the Date.parse method will parse “23 dogs” as the 23rd of November 2018. Or whatever the current month and year is. (“3 dogs” results in an error though).
Parsing should not make too many assumptions and add fake data if some data is missing. Like ignoring “dogs” and adding a year and month that was no where to be found in the string being “parsed”. In that case it is not just parsing. It is parsing and some complicated random data generation combined into one confusing method.
In some cases you have data coming in that does not include all the information you want. Imagine a CSV file with amounts that represent dollar amounts. The people that gave you this data maybe told you in person or over the phone that the numbers represent USD amounts even if the USD currency code is nowhere to be found in the CSV file. It would be good if the currency code was present in the CSV file. However if it is not possible to convince the producer of the data to include the currency code, you can resort to adding the currency code just after parsing it.
If a programmer knows that the string is in fact representing an amount of “USD” even though USD is not present in the data being parsed, the amount “10.00” can be parsed as a number. Then programmers can create a new Money struct with a combination of the parsed number and explicitly putting “USD” in there. This way the parsing function is simply parsing the data that is actually there (10.00). Then the information about the currency (USD) is separately and explicitly being defined in the code for that specific data source. So when someone is later reading the code, they can see that this assumption is there and that is where the currency part comes from, rather than from the parsed input.
The important thing to be clear about is that if you have “10.00” going in and Money struct with “USD” going out at the other end, you are not just parsing data. You are parsing and also adding data about currency based on an assumption.
I think that in general it is best if data is not silently created by default. A money amount parsing function that defaults to a certain currency (USD) might seem innocent enough. It can be convenient in some cases. But convenience in one situation can be silent creation of bad data in another situation.
Getting data out of the system
Whether data is being passed around inside the boundaries of a system or to an entirely different system, keeping track of the units is useful in both cases. If the units are properly determined upon entry into the system and kept all along, they are right there and available for when they are to be exported out of the system or sent to other systems. For instance with a JSON HTTP API. So instead of an API having a field for a price containing just a decimal value, we can provide both the decimal and the the currency together in the API. This means that other systems will also have this currency information. Just like we would want other people to provide this kind of information to our system, we can provide it to others. If we keep track of it.
Communication across systems
If everyone makes sure to keep track of units as soon as the data is either created in that system or read from another system that is a good start. Then if everyone also includes the units when providing data to other systems, this in turn makes it easier for those systems to get the units right as they go from one system to another.
Space craft destroyed by incorrect assumption of unit
The Mars Climate Orbiter was launched into space in 1998. In September 1999 before the mission was completed the spacecraft burned up in the atmosphere of Mars.
Later, many people have used this failure as an example of not following sound engineering practices and suggested different solutions and how those solutions could have prevent failure. I am not the first one to sit in an armchair and write about how it could have been done differently. But I will still use this as an example because it seems like a fun example to use. I used the same example for a talk in 2016 where I briefly touched on the same subject about being explicit about data.
The reason that the space craft burned up was that one part of the system was receiving numbers representing Newton seconds, but using them as if they were pound-force seconds. Imagine that you have an API that provides information. Instead of simply using an integer or a float, the API could include a unit.
Imagine that the consumer of the API was expecting a value in pound-force seconds tagged with a unit. So instead of just sending a decimal such as “2.345” it would send “2.345 N s”. And the consumer of the API would read and verify the unit. On the consumer side it would receive the unit as “N s”. And verify that it is pounds as expected… Hold on a minute. “N s” is not pounds. The software on the consumer side would raise an error because “N s” would not match the expected unit. This error would be seen in tests before sending the system into production. Perhaps this would prevent the mission from failure. Keeping the units around allows software to be written in assertive way to make sure that the right units are sent to it.
Using incorrect units
A unit is information. And wrong information is worse than a lack of information. Using wrong units is worse than not using any units. For instance representing 10 meters as 10 feet is worse than just using the integer 10. A cake recipe that specifies 1kg of salt instead of 1 teaspoon of salt is worse than just specifying “salt to taste”. Have you ever tasted a cake made with too much salt? I once stored both salt and sugar in unlabelled containers and used salt instead of sugar. Let’s just say that I would not recommend doing that.
Other kinds of units
Besides currency codes, these principles apply to other “units” including: lengths, temperatures, weights, timezone identifiers and time zone offsets in combinations with datetimes, and more.
Communication and applications interfaces
Writing computer programs involves both communicating with humans and computers. The source code should be readable by current programmers, future programmers and at the same being understood by computers. Data is not different from source code in that regard. A human can read some data and make better decisions and use of the data if it is correct and sufficient. The same goes for computer systems.
Having good data can be useful to both programmers and computer systems. It could be the programmer that reads the code a month after writing it. Or other programmers that read it later. It could be a function in a library that can make use of the data. Or it can be code in other systems that read the data produced by your system. Be it preventing a baker from adding too much salt, a space craft from burning or helping a programmer more easily reading code and adding features to it.
P.S. In a future post I will expand on these ideas and how it relates to the date and time types in Elixir.
If you liked this post you might want to follow me on twitter for updates on new posts and more. Twitter handle: @laut