April 10, 2019
What Do Obscure Soviet Math Papers Have to do with IoT Security?
Adapted from unpublished research conducted in Dr. Xiaojiang Du's Security and Networking (SAN) Lab at Temple University, between August and December of 2018. Given the unpublished state of this research, the following should be considered more opinion than fact: none of the following research has been peer-reviewed as of this writing.
I was watching TV the other day and they had on this commercial for Ring doorbells: two of the dumbest criminals in the world were attempting to rob a house in broad daylight for some reason. Oh but they can't because the house has a Ring. Rats! Foiled again! That pesky marvelous little whizkid internet doorbell transmitted real time voice and picture data right back to the trendy homeowner, alerting him to the not-quite-Anglin morons trying to case his place. A stern yellin' by the remote homeowner sets them straight and they scurry off.
But who else is that information being transmitted to, do you think?
IoT devices collect a lot of information about its users, by their nature. All of that data gets packaged up and phoned home to the company that built the device. Shocking, right? Yeah not really. We live in the post-privacy age now. Nothing to hide? Doesn't matter. The big, bad, pale, slightly awkward tech weirdos from California are very, very interested in knowing ... what time you put your trousers on every morning?
In academic circles, the response to this situation is a growing body of research into IoT device privacy. I had the opportunity to work with such a research group this past autumn. Our given task was:
"Implement [...] existing data mining algorithms [...] to perform activity recognition as well as learn the living habits of homeowners."
"Well wait a minute!" you may be wondering, "if the goal of the research is privacy preservation, then why are you trying to build your own surveillance apparatus?" The answer is simple: know thy enemy! If you are going to study privacy-preserving schemes, then you are going to need a benchmark to test it against, right?
What is Normal, Anyway?
However, what seems like a simple task at first - "learn the daily living habits of IoT device owners" - quickly becomes muddled when you begin to climb down the rabbit hole. Developing ML models to learn the daily living habits of homeowners via their IoT devices is a challenge, given that we do not have a clearly defined goal as to what exactly we are trying to find out.
Think of it this way: in a face recognition system, we are looking for human faces to compare. Our goal is very clearly defined. It is easy to train a model to look for a face: we just give it lots of pictures of human faces to look at!
But what is a living habit, exactly? What are we expecting people to do that is normal? What is defined as not normal? Is normal the same thing for all people? I have a feeling it isn't.
How are we going to train a model to look for something that we can't even really describe what specifically it is that we are looking for?
This predicament of not knowing what exactly we are looking for means that we can't choose a machine learning model from the vast families of ML models which have a prerequisite for training data. Pity.
But that makes sense; our goals are fuzzy. We are interested in looking for events that are interesting to us, but without any regard as to what those interesting events specifically are.
We need models that generate quantifiable "interesting" points without the need for any prerequisite training data: points that are interesting based on some kind of time-frequency criteria, or maybe even something else entirely...
Compression Algos as ML Models?
An IoT device transmits information on a periodic basis to a cloud- based monitoring application. This data could be the ambient temperature of a space, or perhaps an indication of movement from a motion sensor, or it could be any kind of signal generated by any kind of sensor. That data is easily used to predict the daily habits of the device's owner with the simplest of analysis. Almost too easy.
But, one issue you may find in this problem space is over-relying on models that depend on time as its primary domain. We studied well known clustering-type models during the course of our research, such as Density-based Spatial Clustering of Applications with Noise (DBSCAN).
With models like these, you find yourself falling into the trap of looking for x over time, limiting your predictive capabilities to, "Oh I bet this user is gonna do an action at this time of day!"
Perhaps that is not what we are truly interested in. Maybe instead we are interested in only just identifying segments of data that are interesting, but without giving any specific regard to the time or the frequency in which they occur. We care about what it is, not when it happens.
Contemporary IoT privacy research is going in an interesting direction: we are now beginning to see compression algorithms being used in an off-label manner as a way to gain semantic information from IoT sensor data streams.
For example, the well known and widely used Lempel-Ziv-Welch compression algorithm works by generating a non-repeating dictionary of phrases that the algorithm constructs as it processes the input data stream. If, say, we were to feed IoT sensor data into an LZW program, could this dictionary yield us interesting segments of sensor stream data which might yield us some useful insight?
Well, not quite. There is a caveat: LZW in it's original form is a lossless compression algorithm. It is not capable of generalizing subtle variations that might occur in the data stream, of what otherwise might be an event that is essentially identical to another one. We would want a way to filter out the inevitable environmental jitter from the input.
The Levenshtein Distance Between Words
Levenshtein Edit Distance is a pretty neat mathematic concept. Originally appearing in an issue of Soviet Physics Doklady, it describes a non-negative number describing how much work you would need to do to change one string into another string. A change is defined as being one of either: a substitution, where one character is changed to another character, an insertion, where a character is added, or a deletion, where a character is removed.
Two strings that are exactly the same have a Levenshtein distance of zero. The largest possible Levenshtein distance for any two arbitrary string is equal to the length of the longer string.
The original application of Levenshtein distance was for the construction of error-correcting codes. But Levenshtein distance is also a potential avenue for introducing some "lossyness" into what might normally be a lossless compression algorithm. Levenshtein distance can be used as a selective filter for determining if a new word should be added to the LZW dictionary, or discarded due to being considered too similar to an already existing word in the database.
In our study conducted during this research, a new candidate word for insertion into the LZW dictionary has its Levenshtein distance calculated for all of the words already stored in the dictionary. If the distance is below some threshold value L, then the candidate word is not added to the LZW dictionary.
Using this Levenshtein-LZW filtering scheme has an immediate impact on the total count of the words in the LZW dictionary. As L increases, the number of total words in the dictionary decreases at a predictable rate. As L goes to infinity, we can predict that the number of words in the LZW dictionary will ultimately approach 1, as eventually all possible words will be within infinite Levenshtein distance to the very first word that was added to the LZW dictionary.
Studying privacy leakage in IoT systems requires defining your own benchmark to test against. Do you think Ring is going to tell you what exactly they're doing with all of that data they collect? If you get them to tell you, then please give me a call.
We can safely assume that there is a strong economic incentive for commercial IoT companies to keep their models under close wraps. So we need to take an educated guess as to how and what they are doing.
We can postulate that the ML models used commercially probably do not use any training data. They learn about their target users solely by relying on patterns that it has already seen in the past. These models are probably looking for points in the data that are "interesting" - they represent some kind of unique marker of behavior that personally identifies the user.
The goals of studying privacy leakage in IoT systems, therefore, should focus on reducing the number of these interesting segments that the ML model can detect. A perfect theoretical system would eliminate all of them.
But how do you do that without harming the functionality of the device? People are still going to want their smart thermostat to work. What good is a privacy-preserving scheme that obfuscates all of the data to guarantee privacy, but it also causes the device to simply stop working correctly?
What good is a smart thermostat that fails to trip on the furnace at the right time, because its privacy-preserving model refuses to send any sensor data back to the IoT company? How do you explain away that little bug to the user?
We have a long way to go and a lot of questions left to answer in this problem space.