Experiment: pulling personal information out of thin air, at scale
Recently, the Washington State Department of Licensing closed for a number of days to upgrade its systems. The reasoning behind the upgrade is noble enough, to “better protect your information”. Both the information that ties vehicle license plates to registered owners, and the personally identifiable information found on a drivers license, is obviously very sensitive, and should be protected accordingly. However, the system used to store a process the data is only one part of the puzzle, and this specific closure rekindled an idea for an experiment I’d wanted to try for a while. With extra time thanks to the long weekend, an AWS account and a police scanner, I had all the ingredients needed to attempt to pull sensitive DOL data out of thin air.
The police (in whichever state they operate) are obviously big consumers of DOL data. Traffic stops happen multiple times an hour, and each time they do, an officer will run a plate using their police radio. An operator in the dispatch center will then respond with registered owner information, including in some cases the address of said owner. The same is true with license information, in some cases the officer will ask the operator to check and ensure that a license is still valid, and that the person associated with said license doesn’t have any active warrants.
Trunked Radio Systems
The vast majority of police radio chatter in the USA happens over an unencrypted trunked radio system, meaning anyone with a scanner can hear the broadcast, this is well known. The majority of the time that radio traffic is heard for a fleeting second by the police and any scanner enthusiast who happens to be listening at the time. Depending on the location, it’s also possible that the scanner traffic is recorded, archived and made accessible via enthusiast websites. Those same websites also allow remote streaming of monitored scanner feeds. Therefore, the data included these radio exchanges, including license plate lookups, has been available for years for anyone who cares to listen to it and gain insight, nothing new here.
Of course, if you’re looking to build up a repository of sensitive vehicle licensing data, sitting and listening to scanner feeds and manually picking out sensitive information isn’t really a practical way of doing it, the time and effort required doesn’t really justify what the person listening in would get out of it. To put it another way, and to continue the overuse of an already overused tech term, it “doesn’t scale”.
This brings us on to the crux of my experiment, automating and scaling the process of data collection from scanner traffic. Earlier this year, Amazon Web Services launched AWS Transcribe, a service that creates text from speech. You can probably guess where this is going.
The setup was relatively simple, I set my Uniden police scanner to my local police dispatch feed, connected the audio out from the scanner to my computer, and used a piece of software called FreeSCAN (http://www.sixspotsoftware.com/products/freescan) to manage the captured audio MP3 files.
Scripting was then used to upload the MP3 files to Amazon’s simple storage service (S3), before being processed through AWS Transcribe using this services API. Automated listening, and processing of scanner traffic to produce JSON files containing text. After about an hour, I decided to review the content of the JSON files. As expected, there was a lot of garbage chatter in there, but it didn’t take long to find a read back from a plate, as shown below.
This may not look like much at first glance, but this is actually a phonetic read back of a name, specifically the owner of a vehicle, Catherine N. Sully. The police use a phonetic alphabet (A = Adam, C = Charles etc.) when spelling to reduce errors, and it turns out, this is great for transcribe also! Just prior to this chunk of the message you could also see the license plate, which of course I’ve omitted for Catherine’s sake.
I then set about scripting a quick parser to extract more instances of this using the phonetic words. After about 6 hours recording on a single channel, I had about a dozen license plate to name pairs. Amazon transcribe costs a fraction of a cent per second, and with some free tier allowances, I think I spent a total of 6 cents on processing. Not a bad rate of return.
It’s a good start, but this is just one channel in one location. I mentioned earlier that it’s possible to stream feeds remotely, so I used multiple internet scanner feeds from various parts of the country to test the service with a different kind of audio. After three more hours of audio, I again was able to extract a few more identifiers, but the quality of the audio in some cases definitely made it harder for Transcribe to do its stuff. My own feed, which I had control over, was much more reliable.
I was pretty excited by the results here, it was very possible to automatically, reliably and repeatedly extract PII as found in license records using a radio and cloud services. Do I think this illustrates a significant risk to privacy? Probably not. A malicious actor could very well set up a scanner rig for automated collection of vehicle license data, but still, the relatively slow trickle of records would mean the practical applications for further use of that data are reduced.
What I do think, is that this experiment further reiterates a couple of truths regarding sensitive data.
- Data is only as secure as the weakest link in the chain. In Washington, the DOL may well have just upgraded their systems in the name of security, but if that same data is transmitted in the clear, across a public channel by a consumer of that data, then it’s there for everyone to grab.
- As cloud offerings reduce the cost and barriers to running tasks like transcription at scale, we will continue to see creative uses for those services, not all of which will be benign.
Finally, this experiment amounts to an experiment using Open Source Intelligence (OSINT), and I’m pretty sure that others are already doing similar work using scanner feeds. I found myself wondering about other applications of scanner transcriptions, and I think that commercial use for keyword based alerting (e.g. breaking news, health and safety) are there. Given this, I would hope that in the name of privacy and security, such sensitive lookups are moved either to encrypted channels, or performed directly by the officer on scene using a laptop to query to data directly in more cases.
I will continue to experiment and play with scanner feeds and the transcribe services as time permits, and will drop an update, including any code if I think it’s valuable enough!