Separating data cleaning from data analyzing
I have a particular way of separating data-cleaning from data-analyzing, which is that I don't separate data-cleaning from data-analyzing. The present article goes like this.
- Conceptual discussion of my the relevant aspects of my general approach to data-processing
- Some data-cleaning problems that don't happen to me because of my approach
- Specific examples of how I implement things differently from how other people might
- Why people approach things in other ways
Relevant aspects of my approach to data-processing
Here are some general principles I follow when writing data processing systems. I should merge this with the aforelinked article on Tom-style data-processing.
Let's start with some design-type stuff.
- I accomplish my goal with little work as possible.
- I design computer software with relevant users in mind. In most cases, the only user is me.
Here are some things on the specific code style.
- Software should be packaged in standard ways.
- Software should be easy to install.
- Software should have follow relevant user interface conventions.
And here are some things on documentation.
- Every interaction with the computer should be should be scripted; that is, anything that isn't encoded in the software itself should be well documented.
- The rest of the software should be well documented too.
The concept of data cleaning is particularly non-Tom-like and thus doesn't fit anywhere in my model. The present article is all about why I don't need such a concept.
Problems I avoid
These are things that other people have to deal with that are either easy for me to deal with or are completely irrelevant.
- Cleaning data unnecessarily
- Deciding whether something is part of the cleaning step or the analysis step
- Excessively verbose code
- Difficulty reusing code through standard programming language semantics
- Copying data unnecessarily
- Determining how to handle edge cases in data cleaning
I'll go through some examples of where that happened.
Cleaning data unnecessarily
One reason why people separate cleaning from analysis is that they haven't figured out what analysis they want to do yet. It might save you some time to figure out what analysis you want before you get very far.
One time I was on a team that was working with money figures in different currencies. Half of the people on the team didn't talk much, and the other half was very insistent on a process that went something like data cleaning, data something, data something else, data visualization, modeling, and maybe some other things. My way is better, as you will see.
Early on, someone decided that we should convert the currencies to one currency (like inflation-adjusted US dollars) so we could compare them, without any particular plan as to how we would compare them. This kind of makes sense, and pretty much everyone went along with it.
Fortunately, I was paying attention. There are so many sorts of manipulations that we could have done with the currency figures in order to model the particular phenomenon that concerned us, and we had several options that did not require us to convert the figures in the obvious way.
Rather than converting currencies, I simply standardized the figures within each currency. We had data about several different things, and we had several money figures for each particular thing. (Each thing corresponded to a particular year, or a small range of years, so it was pretty okay to ignore inflation within each thing.) This produced several distributions, one per currency, rather than one raw distribution across all currencies, and that was all we needed for the analysis of interest.
The loud half of the team was very eager to produce a "clean dataset" so they could finish the cleaning and start on the analysis. I do not distinguish between cleaning and analysis; I made something that I called a "report", but you could also call it a "clean dataset" if you want.
Deciding whether something is part of the cleaning or the analysis
If want to "clean" dates that start out like "yesterday at 3 o'clock", you might produce something that looks like "1990-03-30 03:30:00". For your analysis, you might be concerned with some other representation of the date, such as the day of the week, the phase of the moon, or the minute within the hour. Do you create these other representations in the "clean" step or the "analysis" step?
It gets confusing when you have to guess whether things are part of cleaning or part of analysis. I avoid this confusion by writing a date-cleaning function having separate cleaning and analysis steps.
Excessively verbose code
Whenever you separate something into two different components that communicate with each other, you have to write the interface between the two components. If you are passing complicated stuff between the two things, this becomes significant.
Once I wound up rewriting a data processing system that some other people had written. It was a rather pleasant experience, I must say; they had already figured out all the hard stuff, and now I just had to refactor their messy code. Their system looked more messy than it really was because they were passing data around as temporary files of various different file formats. One job would save stuff to a file, and the next job was expected to read data back from that file. Thus, each job included the same few lines of code for serializing or deserializing the data.
After I rewrote it, data mostly stayed in memory, as Python objects. When data were written to disk, it was through a custom module that abstracted all of the reading and writing and serializing and deserializing.
If the different components are just different functions, you probably won't have this problem, but you might have next problem instead.
Difficulty reusing code through standard programming language semantics
If you start processing data in multiple steps, you may begin rewriting your own step-based programming language to replace function calls. I have seen this in many different softwares.
In this paradigm, one reuses code by sending data through the same steps rather than by making separate modules. In order to interact with a system, one must now learn the implicit system-specific language rather than using equivalents of the underlying language, which are usually better in all ways.
In the case of the aforementioned refactoring, the original system had its own implicit language that was not really tested nor documented. I gradually replaced it with idiomatic Python, which has very good tests and documentation.
Determining how to handle edge cases in data cleaning
The prompt for my writing this article was hearing that someone wasn't sure how to handle edge cases in data cleaning. This difficulty had been foreign to me!
Here are some precise things that the someone had been wondering.
- How do I handle missing values?
- What do I do if I see "aoeu" in a numeric field?
- How do we represent our degree of confidence in our parse?
If you don't do things my way, you might wind up answering these questions based on theory, on standards, or on the difficulty of explaining your decision. Most likely, you'll get it wrong, and in the best case, you'll waste a lot of time.
I have no difficulty deciding what to do with edge cases because I'm always thinking about the final product of my data. In many (most?) cases I conclude that it doesn't really matter how I handle the edge case and that I can just skip those records or do something similarly stupid. In other cases, I tailor my approach such that it can be very little work for me while still being appropriate for whatever I want to do with the data.
In some cases I come up with particularly strange approaches that make sense only for my specific situation. Fortunately, I script and document everything, so I can figure out how my stuff works after I forget what I had been doing.
Why people approach things in other ways
We all know that my way is best, so why would anyone approach data stuff differently and wind up with a separation between data cleaning and data analysis?
Sometimes it's fun to categorize things and to have specific names for everything. Categories make you think that things are orderly; that you understand what's going on; that there is a clear path in your career; or that you have a clear role in your organization or overall society.
Categories are especially relevant to managers, in certain cases, because it helps them believe that they are orchestrating it; and can make them feel very smart and skilled and powerful.
If you don't know how to do things correctly but you want to think that you're doing things correctly, it can be nice to agree on doing things the same way as a bunch of other people. This is still likely to be wrong, but you might feel like it must be right.
Some people make it look like doing data science involves seven specific steps that should be done in order.
Here are some articles and a diagram that could give you this impression.
- A Taxonomy of Data Science
- Cross Industry Standard Process for Data Mining
- The Age of the Data Product
- Building a Data Science Platform in Scala
I can never tell whether the people sound like this because they actually think that or because they're presenting their thoughts badly. Either way, their articles can convey the unnecessary grouping of data processing into steps.
If you hear some big fancy people talking about their big fancy data-cleaning step, it's easy to want to copy them.
Other ways might be better
The simplest explanation, of course, is that dividing data processing into steps is sometimes helpful.
Dividing complicated projects into separate components makes a lot of sense when different people are working on the different components, and it is sometimes especially important that people work on the different components in order to maintain privacy of data. For example, the data-cleaning step may also include anonymization, so it may be important that the people doing the analysis not have access to the original data.
Dividing projects like this can also make them seem more complicated and fancy. This can help contributors feel smart, and this can also impress potential users.
When projects are divided like this, managers may feel more knowledgeable or feel a greater sense of purpose. If everyone is doing separate analyses and organically reusing their code through a module system, it might seem like the manager isn't doing anything. On the other hand, if the manager is the one assembling the individual components, he or she feels like he or she understands what is going on and like he or she is the one orchestrating the data science.