Surveying files

Sometimes when testing computer programs I want to run the program on just a small part of some files, because that's sometimes faster than running the program on the entirety of these files.

cat chainsaws.csv |
  sort -R |
  head -n 80 |
  my-program
# Note: -R is a pretty new feature; some operating systems don't support it.

But this is a slow way of selecting a random subset of lines because sort -R reads the whole file and then sorts randomly.

Other times, I want to see how common a particular phrase is in a particular file, usually because I want to know how important it is to handle a specific data format.

# zip+4 codes
grep -c '\d\{5\}-\d\{4\}' \
  chainsaws.csv

This too takes a long time, because we need to read the whole file, and the file might be very big if we have a lot of chainsaws.

Here are some tools to support these endeavors!

Estimate counts in a file

If we are okay with an estimate rather than an exact number, we don't have to read the entire file; instead, we can seek randomly through the file and count just random sections of the file. This is what svygrepc.file does.

svygrepc.file(file,
              pattern = '\n',
              n = 200,
              page.size = 2^14)

It treats the entire file as a population of bytes. Modern storage media reads entire "pages" of data at once, even if you request just one byte from that page, so svygrepc.file divides the file into pages and treats each page as a "cluster". svygrepc.file takes a simple random sample of clusters, and it reads all of the data within a particular cluster. The data are accordingly analyzed as a cluster sample.

Beginning of file
|--------------- 2^14 bytes ---------------|
|--------------- 2^14 bytes ---------------|
|- 5 kb -|
End of file

The last page in a file is likely to be smaller than the other pages, so we assign this page a correspondingly high sampling weight.

svygrepc.file should adjust its population size figure according to the size of the pattern that you are matching. I don't remember whether I did this though.

svygrepc.file uses R's survey package to calculate confidence intervals for its estimates of the number of occurrences of the pattern (newline by default) in the file. This is what happens when you run it.

> svygrepc.file('~/big-file.csv')
total    SE
count 2451437 19221
2.5 %  97.5 %
count 2413765 2489110

Sample lines from a file

I found it quite hard to efficiently and representatively sample full lines from a file, so I came up with something that is hopeful decent in many cases.

cat chainsaws.csv |
  sort -R |
  head -n 80 |
  my-program
sample-lines chainsaws.csv | my-program

sample-lines randomly samples full lines from a file. Samples are with replacement and weighted by line length; The probability of selecting a line is proportional the length of the previous line. You can configure whether to do a simple random sample or repeated systematic samples.

This allows us to sample very quickly, but it makes this approach appropriate only if your file has reasonably consistent line lengths or at least if there is no periodic variation in line length.

How fast

Consider this 1-gigabyte CSV file.

$ wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

Running wc took three seconds.

time wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

real    0m3.789s
user    0m3.560s
sys     0m0.190s

Here's how long it takes to parse the whole file.

$ time python3 -c 'for line in open("big-file.csv"): pass'

real    0m2.892s
user    0m2.641s
sys     0m0.245s

sample-lines is much faster. Here's a simple random sample of 40 lines,

$ time sample-lines -n 40 -m simple-random big-file.csv > /dev/null

real    0m0.136s
user    0m0.113s
sys     0m0.018s

a systematic sample of 40 lines,

$ time sample-lines -n 40 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.148s
user    0m0.122s
sys     0m0.019s

and repeated systematic sample, with 4 repeats and 10 lines each, for a total of 40 lines.

$ time sample-lines -n 10 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.175s
user    0m0.140s
sys     0m0.025s

Most of the time in the above examples was spent loading Python and the various modules; printing the help takes almost as long as running the sample.

$ time sample-lines -h > /dev/null

real    0m0.157s
user    0m0.129s
sys     0m0.021s

So even a pretty big sample is still fast to run.

$ time sample-lines -n 2000 -m systematic -r 50 big-file.csv > /dev/null

real    0m2.695s
user    0m2.435s
sys     0m0.231s

Alternatives

Use sample if you want to sample from a stream.

Estimate counts in a many files

I think the really cool thing would be to combine these file-level samplers with something that samples across files.

Cluster sampling
Select random groups, and get everything from each group
Systematic sampling
Select at an interval along a variable
Stratified sampling
Select based on some grouping

Stratify by directory/filename

If files are grouped into directories, there's a good chance that files within a directory are sort of similar. Like maybe they're all from the same system or month or person or data format. So we can stratify by directory and then eventually select only a subset of files by directory to read fully.

Consider the following files.

logs/example.com/2014/log1
logs/example.com/2014/log2
logs/example.com/2014/...
logs/example.com/2015/log1
logs/example.com/2015/log2
logs/example.com/2015/...
logs/thomaslevine.com/access-log1
logs/thomaslevine.com/access-log2
logs/thomaslevine.com/...
logs/thomaslevine.com/error-log1
logs/thomaslevine.com/error-log2
logs/thomaslevine.com/...

If I had to guess, I'd say that the logs/example.com files are more similar to each other than they are to the logs/thomaslevine.com, that the logs/example.com/2014 files are older than the logs/example.com/2015 files, that log1 is older than log2, and that the logs/thomaslevine.com/error-* files are more similar to each other than to the logs/thomaslevine.com/access-* files.

Strata

  • logs/example.com
  • logs/thomaslevine.com
  • logs/example.com/2014
  • logs/example.com/2015
  • logs/thomaslevine.com/error-*
  • logs/thomaslevine.com/access-*

Component for systematic sampling

  • log1
  • log2

We can use this knowledge to better estimate how many lines are in this entire directory. Moreover, we probably also want to know the line count for each stratum.

Systematic sampling within a file

Systematic sampling is when we order our data by some variable, pick a random starting point, and then select every 20th (or whatever number) item. I think this is very appropriate for text files because data are likely to be correlated with position within file. For example, data later in the file might have been written later or might be higher in alphabetical order. Data in the following file are approximately ordered by decreasing expirationDate

permitApplicationNumber,expirationDate
MVN-2012-1762-CU,2012-08-22
MVN-2011-02043-ETT,2012-08-20
MVN-2012-1846-WII,2012-08-14
MVN-2012-00199-CM,2012-08-20
MVN-2011-00989-CL,2012-08-19
MVN-2010-0024,2012-08-19
MVN-2005-1630-CYA,2012-08-19
MVN-2006-0335-CYA,2012-08-19
MVN-2012-1879-EII,2012-08-14
MVN-2012-1732-WJJ,2012-08-12
MVN-2012-1660-WII,2012-08-12
MVN-2012-1504-CQ,2012-08-13
MVN-2012-1608-CQ,2012-08-13
MVN-2008-3300-WJJ,2012-08-12
MVN-2010-1270-WII,2012-08-12
MVN-2012-01027-WMM,2012-08-07
MVN-2012-00926,2012-08-13
MVN-2012-0740-EBB,2012-08-12
MVN-2009-0862-EBB,2012-08-22
MVN-1997-0884-EPP,2012-08-22
MVN-2012-1598-EFF,2012-08-13
MVN-2012-0887-CO,2012-08-11
MVN-2011-0639-EQ,2012-08-21
MVN-2012-1174-WII,2012-08-05
MVN-2004-5061-EOO,2012-07-30
MVN-2012-01496-WMM,2012-08-05
MVN-2012-00786-WMM,2012-08-05
MVN-2012-1489-EMM,2012-08-05
MVN-2012-1644-WJJ,2012-08-05
MVN-2012-01146-CJ,2012-08-05

Present support

sample-lines
Both simple random and repeated systematic
svygrepc.file
Only simple random

Multistage

And if we want to go crazy with efficient sampling, we can do a two-stage survey; we take a small sample to estimate how varied are the data within particular directories or files, and then we choose our sample size for the second stage of the survey based on this first stage's results. This would all happen within the span of a second, of course; the goal is to get confident estimates as quickly as we can.

  1. Take a small sample from a large directory.
  2. Estimate how many lines are in the subdirectories.
  3. Cluster sample with random files, allocating files to different directories based on variance estimates

This would mainly be a good way to quickly sample full lines from a whole directory, by choosing representative files and selecting lines from these files, as selecting full lines representatively takes much longer than counting occurrences.

  1. Take a small sample from a large directory.
  2. Estimate how many lines are in the subdirectories.
  3. Cluster sample with random files, allocating files to different directories based on variance estimates
  4. Scan each selected file, and emit a full line from each

Links

Two things I'm wondering

  1. Is this at all useful?
  2. Sleep