Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
Freaky committed Apr 23, 2018
1 parent 66a4178 commit 0ca589e
Showing 1 changed file with 65 additions and 42 deletions.
107 changes: 65 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,97 @@
# LineReader [![Build Status](https://travis-ci.org/Freaky/rust-linereader.svg?branch=master)](https://travis-ci.org/Freaky/rust-linereader)
A fast line-oriented reader for Rust.

## Summary
## Synopsis

In my tests LineReader is 1.1x-1.6x faster than the typically-recommended fastest
stdlib option: `BufReader::read_until()`. It achieves this by avoiding copying
from its own internal buffer, instead returning immutable slices of its own.
The `LineReader` struct is a byte-delimiter-focused buffered reader meant as a
faster, less error-prone alternative to `BufRead::read_until`.

Like `read_until`, it does *not* perform UTF-8 processing - you get a slice of
raw u8's, including the delimiter, and nothing more.
It provides two main functions:

Lines are limited to the size of the internal buffer (default 1MB).

### `next_line()`

Returns `Option<io::Result<&[u8]>>` - `None` on end-of-file, an IO error from the
wrapped reader, or an immutable byte slice ending on and including any delimiter.

Line length is limited to the size of the internal buffer.

In contrast with `read_until`, detecting end-of-file is more natural with the
use of `Option`; line length is naturally limited to some sensible value without
the use of `by_ref().take(limit)`; copying is minimised by returning borrowed
slices; you'll never forget to call `buf.clear()`.


### `next_batch()`

Behaves identically to `next_line()`, except it returns a slice of *all* the complete
lines in the buffer.


## Example

extern crate linereader;
use linereader::LineReader;

// Note BufReader will result in unnecessary copying, so, er, don't do that.
let mut file = File::open(myfile).expect("open");

// or LineReader::with_capacity(usize);
// or LineReader::with_delimiter(u8);
// or LineReader::with_delimiter_and_capacity(u8, usize)
let reader = LineReader::new(file);
// Defaults to a 1 MiB buffer and b'\n' delimiter; change with one of:
// * LineReader::with_capacity(usize);
// * LineReader::with_delimiter(u8);
// * LineReader::with_delimiter_and_capacity(u8, usize)
let mut reader = LineReader::new(file);

while let Some(line) = reader.next_line() {
let line = line.expect("oh noes, an IO error");
let line = line.expect("read error");
// line is a &[u8] owned by reader.
}

Lines can also be read in batches for group processing - e.g. in threads:

while let Some(lines) = reader.next_batch() {
send(&chan, lines.unwrap().to_vec());
send(&chan, lines.expect("read error").to_vec());
}

This should be more efficient than finding each intermediate delimiter in the main
thread, and allocating and sending each individual line.
thread, and allocating and sending each individual line. Any line fragments at
the end of the internal buffer will be copied to the start in the next call.


## Performance

Comparison with using typical BufReader methods against pwned-passwords-2.0.txt:
Tests performed using ['Dickens_Charles_Pickwick_Papers.xml'](http://hur.st/Dickens_Charles_Pickwick_Papers.xml.xz),
concatinated to itself 480 times. The resulting file is 976 MB and 10.3 million lines long.

Westmere Xeon 2.1GHz, FreeBSD/ZFS, 29GB, 501.6 million lines:
Buffers in each test are set to 1 MiB.

| Method | Time | Lines/sec | Bandwidth |
|----------|------:|----------:|----------:|
|128k read | 36.85s| 13,612,940|817.92 MB/s|
|LineReader| 73.96s| 6,782,542|407.52 MB/s|
|read_until|119.30s| 4,204,835|252.62 MB/s|
|read_line |139.14s| 3,605,267|216.61 MB/s|
|lines() |167.17s| 3,000,759|174.57 MB/s|
### Westmere Xeon 2.1GHz, FreeBSD/ZFS.

Haswell Xeon 3.4GHz, Windows 10 Subystem for Linux, 5.9GB, 100 million lines:
| Method | Time | Lines/sec | Bandwidth |
|------------------|--------:|------------:|--------------:|
| read() | 1.82s | 5,674,452/s | 535.21 MB/s |
| LR::next_batch() | 1.83s | 5,650,387/s | 532.94 MB/s |
| LR::next_line() | 3.10s | 3,341,796/s | 315.20 MB/s |
| read_until() | 3.62s | 2,861,864/s | 269.93 MB/s |
| read_line() | 4.25s | 2,432,505/s | 229.43 MB/s |
| lines() | 4.88s | 2,119,837/s | 199.94 MB/s |

| Method | Time | Lines/sec | Bandwidth |
|----------|-----:|----------:|------------:|
|128k read | 1.83s| 54,644,809|3,282.17 MB/s|
|LineReader| 2.98s| 33,557,047|2,016.28 MB/s|
|read_until| 3.43s| 29,154,519|1,752.24 MB/s|
|read_line | 5.17s| 19,342,360|1,162.51 MB/s|
|lines() | 7.83s| 12,771,392| 742.52 MB/s|
### Haswell Xeon 3.4GHz, Windows 10 Subystem for Linux.

It's also surprisingly fast on debug builds (or stdlib is surprisingly slow):
| Method | Time | Lines/sec | Bandwidth |
|------------------|--------:|------------:|--------------:|
| read() | 0.26s | 39,253,494/s | 3702.36 MB/s |
| LR::next_batch() | 0.26s | 39,477,365/s | 3723.47 MB/s |
| LR::next_line() | 0.50s | 20,672,784/s | 1949.84 MB/s |
| read_until() | 0.60s | 17,303,147/s | 1632.02 MB/s |
| read_line() | 0.84s | 12,293,247/s | 1159.49 MB/s |
| lines() | 1.53s | 6,783,849/s | 639.85 MB/s |

| Method | Time | Lines/sec | Bandwidth |
|----------|-------:|----------:|------------:|
|128k read | 1.82s| 54,945,055|3,296.37 MB/s|
|LineReader| 29.17s| 3,428,180| 205.98 MB/s|
|read_until| 368.02s| 271,724| 16.33 MB/s|
|read_line | 383.00s| 261,097| 15.69 MB/s|
|lines() | 220.28s| 453,968| 26.41 MB/s|
It's also surprisingly fast on debug builds (or stdlib is surprisingly slow):

Hmmm.
| Method | Time | Lines/sec | Bandwidth |
|------------------|--------:|------------:|--------------:|
| read() | 0.27s | 38,258,105/s | 3608.47 MB/s |
| LR::next_batch() | 0.28s | 36,896,353/s | 3480.04 MB/s |
| LR::next_line() | 2.99s | 3,463,911/s | 326.71 MB/s |
| read_until() | 57.01s | 181,505/s | 17.12 MB/s |
| read_line() | 58.36s | 177,322/s | 16.72 MB/s |
| lines() | 21.06s | 491,320/s | 46.34 MB/s |

0 comments on commit 0ca589e

Please sign in to comment.