<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8">
</head>
<body><div style="font-family: sans-serif;"><div class="markdown" style="white-space: normal;">
<p dir="auto">I took a file of about 450MB of characters. Using SBCL, when I read it like this:</p>
<pre style="margin-left: 15px; margin-right: 15px; padding: 5px; background-color: #F7F7F7; border-radius: 5px 5px 5px 5px; overflow-x: auto; max-width: 90vw;"><code style="margin: 0 0; border-radius: 3px; background-color: #F7F7F7; padding: 0px;"> (defun do-test2 ()
(with-open-file (stream *text-file*)
(let ((buffer-size (* 16 1024 1024)) ; 16M
)
(time
(loop with buffer = (make-array buffer-size :element-type 'character)
for n-characters = (read-sequence buffer stream)
while (< 0 n-characters))))))
</code></pre>
<p dir="auto">It took an average of 1.08125s to read (4 trials).</p>
<p dir="auto">This procedure:</p>
<pre style="margin-left: 15px; margin-right: 15px; padding: 5px; background-color: #F7F7F7; border-radius: 5px 5px 5px 5px; overflow-x: auto; max-width: 90vw;"><code style="margin: 0 0; border-radius: 3px; background-color: #F7F7F7; padding: 0px;">(defun do-test3 ()
(with-open-file (stream *text-file* :element-type '(unsigned-byte 8))
(let ((buffer-size (* 16 1024 1024)) ; 16M
)
(time
(loop with buffer = (make-array buffer-size :element-type '(unsigned-byte 8))
for n-characters = (read-sequence buffer stream)
while (< 0 n-characters))))))
</code></pre>
<p dir="auto">It took an average of 0.07s</p>
<p dir="auto">Modifying this to set the <code style="margin: 0 0; padding: 0 0.25em; border-radius: 3px; background-color: #F7F7F7;">:external-format</code> to <code style="margin: 0 0; padding: 0 0.25em; border-radius: 3px; background-color: #F7F7F7;">:iso8859-1</code> and reading into an array of <code style="margin: 0 0; padding: 0 0.25em; border-radius: 3px; background-color: #F7F7F7;">:element-type 'character</code> it takes an average of 0.8095s</p>
<p dir="auto">So there seems to be <em>some</em> overhead to the unicode handling. Note that I didn't have a file at hand that actually had ISO8859-1 in it, so I don't know if that would have complicated matters.</p>
<p dir="auto">This suggests that just moving around bits without worrying about their interpretation <em>may</em> be faster than treating them as characters. So you could see if that changes your results at all.</p>
<p dir="auto">I'm not a real expert in CL file I/O, so it's likely that this could be done better.</p>
<p dir="auto">On 21 Oct 2022, at 16:18, Garrett Dangerfield wrote:</p>
</div><blockquote class="embedded" style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #777777; color: #777777;"><div id="190B356E-24D5-45BF-B5DF-47CE3F4C5597">
<div dir="ltr">
<div>I tried changing (make-array buffer-size :element-type 'character)</div>
<div>to</div>
<div>(make-array buffer-size :element-type 'byte)</div>
<div>and I got additional warnings and it took 70 seconds instead of 20.</div>
<div><br></div>
<div>Thanks,</div>
<div>Garrett.<br></div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Oct 21, 2022 at 1:47 PM Robert Goldman <<a href="mailto:rpgoldman@sift.net">rpgoldman@sift.net</a>> wrote:<br></div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div style="font-family:sans-serif">
<div style="white-space:normal">
<p dir="auto">I don't know what data you are reading but is there any chance that the difference is that when you read text in lisp as ISO-8859-1 lisp is actually processing the text as unicode, but when you are reading it in Java you are just slamming raw bytes into memory?</p>
<p dir="auto">Maybe this is relevant? <a href="https://stackoverflow.com/questions/979932/read-unicode-text-files-with-java" style="color:rgb(57,131,196)" target="_blank">https://stackoverflow.com/questions/979932/read-unicode-text-files-with-java</a></p>
<p dir="auto">I don't use Java myself, so I can't say, and I don't have access to your data, but it does seem like the Java code is doing something simpler than the Lisp code.</p>
<p dir="auto">What happens if you change your Lisp code to <code style="margin:0px;padding:0px 0.25em;border-radius:3px;background-color:rgb(247,247,247)">read-sequence</code> of type <code style="margin:0px;padding:0px 0.25em;border-radius:3px;background-color:rgb(247,247,247)">byte</code> instead of <code style="margin:0px;padding:0px 0.25em;border-radius:3px;background-color:rgb(247,247,247)">character</code>?</p>
<p dir="auto">On 21 Oct 2022, at 13:43, Garrett Dangerfield wrote:</p>
</div>
<div style="white-space:normal">
<blockquote style="margin:0px 0px 5px;padding-left:5px;border-left:2px solid rgb(119,119,119);color:rgb(119,119,119)">
<p dir="auto">I don't want to cause a firestore here but I was doing some simple<br>
benchmarks on file i/o between Java, ABCL, and SBCL and I'm a bit shocked,<br>
honestly.</p>
<p dir="auto">Reading a 2.5M file in 16M chunks in (using iso-8859-1):<br>
- abcl takes a tad over 1 second<br>
- sbcl takes 0.04 seconds</p>
<p dir="auto">Reading a 5.8G file in 16M chunks in (using iso-8859-1 for Lisp, for Java<br>
it's just bytes):<br>
- abcl takes...too long, I gave up<br>
- sbcl takes between 20 and 21 seconds<br>
- Java takes 1.5 seconds</p>
<p dir="auto">These are all run on the same computer using the same files, etc.</p>
<p dir="auto">What's up with this? Thoughts? I'd heard that SBCL should be as fast as C<br>
under at least some circumstances. I'd wager that C is at least as fast as<br>
Java (probably faster).</p>
<p dir="auto">Thanks,<br>
Garrett Dangerfield. (he/him/his)</p>
<p dir="auto">P.S. Don't get me wrong, I *LOVE* Lisp, I'm trying to get away from Java as<br>
fast as I can (the syntax is killing me slowly). I've used ABCL in<br>
projects before (it was wonderful, Java doesn't handle XML well).</p>
<p dir="auto">Lisp code:<br>
(with-open-file (stream "/media/danger/OS/temp/jars.txt" :external-format<br>
:iso-8859-1) ; great_expectations.iso<br>
(let ((size (file-length stream))<br>
(buffer-size (* 16 1024 1024)) ; 16M<br>
)<br>
(time<br>
(loop with buffer = (make-array buffer-size :element-type 'character)<br>
for n-characters = (read-sequence buffer stream)<br>
while (< 0 n-characters)))<br>
)))</p>
<p dir="auto">Java code:<br>
private static final int BUFFER_SIZE = 16 * 1024 * 1024;<br>
try (InputStream in = new<br>
FileInputStream("/media/danger/OS/temp/great_expectations.iso"); ) {<br>
byte[] buff = new byte[BUFFER_SIZE];<br>
int chunkLen = -1;<br>
long start = System.currentTimeMillis();<br>
while ((chunkLen = in.read(buff)) != -1) {<br>
System.out.println("chunkLen = " + chunkLen);<br>
}<br>
double duration = System.currentTimeMillis() - start;<br>
duration /= 1000;<br>
System.out.println(String.format("it took %,2f secs", duration));<br>
} catch (Exception e) {<br>
e.printStackTrace(System.out);<br>
} finally {<br>
System.out.println("Done.");<br>
}</p>
</blockquote>
</div>
<div style="white-space:normal">
<p dir="auto">Robert P. Goldman<br>
Research Fellow<br>
Smart Information Flow Technologies (d/b/a SIFT, LLC)</p>
<p dir="auto">319 N. First Ave., Suite 400<br>
Minneapolis, MN 55401</p>
<p dir="auto">Voice: (612) 326-3934<br>
Email: <a href="mailto:rpgoldman@SIFT.net" style="color:rgb(57,131,196)" target="_blank">rpgoldman@SIFT.net</a></p>
</div>
</div>
</div>
</blockquote>
</div></div></blockquote>
<div class="markdown" style="white-space: normal;">
<p dir="auto">Robert P. Goldman<br>
Research Fellow<br>
Smart Information Flow Technologies (d/b/a SIFT, LLC)</p>
<p dir="auto">319 N. First Ave., Suite 400<br>
Minneapolis, MN 55401</p>
<p dir="auto">Voice: (612) 326-3934<br>
Email: <a href="mailto:rpgoldman@SIFT.net" style="color: #3983C4;">rpgoldman@SIFT.net</a></p>
</div>
</div>
</body>
</html>