Skip to content
Merged
Changes from 5 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
ea46579
Cache UTF-8 partials of ConhostConnection output pipe
german-one Jun 30, 2019
7c83c38
Revert "Cache UTF-8 partials of ConhostConnection output pipe"
german-one Jul 6, 2019
7a4a814
Fix for UTF-8 partials in functions `ConhostConnection::_OutputThread…
german-one Jul 6, 2019
a424a16
Revert "Fix for UTF-8 partials in functions `ConhostConnection::_Outp…
german-one Jul 6, 2019
85b5509
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 6, 2019
5780770
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 6, 2019
d0e6e82
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 6, 2019
d98c589
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 7, 2019
7df6609
Fix for UTF-8 partials in function `ApiRoutines::WriteConsoleOutputCh…
german-one Jul 7, 2019
507f526
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 7, 2019
7d8c4b1
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 9, 2019
f08f31c
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 9, 2019
bb299d7
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 9, 2019
c395f3b
Fix for UTF-8 partials in function `ConhostConnection::_OutputThread`
german-one Jul 9, 2019
23c4286
Utf8OutPipeReader class added
german-one Jul 11, 2019
99d0713
Utf8OutPipeReader class added
german-one Jul 12, 2019
c877edd
Utf8OutPipeReader class added
german-one Jul 12, 2019
e35a78f
Utf8OutPipeReader class added
german-one Jul 12, 2019
31db29b
Unit Test added
german-one Jul 14, 2019
d09ac87
use specific macros and WIL classes
german-one Jul 15, 2019
f23bcd9
avoid possible deadlock caused by unclosed pipe handle
german-one Jul 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 82 additions & 16 deletions src/cascadia/TerminalConnection/ConhostConnection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
#include "ConhostConnection.h"
#include "windows.h"
#include <sstream>
#include <string_view>
#include <algorithm>
#include <type_traits>
// STARTF_USESTDHANDLES is only defined in WINAPI_PARTITION_DESKTOP
// We're just gonna manually define it for this prototyping code
#ifndef STARTF_USESTDHANDLES
Expand Down Expand Up @@ -190,38 +193,101 @@ namespace winrt::Microsoft::Terminal::TerminalConnection::implementation
DWORD ConhostConnection::_OutputThread()
{
const size_t bufferSize = 4096;
BYTE buffer[bufferSize];
DWORD dwRead;
BYTE buffer[bufferSize]{ 0 };
BYTE utf8Partials[4]{ 0 }; // buffer for code units of a partial UTF-8 code point that have to be cached
DWORD dwRead{}; // length of a chunk read
DWORD dwPartialsLen{}; // number of cached UTF-8 code units
bool fSuccess{};
bool fStreamBegin{}; // indicates whether text is left in utf8Partials after the BOM check
const BYTE bitmasks[]{ 0, 0b11000000, 0b11100000, 0b11110000 }; // for comparisons after the Lead Byte was found

// check whether the UTF-8 stream has a Byte Order Mark
fSuccess = !!ReadFile(_outPipe.get(), utf8Partials, 3UL, &dwPartialsLen, nullptr);
if (!fSuccess)
{
if (_closing.load())
{
// This is okay, break out to kill the thread
return 0;
}

_disconnectHandlers();
return (DWORD)-1;
}

if (utf8Partials[0] == 0xEF && utf8Partials[1] == 0xBB && utf8Partials[2] == 0xBF)
{
dwPartialsLen = 0; // discard the BOM
}
else
{
fStreamBegin = true;
}

// process the data of the standard input in a loop
while (true)
{
dwRead = 0;
bool fSuccess = false;
// copy UTF-8 code units that were remaining from the previously read chunk (if any)
if (dwPartialsLen != 0)
{
std::move(utf8Partials, utf8Partials + dwPartialsLen, buffer);
}

fSuccess = !!ReadFile(_outPipe.get(), buffer, bufferSize, &dwRead, nullptr);
if (!fSuccess)
// try to read data
fSuccess = !!ReadFile(_outPipe.get(), &buffer[dwPartialsLen], std::extent<decltype(buffer)>::value - dwPartialsLen, &dwRead, nullptr);

dwRead += dwPartialsLen;
dwPartialsLen = 0;

if (dwRead == 0) // quit if no data has been read and no cached data was left over
{
return 0;
}
else if (!fSuccess && !fStreamBegin) // cached data without bytes for completion in the buffer
{
if (_closing.load())
{
// This is okay, break out to kill the thread
return 0;
}
else
{
_disconnectHandlers();
return (DWORD)-1;
}

_disconnectHandlers();
return (DWORD)-1;
}
if (dwRead == 0)

const BYTE* const endPtr{ buffer + dwRead };
const BYTE* backIter{ endPtr - 1 };
// If the last byte in the buffer was a byte belonging to a UTF-8 multi-byte character
if ((*backIter & 0b10000000) == 0b10000000)
{
continue;
// Check only up to 3 last bytes, if no Lead Byte was found then the byte before must be the Lead Byte and no partials are in the buffer
for (DWORD dwSequenceLen{ 1UL }, stop{ dwRead < 4UL ? dwRead : 4UL }; dwSequenceLen < stop; ++dwSequenceLen, --backIter)
{
// If Lead Byte found
if ((*backIter & 0b11000000) == 0b11000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some names for these bit masks? I'd be surprised if they're not somewhere else in the codebase already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least I'm not aware of any common names. Wikipedia description for the meaning of these bits.
But I'll search the code base for these values. Maybe you already defined names that could be re-used here. (Perhaps in an enum?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const byte NonAsciiBytePrefix = 0x80;
const byte ContinuationByteMask = 0xC0;
const byte ContinuationBytePrefix = 0x80;
const byte MostSignificantBitMask = 0x80;

Different names for the same value. Doesn't include all values used here.

VERIFY_IS_TRUE(parser._IsLeadByte(0xC0)); // 2 byte sequence
VERIFY_IS_TRUE(parser._IsLeadByte(0xE0)); // 3 byte sequence
VERIFY_IS_TRUE(parser._IsLeadByte(0xF0)); // 4 byte sequence

Better. But names are only in the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally okay without the names 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't agree more.
Yesterday I fiddled with an enum and member names like LeadByteTwoByteSequence. Then I used these names in the array and noticed that they are ambiguous. E.g. we use value 0b11100000 as bitmask in the AND operation and compare the result with 0b11000000. In the end each of these values would need two names. For example something like MaskLeadByteTwoByteSequence and IsLeadByteThreeByteSequence for value 0b11100000. Since we can't use two names for an array element, we would need a second array to disambiguate the element names. I don't like it because it's a mess and it doesn't improve the readability imo. When I tried it yesterday I actually confused myself...
In contrast, the binary literals are quite self-explanatory.

{
// If the Lead Byte indicates that the last bytes in the buffer is a partial UTF-8 code point then cache them
if ((*backIter & bitmasks[dwSequenceLen]) != bitmasks[dwSequenceLen - 1])
{
std::move(backIter, endPtr, utf8Partials);
dwRead -= dwSequenceLen;
dwPartialsLen = dwSequenceLen;
}

break;
}
}
}
// Convert buffer to hstring
char* pchStr = (char*)(buffer);
std::string str{ pchStr, dwRead };
auto hstr = winrt::to_hstring(str);
const std::string_view strView{ reinterpret_cast<char*>(buffer), dwRead };
auto hstr{ winrt::to_hstring(strView) };

// Pass the output to our registered event handlers
_outputHandlers(hstr);

fStreamBegin = false;
}

return 0;
}
}