Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43266: [C#] Add LargeBinary, LargeString and LargeList array types #43269

Merged
merged 10 commits into from
Jul 19, 2024
Next Next commit
Add LargeString, LargeBinary and LargeList array types for .NET
  • Loading branch information
adamreeve committed Jul 16, 2024
commit 1280dcce6569cfcbbf77cc069f695cb28e12d6dd
156 changes: 156 additions & 0 deletions csharp/src/Apache.Arrow/Arrays/LargeBinaryArray.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

using Apache.Arrow.Types;
using System;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.CompilerServices;

namespace Apache.Arrow;

public class LargeBinaryArray : Array, IReadOnlyList<byte[]>, ICollection<byte[]>
{
public LargeBinaryArray(ArrayData data)
: base(data)
{
data.EnsureDataType(ArrowTypeId.LargeBinary);
data.EnsureBufferCount(3);
}

public LargeBinaryArray(ArrowTypeId typeId, ArrayData data)
: base(data)
{
data.EnsureDataType(typeId);
data.EnsureBufferCount(3);
}

public LargeBinaryArray(IArrowType dataType, int length,
ArrowBuffer valueOffsetsBuffer,
ArrowBuffer dataBuffer,
ArrowBuffer nullBitmapBuffer,
int nullCount = 0, int offset = 0)
: this(new ArrayData(dataType, length, nullCount, offset,
new[] { nullBitmapBuffer, valueOffsetsBuffer, dataBuffer }))
{ }

public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor);

public ArrowBuffer ValueOffsetsBuffer => Data.Buffers[1];

public ArrowBuffer ValueBuffer => Data.Buffers[2];

public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);

public ReadOnlySpan<byte> Values => ValueBuffer.Span.CastTo<byte>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to consider naming this something different like SmallValues to help with backwards compatibility when we eventually have something like a LargeReadOnlySpan<T> type.

This problem isn't specific to these new array types though. For PrimitiveArray for example we'll probably also want to introduce a new "Large" version of the Values ReadOnlySpan, so for consistency I think it's fine to keep calling this Values. Then we can later add something named like LargeValues to all applicable array types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit that I like the consistency of having the same members on both classes, but I also wonder at the value (ha ha) of exposing this at all. Someone who needs to get at the underlying buffer can already access ValueBuffer, and this span doesn't have any clear uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not adding this member solves the backwards compatibility problem nicely, and I don't see a great need for it either. I'll remove this.


[MethodImpl(MethodImplOptions.AggressiveInlining)]
public int GetValueLength(int index)
{
if (index < 0 || index >= Length)
{
throw new ArgumentOutOfRangeException(nameof(index));
}
if (!IsValid(index))
{
return 0;
}

ReadOnlySpan<long> offsets = ValueOffsets;
return checked((int)(offsets[index + 1] - offsets[index]));
}

/// <summary>
/// Get the collection of bytes, as a read-only span, at a given index in the array.
/// </summary>
/// <remarks>
/// Note that this method cannot reliably identify null values, which are indistinguishable from empty byte
/// collection values when seen in the context of this method's return type of <see cref="ReadOnlySpan{Byte}"/>.
/// Use the <see cref="Array.IsNull"/> method or the <see cref="GetBytes(int, out bool)"/> overload instead
/// to reliably determine null values.
/// </remarks>
/// <param name="index">Index at which to get bytes.</param>
/// <returns>Returns a <see cref="ReadOnlySpan{Byte}"/> object.</returns>
/// <exception cref="ArgumentOutOfRangeException">If the index is negative or beyond the length of the array.
/// </exception>
public ReadOnlySpan<byte> GetBytes(int index) => GetBytes(index, out _);

/// <summary>
/// Get the collection of bytes, as a read-only span, at a given index in the array.
/// </summary>
/// <param name="index">Index at which to get bytes.</param>
/// <param name="isNull">Set to <see langword="true"/> if the value at the given index is null.</param>
/// <returns>Returns a <see cref="ReadOnlySpan{Byte}"/> object.</returns>
/// <exception cref="ArgumentOutOfRangeException">If the index is negative or beyond the length of the array.
/// </exception>
public ReadOnlySpan<byte> GetBytes(int index, out bool isNull)
{
if (index < 0 || index >= Length)
{
throw new ArgumentOutOfRangeException(nameof(index));
}

isNull = IsNull(index);

if (isNull)
{
// Note that `return null;` is valid syntax, but would be misleading as `null` in the context of a span
// is actually returned as an empty span.
return ReadOnlySpan<byte>.Empty;
}

var offset = checked((int)ValueOffsets[index]);
return ValueBuffer.Span.Slice(offset, GetValueLength(index));
}

int IReadOnlyCollection<byte[]>.Count => Length;

byte[] IReadOnlyList<byte[]>.this[int index] => GetBytes(index).ToArray();

IEnumerator<byte[]> IEnumerable<byte[]>.GetEnumerator()
{
for (int index = 0; index < Length; index++)
{
yield return GetBytes(index).ToArray();
}
}

IEnumerator IEnumerable.GetEnumerator() => ((IEnumerable<byte[]>)this).GetEnumerator();

int ICollection<byte[]>.Count => Length;
bool ICollection<byte[]>.IsReadOnly => true;
void ICollection<byte[]>.Add(byte[] item) => throw new NotSupportedException("Collection is read-only.");
bool ICollection<byte[]>.Remove(byte[] item) => throw new NotSupportedException("Collection is read-only.");
void ICollection<byte[]>.Clear() => throw new NotSupportedException("Collection is read-only.");

bool ICollection<byte[]>.Contains(byte[] item)
{
for (int index = 0; index < Length; index++)
{
if (GetBytes(index).SequenceEqual(item))
return true;
}

return false;
}

void ICollection<byte[]>.CopyTo(byte[][] array, int arrayIndex)
{
for (int srcIndex = 0, destIndex = arrayIndex; srcIndex < Length; srcIndex++, destIndex++)
{
array[destIndex] = GetBytes(srcIndex).ToArray();
}
}
}
97 changes: 97 additions & 0 deletions csharp/src/Apache.Arrow/Arrays/LargeListArray.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

using System;
using Apache.Arrow.Types;

namespace Apache.Arrow
{
public class LargeListArray : Array
{
public IArrowArray Values { get; }

public ArrowBuffer ValueOffsetsBuffer => Data.Buffers[1];

public ReadOnlySpan<long> ValueOffsets => ValueOffsetsBuffer.Span.CastTo<long>().Slice(Offset, Length + 1);

public LargeListArray(IArrowType dataType, int length,
ArrowBuffer valueOffsetsBuffer, IArrowArray values,
ArrowBuffer nullBitmapBuffer, int nullCount = 0, int offset = 0)
: this(new ArrayData(dataType, length, nullCount, offset,
new[] { nullBitmapBuffer, valueOffsetsBuffer }, new[] { values.Data }),
values)
{
}

public LargeListArray(ArrayData data)
: this(data, ArrowArrayFactory.BuildArray(data.Children[0]))
{
}

private LargeListArray(ArrayData data, IArrowArray values) : base(data)
{
data.EnsureBufferCount(2);
data.EnsureDataType(ArrowTypeId.LargeList);
Values = values;
}

public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor);

public int GetValueLength(int index)
{
if (index < 0 || index >= Length)
{
throw new ArgumentOutOfRangeException(nameof(index));
}

if (IsNull(index))
{
return 0;
}

ReadOnlySpan<long> offsets = ValueOffsets;
return checked((int)(offsets[index + 1] - offsets[index]));
}

public IArrowArray GetSlicedValues(int index)
{
if (index < 0 || index >= Length)
{
throw new ArgumentOutOfRangeException(nameof(index));
}

if (IsNull(index))
{
return null;
}

if (!(Values is Array array))
{
return default;
}

return array.Slice(checked((int)ValueOffsets[index]), GetValueLength(index));
}

protected override void Dispose(bool disposing)
{
if (disposing)
{
Values?.Dispose();
}
base.Dispose(disposing);
}
}
}
113 changes: 113 additions & 0 deletions csharp/src/Apache.Arrow/Arrays/LargeStringArray.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
// Licensed to the Apache Software Foundation (ASF) under one or more
// contributor license agreements. See the NOTICE file distributed with
// this work for additional information regarding copyright ownership.
// The ASF licenses this file to You under the Apache License, Version 2.0
// (the "License"); you may not use this file except in compliance with
// the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

using System;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using Apache.Arrow.Types;

namespace Apache.Arrow;

public class LargeStringArray: LargeBinaryArray, IReadOnlyList<string>, ICollection<string>
{
public static readonly Encoding DefaultEncoding = StringArray.DefaultEncoding;

public LargeStringArray(ArrayData data)
: base(ArrowTypeId.LargeString, data) { }

public LargeStringArray(int length,
ArrowBuffer valueOffsetsBuffer,
ArrowBuffer dataBuffer,
ArrowBuffer nullBitmapBuffer,
int nullCount = 0, int offset = 0)
: this(new ArrayData(LargeStringType.Default, length, nullCount, offset,
new[] { nullBitmapBuffer, valueOffsetsBuffer, dataBuffer }))
{ }

public override void Accept(IArrowArrayVisitor visitor) => Accept(this, visitor);

/// <summary>
/// Get the string value at the given index
/// </summary>
/// <param name="index">Input index</param>
/// <param name="encoding">Optional: the string encoding, default is UTF8</param>
/// <returns>The string object at the given index</returns>
public string GetString(int index, Encoding encoding = default)
{
encoding ??= DefaultEncoding;

ReadOnlySpan<byte> bytes = GetBytes(index, out bool isNull);

if (isNull)
{
return null;
}

if (bytes.Length == 0)
{
return string.Empty;
}

unsafe
{
fixed (byte* data = &MemoryMarshal.GetReference(bytes))
{
return encoding.GetString(data, bytes.Length);
}
}
}


int IReadOnlyCollection<string>.Count => Length;

string IReadOnlyList<string>.this[int index] => GetString(index);

IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
for (int index = 0; index < Length; index++)
{
yield return GetString(index);
};
}

IEnumerator IEnumerable.GetEnumerator() => ((IEnumerable<string>)this).GetEnumerator();

int ICollection<string>.Count => Length;
bool ICollection<string>.IsReadOnly => true;
void ICollection<string>.Add(string item) => throw new NotSupportedException("Collection is read-only.");
bool ICollection<string>.Remove(string item) => throw new NotSupportedException("Collection is read-only.");
void ICollection<string>.Clear() => throw new NotSupportedException("Collection is read-only.");

bool ICollection<string>.Contains(string item)
{
for (int index = 0; index < Length; index++)
{
if (GetString(index) == item)
return true;
}

return false;
}

void ICollection<string>.CopyTo(string[] array, int arrayIndex)
{
for (int srcIndex = 0, destIndex = arrayIndex; srcIndex < Length; srcIndex++, destIndex++)
{
array[destIndex] = GetString(srcIndex);
}
}
}
3 changes: 3 additions & 0 deletions csharp/src/Apache.Arrow/Types/IArrowType.cs
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,9 @@ public enum ArrowTypeId
BinaryView,
StringView,
ListView,
LargeList,
LargeBinary,
LargeString,
}

public interface IArrowType
Expand Down
Loading