Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add parse(Type, str) method for parsing Types: DataTypes, Unions and UnionAlls. #52668

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Dec 29, 2023

Description

Add a method to parse: parse(::Type{<:Type}, ::AbstractString), which parses a string into a julia type object.

This replaces the only alternative previously available, which was to eval(Meta.parse(str)).

This is beneficial for two major reasons:

  1. It's much faster to directly construct the type object than to call eval(), which is a very general mechanism.
    • Around 5x-15x faster based on microbenchmarks.
  2. It's much safer for application code to construct a type than to call eval.
    • If the type string comes from external data, eval can expose the application to an injection attack. For example, given the string s = "precompile(Tuple{typeof(+), dump_all_your_secrets()})", eval(Meta.parse(s)) would invoke unwanted code.

Some microbenchmark results:

julia> VERSION
v"1.11.0-DEV.1167"

julia> using BenchmarkTools

julia> @btime eval(Meta.parse($("Main.Main.Base.Checked.Vector{Dict{Int, Base.Float64}}")))
  130.000 μs (219 allocations: 10.06 KiB)
Vector{Dict{Int64, Float64}} (alias for Array{Dict{Int64, Float64}, 1})

julia> @btime parse($(Type), $("Main.Main.Base.Checked.Vector{Dict{Int, Base.Float64}}"))
  9.083 μs (138 allocations: 6.95 KiB)
Vector{Dict{Int64, Float64}} (alias for Array{Dict{Int64, Float64}, 1})

julia> @btime eval(Meta.parse($("Main.Vector{S} where {Int<:T<:Number, S<:T}")))
  184.250 μs (220 allocations: 9.85 KiB)
Vector{S} where {Int64<:T<:Number, S<:T} (alias for Array{S, 1} where {Int64<:T<:Number, S<:T})

julia> @btime parse($Type, $("Main.Vector{S} where {Int<:T<:Number, S<:T}"))
  8.486 μs (141 allocations: 7.05 KiB)
Vector{S} where {Int64<:T<:Number, S<:T} (alias for Array{S, 1} where {Int64<:T<:Number, S<:T})

Notes:

Is this really parsing?

This capability has been discussed / requested by members of the community in the past. The answer has always been to parse+eval. For example in this thread, the OP was told to eval the string: https://discourse.julialang.org/t/parse-string-to-datatype/7118/8.

In that thread, the argument is made that this isn't "parsing," since Type objects are not literals from the parser's perspective. This really is more like parse+eval. In which case, I'm happy to use a different name for this if people feel strongly about it. But to me, it feels like parsing, so parse(Type, str) does feel like the right API for this.

I'm open to suggestions on that!

Remaining Work

I first wrote this with only DataTypes in mind. You can see the version that parses only DataTypes in the first commit, 28ee0f3. But then I realized that we should also support UnionAlls, which requires a bit more work. I'm still working through how best to do that... I could probably use some help with this, TBH.

  • The current implementation is still not correct for all UnionAll types: EDIT: I think it's correct now!
    • Vector{<:Number}
    • Main.Vector{S} where {Int<:T<:Number, S<:T}
    • typeof(+)
    • Constants in types? Array{<:Any,2}, Val{X(Y(2))}?
      • These will only be isbits types, but there's no guarantee that the constructors don't have side-effects... :/ Maybe we can use reinterpret? That seems safe.
      • ✔️ I tackled this via reinterpret, which seems it should support all the valid cases of a type string.
  • The current implementation for UnionAlls support is awkward, and an after-thought, and could be much improved.
  • I need to implement tryparse, and use that, to allow parse failures without throwing an exception.
  • more tests
  • Is it okay that the code is currently relying on Meta.parse and then walking the parsed AST?
    • That's where the bulk of the remaining CPU time goes now.. Would it be preferential to write custom parsing code straight from the string to the Types?
    • It doesn't really make sense to me to write custom code for this - what we need to do is a strict subset of what Meta.parse needs to do, and I think it's probably best to share that work as much as possible. I think that to support UnionAlls, we need the later parts of the parse-tree (e.g. the where-clause) before we can interpret the earlier parts anyway.
    • Maybe we could directly call into a subset of Meta.parse's implementation though?
  • News
  • Docs: Should this have a separate docstring, or just extend the main top-level one?

CC: @c42f - this seems very up your alley 😊 Happy new year! 🌟

@NHDaly NHDaly marked this pull request as draft December 29, 2023 21:57
@NHDaly NHDaly added needs tests Unit tests are required for this change needs docs Documentation for this change is required needs news A NEWS entry is required for this change stdlib Julia's standard library labels Dec 29, 2023
@jakobnissen
Copy link
Contributor

jakobnissen commented Dec 30, 2023

This is awesome! ♥️
What's it's semantics though? How does it handle ambiguity? What is the result of parse(Union{Int32, Int64}, "15")? And given that unions are symmetrical, what if the two types in the union were switched?

@NHDaly
Copy link
Member Author

NHDaly commented Dec 30, 2023

@jakobnissen this is about parsing type objects themselves from strings. So in your example, it would be an error, since you've provided a number, not a type.

You could do parse(Type, "Union{Int, Int32}") and you'd get the union type as the result.

Does that make sense?

EDIT: Could I adjust anything in the PR description to make this clearer?

@jakobnissen
Copy link
Contributor

No this is clear enough, it was just me who didn't read it carefully enough :)

@NHDaly NHDaly force-pushed the nhd-parse-type branch 2 times, most recently from 15100e9 to 7bab22d Compare December 31, 2023 03:35
base/parse-type.jl Outdated Show resolved Hide resolved
base/parse-type.jl Outdated Show resolved Hide resolved
end
if ast isa Expr && ast.head === :incomplete
throw(ArgumentError("invalid type expression: $str"))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these checks for a valid ast go into _try_parse_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, no: I left this out of that since _try_parse_type is called recursively, but this only needs to be checked once. Lemme add a comment to explain that.

base/parse-type.jl Outdated Show resolved Hide resolved
v === nothing && return nothing
v
end)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this accomplish compared to just return v?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because I don't want to return if v is not nothing; i want to continue. I should have called this @_bail_if_nothing, since if the value is nothing, it will return early, rather than continuing the function.

I'll rename it, and then can you let me know if this seems acceptable?

@tecosaur
Copy link
Contributor

tecosaur commented Mar 1, 2024

You might be interested in (for reference if nothing else) the approach I've taken in DataToolkitBase where I have a QualifiedType that represents a Type that may or may not be able to be realised at parse-time (e.g. relies on unloaded modules), that I may then construct later.

It is defined like so:

struct QualifiedType
    root::Symbol
    parents::Vector{Symbol}
    name::Symbol
    parameters::Tuple
end

And here is the parsing code (less capable than yours I suspect) https://github.com/tecosaur/DataToolkitBase.jl/blob/main/src/model/parser.jl#L5-L59, and the "make it an actual type" code
https://github.com/tecosaur/DataToolkitBase.jl/blob/main/src/model/qualifiedtype.jl#L33-L78.

This is how performance compares:

julia> @btime eval(Meta.parse("Main.Main.Base.Checked.Vector{Dict{Int, Base.Float64}}"))
  99.360 μs (176 allocations: 10.11 KiB)
Vector{Dict{Int64, Float64}} (alias for Array{Dict{Int64, Float64}, 1})

julia> @btime typeify(parse(QualifiedType, "Main.Main.Base.Checked.Vector{Dict{Int, Base.Float64}}"))
  11.224 μs (190 allocations: 10.59 KiB)
Vector{Dict{Int64, Float64}} (alias for Array{Dict{Int64, Float64}, 1})

Parse all types, including DataTypes, UnionAlls and Unions.

- Support function types: `typeof(+)`
- Support constant isbits type constructors in parse type
- Support `Vector{<:Number}`
- Fix `Vector{S} where {Int<:T<:Number, S<:T}`
- Add a bunch more automated test cases for better coverage
- Add tryparse
- Add support for names qualified by non-imported top-level packages.

Mark tests for Union{} and Tuple{} broken

- Add test case for new loaded_modules support
@NHDaly NHDaly force-pushed the nhd-parse-type branch 2 times, most recently from aa19464 to c35ad1f Compare March 26, 2024 23:09
@NHDaly NHDaly requested review from stevengj and MasonProtter March 26, 2024 23:13
@NHDaly NHDaly removed needs tests Unit tests are required for this change needs docs Documentation for this change is required labels Mar 26, 2024
@NHDaly NHDaly removed the needs news A NEWS entry is required for this change label Mar 26, 2024
@NHDaly NHDaly marked this pull request as ready for review March 26, 2024 23:42
@NHDaly
Copy link
Member Author

NHDaly commented Mar 26, 2024

Okay, I think this is ready for a real review! I would super appreciate another pass, @stevengj and @MasonProtter, and maybe a review from @topolarity, @c42f, or @LilithHafner?

The remaining open questions are I think:

  • Is it okay that this is using Meta.parse under the hood, or should it have custom parsing code? I think I'd like to stick with Meta.parse, so that it's guaranteed to match what julia would parse as. So unless there is a very good reason to do otherwise, I'd like to stick with this.
  • Should this automatically look into Base.loaded_modules to find top-level names? If you try to call parse(Type, "Random.UInt10Raw"), but Random is only an indirect dependency (e.g. you imported Test), should this parse?
    • I have currently decided no, but it's very easy to add this back in! I deleted the support for this in 883673d, so we could always add that back in if people disagree.
    • Instead, I currently require you to pass parse(Type, "Random.UInt10Raw"; module_context=Test), which will work. But you'd have to know a priori where to find it, or do something like import all the names from loaded_modules or something.
    • I'm mainly planning to use this PR in the PackageCompiler context (and a similar context for running precompile statements for PkgImages), where we will have already imported all names from loaded_modules, so we don't need this support for that. It's just about what others think is the friendliest API.
  • Is the implementation for UnionAlls reasonable? Anything you'd change?

Thanks!!

CC: @kpamnany

@LilithHafner
Copy link
Member

tryparse throws sometimes

julia> Base.tryparse(Int, "5+4")

julia> Base.tryparse(Type, "5+4")
ERROR: MethodError: no method matching reinterpret(::typeof(+), ::Tuple{Int64, Int64})
The function `reinterpret` exists, but no method is defined for this combination of argument types.
...

base/parse-type.jl Show resolved Hide resolved
@_bail_if_nothing ast.args[i] = _try_parse_type(module_context, ast.args[i], raise, type_vars)
end
# We use reinterpret to avoid evaluating code, which may have side effects.
return reinterpret(typ, Tuple(ast.args))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the default constructor. It can be wrong, though

julia> struct Point
           x::Int32
           y::Int32
           Point(x, y=x) = new(x, y)
       end

julia> str = "Val{Point(-7)}"
"Val{Point(-7)}"

julia> eval(Meta.parse(str))
Val{Point(-7, -7)}

julia> parse(Type, str)
Val{Point(-7, -1)}

The task of constructing arbitrary values at runtime without evaluating code is tricky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, interesting... 🤔 ..... so maybe this whole idea is not valid? :(
Maybe we could get away with it by dropping values, and documenting a restriction that we don't support struct values in the type domain, and for those, we recommend falling back to eval?

:( Sadbeans! Thanks for catching this. Do you have any other good ideas / suggestions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't get this right, maybe this ultimately doesn't belong in base, and it should be a library similar to what @tecosaur shared.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But also also, i was mainly interested in using this for improving PackageCompiler, and it's worth noting that structs in types are currently broken in precompile statements anyway, since they're printed as kwargs for some reason!?:

precompile(Tuple{Type{Base.Val{Main.Point(x=-7, y=-7)}}})
julia> str = "precompile(Tuple{Type{Base.Val{Main.Point(x=-7, y=-7)}}})"
"precompile(Tuple{Type{Base.Val{Main.Point(x=-7, y=-7)}}})"

julia> eval(Meta.parse(str))
ERROR: MethodError: no method matching Point(; x::Int64, y::Int64)

So these cases are all broken anyway, so maybe we don't need to support them? (though i'd like to fix the way they're printed in precompiles, too...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a parse method, I think it's reasonable to only support values in type parameters if the values are embedded directly in the syntax tree (e.g. Val{5} or Val{"a"} but not Val{1+1} or Val{(1,1)}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if that will be enough for your use-case, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely agree I wouldn't support Val{1+1}, since that is definitely not a value embedded directly in the tree.

But isn't Val{(1,1)}? I do think we see constant tuples in our types fairly frequently, so i would want that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe special case to allow this iff the default inner constructor is accessible and the parsed string would dispatch to it?

Copy link
Member

@vtjnash vtjnash Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, these print with kwarg-like syntax because it would be wrong to run the constructor to recreate them, so that is intended to dissuade the user from attempting to construct the object by doing so (as well as being a bit more more readable). Doing a ccall to the generic construction method (as done in Serialization) is probably the best approach

Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
@topolarity
Copy link
Member

The biggest thing that stands out to me is that this feels to me like a deserialization and not a parse.

In particular, it has to worry about a lot of classic "serialization" problems:

  • Where are type names resolved?
  • Is the de-serializer responsible for loading unavailable types?
  • Is the serialization (i.e. string()) an invertible representation for all objects?

These questions make the meaning of a type string like this contextual, even for very "primitive" seeming type names:

module Test
   struct Union{A,B} end
   struct Float32 end
   struct Float64 end
   struct Foo
       x::Union{Float32,Float64}
   end
end

The Julia syntax here works double-duty as a serialization format (very similar to how raw Expr is stored and used in the deserializer), so that I think tying this to Base.parse(Type, ...) is not quite right.

@NHDaly
Copy link
Member Author

NHDaly commented Apr 3, 2024

@topolarity: I agree with you, but for better or worse, julia is already using string() and parse (via eval(Meta.parse())) as a serialization round-trip format for types, in the PackageCompiler precompilation process. That Meta.parse+eval is slow and unsafe, and so i'm trying to provide a safer faster route to support this.

The current process is that PackageCompiler runs your snoop workload with --trace-compile, and julia produces a bunch of precompile statements:

precompile(Tuple{typeof(Base.iterate), Base.Set{Char}})
precompile(Tuple{typeof(Base.copy), Array{ArgParse.ArgParseGroup, 1}})
precompile(Tuple{typeof(Base.pairs), NamedTuple{(:help, :arg_type), Tuple{String, DataType}}})
precompile(Tuple{typeof(Unrolled.type_length), RAICode.QueryEvaluator.var"#169#170"{String, ((), (1,))}})

then PackageCompiler parses and evals each of those types, and precompiles them, as in something like this:

ps = Meta.parse(statement)
@assert isexpr(ps, :call) && ps.args[1] == :precompile
ps = Core.eval(PrecompileStagingArea, ps)
precompile(ps...)

(We want to do this in our production servers, by having them run a set of precompile statements at startup, but I want to prevent the risk of someone having an arbitrary code-execution exploit by mutating the file that contains those statement - and also to make it faster.)

So I'm open to picking a different name for this mechanism, but i think the use of strings as a round-tripping format for types is already out there, and I'm mainly hoping to make that safer and faster. It does feel like parsing to me though: you're taking a string representation of a value, and constructing that julia value, according to julia's parse-tree.

@topolarity
Copy link
Member

Yeah, I agree with the motivation!

Better safety/performance for eval(Meta.parse(...))-style deserialization in PackageCompiler is a good thing.

My complaint was specifically about tying this to Base.parse. That API should return a valid interpretation of Julia code for all contexts where that code might appear - Even something innocent like evaluating what Float32 means breaks that rule.

On the other hand, doing eval(Meta.parse(...))-style deserialization isn't illegal on its own. Although it can be a bit fragile, I agree that ship is out of the gate. But the eval in eval(Meta.parse(...)) makes it very clear that its result depends on the Module it is evaluated in. For the round trip to work out, you have to make sure you think about that context and "match it up" between serialization/deserialization time.

I'd prefer this had a similar API that is either very clear about doing contextual evaluation (albeit much safer and more limited than eval()) or taking in the relevant context explicitly (similar to two-argument Core.eval)

@NHDaly
Copy link
Member Author

NHDaly commented Apr 16, 2024

Thanks cody. I'll have to think about what an alternative API should look like. --- Do you have any suggestions?

(FWIW, I did add a module_context=Main kwarg in my last push a few weeks ago. I agree that is a necessary part of something like this. 👍)

But yeah, i think your greater point stands that this probably isn't exactly "parsing." It's really close - it is at least parsing-adjacent. But indeed, maybe type-deserializing is a better term?

..... 😅 I keep coming back to this question in my head though: is deserializing a value from a string representation of julia source code anything other than parsing? 😅
Anyway, please help me pick a new name! 🙏 And let me know if you think this should live in a package instead.

@c42f
Copy link
Member

c42f commented Jul 17, 2024

I'm late to the party here (as usual sorry 😬)

But I totally agree with Cody. @topolarity I appreciate how well you explained why this "doesn't quite feel right as parse". I agree.

It seems that what you've implemented here is actually a restricted Julia interpreter:

  • An interpreter where only some forms are allowed
  • It operates on surface syntax, not lowered syntax
  • Some extended cases are still allowed. For example bits types.

All this makes it surprisingly complex!

Amusingly we already have other interpreters a bit like this. For example, the runtime's jl_toplevel_eval_flex() evaluates some forms without lowering them (presumably for efficiency). But unlike the code here, it defers other forms to the full lowering code.

Should we really keep a zoo of special purpose interpreters? I don't know. If it matters performance-wise I guess we have to.

That's where the bulk of the remaining CPU time goes now.. Would it be preferential to write custom parsing code straight from the string to the Types?

This seems like a lot of work, I don't recommend it 😅
Does it help performance if you use JuliaSyntax.parsestmt() directly? That should be quite fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants