-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improves macho loader #799
Conversation
Previously, only little endian binaries were expected in our macho loader, so this caused unexpected behaviour with big endian binaries. This PR fix it. Also, this PR fixes problem with entry point for old macho binaries: they just don't contain so called 'LC_MAIN' command, and entry point as part of binary structure is absent. But we can workaround this problem by emiting elntry point as an address of a first executable section.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
return; | ||
} | ||
} | ||
std::vector<uint64_t> addrs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big binaries can have quite a few sections, so I don't think that we should use such inefficient algorithm. You can use std::optional
if you're afraid of using explicit sentinel values, such as max_int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like that optional
was added only in std+17, so you can't use it. So either use a sentinel or a boolean flag, or you can use error_or, or anything suitable in the LLVM support libraries.
|
||
void image_info(const macho &obj, ogre_doc &s) { | ||
if (!is_exec(obj)) | ||
s.raw_entry("(entry 0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it 0, but not the lowest address? If it is a library then will this take into account the rebasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it equals to zero because we use relative addresses. so after rebasing entry point will be equal to the lowest address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
return; | ||
} | ||
} | ||
uint64_t entry = INT_MAX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The INT_MAX
is not the maximum value of the uint64_t
type, it is
#include <limit>
std::numeric_limits<uint64_t>::max();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please gather all byteswapping routines in one function and call it more explicitly. Otherwise, there are lots of byte order checks, that makes it hard to understand whether you use right byte order in all places and in all fields of all structures. Let's make it a little bit more organized.
|
||
void image_info(const macho &obj, ogre_doc &s) { | ||
if (!is_exec(obj)) | ||
s.raw_entry("(entry 0)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha
So, finally, I wrote |
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
return; | ||
} | ||
} | ||
uint64_t max = std::numeric_limits<uint64_t>::max(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although you're trying to introduce a shorter alias to the max value of the int64_t
domain, you're actually introducing a variable. That makes reasoning about the code harder, since now a reader needs to thing whether you're going to change the max
value. Especially, since in the max_element
and min_element
algorithm implementation the searched value is usually called max
or min
.
Solution, use the using
declaration, e.g., using std::numeric_limits<uint64_t>::max
or mark max as const
or constexpr
.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
uint64_t entry = max; | ||
for (auto sec : prim::sections(obj)) { | ||
auto addr = prim::section_address(sec); | ||
if (!addr) continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an unnecessary convolution of the control flow,
if (addr && (section_flags(obj, sec) & Mach)::S_ATTR_PURE_INSTRUCTIONS)) {
entry = std::min(*addr, entry)
in general, please try not to use the continue
statement, it is a reminiscent of the foul past of C++.
auto it = get_symbol(obj, sym_num); | ||
if (it == prim::end_symbols(obj)) return; | ||
if (is_external(obj, *it)) { | ||
if (is_external(obj, *it)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it
dereferencable here? My intuition tells no.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
MachO::relocation_info rel; | ||
if (obj.isLittleEndian() != sys::IsLittleEndianHost) { | ||
MachO::any_relocation_info any; | ||
memcpy(&any, ptr, sizeof(MachO::any_relocation_info)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, don't mix C and C++, use std::copy
or std::copy_n
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
const char *ptr = prim::get_raw_data(obj); | ||
std::size_t next = off; | ||
MachO::relocation_info get_rel(const macho &obj, uint32_t off) { | ||
const char *ptr = prim::get_raw_data(obj) + off; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we have a truncated object?
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
const MachO::relocation_info *rel = reinterpret_cast<const MachO::relocation_info *>(ptr + next); | ||
if (rel->r_address & MachO::R_SCATTERED) { | ||
MachO::relocation_info rel = get_rel(obj, next); | ||
if (rel.r_address & MachO::R_SCATTERED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rel.r_address & MachO::R_SCATTERED
doesn't imply that the relocation rel
is scattered, it also depends on the architecture, e.g., for x86_64 and arm64 the relocation would be an immediate value, i.e., 64 bit, so this iteration will go off track in case of 64 bit system. If needed I can give the same testing binary for the x86_64 arch.
There is a isRelocationScattered
member-function of the MachOObjectFile
class that can tell you for sure if it scattered or now. Is there any good reason why you are not using this function? Also there are plenty of relocation related member functions that apparently takes the endianness into account. Why aren't you using them?
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
|
||
uint64_t index = n + j; | ||
uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t); | ||
if (offset > obj_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if offset
is equal to obj_size
?
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
@@ -351,13 +387,19 @@ void indirect_symbols(const macho &obj, const MachO::dysymtab_command &dlc, ogre | |||
} // sections | |||
} | |||
|
|||
// This part is mainly for bundles, since it's not enough just to iterate over | |||
// sections in order to get relocations, e.g. like we do it for MH_OBJECT files. | |||
// We are going to take a look only to indirect symbols and external relocations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/to/for
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
// sections in order to get relocations, e.g. like we do it for MH_OBJECT files. | ||
// We are going to take a look only to indirect symbols and external relocations. | ||
// Speaking about local relocations references from dyn sym tab, even modern (6.0) | ||
// does not perform a lot of job about them, so let's do the same. One cat take a look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/cat/can
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
next = next + sizeof(MachO::relocation_info); | ||
} | ||
} | ||
} | ||
|
||
void indirect_symbols(const macho &obj, const MachO::dysymtab_command &dlc, ogre_doc &s) { | ||
|
||
auto base = image_base(obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from here and below there is a lot of voluntarism with the handling of the raw base
pointer. Please, use some smart pointer, that will bound check every dereference. If you don't find any in C++ stdlib or LLVM, then implement your own.
I will try to avoid to use the word SummaryI added more comments, and really hope they will be helpful. Also added more guards and checks, so bug Relocations structures and endianess.It turned out, that we should not use that In the end, I checked everything on different files - little and big endian, and now I can say that it works like we expect. |
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
const char *ptr = prim::get_raw_data(obj); | ||
std::size_t next = off; | ||
template <typename T> | ||
error_or<T> get_data(const macho &obj, uint64_t off) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generic function is extremely unsafe and error prone. First of all, it implies that the constructed object is POD, and that you can default construct it, and then just fill in the bytes with its size. It is very easy to use this function incorrectly, so please, remove this function.
Fortunately, you're using it only once to get the relocation info, so you can do in place, probably without doing this ugly copies with a simple aggregate initialization, after checking that you are not going out of bounds.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
template <typename T> | ||
error_or<T> get_data(const macho &obj, uint64_t off) { | ||
uint64_t obj_size = obj.getData().size(); | ||
if (off + sizeof(T) > obj_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the future, this is a dimension error - you're comparing offsets with sizes. Given that offset is a difference between two points and size is an absolute coordinate it is usually an error.
In your case it works, because there is an implicit obj_size + 0
that makes your comparison valid.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
// like getPlainRelocationSymbolNum | ||
error_or<MachO::any_relocation_info> get_rel(const macho &obj, uint32_t off) { | ||
auto any = get_data<MachO::any_relocation_info>(obj, off); | ||
if (!any) return any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to write functions that are single entry single exit whenever it is possible. It is much easier to understand.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
if (!any) return any; | ||
if (obj.isLittleEndian() != sys::IsLittleEndianHost) | ||
MachO::swapStruct(*any); | ||
if (!obj.is64Bit() && obj.isRelocationScattered(*any)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that the isRelocationScattered
doesn't take into account the endianness, though many other member functions do, e.g., getPlainRelocatoonSymbolNum
, getPlainRelocationExternal
, etc. That means, that the isRelocationScattered
function is buggy, and we're relying on its buggy behavior, that might be fixed in later versions, that will break our code. Moreover, you're already do not trust this function, since the CPU check inside of it is wrong, so you basically need to put your own check here. Given that out of 3 operations in the isRelocationScattered
function, 2 are wrong, it would be better no to use this function at all.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
|
||
// Size in bytes of relocation entry. It coulbe either size of | ||
// relocation_info, scattered_relocation_info, any_relocation_info | ||
#define RELOCATION_ENTRY_SIZE 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use define in C++ to define constants, especially in headers. Use C++ constants, consexpressions and make sure that they are in the anonymous namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, you're using two source of information about the relocation size, this constant, and still sizeof(MachO::any_relocation_info)
, when you're extracting it with the get_data
function.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
// runtime LLVM_ERROR | ||
uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t); | ||
symbol_iterator s = prim::end_symbols(obj); | ||
if (offset < obj_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, you can't compare offsets and sizes.
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t); | ||
symbol_iterator s = prim::end_symbols(obj); | ||
if (offset < obj_size) | ||
s = get_symbol(obj, obj.getIndirectSymbolTableEntry(dlc, index)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of the pattern:
x = invalid;
if (valid(s)) {
x = make(s);
return x;
}
return x;
you can use much more safe and readable pattern:
if (valid(s))
return make(s);
else
return invalid;
This will make your code easier to understand (you do not need to track variables and their flow) and sustainable to changes (the invalid value doesn't roam in the scope waiting to be dereferenced).
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
} | ||
|
||
// returns stride for indirect symbols entries | ||
uint32_t indirect_symbols_stride(const macho &obj, const SectionRef &sec) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an easier to understand and less error prone version:
uint32_t indirect_symbols_stride(const macho &obj, const SectionRef &sec) {
if (section_type(obj, sec) == MachO::S_SYMBOL_STUBS)
return section_reserved2(obj, sec);
else
return obj.is64Bit() ? 8 : 4;
}
lib/bap_llvm/llvm_macho_loader.hpp
Outdated
// true for symbol stubs or symbol pointers sections | ||
bool contains_sym_stubs(const macho &obj, const SectionRef &sec) { | ||
auto typ = section_type(obj, sec); | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do not need to put parentheses around the return expression. You can also use switch/case
switch (section_type(obj, sec)) {
case X:
case Y:
case Z: return true;
default: return false;
}
That's is very close to OCaml's match.
@@ -376,7 +461,13 @@ void symbols(const macho &obj, ogre_doc &s) { | |||
symbols(obj, sizes, s); | |||
} | |||
|
|||
// It's safe to call getSymtabLoadCommand without any checks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true for all versions? I remember that somewhere in LLVM 3.4 we were getting a segmentation fault when we were asking for a symtab. It might be in ELF backend though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's true for llvm versions >= 3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, for the record, in case of 3.4 this function will not be used
|
||
bool is_rel_extern(const macho &obj, const MachO::any_relocation_info &rel) { | ||
if (obj.isLittleEndian()) | ||
return (rel.r_word1 >> 27) & 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bless me father for I have sinned :)
Previously, only little endian binaries were expected in our macho loader, so this caused unexpected behavior with big endian binaries. This PR fix it, as well adds some more guards for reliability.
Also, this PR fixes problem with an entry point for old macho binaries: they just don't contain so-called
LC_MAIN
command, and entry point as part of the binary structure is absent.But we can workaround this problem by emitting entry point as an address of a first executable section.