Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improves macho loader #799

Merged
merged 24 commits into from
Apr 4, 2018
Merged

Conversation

gitoleg
Copy link
Contributor

@gitoleg gitoleg commented Mar 21, 2018

Previously, only little endian binaries were expected in our macho loader, so this caused unexpected behavior with big endian binaries. This PR fix it, as well adds some more guards for reliability.

Also, this PR fixes problem with an entry point for old macho binaries: they just don't contain so-called LC_MAIN command, and entry point as part of the binary structure is absent.
But we can workaround this problem by emitting entry point as an address of a first executable section.

gitoleg added 3 commits March 21, 2018 10:11
Previously, only little endian binaries were expected
in our macho loader, so this caused unexpected behaviour
with big endian binaries. This PR fix it.

Also, this PR fixes problem with entry point for old macho
binaries: they just don't contain so called 'LC_MAIN'
command, and entry point as part of binary structure is absent.
But we can workaround this problem by emiting elntry point as
an address of a first executable section.
return;
}
}
std::vector<uint64_t> addrs;
Copy link
Member

@ivg ivg Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big binaries can have quite a few sections, so I don't think that we should use such inefficient algorithm. You can use std::optional if you're afraid of using explicit sentinel values, such as max_int.

Copy link
Member

@ivg ivg Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like that optional was added only in std+17, so you can't use it. So either use a sentinel or a boolean flag, or you can use error_or, or anything suitable in the LLVM support libraries.


void image_info(const macho &obj, ogre_doc &s) {
if (!is_exec(obj))
s.raw_entry("(entry 0)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it 0, but not the lowest address? If it is a library then will this take into account the rebasing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it equals to zero because we use relative addresses. so after rebasing entry point will be equal to the lowest address

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha

return;
}
}
uint64_t entry = INT_MAX;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INT_MAX is not the maximum value of the uint64_t type, it is

#include <limit>
std::numeric_limits<uint64_t>::max();

Copy link
Member

@ivg ivg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please gather all byteswapping routines in one function and call it more explicitly. Otherwise, there are lots of byte order checks, that makes it hard to understand whether you use right byte order in all places and in all fields of all structures. Let's make it a little bit more organized.


void image_info(const macho &obj, ogre_doc &s) {
if (!is_exec(obj))
s.raw_entry("(entry 0)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha

@gitoleg
Copy link
Contributor Author

gitoleg commented Mar 26, 2018

So, finally, I wrote swap for the relocation_info struct and it is really better (and correct) than it was earler. Also added few comments for future readers.

return;
}
}
uint64_t max = std::numeric_limits<uint64_t>::max();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although you're trying to introduce a shorter alias to the max value of the int64_t domain, you're actually introducing a variable. That makes reasoning about the code harder, since now a reader needs to thing whether you're going to change the max value. Especially, since in the max_element and min_element algorithm implementation the searched value is usually called max or min.

Solution, use the using declaration, e.g., using std::numeric_limits<uint64_t>::max or mark max as const or constexpr.

uint64_t entry = max;
for (auto sec : prim::sections(obj)) {
auto addr = prim::section_address(sec);
if (!addr) continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an unnecessary convolution of the control flow,

if (addr && (section_flags(obj, sec) & Mach)::S_ATTR_PURE_INSTRUCTIONS)) {
   entry = std::min(*addr, entry)

in general, please try not to use the continue statement, it is a reminiscent of the foul past of C++.

auto it = get_symbol(obj, sym_num);
if (it == prim::end_symbols(obj)) return;
if (is_external(obj, *it)) {
if (is_external(obj, *it))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it dereferencable here? My intuition tells no.

MachO::relocation_info rel;
if (obj.isLittleEndian() != sys::IsLittleEndianHost) {
MachO::any_relocation_info any;
memcpy(&any, ptr, sizeof(MachO::any_relocation_info));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, don't mix C and C++, use std::copy or std::copy_n

const char *ptr = prim::get_raw_data(obj);
std::size_t next = off;
MachO::relocation_info get_rel(const macho &obj, uint32_t off) {
const char *ptr = prim::get_raw_data(obj) + off;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we have a truncated object?

const MachO::relocation_info *rel = reinterpret_cast<const MachO::relocation_info *>(ptr + next);
if (rel->r_address & MachO::R_SCATTERED) {
MachO::relocation_info rel = get_rel(obj, next);
if (rel.r_address & MachO::R_SCATTERED) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rel.r_address & MachO::R_SCATTERED doesn't imply that the relocation rel is scattered, it also depends on the architecture, e.g., for x86_64 and arm64 the relocation would be an immediate value, i.e., 64 bit, so this iteration will go off track in case of 64 bit system. If needed I can give the same testing binary for the x86_64 arch.

There is a isRelocationScattered member-function of the MachOObjectFile class that can tell you for sure if it scattered or now. Is there any good reason why you are not using this function? Also there are plenty of relocation related member functions that apparently takes the endianness into account. Why aren't you using them?


uint64_t index = n + j;
uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t);
if (offset > obj_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if offset is equal to obj_size?

@@ -351,13 +387,19 @@ void indirect_symbols(const macho &obj, const MachO::dysymtab_command &dlc, ogre
} // sections
}

// This part is mainly for bundles, since it's not enough just to iterate over
// sections in order to get relocations, e.g. like we do it for MH_OBJECT files.
// We are going to take a look only to indirect symbols and external relocations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to/for

// sections in order to get relocations, e.g. like we do it for MH_OBJECT files.
// We are going to take a look only to indirect symbols and external relocations.
// Speaking about local relocations references from dyn sym tab, even modern (6.0)
// does not perform a lot of job about them, so let's do the same. One cat take a look at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cat/can

next = next + sizeof(MachO::relocation_info);
}
}
}

void indirect_symbols(const macho &obj, const MachO::dysymtab_command &dlc, ogre_doc &s) {

auto base = image_base(obj);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from here and below there is a lot of voluntarism with the handling of the raw base pointer. Please, use some smart pointer, that will bound check every dereference. If you don't find any in C++ stdlib or LLVM, then implement your own.

@gitoleg
Copy link
Contributor Author

gitoleg commented Mar 29, 2018

I will try to avoid to use the word finally) but anyway I like this implementation.

Summary

I added more comments, and really hope they will be helpful. Also added more guards and checks, so bug shall not pass, and refactored and simplified the code. There is only one place when we work directly with a raw pointer to object data, and I guarded it with checks too.

Relocations structures and endianess.

It turned out, that we should not use that relocation_info structure at all. I wrote a detailed explanation directly in code, but briefly - there are not any easy ways to read bit fields from it. We have to use masks, swaps, shifts. And actually, everything is already done in llvm, so we just need to use it.

In the end, I checked everything on different files - little and big endian, and now I can say that it works like we expect.

const char *ptr = prim::get_raw_data(obj);
std::size_t next = off;
template <typename T>
error_or<T> get_data(const macho &obj, uint64_t off) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generic function is extremely unsafe and error prone. First of all, it implies that the constructed object is POD, and that you can default construct it, and then just fill in the bytes with its size. It is very easy to use this function incorrectly, so please, remove this function.

Fortunately, you're using it only once to get the relocation info, so you can do in place, probably without doing this ugly copies with a simple aggregate initialization, after checking that you are not going out of bounds.

template <typename T>
error_or<T> get_data(const macho &obj, uint64_t off) {
uint64_t obj_size = obj.getData().size();
if (off + sizeof(T) > obj_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the future, this is a dimension error - you're comparing offsets with sizes. Given that offset is a difference between two points and size is an absolute coordinate it is usually an error.

In your case it works, because there is an implicit obj_size + 0 that makes your comparison valid.

// like getPlainRelocationSymbolNum
error_or<MachO::any_relocation_info> get_rel(const macho &obj, uint32_t off) {
auto any = get_data<MachO::any_relocation_info>(obj, off);
if (!any) return any;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to write functions that are single entry single exit whenever it is possible. It is much easier to understand.

if (!any) return any;
if (obj.isLittleEndian() != sys::IsLittleEndianHost)
MachO::swapStruct(*any);
if (!obj.is64Bit() && obj.isRelocationScattered(*any)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that the isRelocationScattered doesn't take into account the endianness, though many other member functions do, e.g., getPlainRelocatoonSymbolNum, getPlainRelocationExternal, etc. That means, that the isRelocationScattered function is buggy, and we're relying on its buggy behavior, that might be fixed in later versions, that will break our code. Moreover, you're already do not trust this function, since the CPU check inside of it is wrong, so you basically need to put your own check here. Given that out of 3 operations in the isRelocationScattered function, 2 are wrong, it would be better no to use this function at all.


// Size in bytes of relocation entry. It coulbe either size of
// relocation_info, scattered_relocation_info, any_relocation_info
#define RELOCATION_ENTRY_SIZE 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use define in C++ to define constants, especially in headers. Use C++ constants, consexpressions and make sure that they are in the anonymous namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you're using two source of information about the relocation size, this constant, and still sizeof(MachO::any_relocation_info), when you're extracting it with the get_data function.

// runtime LLVM_ERROR
uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t);
symbol_iterator s = prim::end_symbols(obj);
if (offset < obj_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, you can't compare offsets and sizes.

uint64_t offset = dlc.indirectsymoff + index * sizeof(uint32_t);
symbol_iterator s = prim::end_symbols(obj);
if (offset < obj_size)
s = get_symbol(obj, obj.getIndirectSymbolTableEntry(dlc, index));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the pattern:

x = invalid;
if (valid(s)) {
  x = make(s);
  return x;
}
return x;

you can use much more safe and readable pattern:

if (valid(s)) 
    return make(s);
else
   return invalid;

This will make your code easier to understand (you do not need to track variables and their flow) and sustainable to changes (the invalid value doesn't roam in the scope waiting to be dereferenced).

}

// returns stride for indirect symbols entries
uint32_t indirect_symbols_stride(const macho &obj, const SectionRef &sec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an easier to understand and less error prone version:

uint32_t indirect_symbols_stride(const macho &obj, const SectionRef &sec) {
    if (section_type(obj, sec) == MachO::S_SYMBOL_STUBS)
        return section_reserved2(obj, sec);
    else
        return obj.is64Bit() ? 8 : 4;
}

// true for symbol stubs or symbol pointers sections
bool contains_sym_stubs(const macho &obj, const SectionRef &sec) {
auto typ = section_type(obj, sec);
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not need to put parentheses around the return expression. You can also use switch/case

switch (section_type(obj, sec)) {
  case X:
  case Y:  
  case Z:   return true;
  default:  return false;
}

That's is very close to OCaml's match.

@@ -376,7 +461,13 @@ void symbols(const macho &obj, ogre_doc &s) {
symbols(obj, sizes, s);
}

// It's safe to call getSymtabLoadCommand without any checks,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true for all versions? I remember that somewhere in LLVM 3.4 we were getting a segmentation fault when we were asking for a symtab. It might be in ELF backend though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's true for llvm versions >= 3.8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for the record, in case of 3.4 this function will not be used


bool is_rel_extern(const macho &obj, const MachO::any_relocation_info &rel) {
if (obj.isLittleEndian())
return (rel.r_word1 >> 27) & 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bless me father for I have sinned :)

@ivg ivg merged commit 21c47fa into BinaryAnalysisPlatform:master Apr 4, 2018
@gitoleg gitoleg deleted the fix-macho-loader branch April 17, 2018 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants