How to distinguish 'syscall' from 'int 80h' when using ptrace

Question

As far as I know, ptrace can only get syscall number by PTRACE_SYSCALL, but syscall number is different in x86 and x64. So is there any way to figure out where this syscall real origin?

I am now coding a program to limit some others' syscall by syscall number, I know the syscall number on both x86 and x64, but some of the programs using 'int 80h' instead of 'syscall' so that they can do dangerous thing which I limited on x64. For example,I banned fork() on x64, they can use 'int 80h(2)'(fork()) and I fell they are using 'syscall(2)'(open()), thus they can break the limit. Although ptrace can track both of them and get the syscall number, I cannot distinguish where the syscall actually comes.

nh2 · Answer 1 · 2019-02-09T01:55:35.087

1

Looks like as of writing (2019-02-08), this is impossible.

And even strace gets it wrong.

Edit: Linus Torvalds talks about it here, also analysing possible (but commented-out) workarounds in the strace code that directly look at the instructions made in the binary. This code was removed here as part of the patchset I mention below. It says It works, but is too complicated, and strictly speaking, unreliable, but it is unclear to me in which cases the "strictly speaking, unreliably" applies, if that is only in the case of a multi-threaded executable rewriting itself at runtime (thus not suitable for forbidding certain syscalls for security use cases), or also in other cases.

Edit: The "unreliable" part was added in this commit.

Edit: I have now tried out strace's opcode-peeking implementation (version v4.25), and suspect that it was bugged: When activating that code path by changing this line to #if 0 and this line to #elif 1, no syscalls are printed because scno is not set at all. I added scno = x86_64_regs.orig_rax; after this line to make it work.

See the presentation How to make strace happy, slide 2, problem 2:

There is no reliable way to distinguish between x86_64 and x86 syscalls.

Details shown on slides 4-6. There is a proposed solution to be added to the kernel:

Extend the ptrace API with PTRACE_GET_SYSCALL_INFO request

But this solution isn't merged to the kernel.

The patchset is called ptrace: add PTRACE_GET_SYSCALL_INFO request and it's still being worked on in January 2019. Hopefully it will soon be merged.

strace already has support for it since release 4.26 (but it shouldn't work unless you apply the kernel patch manually):

Implemented obtainment of system call information using PTRACE_GET_SYSCALL_INFO ptrace API.

edited Feb 09 '19 at 01:55

answered Feb 08 '19 at 04:38

nh2

927
1
7
24

Related: In [Can ptrace tell if an x86 system call used the 64-bit or 32-bit ABI?](//stackoverflow.com/q/53456266) I suggested that you could disassemble the code at RIP and check for the `0f 05` `syscall` instruction. IDK if that would really work, and of course it would be slower to use an extra `ptrace` system call or two to fetch registers and peek at the process memory. (And for security use-cases like this there'd be a race condition where another thread could rewrite those bytes after they execute, fooling the filter.) – Peter Cordes Feb 08 '19 at 04:42
1

@PeterCordes Looks like code that does that actually exists/existed in `strace` (commented out); Linus Torvalds analysed it in a thread I've now linked (edited). I've added to the edit a new question about the "unreliability" of this method -- do you know more? – nh2 Feb 08 '19 at 05:06
1

I think Linus is just talking about the same race condition I pointed out: another thread in the process that made the syscall could modify the instruction or unmap the page before `strace` can read it. Most of his message is proposing mechanisms to sneak in extra signalling from the kernel to the tracing process without breaking old user-space. (e.g. upper bits of the 8-byte space for CS, or of RFLAGS.) Oh, the [parent message](https://lore.kernel.org/lkml/fd4ccb42a25876131e411299d24d9151.squirrel@webmail.greenhost.nl/) shows that a single thread can bypass SMC flush with another mapping. – Peter Cordes Feb 08 '19 at 05:48
1

Normally modern x86 CPUs snoop stores anywhere near any instruction address that's in the pipeline ([Observing stale instruction fetching on x86 with self-modifying code](//stackoverflow.com/q/17395557)), but apparently using a different virtual page bypasses that on at least some CPUs. The answer on that SO question already mentions that it's not guaranteed in that case, but that Intel has a patent on a mechanism for snooping based on physical address (with finer than 1 page granularity), so on most actual CPUs you maybe couldn't work around this with a single thread. – Peter Cordes Feb 08 '19 at 05:49
@PeterCordes OK, so this sounds to me like the (now deleted) strace approach should work reliably for normal debugging tasks, unless the program self-modifies multithreadedly or unmaps its own memory. – nh2 Feb 08 '19 at 20:41
1

yup, that's definitely true. The discussion leading up to questions about reliability was all about obfuscation attempts. Even if a single thread could create stale instruction fetch (e.g. on Atom or Silvermont?), that would only be problem for intentional obfuscation. And if you're worried about obfuscation, there are ways that can work on mainstream Intel (e.g. cross-modifying instead of self-modifying, on a multicore), so it turns out that self-modifying code is probably only relevant on a single-core machine. (Where only *very* lucky preemption could let another thread in then.) – Peter Cordes Feb 09 '19 at 02:45

chaos · Answer 2 · 2014-10-31T07:19:49.087

It's the system call sys_rt_sigtimedwait (since kernel 2.2). See the manpage of it by:

man 2 rt_sigtimedwait

That syscall suspends the execution, until a signal (or a set of singals) specified by the argument is delivered. A timeout is also been given.

To be 100% sure there is a file called unistd_64.h. Search your system for that file. Mostly it's in the include folder (/usr/include/x86_64-linux-gnu/asm/unistd_64.h). In there are the numbers defined. Here the relevant line in my case (it's also a 64-bit system, kernel 3.2.0-58):

#define __NR_rt_sigtimedwait                    128
__SYSCALL(__NR_rt_sigtimedwait, sys_rt_sigtimedwait)

Note 128 is decimal for 80 in hex.

This does not answer the question how to distinguish, when `ptrace()`ing, whether `int 0x80` or `syscall` was used. — nh2, Feb 08 '19 at 04:40

How to distinguish 'syscall' from 'int 80h' when using ptrace

2 Answers2