12

I have a folder with around 5000 .ts files that I want to concat together into one. They follow the naming format of chunk_n so for example chunk_50, chunk_51, chunk_52 etc. When I run the windows copy command like this:

copy /b *.ts final.ts

It doesn't iterate through them in normal numeric order. It will start concatting them in this order:

chunk_100, chunk_1000, chunk_1001, chunk_1002, chunk_1003, chunk_1004, chunk_1005, chunk_1006, chunk_1007, chunk_1008, chunk_1009, chunk_101

and so on.

It doesn't look at the entire filename when it iterates over them it seems. Even if I remove the chunk_ beforehand and just named them all 1 through 5000, the issue would still persist. Is this intended behaviour? And is there a better way for me to concat all of these files?

Dongliu
  • 131
  • 5
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackexchange.com/rooms/132621/discussion-on-question-by-user1652994-windows-copy-command-copies-several-files). – DavidPostill Dec 26 '21 at 12:06

5 Answers5

25

In most programs, numeric order has never been normal order. Only recently did file managers (such as Windows Explorer) begin to sort files using "natural sort", but the copy command and the * have not been changed accordingly.

Instead, * expansion sorts filenames by their ASCII byte order, comparing each character separately. For example, with "chunk_1000.ts" vs "chunk_101.ts" (the extension is part of the name), the first 8 bytes (chunk_10) are equal, and then you have 0 < 1 (zero is ASCII 48, one is ASCII 49), and therefore chunk_100(0.ts) < chunk_101(.ts) – even if more digits follow, they're not considered to be part of a whole "number" at all.

(Byte order is the most straightforward way to sort names – all that's needed is directly comparing two memory locations. This simplicity was important when RAM was measured in kilobytes. Now it has been the case for a good 50 years, so it's pretty much impossible to change now, due to many scripts relying on the existing behavior.)

The simplest workaround is to pad all numbers to a fixed width, e.g. chunk0050 chunk0100.

A different workaround, available in more advanced shells such as PowerShell or Bash, is to get and sort the list of inputs as a separate step. For example, on Linux you would use the external 'sort' or 'natsort' commands to get the desired order (they're not exactly part of Bash, but generally present wherever Bash is present):

(printf '%s\n' chunk*.ts | sort -V | xargs -d'\n' cat) > allchunks.ts

In PowerShell, Sort-Object can be given a custom comparator; there isn't a pre-defined one for "natural" (numeric) order, but from here you can borrow a one-liner:

$ToNatural = { [regex]::Replace($_, '\d+', { $args[0].Value.PadLeft(20) }) }
$filenames = Get-ChildItem chunk*.ts | Sort-Object $ToNatural
$filenames = $filenames -join "+"
cmd /c "copy $filenames allchunks.ts"
u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • 2
    Thanks for the answer. I just wrote a bash script to add the necessary padding to the filenames that needed it. Everything worked perfectly after that. – Dongliu Dec 23 '21 at 09:45
  • 11
    The reason byte ordering sticks around is because it's *predictable*: one program's smart sort might not be the same as another's, but byte ordering is always the same. This makes it much easier to figure out what a command or a program is going to do. – Mark Dec 23 '21 at 21:27
  • 3
    @user1652994 if you had bash available you could have also done `cat chunk_{1..5000}.ts > out.ts` :) – hobbs Dec 24 '21 at 01:53
  • @Mark: Ehh, in case of e.g. Bash, "predictable" largely means "the POSIX shell specifications say do it this way". If the specification said "implement numeric sort using this method" it'd still be just as predictable. (Actually I'm not 100% sure whether the spec says to use byte order or locale order. I could've been wrong in my post – it could actually be that POSIX wildcard expansion is *not* byte-ordered at all.) – u1686_grawity Dec 24 '21 at 10:12
  • 1
    Shell wildcard expansion will use the current locale (specifically, `LC_COLLATE`) to determine the sort ordering. It'll only be strictly byte-order if `LC_COLLATE` or `LC_ALL` is set to either `C` or `POSIX`. Not that it would affect numeric sorting of filenames like the ones presented in the question, but for alpha filenames it will definitely not use simple byte ordering. When a regional locale is set, it will use the locale sort order. Compare `LC_COLLATE="C" ls *` to `LC_COLLATE="en_US" ls *` some time; en_US sorts case-insensitively, unlike POSIX. – FeRD Dec 24 '21 at 20:16
5

File names are names, not numbers, so they're compared as text strings, using lexicographic order. chunk_1 comes before chunk_10 which comes before chunk_10a which comes before chunk_1z which comes before chunk_2.

If zsh is available, you can use the following command:

<*.ts(n) >final.ts

This leverages a combination of zsh features:

If basic Unix-like utilities are also available, you can use the cat command to concatenate files:

cat *.ts(n) >final.ts

Zsh is available through WSL (sudo apt-get install zsh), Cygwin and probably other distributions of free software for Windows.

Gilles 'SO- stop being evil'
  • 69,786
  • 21
  • 137
  • 178
2

The Windows copy command is a CMD.EXE builtin command. It does no sorting at all on its wildcard expansions. This behaviour dates back to at least MS-DOS 6.22 COMMAND.COM (I don't have anything earlier handy to test with). It just so happens that your NTFS file system stores file names in a B-tree structure which has the effect of enumerating them in an order that approximates lexical sorted order.

More information is available on the Old New Thing blog:

Why do NTFS and Explorer disagree on filename sorting?

What order does the DIR command arrange files if no sort order is specified?

Neil
  • 847
  • 7
  • 10
0

TBH I think all of these digressions into things like byte ordering and etc. are muddying the waters somewhat. Here's the most important, most basic point everyone should understand about sorting in computers:

By default (meaning, unless a more advanced algorithm is applied that does more complex interpretation of the inputs), a computer will sort string data (like filenames) by comparing them character-by-character, starting from the first. (Which is probably the leftmost, although not necessarily because RTL languages exist.)

That's correct for sorting of text strings, even if they're different lengths. Most of us would expect a list of names to be sorted like this, for example:

  • Aaron
  • Alexander
  • Anna
  • Annabelle
  • Brian

The most significant letter in a string, when it comes to sorting, is the first one, and length is irrelevant except that strings with shorter total length come before longer ones that contain the same initial text.

Problem is, that's exactly wrong for numbers, which are supposed to be compared with their last digits aligned. When we're talking about numbers, length is the most critical factor: longer values are ALWAYS greater than shorter ones, because they have more significant digits. (Assume we're talking about integers here; decimal points complicate things further.) The fact that length is more significant than value is why we typically right-align lists of numbers.

Alphabetical sorting can sometimes produce the same results as numeric sorting, but only when the numbers are represented as strings of equal length. That's why padding with 0s fixed your issue. The sorting still isn't doing proper numeric comparison, though. (It's comparing the numbers as strings, character-by-character from left to right. But it turns out that comparing digit-by-digit, from left to right, is exactly how you would compare two n-digit numbers. The most significant digit is on the left, and the least significant is on the right.)

The sorting issue is also why so many in the computing industry (including myself) are big proponents of the ISO-8601 standard for representing dates as YYYY-MM-DD. As with numbers, dates in that form will happen to be correctly sorted by "dumb" alphabetical sorting, because the components are ordered left-to-right from most to least significant, which is exactly how alphabetical sorting works. The sorting isn't interpreting the string YYYY-MM-DD as a date, nor is it interpreting it as a number... but that's OK because it'll still sort correctly anyway.

FeRD
  • 1,304
  • 12
  • 17
0

Here's another PowerShell approach:

Get-ChildItem *.cs |
    Sort-Object { [Int]( $_.BaseName -replace ('\D+','') ) } |
        Get-Content |
            Add-Content ConCat.cs

Aliased:

gci *.cs | Sort {[Int]($_.BaseName -replace ('\D+',''))} | gc | ac ConCat.cs
Keith Miller
  • 8,704
  • 1
  • 15
  • 28