Super fast search by file content using PHP + Windows

speed-windows-php-search-com
I was struggling for a while to create a fast solution to search through tens of thousands of files (30.000 to 60.000 files) and I tried different approaches until I found what I believe is the faster way to do it.

I will name the different techniques I used for it and I will conclude with the fast solution.

Using PHP with DirectoryIterator

By using DirectoryIterator in combination with file_get_contents and strpos.

public function searchDirectoryIterator($path, $string){
    $dir = new DirectoryIterator($path);
    $files = array();

    $totalFiles = 0;
    $cont = 0;
    foreach ($dir as $file){
        if (!$file->isDot()){
            $content = file_get_contents($file->getPathname());
            if (strpos($content, $string) !== false) {
                $files[$file->getMTime()] = $file->getBasename();
            }
        }
    }

    ksort($files);

    return array('files' => $files, 'totalFiles' => $cont);
}

I improved this version by using AJAX and calling the function asynchronously to read groups of 1.000 files so I could saw a progress bar. I also cached the list of documents in the path returned by DirectoryIterator.

It was quite slow. It could take 5-10 minutes or even more. The 2nd time it was run it was faster. I guess because Windows (or PHP) cached somehow the files content or something similar.

Using PHP with readdir

public function method3($path, $string, $limitFrom, $limitTo){
    $files = array();

    $cont = 0;

    if ($handle = opendir($path)) {
        while (false !== ($entry = readdir($handle))) {
            if ($entry != "." && $entry != "..") {
                $content = file_get_contents($path.$entry);
                if (strpos($content, $string) !== false) {
                    $files[] = $path.$entry;
                }
            }
                $cont++;
        }
        closedir($handle);
    }

    return array('files' => $files, 'totalFiles' => $cont);
}

No diference with the DirectoryIterator option in terms of speed.

Using Windows Batch Scripting

I though it could be faster than with PHP as it deals directly with the OS instructions. I created a file which I called “search.bat” with the following content:

@echo off
pushd "%1"
findstr /m /C:%2 *
popd

I got the results from PHP by using exec:

public function batchSearch($path, $string){
    $searchScript = public_path().'/files/search.bat';

    exec('cmd /c '.$searchScript.' '.$path.' "'.$string.'"', $result);

    return $result;
}

It was faster the 2nd time it was run as well. Not fast enough tough. Quite similar to the PHP DirectoryIterator solution. Also, this option offered less possibilities in terms of getting more information from the files such as the modified date or filtering by date… etc.

Using Windows Search Indexer with PHP COM

This is where the magic come true.
There’s an option in Windows to enable the indexing of files. Once you do it, you will see in the control panel a new option called “Indexing Options“.
There you can add the content you want windows to index. This way the search over the files contained in the indexed folders will get reduced considerably.

windows-indexing-php

Depending on the content to index it can take hours. (Better to wait now than in execution time! )

To access the Windows Search Indexer with PHP, we have to use the PHP COM Class together with ADODB.
To make use of COM you should make sure your version of PHP contains the file php_com_dotnet.dll in the extensions folder and that you are including it in the php.ini file.

Here’s what we need:

public function com($path, $string){
    $conn = new COM("ADODB.Connection") or die("Cannot start ADO");
    $recordset = new COM("ADODB.Recordset");

    //setting the limit for the query
    $recordset->MaxRecords = 150;

    $conn->Open("Provider=Search.CollatorDSO;Extended Properties='Application=Windows';");

    //creating the query against windows search indexer database.
    //we specify the path and the string we are looking for
    $recordset->Open("SELECT System.ItemName, System.DateModified FROM  SYSTEMINDEX  WHERE DIRECTORY='".$path."' AND CONTAINS('".$string."') ORDER BY System.DateModified DESC", $conn);

    if(!$recordset->EOF){
        $recordset->MoveFirst();
    }
    $files = array();
    while(!$recordset->EOF) {
        $filename = $recordset->Fields->Item("System.ItemName")->value;

        //obtaining the date and formatting it.
        $date = $recordset->Fields->Item("System.DateModified")->Value;
        $timestamp = variant_date_to_timestamp($date);

        //getting the filename and the modified date
        $files[]=  array('filename' => $filename,  'date' => date('d-M-Y H:i:s', $timestamp));

        $recordset->MoveNext();
    }

    return $files;
}

By using `microtime` I could measure the speed of the script and I got results of 0,004 – 0,019 seconds looking for a string through 50.000 files. Isn’t it amazing? 🙂

I really wanted to post it here as I couldn’t find PHP solution for this problem anywhere else. I hope this can help someone in a close future if they have to deal with Windows and PHP together as I have to.

  • Apphut

    If your running on Windows check out Everything http://www.voidtools.com/ it is amazing and will index everything in under minute. It also has wild card searching, just try it out. By the way fullpage.js is awesome.

    • Alvaro

      That’s a desktop application which can not be connected to a Web application. Thanks anyway! 🙂

  • Jacob

    Search can be done with regular expressions, include or exclude certain search patterns. http://www.googbox.com is capable of searching within folders and subfolders.

    • Alvaro

      Sounds like an advert uh? 😀
      I wasn’t looking for any application but for a peace of code to do it within PHP (or Java)

  • Adi

    Thank you for the step by step tip. Are there any special settings to enable Windows Search? Does it only come with certain Operating Systems? I am getting “Provider cannot be found. It may not be properly installed” on the following line: $conn->Open(“Provider=Search.CollatorDSO;Extended Properties=’Application=Windows’;”);
    Thank you.