I was struggling for a while to create a fast solution to search through tens of thousands of files (30.000 to 60.000 files) and I tried different approaches until I found what I believe is the faster way to do it.
I will name the different techniques I used for it and I will conclude with the fast solution.
Using PHP with DirectoryIterator
By using DirectoryIterator in combination with file_get_contents and strpos.
public function searchDirectoryIterator($path, $string){
$dir = new DirectoryIterator($path);
$files = array();
$totalFiles = 0;
$cont = 0;
foreach ($dir as $file){
if (!$file->isDot()){
$content = file_get_contents($file->getPathname());
if (strpos($content, $string) !== false) {
$files[$file->getMTime()] = $file->getBasename();
}
}
}
ksort($files);
return array('files' => $files, 'totalFiles' => $cont);
}
Code language: PHP (php)
I improved this version by using AJAX and calling the function asynchronously to read groups of 1.000 files so I could saw a progress bar. I also cached the list of documents in the path returned by DirectoryIterator.
It was quite slow. It could take 5-10 minutes or even more. The 2nd time it was run it was faster. I guess because Windows (or PHP) cached somehow the files content or something similar.
Using PHP with readdir
public function method3($path, $string, $limitFrom, $limitTo){
$files = array();
$cont = 0;
if ($handle = opendir($path)) {
while (false !== ($entry = readdir($handle))) {
if ($entry != "." && $entry != "..") {
$content = file_get_contents($path.$entry);
if (strpos($content, $string) !== false) {
$files[] = $path.$entry;
}
}
$cont++;
}
closedir($handle);
}
return array('files' => $files, 'totalFiles' => $cont);
}
Code language: PHP (php)
No diference with the DirectoryIterator option in terms of speed.
Using Windows Batch Scripting
I though it could be faster than with PHP as it deals directly with the OS instructions. I created a file which I called “search.bat” with the following content:
@echo off
pushd "%1"
findstr /m /C:%2 *
popd
Code language: JavaScript (javascript)
I got the results from PHP by using exec:
public function batchSearch($path, $string){
$searchScript = public_path().'/files/search.bat';
exec('cmd /c '.$searchScript.' '.$path.' "'.$string.'"', $result);
return $result;
}
Code language: PHP (php)
It was faster the 2nd time it was run as well. Not fast enough tough. Quite similar to the PHP DirectoryIterator solution. Also, this option offered less possibilities in terms of getting more information from the files such as the modified date or filtering by date… etc.
Using Windows Search Indexer with PHP COM
This is where the magic come true.
There’s an option in Windows to enable the indexing of files. Once you do it, you will see in the control panel a new option called “Indexing Options”.
There you can add the content you want windows to index. This way the search over the files contained in the indexed folders will get reduced considerably.
Depending on the content to index it can take hours. (Better to wait now than in execution time! )
To access the Windows Search Indexer with PHP, we have to use the PHP COM Class together with ADODB.
To make use of COM you should make sure your version of PHP contains the file php_com_dotnet.dll in the extensions folder and that you are including it in the php.ini file.
Here’s what we need:
public function com($path, $string){
$conn = new COM("ADODB.Connection") or die("Cannot start ADO");
$recordset = new COM("ADODB.Recordset");
//setting the limit for the query
$recordset->MaxRecords = 150;
$conn->Open("Provider=Search.CollatorDSO;Extended Properties='Application=Windows';");
//creating the query against windows search indexer database.
//we specify the path and the string we are looking for
$recordset->Open("SELECT System.ItemName, System.DateModified FROM SYSTEMINDEX WHERE DIRECTORY='".$path."' AND CONTAINS('".$string."') ORDER BY System.DateModified DESC", $conn);
if(!$recordset->EOF){
$recordset->MoveFirst();
}
$files = array();
while(!$recordset->EOF) {
$filename = $recordset->Fields->Item("System.ItemName")->value;
//obtaining the date and formatting it.
$date = $recordset->Fields->Item("System.DateModified")->Value;
$timestamp = variant_date_to_timestamp($date);
//getting the filename and the modified date
$files[]= array('filename' => $filename, 'date' => date('d-M-Y H:i:s', $timestamp));
$recordset->MoveNext();
}
return $files;
}
Code language: PHP (php)
By using microtime
I could measure the speed of the script and I got results of 0,004 – 0,019 seconds looking for a string through 50.000 files. Isn’t it amazing? 🙂
I really wanted to post it here as I couldn’t find PHP solution for this problem anywhere else. I hope this can help someone in a close future if they have to deal with Windows and PHP together as I have to.