轉載好文 php讀取大文件

時間 2019-11-14

標籤轉載 php 讀取文件欄目 PHP 简体版

原文原文鏈接

本文轉載自https://www.luyuqiang.com/how-php-read-a-large-file

原由

這是偶然間看到的一篇文章，感受收穫頗豐，故轉載。轉載自蘆雨強的網絡日誌php

乾貨分割線

做爲一個PHP開發者，咱們常常須要關注內存管理。PHP引擎在咱們運行腳本以後作了很好的清理工做，短週期執行的web服務器模型意味着即便是爛代碼也不會長時間影響。linux

咱們不多須要走出溫馨的邊界–好比咱們嘗試在一個小的VPS上爲建立一個大項目運行Composer，或者當咱們在小服務器上讀取一個大文件。git

這是後續將在本教程中呈現的問題。github

教程代碼能夠在github找到web

衡量成功

肯定咱們完善代碼的惟一方式是把爛代碼和修正過的代碼進行比較。換句話說，咱們不知道它是不是解決辦法，除非咱們知道它幫了多少。安全

有兩個咱們須要關心的指標。第一個是CPU的使用。咱們想要過程快或者慢？第二個是內存的使用。腳本運行使用了多少內存？這些一般是成反比的-意味着咱們能夠在看CPU的使用時候，不看內存的使用，反之亦然。服務器

在一個異步程序模型中（好比多進程或者多線程的PHP應用），CPU和內存使用都須要謹慎考慮的。在傳統PHP架構中，當它們中的哪一個達到服務器極限的時候一般就會有問題。網絡

在PHP中測量CPU使用不切實際。若是你關注，能夠考慮在Ubuntu或者MacOs中使用top命令。Windows能夠考慮安裝一個linux子系統，你就能夠在Ubuntu上使用top。多線程

這個教程的目的是測量內存使用。咱們將看到「傳統」腳本中內存的使用狀況，以後將會優化而且測量，最後我但願你能夠作一個學習後的選擇。架構

//php.net文檔中格式化字節的方法
memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {
 $units = array('b', 'kb', 'mb', 'gb', 'tb');

 $bytes = max($bytes, 0);
 $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
 $pow = min($pow, count($units) - 1);

 $bytes /= (1 << (10 * $pow));

 return round($bytes, $precision) . ' ' . $units[$pow];
}

咱們將會在腳本的最後使用這個函數，所以能夠在第一時間看到哪一個腳本使用了更多的內存。

選項

咱們能夠採起不少高效讀取文件方法。可是有兩種經常使用的場景。咱們能夠先讀取後處理數據，而後輸出處理後的數據或者執行其餘操做。咱們可能也想要轉換一個數據流而不用獲取數據。

對於第一種狀況，咱們讀取一個文件，而後每一萬行建立一個獨立的隊列進程。咱們須要至少把一萬行放到在內存中，而後把他們發送到隊列管理器。

對於第二種狀況，咱們壓縮一個特別大的API響應。咱們不在意它說什麼，但咱們須要確保它是以壓縮形式備份的。

兩種狀況下，咱們都須要讀取大文件，只不過一個關注數據一個不關注。讓咱們探索這些選項吧。。。

一行一行讀文件

有不少處理文件的函數。讓咱們使用一個簡單明瞭的文件讀取：

// from memory.php

function formatBytes($bytes, $precision = 2) {
 $units = array('b', 'kb', 'mb', 'gb', 'tb');

 $bytes = max($bytes, 0);
 $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
 $pow = min($pow, count($units) - 1);

 $bytes /= (1 << (10 * $pow));

 return round($bytes, $precision) . ' ' . $units[$pow];
}

print formatBytes(memory_get_peak_usage());

// from reading-files-line-by-line-1.php

function readTheFile($path) {
 $lines = [];
 $handle = fopen($path, 'r');

 while(!feof($handle)) {
 $lines[] = trim(fgets($handle));
 }

 fclose($handle);
 return $lines;
}

readTheFile('shakespeare.txt');

require 'memory.php';

咱們正在讀取一個包含莎士比亞全集的文本文件。文本文件大約5.5MB，消耗了12.8MB的內存。如今，讓咱們使用生成器來讀取每一行：

// from reading-files-line-by-line-2.php

function readTheFile($path) {
 $handle = fopen($path, 'r');

 while(!feof($handle)) {
 yield trim(fgets($handle));
 }

 fclose($handle);
}

readTheFile('shakespeare.txt');

require 'memory.php';

這個文本文件一樣大小，可是消耗了393KB的內存。這也說明不了什麼，除非咱們使用讀取的數據作一些事。假設咱們把文檔以每兩個空行分紅小片斷。就像：

// from reading-files-line-by-line-3.php

$iterator = readTheFile('shakespeare.txt');

$buffer = '';

foreach ($iterator as $iteration) {
 preg_match('/\n{3}/', $buffer, $matches);

 if (count($matches)) {
 print '.';
 $buffer = '';
 } else {
 $buffer .= $iteration . PHP_EOL;
 }
}

require 'memory.php';

猜一下咱們如今用了多少內存？儘管咱們把文檔分割成了1216個片斷，咱們卻只用了458KB的內存，意外嗎？鑑於生成器的性質，咱們內存消耗最大的是須要在循環中存儲最大文本塊的內存。在這種狀況下，最大的塊是101,985個字符。

我已經寫了使用生成器的性能提高和Nikita Popov的生成器庫，因此你想要了解更多就去看吧。

生成器也有其餘用法，但對讀取大文件有很明顯的性能提高。若是咱們須要去處理數據，生成器也是最好的方式。

文件間的管道輸送

在某些狀況下，咱們不須要處理數據，而是把一個文件的數據傳遞到另外一個文件。這一般被叫作管道輸送（大概由於咱們只看到了兩頭，沒看到管道內。。。固然它不是透明的）。咱們能夠經過使用流方法獲取它們。寫了個從一個文件傳遞到另外一個的腳本，方便咱們能夠測量內存使用：

// from piping-files-1.php

file_put_contents(
 'piping-files-1.txt', file_get_contents('shakespeare.txt')
);

require 'memory.php';

不出意外地，這個腳本使用比文件的拷貝更多的內存。這是由於它不得不讀取、把文本內容放到內存中，而後寫入到一個新文件。對於小文件還好。可是當咱們處理一個大文件，就不妙了。。。

讓咱們使用流的方式從一個文件傳遞到另外一個（或者叫管道輸送）

// from piping-files-2.php

$handle1 = fopen('shakespeare.txt', 'r');
$handle2 = fopen('piping-files-2.txt', 'w');

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require 'memory.php';

這段代碼很奇怪。咱們打開兩個文件的句柄，第一個使用讀模式，第二使用寫模式。而後咱們從第一個複製到第二個。而後關閉兩個文件的句柄。是否是驚喜到你了，內存只使用了393KB。

這看起來是否是很熟悉。不就是咱們使用生成器的代碼一行一行讀取而後存儲嗎？這是由於第二個變量使用fgets指定每行讀取多少字節（默認-1或者直到一個新行）

stream_copy_to_stream的第三個參數是徹底相同的參數（具備徹底相同的默認值）。stream_copy_to_stream正在讀取一個流，一次一行，並將其寫入另外一個流。它跳過了生成器產生值的部分，由於咱們不須要使用該值。

管道輸送這些文本對咱們來講沒用，因此讓咱們仔細思考一下其餘可能的例子。假設咱們想要從CDN輸出一個圖像，重定向應用的路由。代碼以下：

// from piping-files-3.php

file_put_contents(
 'piping-files-3.jpeg', file_get_contents(
 'https://github.com/assertchris/uploads/raw/master/rick.jpg'
 )
);

// ...or write this straight to stdout, if we don't need the memory info

require 'memory.php';

咱們可使用以上代碼解決一個應用的路由問題。但咱們想從CDN獲取而不是把文件存儲在本地文件系統中。咱們可能使用更優雅的（像Guzzle）替代file_get_contents，可是效果同樣。

圖片的內存使用大約581KB。如今，咱們試着使用流替代？

// from piping-files-4.php

$handle1 = fopen(
 'https://github.com/assertchris/uploads/raw/master/rick.jpg', 'r'
);

$handle2 = fopen(
 'piping-files-4.jpeg', 'w'
);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require 'memory.php';

內存使用會略少（400KB），可是結果卻同樣。若是咱們須要更多的內存信息，咱們能夠打印到standard output。事實上，PHP爲實現這個提供了簡單的方法。

$handle1 = fopen(
 'https://github.com/assertchris/uploads/raw/master/rick.jpg', 'r'
);

$handle2 = fopen(
 'php://stdout', 'w'
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

// require 'memory.php';

其餘流

有一些其餘流咱們能夠管道傳遞、讀、或者寫：

php://stdin (只讀)
php://stderr (只寫, 像 php://stdout)
php://input (只讀) 獲取原請求體
php://output (只寫) 能夠寫到緩衝區
php://memory 和 php://temp (讀寫)存儲臨時數據的地方。php://temp不一樣的是以文件存儲，php://memory存儲在內存

過濾器

還有一個使用流的技巧叫過濾器。它們是中間步驟，提供管理流而不暴露給咱們的功能。設想一下咱們想要壓縮莎士比亞.txt。可能會使用Zip擴展：

// from filters-1.php

$zip = new ZipArchive();
$filename = 'filters-1.zip';

$zip->open($filename, ZipArchive::CREATE);
$zip->addFromString('shakespeare.txt', file_get_contents('shakespeare.txt'));
$zip->close();

require 'memory.php';

整潔的代碼，可是卻消耗了10.75MB。咱們使用過濾器改進：

// from filters-2.php

$handle1 = fopen(
 'php://filter/zlib.deflate/resource=shakespeare.txt', 'r'
);

$handle2 = fopen(
 'filters-2.deflated', 'w'
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require 'memory.php';

能夠看到使用php://filter/zlib.defalte的過濾器來壓縮資源。咱們能夠把一個壓縮後的數據管道傳遞到另外一個文件。內存消耗896KB。

我知道這不是同一個格式，或者使用zip壓縮更好。可是你不得不懷疑：若是你選擇不一樣的格式能夠節省掉12倍的內存，何樂而不爲呢？

能夠經過另外一個zlib的解壓縮過濾器解壓文件：

// from filters-2.php

file_get_contents(
 'php://filter/zlib.inflate/resource=filters-2.deflated'
);

流已經在理解PHP中的流和 PHP流與效率中大量說起。若是你想要了解更多，點開看看。

自定義流

fopen和file_get_contents有他們本身的默認設置，可是能夠徹底的自定義。爲了方便理解，本身建立一個新的流：

// from creating-contexts-1.php

$data = join('&', [
 'twitter=assertchris',
]);

$headers = join('\r\n', [
 'Content-type: application/x-www-form-urlencoded',
 'Content-length: ' . strlen($data),
]);

$options = [
 'http' => [
 'method' => 'POST',
 'header'=> $headers,
 'content' => $data,
 ],
];

$context = stream_content_create($options);

$handle = fopen('https://example.com/register', 'r', false, $context);
$response = stream_get_contents($handle);

fclose($handle);

在這個例子中，咱們嘗試向API發出POST請求。API端是安全的，可是仍須要使用http上下文屬性（用於http和http）。咱們設置一些頭而且打開API文件句柄。考慮到安全，咱們以只讀方式打開。

能夠自定義不少東西，因此若是你想了解更多，最好查看文檔。

自定義協議的過濾器

在本文結束以前，來談談自定義協議。若是你看文檔，你能夠找到一個示例類來實現：

Protocol {
 public resource $context;
 public construct ( void )
 public destruct ( void )
 public bool dir_closedir ( void )
 public bool dir_opendir ( string $path , int $options )
 public string dir_readdir ( void )
 public bool dir_rewinddir ( void )
 public bool mkdir ( string $path , int $mode , int $options )
 public bool rename ( string $path_from , string $path_to )
 public bool rmdir ( string $path , int $options )
 public resource stream_cast ( int $cast_as )
 public void stream_close ( void )
 public bool stream_eof ( void )
 public bool stream_flush ( void )
 public bool stream_lock ( int $operation )
 public bool stream_metadata ( string $path , int $option , mixed $value )
 public bool stream_open ( string $path , string $mode , int $options ,
 string &$opened_path )
 public string stream_read ( int $count )
 public bool stream_seek ( int $offset , int $whence = SEEK_SET )
 public bool stream_set_option ( int $option , int $arg1 , int $arg2 )
 public array stream_stat ( void )
 public int stream_tell ( void )
 public bool stream_truncate ( int $new_size )
 public int stream_write ( string $data )
 public bool unlink ( string $path )
 public array url_stat ( string $path , int $flags )
}

咱們不打算實如今教程中，由於我認爲這是值得的本身完成過程。須要作不少工做，可是一旦這個工做完成，能夠很容易地註冊的流包裝：

if (in_array('highlight-names', stream_get_wrappers())) {
 stream_wrapper_unregister('highlight-names');
}

stream_wrapper_register('highlight-names', 'HighlightNamesProtocol');

$highlighted = file_get_contents('highlight-names://story.txt');

相似地，能夠本身建立一個自定義流過濾器。文檔有一個過濾器類的例子：

Filter {
 public $filtername;
 public $params
 public int filter ( resource $in , resource $out , int &$consumed ,
 bool $closing )
 public void onClose ( void )
 public bool onCreate ( void )
}

很容易註冊：

1 2	$handle = fopen('story.txt', 'w+'); stream_filter_append($handle, 'highlight-names', STREAM_FILTER_READ);

高亮名字過濾器須要去匹配新的過濾器類的過濾器名屬性。也能夠在php：//filter/highligh-names/resource=story.txt字符串中使用自定義過濾器。定義過濾器比定義協議要容易得多。其中一個緣由是協議須要處理目錄操做，而過濾器只需處理每一個數據塊。

若是你有強烈的進取心，鼓勵你編寫協議的過濾器。若是你能夠將過濾器應用於stream_copy_to_stream操做，那麼即便處理大容量的大文件，你的應用程序內存也不會超閾值。試着編寫一個調整圖像大小的過濾器或加密應用程序的過濾器。

總結

儘管這不是咱們常常處理的問題，在讀取大文件時也很容易陷入困境。在異步應用中，當咱們不注意內存使用時，很容易就把整個服務搞掛。

這個教程但願給你講解一些新想法（或者喚醒你的記憶），以便你能在讀、寫大文件時想得更多。當開始熟練掌握流和生成器後，中止使用像file_get_contents函數：一些莫名其妙問題就在程序中消失了。這就是意義所在！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。