管理 Impala(部分翻譯)

管理 Impala

做爲管理員,你應監視 Impala 的資源使用狀況,必要時採起行動以保證 Impala 平衡運行,避免與統一集羣裏的其餘 Haoopd 組件衝突。當檢測到已發生或將發生的問題時,你應從新配置 Impala 或其餘組件,如HDFS乃至集羣中的硬件,來解決或避免問題的發生。 html

繼續閱讀: node

  • 使用 Impala 資源管理器[僅支持CDH5]
  • 管理 Impala 數據的硬盤空間
  • 設置查詢與會話的超時時間

做爲管理員,你能夠在集羣的全部機器上執行 Impala 的安裝、升級、配置任務。參見 Installing Cloudera ImpalaUpgrading ImpalaConfiguring Impala 瞭解詳細信息。 sql

對於由管理員執行的額外的安全任務,參見 Impala Security 瞭解詳細信息。 shell

使用 Impala 資源管理器 [僅支持 CDH 5]

You can limit the CPU and memory resources used by Impala, to manage and prioritize workloads on clusters that run jobs from many Hadoop components. (Currently, there is no limit or throttling on the I/O for Impala queries.) Impala uses the underlying Apache Hadoop YARN resource management framework, which allocates the required resources for each Impala query. Impala estimates the resources required by the query on each node of the cluster, and requests the resources from YARN. Requests from Impala to YARN go through an intermediary service Llama (Low Latency Application Master). When the resource requests are granted, Impala starts the query and places all relevant execution threads into the CGroup containers and sets up the memory limit on each node. If sufficient resources are not available, the Impala query waits until other jobs complete and the resources are freed. While the waits for resources might make individual queries seem less responsive on a heavily loaded cluster, the resource management feature makes the overall performance of the cluster smoother and more predictable, without sudden spikes in utilization due to memory paging, saturated I/O channels, CPUs pegged at 100%, and so on. 數據庫

Checking Resource Estimates and Actual Usage

To make resource usage easier to verify, the output of the EXPLAIN SQL statement now includes information about estimated memory usage, whether table and column statistics are available for each table, and the number of virtual cores that a query will use. You can get this information through the EXPLAIN statement without actually running the query. The extra information requires setting the query option EXPLAIN_LEVEL=verbose; see EXPLAIN Statement for details. The same extended information is shown at the start of the output from the PROFILE statement in impala-shell. The detailed profile information is only available after running the query. You can take appropriate actions (gathering statistics, adjusting query options) if you find that queries fail or run with suboptimal performance when resource management is enabled. 安全

How Resource Limits Are Enforced

  • CPU limits are enforced by the Linux CGroups mechanism. YARN grants resources in the form of containers that correspond to CGroups on the respective machines.
  • Memory is enforced by Impala's query memory limits. Once a reservation request has been granted, Impala sets the query memory limit according to the granted amount of memory before executing the query.

Enabling Resource Management for Impala

To enable resource management for Impala, first you set up the YARN and Llama services for your CDH cluster. Then you add startup options and customize resource management settings for the Impala services. session

Required CDH Setup for Resource Management with Impala

YARN is the general-purpose service that manages resources for many Hadoop components within a CDH cluster. Llama is a specialized service that acts as an intermediary between Impala and YARN, translating Impala resource requests to YARN and coordinating with Impala so that queries only begin executing when all needed resources have been granted by YARN. app

For information about setting up the YARN and Llama services, see the instructions for YARN and Llama in the CDH 5 Installation Guide. less

impalad Startup Options for Resource Management

The following startup options for  impalad enable resource management and customize its parameters for your cluster configuration:
  • -enable_rm: Whether to enable resource management or not, either true or false. The default is false. None of the other resource management options have any effect unless -enable_rm is turned on.
  • -llama_host: Hostname or IP address of the Llama service that Impala should connect to. The default is 127.0.0.1.
  • -llama_port: Port of the Llama service that Impala should connect to. The default is 15000.
  • -llama_callback_port: Port that Impala should start its Llama callback service on. Llama reports when resources are granted or preempted through that service.
  • -cgroup_hierarchy_path: Path where YARN and Llama will create CGroups for granted resources. Impala assumes that the CGroup for an allocated container is created in the path 'cgroup_hierarchy_path + container_id'.

impala-shell Query Options for Resource Management

Before issuing SQL statements through the impala-shell interpreter, you can use the SET command to configure the following parameters related to resource management: ide

EXPLAIN_LEVEL

Setting this option to verbose or 1 enables extra information in the output of the EXPLAIN command. Setting the option to normal or 0 suppresses the extra information. The extended information is especially useful during performance tuning, when you need to confirm if table and column statistics are available for a query. The extended information also helps to check estimated resource usage when you use the resource management feature in CDH 5. See EXPLAIN Statement for details about the extended information and how to use it.

MEM_LIMIT

When resource management is not enabled, defines the maximum amount of memory a query can allocate on each node. If query processing exceeds the specified memory limit on any node, Impala cancels the query automatically. Memory limits are checked periodically during query processing, so the actual memory in use might briefly exceed the limit without the query being cancelled.

When resource management is enabled in CDH 5, the mechanism for this option changes. If set, it overrides the automatic memory estimate from Impala. Impala requests this amount of memory from YARN on each node, and the query does not proceed until that much memory is available. The actual memory used by the query could be lower, since some queries use much less memory than others. With resource management, the MEM_LIMIT setting acts both as a hard limit on the amount of memory a query can use on any node (enforced by YARN and a guarantee that that much memory will be available on each node while the query is being executed. When resource management is enabled but no MEM_LIMIT setting is specified, Impala estimates the amount of memory needed on each node for each query, requests that much memory from YARN before starting the query, and then internally sets the MEM_LIMIT on each node to the requested amount of memory during the query. Thus, if the query takes more memory than was originally estimated, Impala detects that the MEM_LIMIT is exceeded and cancels the query itself.

Default: 0

RESERVATION_REQUEST_TIMEOUT [CDH 5 only]

Maximum number of milliseconds Impala will wait for a reservation to be completely granted or denied. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher with CDH 5.

Default: 300000 (5 minutes)

V_CPU_CORES [CDH 5 only]

The number of per-host virtual CPU cores to request from YARN. If set, the query option overrides the automatic estimate from Impala. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5.

Default: 0 (use automatic estimates)

YARN_POOL [CDH 5 only]

The YARN pool/queue name that queries should be submitted to. Used in conjunction with the Impala resource management feature in Impala 1.2 and higher and CDH 5. Specifies the name of the pool used by resource requests from Impala to the YARN resource management framework.

Default: empty (use the user-to-pool mapping defined by an impalad startup option in the Impala configuration file)

Limitations of Resource Management for Impala

Currently, the beta versions of CDH 5 and Impala have the following limitations for resource management of Impala queries:

  • The resource management feature is not available for a cluster that uses Kerberos authentication.
  • Table statistics are required, and column statistics are highly valuable, for Impala to produce accurate estimates of how much memory to request from YARN. See Table Statistics and Column Statistics for instructions on gathering both kinds of statistics, and EXPLAIN Statement for the extended EXPLAIN output where you can check that statistics are available for a specific table and set of columns.
  • If the Impala estimate of required memory is lower than is actually required for a query, Impala will cancel the query when it exceeds the requested memory size. This could happen in some cases with complex queries, even when table and column statistics are available. You can see the actual memory usage after a failed query by issuing a PROFILE command in impala-shell. Specify a larger memory figure with the MEM_LIMIT query option and re-try the query.

    Currently, there are known bugs that could cause the maximum memory usage reported by the PROFILE command to be lower than the actual value.

  • The MEM_LIMIT query option, and the other resource-related query options, are not currently settable through the ODBC or JDBC interfaces.

管理 Impala 數據的硬盤空間

儘管 Impala 一般工做在放置於有充足容量空間的HDFS存儲系統裏的許多大文件之上的,有時你也須要執行清理釋放空間,或者爲開發者在最小化空間使用與文件副本方面提供技術支持(Although Impala typically works with many large files in an HDFS storage system with plenty of capacity, there are times when you might perform some file cleanup to reclaim space, or advise developers on techniques to minimize space consumption and file duplication)。

  • 對 Impala 管理("內部")表使用 DROP TABLE 語句來刪除數據文件
  • 關注 HDFS 回收站(Be aware of the HDFS trashcan)
  • 使用 DESCRIBE FORMATTED 語句查看錶裏的數據文件在 HDFS 中的物理位置
  • 使用 external 表在 HDFS 原始位置引用數據文件。使用這一技術,能夠避免複製文件,能夠映射多個 Impala 表到同一組數據文件。當刪除表時,數據文件仍然保持原狀。
  • 使用 LOAD DATA 語句 把 HDFS 文件置於 Impala 控制之下
  • 在刪除數據庫以前,先刪除掉數據庫中全部表
  • 使用實用的緊湊的二進制文件格式(Use compact binary file formats where practical)
  • 插入語句失敗後清理臨時文件

設置查詢與會話的超時時間

爲了保持長時間運行的查詢,或釋放會話佔用的集羣資源,你能夠針對單獨的查詢或整個會話設置超時時長(To keep long-running queries or idle sessions from tying up cluster resources, you can set timeout intervals for both individual queries, and entire sessions)。爲 impalad 守護進程設置以下啓動選項:

  • 超出 --idle_query_timeout 選項指定的秒數後,空閒查詢將被取消。這多是一個全部結果已經取出但沒有關閉的查詢,也多是結果部分取出但客戶端程序再也不請求更多數據的查詢(This could be a query whose results were all fetched but was never closed, or one whose results were partially fetched and then the client program stopped requesting further results)。這種狀況多發生在使用 JDBC 或 ODBC 接口的客戶端程序,較少發生在交互式的 impala-shell 裏。當查詢被取消,客戶端程序將不能再接收到數據。
  • 選項 --idle_session_timeout 指定了空閒會話超時的秒數。 當前沒有查詢活動發生,而且會話沒有啓動新的查詢時,會話是空閒的(A session is idle when no activity is occurring for any of the queries in that session, and the session has not started any new queries)。當會話過時,你將不能再在上面執行新的查詢,這一會話保持打開的,可是隻能在上面執行關閉操做。默認值 0 表示會話永不過時。

關於修改 impalad 選項,參見 Modifying Impala Startup Options.

相關文章
相關標籤/搜索