JVM Bug:多個線程持有一把鎖?

JVM線程dump Bug描述

  在JAVA語言中,當同步塊(Synchronized)被多個線程併發訪問時,JVM中會採用基於互斥實現的重量級鎖。JVM最多隻容許一個線程持有這把鎖,若是其它線程想要得到這把鎖就必須處於等待狀態,也就是說在同步塊被併發訪問時,最多隻會有一個處於RUNNABLE狀態的線程持有某把鎖,而另外的線程由於競爭不到這把鎖而都處於BLOCKED狀態。然而有些時候咱們會發現處於BLOCKED狀態的線程,它的最上面那一幀在打印其正在等待的鎖對象時,竟然也會出現-locked的信息,這個信息和持有該鎖的線程打印出來的結果是同樣的(請看下圖),可是對比其餘BLOCKED態的線程卻並無都出現這種狀況。當咱們再次dump線程時又可能出現不同的結果。測試代表這多是一個偶發的狀況,本文就是針對這種狀況對JVM內部的實現作了一個研究以尋找其根源。image.pngjava

JStack命令的整個過程

  上面提到了線程dump,那麼就不得不提執行線程dump的工具—jstack,這個工具是Java自帶的工具,和Java處於同一個目錄下,主要是用來dump線程的,或許你們也有使用kill -3的命令來dump線程,但這二者最明顯的一個區別是,前者的dump內容是由jstack這個進程來輸出的,目標JVM進程將dump內容發給jstack進程(注意這是沒有加-m參數的場景,指定-m參數就有點不同了,它使用的是serviceability agent的api來實現的,底層經過ptrace的方式來獲取目標進程的內容,執行過程可能會比正常模式更長點),這意味着能夠作文件重定向,將線程dump內容輸出到指定文件裏;然後者是由目標進程輸出的,只會產生在目標進程的標準輸出文件裏,若是正巧標準輸出裏自己就有內容的話,看起來會比較亂,好比想經過一些分析工具去分析的話,要是該工具沒有作過濾操做,極可能沒法分析。所以通常狀況咱們儘可能使用jstack,另外jstack還有不少實用的參數,好比jstack pid >thread_dump.log,該命令會將指定pid的進程的線程dump到當前目錄的thread_dump.log文件裏。api

  jstack是使用Java實現的,它經過給目標JVM進程發送一個threaddump的命令,目標JVM的監聽線程(attachListener)會實時監聽傳過來的命令(其實attachListener線程並非一啓動就建立的,它是lazy建立啓動的),當attachListener收到threaddump命令時會調用thread_dump的方法來處理dump操做(方法在attachListener.cpp裏)。安全

static jint thread_dump(AttachOperation* op, outputStream* out) {
  bool print_concurrent_locks = false;
  if (op->arg(0) != NULL && strcmp(op->arg(0), "-l") == 0) {
    print_concurrent_locks = true;
  }

  // thread stacks
  VM_PrintThreads op1(out, print_concurrent_locks);
  VMThread::execute(&op1);

  // JNI global handles
  VM_PrintJNI op2(out);
  VMThread::execute(&op2);

  // Deadlock detection
  VM_FindDeadlocks op3(out);
  VMThread::execute(&op3);

  return JNI_OK;
}複製代碼

從上面的方法能夠看到,jstack命令執行了三個操做:數據結構

  • VM_PrintThreads:打印線程棧
  • VM_PrintJNI:打印JNI
  • VM_FindDeadlocks:打印死鎖

三個操做都是交給VMThread線程去執行的,VMThread線程在整個JAVA進程有且只會有一個。能夠想象一下VMThread線程的簡單執行過程:不斷地輪詢某個任務列表並在有任務時依次執行任務。任務執行時,它會根據具體的任務決定是否會暫停整個應用,也就是stop the world,這是否是讓咱們聯想到了咱們熟悉的GC過程?是的,咱們的ygc以及cmsgc的兩個暫停應用的階段(init_mark和remark)都是由這個線程來執行的,而且都要求暫停整個應用。其實上面的三個操做都是要求暫停整個應用的,也就是說jstack觸發的線程dump過程也是會暫停應用的,只是這個過程通常很快就結束,不會有明顯的感受。另外內存dump的jmap命令,也是會暫停整個應用的,若是使用了-F的參數,其底層也是使用serviceability agent的api來dump的,可是dump內存的速度會明顯慢不少。多線程

VMThread執行任務的過程

  VMThread執行的任務稱爲vmopration,在JVM中存在兩種vmopration,一種是須要在安全點內執行的(所謂安全點,就是系統處於一個安全的狀態,除了VMThread這個線程能夠正常運行以外,其餘的線程都必須暫停執行,在這種狀況下就能夠放心執行當前的一系列vm_opration了),另一種是不須要在安全點內執行的。而此次咱們討論的線程dump是須要在安全點內執行的。併發

  如下是VMThread輪詢的邏輯:jvm

void VMThread::loop() {
  assert(_cur_vm_operation == NULL, "no current one should be executing");

  while(true) {
    ...
    //已經獲取了一個vm_operation
    if (_cur_vm_operation->evaluate_at_safepoint()) {
        //若是該vm_operation須要在安全點內執行
        _vm_queue->set_drain_list(safepoint_ops); 
        SafepointSynchronize::begin();//進入安全點
        evaluate_operation(_cur_vm_operation);
        do {
          _cur_vm_operation = safepoint_ops;
          if (_cur_vm_operation != NULL) {
            do {
              VM_Operation* next = _cur_vm_operation->next();
              _vm_queue->set_drain_list(next);
              evaluate_operation(_cur_vm_operation);
              _cur_vm_operation = next;
              if (PrintSafepointStatistics) {
                SafepointSynchronize::inc_vmop_coalesced_count();
              }
            } while (_cur_vm_operation != NULL);
          }
          if (_vm_queue->peek_at_safepoint_priority()) {
            MutexLockerEx mu_queue(VMOperationQueue_lock,
                                     Mutex::_no_safepoint_check_flag);
            safepoint_ops = _vm_queue->drain_at_safepoint_priority();
          } else {
            safepoint_ops = NULL;
          }
        } while(safepoint_ops != NULL);
        _vm_queue->set_drain_list(NULL);
        SafepointSynchronize::end();//退出安全點
      } else {  // not a safepoint operation
        if (TraceLongCompiles) {
          elapsedTimer t;
          t.start();
          evaluate_operation(_cur_vm_operation);
          t.stop();
          double secs = t.seconds();
          if (secs * 1e3 > LongCompileThreshold) {
            tty->print_cr("vm %s: %3.7f secs]", _cur_vm_operation->name(), secs);
          }
        } else {
            evaluate_operation(_cur_vm_operation);
        }
        _cur_vm_operation = NULL;
      }
    }
    ...
  }複製代碼

在這裏重點解釋下在安全點內執行的vmopration的過程,VMThread經過不斷循環從vmqueue中獲取一個或者幾個須要在安全點內執行的vmopertion,而後在準備執行這些vmopration以前先經過調用SafepointSynchronize::begin()進入到安全點狀態,在執行完這些vmopration以後,調用SafepointSynchronize::end(),退出安全點模式,恢復以前暫停的全部線程讓他們繼續運行。對於安全點這塊的邏輯挺複雜的,僅僅須要記住在進入安全點模式的時候會持有Threadslock這把線程互斥鎖,對線程的操做都須要獲取到這把鎖才能繼續執行,而且還會設置安全點的狀態,若是正在進入安全點過程當中設置state爲synchronizing,當全部線程都徹底進入了安全點以後設置state爲synchronized狀態,退出的時候設置爲not_synchronized狀態。ide

void SafepointSynchronize::begin() {
  ...
  Threads_lock->lock();
  ...
  _state            = _synchronizing;
  ...
   _state = _synchronized;
...
}

void SafepointSynchronize::end() {
    assert(Threads_lock->owned_by_self(), "must hold Threads_lock");
    ...
    _state = _not_synchronized;
    ...
    Threads_lock->unlock();
}複製代碼

線程Dump中的VM_PrintThreads過程

  回到開頭提到的JVM線程Dump時的Bug,從咱們打印的結果來看也基本猜到了這個過程:遍歷每一個Java線程,而後再遍歷每一幀,打印該幀的一些信息(包括類,方法名,行數等),在打印完每一幀以後而後打印這幀已經關聯了的鎖信息,下面代碼就是打印每一個線程的過程:工具

void JavaThread::print_stack_on(outputStream* st) {
  if (!has_last_Java_frame()) return;
  ResourceMark rm;
  HandleMark   hm;

  RegisterMap reg_map(this);
  vframe* start_vf = last_java_vframe(®_map);
  int count = 0;
  for (vframe* f = start_vf; f; f = f->sender() ) {
    if (f->is_java_frame()) {
      javaVFrame* jvf = javaVFrame::cast(f);
      java_lang_Throwable::print_stack_element(st, jvf->method(), jvf->bci());
      if (JavaMonitorsInStackTrace) {
        jvf->print_lock_info_on(st, count);
      }
    } else {
      // Ignore non-Java frames
    }
    count++;
    if (MaxJavaStackTraceDepth == count) return;
  }
}複製代碼

和咱們此次問題相關的邏輯,也就是打印"-locked"的信息是正好是在jvf->print_lock_info_on(st, count)這行裏面,請看具體實現:oop

void javaVFrame::print_lock_info_on(outputStream* st, int frame_count) {
  ResourceMark rm;
  if (frame_count == 0) {
    if (method()->name() == vmSymbols::wait_name() &&
        instanceKlass::cast(method()->method_holder())->name() == vmSymbols::java_lang_Object()) {
      StackValueCollection* locs = locals();
      if (!locs->is_empty()) {
        StackValue* sv = locs->at(0);
        if (sv->type() == T_OBJECT) {
          Handle o = locs->at(0)->get_obj();
          print_locked_object_class_name(st, o, "waiting on");
        }
      }
    } else if (thread()->current_park_blocker() != NULL) {
      oop obj = thread()->current_park_blocker();
      Klass* k = Klass::cast(obj->klass());
      st->print_cr("\t- %s <" INTPTR_FORMAT "> (a %s)", "parking to wait for ", (address)obj, k->external_name());
    }
  }

  GrowableArray* mons = monitors();
  if (!mons->is_empty()) {
    bool found_first_monitor = false;
    for (int index = (mons->length()-1); index >= 0; index--) {
      MonitorInfo* monitor = mons->at(index);
      if (monitor->eliminated() && is_compiled_frame()) {
        if (monitor->owner_is_scalar_replaced()) {
          Klass* k = Klass::cast(monitor->owner_klass());
          st->print("\t- eliminated  (a %s)", k->external_name());
        } else {
          oop obj = monitor->owner();
          if (obj != NULL) {
            print_locked_object_class_name(st, obj, "eliminated");
          }
        }
        continue;
      }
      if (monitor->owner() != NULL) {
        const char *lock_state = "locked";
        if (!found_first_monitor && frame_count == 0) {
          markOop mark = monitor->owner()->mark();
          if (mark->has_monitor() &&
              mark->monitor() == thread()->current_pending_monitor()) {
            lock_state = "waiting to lock";
          }
        }
        found_first_monitor = true;
        print_locked_object_class_name(st, monitor->owner(), lock_state);
      }
    }
  }
}複製代碼

看到上面的方法,再對比線程dump的結果,咱們會發現不少熟悉的東西,好比waiting onparking to wait forlocked,waiting to lock,並且也清楚了它們分別是在什麼狀況下會打印的。

  那爲何咱們的例子中BLOCKED狀態的線程本應該打印waiting to lock,可是爲何卻打印了locked呢,那說明if (mark->has_monitor() && mark->monitor() == thread()->current_pending_monitor()) 這個條件確定不成立,那這個在什麼狀況下不成立呢?在驗證此問題前,有必要先了解下markOop是什麼東西,它是用來幹什麼的?

markOop是什麼

  markOop描述了一個對象(也包括了Class)的狀態信息,Java語法層面的每一個對象或者Class在JVM的結構表示中都會包含一個markOop做爲Header,固然還有一些其餘的JVM數據結構也用它作Header。markOop由32位或者64位構成,具體位數根據運行環境而定。

  下面的結構圖包含markOop每一位所表明的含義,markOop的值根據所描述的對象的類型(好比是鎖對象仍是正常的對象)以及做用的不一樣而不一樣。就算在同一個對象裏,它的值也是可能會不斷變化的,好比鎖對象,在一開始建立的時候其實並不知道是鎖對象,會當成一個正常對象來建立(在對象的類型並無設置偏向鎖的狀況下,其markOop值多是0x1),可是隨着咱們執行到synchronized的代碼邏輯時,就知道其實它是一個鎖對象了,它的值就再也不是0x1了,而是一個新的值,該值是對應棧幀結構裏的監控對象列表裏的某一個內存地址。

//  32 bits:
//  --------
//             hash:25 ------------>| age:4    biased_lock:1 lock:2 (normal object)
//             JavaThread*:23 epoch:2 age:4    biased_lock:1 lock:2 (biased object)
//             size:32 ------------------------------------------>| (CMS free block)
//             PromotedObject*:29 ---------->| promo_bits:3 ----->| (CMS promoted object)
//
//  64 bits:
//  --------
//  unused:25 hash:31 -->| unused:1   age:4    biased_lock:1 lock:2 (normal object)
//  JavaThread*:54 epoch:2 unused:1   age:4    biased_lock:1 lock:2 (biased object)
//  PromotedObject*:61 --------------------->| promo_bits:3 ----->| (CMS promoted object)
//  size:64 ----------------------------------------------------->| (CMS free block)
//
//  unused:25 hash:31 -->| cms_free:1 age:4    biased_lock:1 lock:2 (COOPs && normal object)
//  JavaThread*:54 epoch:2 cms_free:1 age:4    biased_lock:1 lock:2 (COOPs && biased object)
//  narrowOop:32 unused:24 cms_free:1 unused:4 promo_bits:3 ----->| (COOPs && CMS promoted object)
//  unused:21 size:35 -->| cms_free:1 unused:7 ------------------>| (COOPs && CMS free block)複製代碼

就最後的3位而言,其不一樣的值表明不一樣的含義:

enum { locked_value             = 0,//00
         unlocked_value           = 1,//01
         monitor_value            = 2,//10
         marked_value             = 3,//11      
         biased_lock_pattern      = 5 //101
  };複製代碼

上面的判斷條件「mark->has_monitor()」其實就是判斷最後的2位是否是10,若是是,則說明這個對象是一個監控對象,能夠經過mark->monitor()方法獲取到對應的結構體:

bool has_monitor() const {
    return ((value() & monitor_value) != 0);
  }
  ObjectMonitor* monitor() const {
    assert(has_monitor(), "check");
    // Use xor instead of &~ to provide one extra tag-bit check.
    return (ObjectMonitor*) (value() ^ monitor_value);
  }複製代碼

將一個普通對象轉換爲一個monitor對象的過程(就是替換markOop的值)請參考爲ObjectSynchronizer::inflate方法,能進入到該方法說明該鎖爲重量級鎖,也就是說這把鎖實際上是被多個線程競爭的。

  瞭解了markOop以後,還要了解下上面那個條件裏的thread()->current_pending_monitor(),也就是這個值是何時設置進去的呢?

線程設置等待的監控對象的時機

  設置的邏輯在ObjectMonitor::enter裏,關鍵代碼以下:

...
{
    JavaThreadBlockedOnMonitorEnterState jtbmes(jt, this);
    DTRACE_MONITOR_PROBE(contended__enter, this, object(), jt);
    if (JvmtiExport::should_post_monitor_contended_enter()) {
      JvmtiExport::post_monitor_contended_enter(jt, this);
    }
    OSThreadContendState osts(Self->osthread());
    ThreadBlockInVM tbivm(jt);
    Self->set_current_pending_monitor(this);//設置當前monitor對象爲當前線程等待的monitor對象
    for (;;) {
      jt->set_suspend_equivalent();
      EnterI (THREAD) ;
      if (!ExitSuspendEquivalent(jt)) break ;
          _recursions = 0 ;
      _succ = NULL ;
      exit (false, Self) ;

      jt->java_suspend_self();
    }
    Self->set_current_pending_monitor(NULL);
  }
 ... 複製代碼

設置當前線程等待的monitorObject是在有中文註釋的那一行設置的,那麼出現Bug的緣由是否是正好在設置以前進行了線程dump呢?

水落石出

  在JVM中只會有一個處於RUNNBALE狀態的線程,也就是說另一個打印"-locked"信息的線程是處於BLOCKED狀態的。上面的第一行代碼:

JavaThreadBlockedOnMonitorEnterState jtbmes(jt, this);複製代碼

找到其實現位置:

JavaThreadBlockedOnMonitorEnterState(JavaThread *java_thread, ObjectMonitor *obj_m) :
    JavaThreadStatusChanger(java_thread) {
    assert((java_thread != NULL), "Java thread should not be null here");
    _active = false;
    if (is_alive() && ServiceUtil::visible_oop((oop)obj_m->object()) && obj_m->contentions() > 0) {
      _stat = java_thread->get_thread_stat();
      _active = contended_enter_begin(java_thread);//關鍵處
    }
  }

 static bool contended_enter_begin(JavaThread *java_thread) {
    set_thread_status(java_thread, java_lang_Thread::BLOCKED_ON_MONITOR_ENTER);//關鍵處
    ThreadStatistics* stat = java_thread->get_thread_stat();
    stat->contended_enter();
    bool active = ThreadService::is_thread_monitoring_contention();
    if (active) {
      stat->contended_enter_begin();
    }
    return active;
  } 複製代碼

上面的contendedenterbegin方法會設置java線程的狀態爲java_lang_Thread::BLOCKED_ON_MONITOR_ENTER,而線程dump時根據這個狀態打印的結果以下:

const char* java_lang_Thread::thread_status_name(oop java_thread) {
  assert(JDK_Version::is_gte_jdk15x_version() && _thread_status_offset != 0, "Must have thread status");
  ThreadStatus status = (java_lang_Thread::ThreadStatus)java_thread->int_field(_thread_status_offset);
  switch (status) {
    case NEW                      : return "NEW";
    case RUNNABLE                 : return "RUNNABLE";
    case SLEEPING                 : return "TIMED_WAITING (sleeping)";
    case IN_OBJECT_WAIT           : return "WAITING (on object monitor)";
    case IN_OBJECT_WAIT_TIMED     : return "TIMED_WAITING (on object monitor)";
    case PARKED                   : return "WAITING (parking)";
    case PARKED_TIMED             : return "TIMED_WAITING (parking)";
    case BLOCKED_ON_MONITOR_ENTER : return "BLOCKED (on object monitor)";
    case TERMINATED               : return "TERMINATED";
    default                       : return "UNKNOWN";
  };
}複製代碼

正好對應咱們dump日誌中的信息"BLOCKED (on object monitor)"也就是說這行代碼被正常執行了,那問題就可能出在JavaThreadBlockedOnMonitorEnterState jtbmes(jt, this)Self->set_current_pending_monitor(this)這兩行代碼之間的邏輯裏了:

JavaThreadBlockedOnMonitorEnterState jtbmes(jt, this);
    DTRACE_MONITOR_PROBE(contended__enter, this, object(), jt);
    if (JvmtiExport::should_post_monitor_contended_enter()) {
      JvmtiExport::post_monitor_contended_enter(jt, this);
    }
    OSThreadContendState osts(Self->osthread());
    ThreadBlockInVM tbivm(jt);
    Self->set_current_pending_monitor(this);//設置當前monitor對象爲當前線程等待的monitor對象複製代碼

因而檢查每一行的實現,前面幾行都基本能夠排除了,由於它們都是很簡單的操做,下面來分析下ThreadBlockInVM tbivm(jt)這一行的實現:

ThreadBlockInVM(JavaThread *thread)
  : ThreadStateTransition(thread) {
    thread->frame_anchor()->make_walkable(thread);
    trans_and_fence(_thread_in_vm, _thread_blocked);
  }

 void trans_and_fence(JavaThreadState from, JavaThreadState to) { 
    transition_and_fence(_thread, from, to); 
 }

 static inline void transition_and_fence(JavaThread *thread, JavaThreadState from, JavaThreadState to) {
    assert(thread->thread_state() == from, "coming from wrong thread state");
    assert((from & 1) == 0 && (to & 1) == 0, "odd numbers are transitions states");
    thread->set_thread_state((JavaThreadState)(from + 1));
    if (os::is_MP()) {
      if (UseMembar) {
        OrderAccess::fence();
      } else {
        InterfaceSupport::serialize_memory(thread);
      }
    }

    if (SafepointSynchronize::do_call_back()) {
      SafepointSynchronize::block(thread);
    }
    thread->set_thread_state(to);
    CHECK_UNHANDLED_OOPS_ONLY(thread->clear_unhandled_oops();)
  }
 ...
 } 
 複製代碼

也許咱們看到可能形成問題的代碼了:

if (SafepointSynchronize::do_call_back()) {
      SafepointSynchronize::block(thread);
}複製代碼

想象一下,當這個線程正好執行到這個條件判斷,而後進去了,從方法名上來講是否是意味着這個線程會block住,而且不日後走了呢?這樣一來設置當前線程的pending_monitor對象的操做就不會被執行了,從而在打印這個線程棧的時候就會打印"-locked"信息了,那麼糾結是否正如咱們想的那樣呢?

首先來看條件SafepointSynchronize::do_call_back()是否必定會成立:

inline static bool do_call_back() {
    return (_state != _not_synchronized);
}複製代碼

  上面的VMThread執行任務的過程當中說到了這個狀態,當vmThread執行完了SafepointSynchronize::begin()以後,這個狀態是設置爲synchronized的。若是正在執行,那麼狀態是synchronizing,所以,當咱們觸發了jvm的線程dump以後,VMThread執行該操做,並且還在執行線程dump過程前,可是還只是synchronizing的狀態,那麼docall_back()將會返回true,那麼將執行接下來的SafepointSynchronize::block(thread)方法:

void SafepointSynchronize::block(JavaThread *thread) {
  assert(thread != NULL, "thread must be set");
  assert(thread->is_Java_thread(), "not a Java thread");

  ttyLocker::break_tty_lock_for_safepoint(os::current_thread_id());

  if (thread->is_terminated()) {
     thread->block_if_vm_exited();
     return;
  }

  JavaThreadState state = thread->thread_state();
  thread->frame_anchor()->make_walkable(thread);

  switch(state) {
    case _thread_in_vm_trans:
    case _thread_in_Java:        // From compiled code
      thread->set_thread_state(_thread_in_vm);

      if (is_synchronizing()) {
         Atomic::inc (&TryingToBlock) ;
      }
      Safepoint_lock->lock_without_safepoint_check();
      if (is_synchronizing()) {
        assert(_waiting_to_block > 0, "sanity check");
        _waiting_to_block--;
        thread->safepoint_state()->set_has_called_back(true);

        DEBUG_ONLY(thread->set_visited_for_critical_count(true));
        if (thread->in_critical()) {
          increment_jni_active_count();
        }
        if (_waiting_to_block == 0) {
          Safepoint_lock->notify_all();
        }
      }
      thread->set_thread_state(_thread_blocked);
      Safepoint_lock->unlock();
      Threads_lock->lock_without_safepoint_check();//關鍵代碼
      thread->set_thread_state(state);
      Threads_lock->unlock();
      break;
   ...
  }
  if (state != _thread_blocked_trans &&
      state != _thread_in_vm_trans &&
      thread->has_special_runtime_exit_condition()) {
    thread->handle_special_runtime_exit_condition(
      !thread->is_at_poll_safepoint() && (state != _thread_in_native_trans));
  }
}

void Monitor::lock_without_safepoint_check (Thread * Self) {
  assert (_owner != Self, "invariant") ;
  ILock (Self) ;
  assert (_owner == NULL, "invariant");
  set_owner (Self);
}

void Monitor::lock_without_safepoint_check () {
  lock_without_safepoint_check (Thread::current()) ;
}複製代碼

看到上面的實現能夠肯定,Java線程執行時會調用Threads_lock->lock_without_safepoint_check(),而Threadslock由於被VMThread持有,將一直卡死在ILock (Self)這個邏輯裏,從而沒有設置currentmonitor屬性,由此驗證了咱們的想法。

Bug修復

  在瞭解了緣由以後,咱們能夠簡單的修復這個Bug。將下面兩行代碼調換下位置便可:

ThreadBlockInVM tbivm(jt);
 Self->set_current_pending_monitor(this);//設置當前monitor對象爲當前線程等待的monitor對象複製代碼

該Bug不會對生產環境產生影響,本文主要是和你們分享分析問題的過程,但願你們碰到疑惑都能有一查到底的勁兒,帶着問題,不斷提出本身的猜測,而後不斷驗證本身的猜測,最終解決問題。

歡迎關注 PerfMa 社區,推薦閱讀:
Java多線程——併發測試
JVM源碼分析之自定義類加載器如何拉長YGC

相關文章
相關標籤/搜索