The Sherox in lab analysis workforce on the College of California San Diego — on the forefront of pioneering AI mannequin innovation — not too long ago obtained an NVIDIA DGX B200 system to raise their important work in giant language mannequin inference.
Many LLM inference platforms in manufacturing immediately, akin to NVIDIA Dynamouse analysis ideas that originated within the Hao AI Lab, together with DistServe.
How Is Hao AI Lab Utilizing the DGX B200??

With the DGX B200 now absolutely accessible to the Hao AI Lab and broader UC San Diego neighborhood on the College of Computing, Data and Knowledge Sciences’ San Diego Supercomputer Middlethe analysis alternatives are boundless.
“DGX B200 is among the strongest AI methods from NVIDIA so far, which signifies that its efficiency is among the many finest on the planet,” stated Hao Zhang, assistant professor within the Hal?c?o?lu Knowledge Science Institute and division of pc science and engineering at UC San Diego. “It allows us to prototype and experiment a lot sooner than utilizing previous-generation {hardware}.”
Two Hao AI Lab initiatives the DGX B200 is accelerating are FastVideo and the Lmgame benchmark.
FastVideo focuses on coaching a household of video era fashions to provide a five-second video primarily based on a given textual content immediate — in simply 5 seconds.
The analysis part of FastVideo faucets into NVIDIA H200 GPUs along with the DGX B200 system.
Lmgame-bench is a benchmarking suite that places LLMs to the check utilizing standard on-line video games together with Tetris and Tremendous Mario Bros. Customers can check one mannequin at a time or put two fashions up in opposition to one another to measure their efficiency.

Different ongoing initiatives at Hao AI Labs discover new methods to realize low-latency LLM serving, pushing giant language fashions towards real-time responsiveness.
“Our present analysis makes use of the DGX B200 to discover the following frontier of low-latency LLM-serving on the superior {hardware} specs the system offers us,” stated Junda Chen, a doctoral candidate in pc science at UC San Diego.
How DistServe Influenced Disaggregated Serving
Disaggregated inference is a approach to make sure large-scale LLM-serving engines can obtain the optimum combination system throughput whereas sustaining acceptably low latency for consumer requests.
The good thing about disaggregated inference lies in optimizing what DistServe calls “goodput” as an alternative of “throughput” within the LLM-serving engine.
Right here’s the distinction:
Throughput is measured by the variety of tokens per second that the complete system can generate. Increased throughput means decrease price to generate every token to serve the consumer. For a very long time, throughput was the one metric utilized by LLM-serving engines to measure their efficiency in opposition to each other.
Whereas throughput measures the combination efficiency of the system, it doesn’t instantly correlate to the latency {that a} consumer perceives. If a consumer calls for decrease latency to generate the tokens, the system has to sacrifice throughput.
This pure trade-off between throughput and latency is what led the DistServe workforce to suggest a brand new metric, “goodput”: the measure of throughput whereas satisfying the user-specified latency aims, often known as service-level aims. In different phrases, goodput represents the general well being of a system whereas satisfying consumer expertise.
DistServe exhibits that goodput is a a lot better metric for LLM-serving methods, because it elements in each price and repair high quality. Goodput results in optimum effectivity and very best output from a mannequin.
How Can Builders Obtain Optimum Goodput?
When a consumer makes a request in an LLM system, the system takes the consumer enter and generates the primary token, often called prefill. Then, the system creates quite a few output tokens, one after one other, predicting every token’s future conduct primarily based on previous requests’ outcomes. This course of is called decode.
Prefill and decode have traditionally run on the identical GPU, however the researchers behind DistServe discovered that splitting them onto totally different GPUs maximizes goodput.
“Beforehand, in the event you put these two jobs on a GPU, they’d compete with one another for sources, which may make it sluggish from a consumer perspective,” Chen stated. “Now, if I break up the roles onto two totally different units of GPUs — one doing prefill, which is compute intensive, and the opposite doing decode, which is extra reminiscence intensive — we are able to basically eradicate the interference between the 2 jobs, making each jobs run sooner.
This course of is known as prefill/decode disaggregation, or separating the prefill from decode to get better goodput.
Rising goodput and utilizing the disaggregated inference methodology allows the continual scaling of workloads with out compromising on low-latency or high-quality mannequin responses.
NVIDIA Dynamo — an open-source framework designed to speed up and scale generative AI fashions on the highest effectivity ranges with the bottom price — allows scaling disaggregated inference.
Along with these initiatives, cross-departmental collaborations, akin to in healthcare and biology, are underway at UC San Diego to additional optimize an array of analysis initiatives utilizing the NVIDIA DGX B200, as researchers proceed exploring how AI platforms can speed up innovation.
Study extra in regards to the NVIDIA DGX B200 system.
