Shutdown handler not registered because Python interpreter is not running in the main thread
run pipeline %s
run pipeline stage %s
Running pipeline stage MKMLizer
Starting job with name chaiml-llama-8b-multih-78780-v32-mkmlizer
Waiting for job on chaiml-llama-8b-multih-78780-v32-mkmlizer to finish
chaiml-llama-8b-multih-78780-v32-mkmlizer: ╔═════════════════════════════════════════════════════════════════════╗
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ _____ __ __ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ / _/ /_ ___ __/ / ___ ___ / / ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ / _/ / // / |/|/ / _ \/ -_) -_) / ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ /_//_/\_, /|__,__/_//_/\__/\__/_/ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ /___/ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ Version: 0.12.8 ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ Copyright 2023 MK ONE TECHNOLOGIES Inc. ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ https://mk1.ai ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ The license key for the current software has been verified as ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ belonging to: ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ Chai Research Corp. ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ Account ID: 7997a29f-0ceb-4cc7-9adf-840c57b4ae6f ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ Expiration: 2025-04-15 23:59:59 ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ║ ║
chaiml-llama-8b-multih-78780-v32-mkmlizer: ╚═════════════════════════════════════════════════════════════════════╝
chaiml-llama-8b-multih-78780-v32-mkmlizer: Downloaded to shared memory in 24.498s
chaiml-llama-8b-multih-78780-v32-mkmlizer: quantizing model to /dev/shm/model_cache, profile:s0, folder:/tmp/tmpdbhsnr31, device:0
chaiml-llama-8b-multih-78780-v32-mkmlizer: Saving flywheel model at /dev/shm/model_cache
chaiml-llama-8b-multih-78780-v32-mkmlizer: quantized model in 16.129s
chaiml-llama-8b-multih-78780-v32-mkmlizer: Processed model ChaiML/llama_8b_multihead_204m_512_v3_tokens_step_398208 in 40.627s
chaiml-llama-8b-multih-78780-v32-mkmlizer: creating bucket guanaco-mkml-models
chaiml-llama-8b-multih-78780-v32-mkmlizer: cp /dev/shm/model_cache/special_tokens_map.json s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32/special_tokens_map.json
chaiml-llama-8b-multih-78780-v32-mkmlizer: cp /dev/shm/model_cache/config.json s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32/config.json
chaiml-llama-8b-multih-78780-v32-mkmlizer: cp /dev/shm/model_cache/tokenizer_config.json s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32/tokenizer_config.json
chaiml-llama-8b-multih-78780-v32-mkmlizer: cp /dev/shm/model_cache/tokenizer.json s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32/tokenizer.json
chaiml-llama-8b-multih-78780-v32-mkmlizer: cp /dev/shm/model_cache/flywheel_model.0.safetensors s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32/flywheel_model.0.safetensors
chaiml-llama-8b-multih-78780-v32-mkmlizer:
Loading 0: 0%| | 0/294 [00:00<?, ?it/s]
Loading 0: 2%|▏ | 5/294 [00:00<00:08, 32.92it/s]
Loading 0: 4%|▍ | 13/294 [00:00<00:05, 53.92it/s]
Loading 0: 6%|▋ | 19/294 [00:00<00:05, 49.86it/s]
Loading 0: 9%|▊ | 25/294 [00:00<00:05, 50.03it/s]
Loading 0: 11%|█ | 32/294 [00:00<00:05, 46.78it/s]
Loading 0: 14%|█▍ | 41/294 [00:00<00:05, 50.08it/s]
Loading 0: 17%|█▋ | 49/294 [00:00<00:04, 56.02it/s]
Loading 0: 19%|█▊ | 55/294 [00:01<00:04, 53.03it/s]
Loading 0: 21%|██ | 61/294 [00:01<00:04, 48.66it/s]
Loading 0: 23%|██▎ | 67/294 [00:01<00:04, 51.22it/s]
Loading 0: 25%|██▍ | 73/294 [00:01<00:04, 47.29it/s]
Loading 0: 27%|██▋ | 78/294 [00:01<00:04, 46.72it/s]
Loading 0: 28%|██▊ | 83/294 [00:01<00:05, 35.86it/s]
Loading 0: 30%|██▉ | 87/294 [00:01<00:05, 35.24it/s]
Loading 0: 32%|███▏ | 94/294 [00:02<00:04, 42.38it/s]
Loading 0: 34%|███▍ | 100/294 [00:02<00:04, 43.48it/s]
Loading 0: 36%|███▌ | 105/294 [00:02<00:04, 44.28it/s]
Loading 0: 38%|███▊ | 113/294 [00:02<00:03, 45.76it/s]
Loading 0: 41%|████ | 121/294 [00:02<00:03, 53.03it/s]
Loading 0: 43%|████▎ | 127/294 [00:02<00:03, 49.41it/s]
Loading 0: 45%|████▌ | 133/294 [00:02<00:03, 51.12it/s]
Loading 0: 48%|████▊ | 140/294 [00:02<00:03, 48.61it/s]
Loading 0: 51%|█████ | 149/294 [00:03<00:02, 50.62it/s]
Loading 0: 53%|█████▎ | 157/294 [00:03<00:02, 56.13it/s]
Loading 0: 55%|█████▌ | 163/294 [00:03<00:02, 53.21it/s]
Loading 0: 57%|█████▋ | 169/294 [00:03<00:02, 54.58it/s]
Loading 0: 60%|██████ | 177/294 [00:03<00:02, 56.29it/s]
Loading 0: 62%|██████▏ | 183/294 [00:03<00:02, 52.91it/s]
Loading 0: 64%|██████▍ | 189/294 [00:04<00:02, 36.41it/s]
Loading 0: 66%|██████▌ | 194/294 [00:04<00:02, 38.70it/s]
Loading 0: 69%|██████▊ | 202/294 [00:04<00:01, 46.68it/s]
Loading 0: 71%|███████ | 208/294 [00:04<00:01, 46.62it/s]
Loading 0: 73%|███████▎ | 214/294 [00:04<00:01, 48.62it/s]
Loading 0: 75%|███████▌ | 221/294 [00:04<00:01, 47.27it/s]
Loading 0: 78%|███████▊ | 230/294 [00:04<00:01, 48.11it/s]
Loading 0: 81%|████████ | 238/294 [00:04<00:01, 54.61it/s]
Loading 0: 83%|████████▎ | 244/294 [00:05<00:00, 51.40it/s]
Loading 0: 85%|████████▌ | 250/294 [00:05<00:00, 46.27it/s]
Loading 0: 87%|████████▋ | 257/294 [00:05<00:00, 45.38it/s]
Loading 0: 90%|█████████ | 265/294 [00:05<00:00, 52.79it/s]
Loading 0: 92%|█████████▏| 271/294 [00:05<00:00, 51.21it/s]
Loading 0: 94%|█████████▍| 277/294 [00:05<00:00, 52.91it/s]
Loading 0: 96%|█████████▋| 283/294 [00:05<00:00, 48.35it/s]
Loading 0: 98%|█████████▊| 289/294 [00:06<00:00, 39.03it/s]
Job chaiml-llama-8b-multih-78780-v32-mkmlizer completed after 73.52s with status: succeeded
Stopping job with name chaiml-llama-8b-multih-78780-v32-mkmlizer
Pipeline stage MKMLizer completed in 73.97s
run pipeline stage %s
Running pipeline stage MKMLTemplater
Pipeline stage MKMLTemplater completed in 0.15s
run pipeline stage %s
Running pipeline stage MKMLDeployer
Creating inference service chaiml-llama-8b-multih-78780-v32
Waiting for inference service chaiml-llama-8b-multih-78780-v32 to be ready
Inference service chaiml-llama-8b-multih-78780-v32 ready after 201.15743136405945s
Pipeline stage MKMLDeployer completed in 201.68s
run pipeline stage %s
Running pipeline stage StressChecker
Received healthy response to inference request in 3.5090925693511963s
Received healthy response to inference request in 2.509389638900757s
Received healthy response to inference request in 3.19208025932312s
Received healthy response to inference request in 4.429242849349976s
Received healthy response to inference request in 3.1971027851104736s
5 requests
0 failed requests
5th percentile: 2.6459277629852296
10th percentile: 2.7824658870697023
20th percentile: 3.0555421352386474
30th percentile: 3.1930847644805906
40th percentile: 3.195093774795532
50th percentile: 3.1971027851104736
60th percentile: 3.3218986988067627
70th percentile: 3.4466946125030518
80th percentile: 3.693122625350952
90th percentile: 4.061182737350464
95th percentile: 4.24521279335022
99th percentile: 4.392436838150024
mean time: 3.3673816204071043
Pipeline stage StressChecker completed in 18.36s
run pipeline stage %s
Running pipeline stage OfflineFamilyFriendlyTriggerPipeline
run_pipeline:run_in_cloud %s
starting trigger_guanaco_pipeline args=%s
triggered trigger_guanaco_pipeline args=%s
Pipeline stage OfflineFamilyFriendlyTriggerPipeline completed in 0.62s
run pipeline stage %s
Running pipeline stage TriggerMKMLProfilingPipeline
run_pipeline:run_in_cloud %s
starting trigger_guanaco_pipeline args=%s
triggered trigger_guanaco_pipeline args=%s
Pipeline stage TriggerMKMLProfilingPipeline completed in 0.67s
Shutdown handler de-registered
chaiml-llama-8b-multih_78780_v32 status is now deployed due to DeploymentManager action
Shutdown handler registered
run pipeline %s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeleter
Skipping teardown as no inference service was successfully deployed
Pipeline stage MKMLProfilerDeleter completed in 0.09s
run pipeline stage %s
Running pipeline stage MKMLProfilerTemplater
Pipeline stage MKMLProfilerTemplater completed in 0.09s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeployer
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Shutdown handler registered
run pipeline %s
run pipeline stage %s
Running pipeline stage OfflineFamilyFriendlyScorer
Evaluating %s Family Friendly Score with %s threads
%s, retrying in %s seconds...
%s, retrying in %s seconds...
%s, retrying in %s seconds...
%s, retrying in %s seconds...
%s, retrying in %s seconds...
%s, retrying in %s seconds...
Shutdown handler registered
run pipeline %s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeleter
Skipping teardown as no inference service was successfully deployed
Pipeline stage MKMLProfilerDeleter completed in 0.12s
run pipeline stage %s
Running pipeline stage MKMLProfilerTemplater
Pipeline stage MKMLProfilerTemplater completed in 0.10s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeployer
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
%s, retrying in %s seconds...
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
%s, retrying in %s seconds...
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
clean up pipeline due to error=DeploymentError('Timeout to start the InferenceService chaiml-llama-8b-multih-78780-v32-profiler. The InferenceService is as following: {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'kind\': \'InferenceService\', \'metadata\': {\'annotations\': {\'autoscaling.knative.dev/class\': \'hpa.autoscaling.knative.dev\', \'autoscaling.knative.dev/container-concurrency-target-percentage\': \'70\', \'autoscaling.knative.dev/initial-scale\': \'1\', \'autoscaling.knative.dev/max-scale-down-rate\': \'1.1\', \'autoscaling.knative.dev/max-scale-up-rate\': \'2\', \'autoscaling.knative.dev/metric\': \'mean_pod_latency_ms_v2\', \'autoscaling.knative.dev/panic-threshold-percentage\': \'650\', \'autoscaling.knative.dev/panic-window-percentage\': \'35\', \'autoscaling.knative.dev/scale-down-delay\': \'30s\', \'autoscaling.knative.dev/scale-to-zero-grace-period\': \'10m\', \'autoscaling.knative.dev/stable-window\': \'180s\', \'autoscaling.knative.dev/target\': \'3700\', \'autoscaling.knative.dev/target-burst-capacity\': \'-1\', \'autoscaling.knative.dev/tick-interval\': \'15s\', \'features.knative.dev/http-full-duplex\': \'Enabled\', \'networking.knative.dev/ingress-class\': \'istio.ingress.networking.knative.dev\'}, \'creationTimestamp\': \'2025-02-14T20:58:56Z\', \'finalizers\': [\'inferenceservice.finalizers\'], \'generation\': 1, \'labels\': {\'knative.coreweave.cloud/ingress\': \'istio.ingress.networking.knative.dev\', \'prometheus.k.chaiverse.com\': \'true\', \'qos.coreweave.cloud/latency\': \'low\'}, \'managedFields\': [{\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:metadata\': {\'f:annotations\': {\'.\': {}, \'f:autoscaling.knative.dev/class\': {}, \'f:autoscaling.knative.dev/container-concurrency-target-percentage\': {}, \'f:autoscaling.knative.dev/initial-scale\': {}, \'f:autoscaling.knative.dev/max-scale-down-rate\': {}, \'f:autoscaling.knative.dev/max-scale-up-rate\': {}, \'f:autoscaling.knative.dev/metric\': {}, \'f:autoscaling.knative.dev/panic-threshold-percentage\': {}, \'f:autoscaling.knative.dev/panic-window-percentage\': {}, \'f:autoscaling.knative.dev/scale-down-delay\': {}, \'f:autoscaling.knative.dev/scale-to-zero-grace-period\': {}, \'f:autoscaling.knative.dev/stable-window\': {}, \'f:autoscaling.knative.dev/target\': {}, \'f:autoscaling.knative.dev/target-burst-capacity\': {}, \'f:autoscaling.knative.dev/tick-interval\': {}, \'f:features.knative.dev/http-full-duplex\': {}, \'f:networking.knative.dev/ingress-class\': {}}, \'f:labels\': {\'.\': {}, \'f:knative.coreweave.cloud/ingress\': {}, \'f:prometheus.k.chaiverse.com\': {}, \'f:qos.coreweave.cloud/latency\': {}}}, \'f:spec\': {\'.\': {}, \'f:predictor\': {\'.\': {}, \'f:affinity\': {\'.\': {}, \'f:nodeAffinity\': {\'.\': {}, \'f:tion\': {}, \'f:requiredDuringSchedulingIgnoredDuringExecution\': {}}}, \'f:containerConcurrency\': {}, \'f:containers\': {}, \'f:imagePullSecrets\': {}, \'f:maxReplicas\': {}, \'f:minReplicas\': {}, \'f:timeout\': {}, \'f:volumes\': {}}}}, \'manager\': \'OpenAPI-Generator\', \'operation\': \'Update\', \'time\': \'2025-02-14T20:58:56Z\'}, {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:metadata\': {\'f:finalizers\': {\'.\': {}, \'v:"inferenceservice.finalizers"\': {}}}}, \'manager\': \'manager\', \'operation\': \'Update\', \'time\': \'2025-02-14T20:58:56Z\'}, {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:status\': {\'.\': {}, \'f:components\': {\'.\': {}, \'f:predictor\': {\'.\': {}, \'f:latestCreatedRevision\': {}}}, \'f:conditions\': {}, \'f:modelStatus\': {\'.\': {}, \'f:lastFailureInfo\': {\'.\': {}, \'f:exitCode\': {}, \'f:message\': {}, \'f:reason\': {}}, \'f:states\': {\'.\': {}, \'f:activeModelState\': {}, \'f:targetModelState\': {}}, \'f:transitionStatus\': {}}, \'f:observedGeneration\': {}}}, \'manager\': \'manager\', \'operation\': \'Update\', \'subresource\': \'status\', \'time\': \'2025-02-14T21:00:11Z\'}], \'name\': \'chaiml-llama-8b-multih-78780-v32-profiler\', \'namespace\': \'tenant-chaiml-guanaco\', \'resourceVersion\': \'276914497\', \'uid\': \'6e823aa9-1464-4ba3-8aec-cac370b37b0d\'}, \'spec\': {\'predictor\': {\'affinity\': {\'nodeAffinity\': {\'tion\': [{\'preference\': {\'matchExpressions\': [{\'key\': \'topology.kubernetes.io/region\', \'operator\': \'In\', \'values\': [\'ORD1\']}]}, \'weight\': 5}], \'requiredDuringSchedulingIgnoredDuringExecution\': {\'nodeSelectorTerms\': [{\'matchExpressions\': [{\'key\': \'gpu.nvidia.com/class\', \'operator\': \'In\', \'values\': [\'RTX_A5000\']}]}]}}}, \'containerConcurrency\': 0, \'containers\': [{\'env\': [{\'name\': \'MAX_TOKEN_INPUT\', \'value\': \'1024\'}, {\'name\': \'BEST_OF\', \'value\': \'1\'}, {\'name\': \'TEMPERATURE\', \'value\': \'1.0\'}, {\'name\': \'PRESENCE_PENALTY\', \'value\': \'0.0\'}, {\'name\': \'FREQUENCY_PENALTY\', \'value\': \'0.0\'}, {\'name\': \'TOP_P\', \'value\': \'1.0\'}, {\'name\': \'MIN_P\', \'value\': \'0.0\'}, {\'name\': \'TOP_K\', \'value\': \'40\'}, {\'name\': \'STOPPING_WORDS\', \'value\': \'["\\\\\\\\n"]\'}, {\'name\': \'MAX_TOKENS\', \'value\': \'1\'}, {\'name\': \'MAX_BATCH_SIZE\', \'value\': \'128\'}, {\'name\': \'URL_ROUTE\', \'value\': \'GPT-J-6B-lit-v2\'}, {\'name\': \'OBJ_ACCESS_KEY_ID\', \'value\': \'LETMTTRMLFFAMTBK\'}, {\'name\': \'OBJ_SECRET_ACCESS_KEY\', \'value\': \'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\'}, {\'name\': \'OBJ_ENDPOINT\', \'value\': \'https://accel-object.ord1.coreweave.com\'}, {\'name\': \'TENSORIZER_URI\', \'value\': \'s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32\'}, {\'name\': \'RESERVE_MEMORY\', \'value\': \'2048\'}, {\'name\': \'DOWNLOAD_TO_LOCAL\', \'value\': \'/dev/shm/model_cache\'}, {\'name\': \'NUM_GPUS\', \'value\': \'1\'}, {\'name\': \'MK1_MKML_LICENSE_KEY\', \'valueFrom\': {\'secretKeyRef\': {\'key\': \'key\', \'name\': \'mkml-license-key\'}}}], \'image\': \'gcr.io/chai-959f8/chai-guanaco/mkml:mkml_v0.11.12_dg\', \'imagePullPolicy\': \'IfNotPresent\', \'name\': \'kserve-container\', \'readinessProbe\': {\'exec\': {\'command\': [\'cat\', \'/tmp/ready\']}, \'failureThreshold\': 1, \'initialDelaySeconds\': 10, \'periodSeconds\': 10, \'successThreshold\': 1, \'timeoutSeconds\': 5}, \'resources\': {\'limits\': {\'cpu\': \'2\', \'memory\': \'12Gi\', \'nvidia.com/gpu\': \'1\'}, \'requests\': {\'cpu\': \'2\', \'memory\': \'12Gi\', \'nvidia.com/gpu\': \'1\'}}, \'volumeMounts\': [{\'mountPath\': \'/dev/shm\', \'name\': \'shared-memory-cache\'}]}], \'imagePullSecrets\': [{\'name\': \'docker-creds\'}], \'maxReplicas\': 1, \'minReplicas\': 1, \'timeout\': 60, \'volumes\': [{\'emptyDir\': {\'medium\': \'Memory\'}, \'name\': \'shared-memory-cache\'}]}}, \'status\': {\'components\': {\'predictor\': {\'latestCreatedRevision\': \'chaiml-llama-8b-multih-78780-v32-profiler-predictor-00001\'}}, \'conditions\': [{\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'reason\': \'PredictorConfigurationReady not ready\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'LatestDeploymentReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'message\': \'Revision "chaiml-llama-8b-multih-78780-v32-profiler-predictor-00001" failed with message: Container failed with: uantization_profile=s0, all_reduce_profile=None, kv_cache_profile=None, calibration_samples=-1, sampling=SamplingParameters(temperature=1.0, top_p=1.0, min_p=0.0, top_k=40, max_input_tokens=1024, max_tokens=1, stop=[\\\'\\\\n\\\'], eos_token_ids=[], frequency_penalty=0.0, presence_penalty=0.0, reward_enabled=True, num_samples=1, reward_max_token_input=256, drop_incomplete_sentences=True, profile=False), url_route=GPT-J-6B-lit-v2, tensorizer_uri=s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32, s3_creds=S3Credentials(s3_access_key_id=\\\'LETMTTRMLFFAMTBK\\\', s3_secret_access_key=\\\'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\\\', s3_endpoint=\\\'https://accel-object.ord1.coreweave.com\\\', s3_uncached_endpoint=\\\'https://object.ord1.coreweave.com\\\'), local_folder=/dev/shm/model_cache)\\n[INFO] Initialized device rank 0\\nTraceback (most recent call last):\\n File "/code/mkml_inference_service/main.py", line 95, in <module>\\n model.load()\\n File "/code/mkml_inference_service/main.py", line 31, in load\\n self.engine = mkml_backend.AsyncInferenceService.from_folder(settings, settings.local_folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 49, in from_folder\\n return service._from_folder(settings, folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 71, in _from_folder\\n engine = mkml.ModelForInference.from_pretrained(\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/inference.py", line 66, in from_pretrained\\n manifold = TensorManifold(model_path, tensor_parallel_size, batching_config, profile, s3_config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 152, in __init__\\n self.model_actor.load(model_path, profile)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 63, in load\\n Factory = get_model_factory(self.config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/instrument.py", line 65, in get_model_factory\\n raise NotImplementedError(config.architectures)\\nNotImplementedError: [\\\'MultiHeadLlamaClassifier\\\']\\n.\', \'reason\': \'RevisionFailed\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'PredictorConfigurationReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'status\': \'False\', \'type\': \'PredictorReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'PredictorRouteReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'status\': \'False\', \'type\': \'Ready\'}, {\'lastTransitionTime\': \'2025-02-14T21:00:11Z\', \'reason\': \'PredictorRouteReady not ready\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'RoutesReady\'}], \'modelStatus\': {\'lastFailureInfo\': {\'exitCode\': 1, \'message\': \'uantization_profile=s0, all_reduce_profile=None, kv_cache_profile=None, calibration_samples=-1, sampling=SamplingParameters(temperature=1.0, top_p=1.0, min_p=0.0, top_k=40, max_input_tokens=1024, max_tokens=1, stop=[\\\'\\\\n\\\'], eos_token_ids=[], frequency_penalty=0.0, presence_penalty=0.0, reward_enabled=True, num_samples=1, reward_max_token_input=256, drop_incomplete_sentences=True, profile=False), url_route=GPT-J-6B-lit-v2, tensorizer_uri=s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32, s3_creds=S3Credentials(s3_access_key_id=\\\'LETMTTRMLFFAMTBK\\\', s3_secret_access_key=\\\'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\\\', s3_endpoint=\\\'https://accel-object.ord1.coreweave.com\\\', s3_uncached_endpoint=\\\'https://object.ord1.coreweave.com\\\'), local_folder=/dev/shm/model_cache)\\n[INFO] Initialized device rank 0\\nTraceback (most recent call last):\\n File "/code/mkml_inference_service/main.py", line 95, in <module>\\n model.load()\\n File "/code/mkml_inference_service/main.py", line 31, in load\\n self.engine = mkml_backend.AsyncInferenceService.from_folder(settings, settings.local_folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 49, in from_folder\\n return service._from_folder(settings, folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 71, in _from_folder\\n engine = mkml.ModelForInference.from_pretrained(\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/inference.py", line 66, in from_pretrained\\n manifold = TensorManifold(model_path, tensor_parallel_size, batching_config, profile, s3_config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 152, in __init__\\n self.model_actor.load(model_path, profile)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 63, in load\\n Factory = get_model_factory(self.config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/instrument.py", line 65, in get_model_factory\\n raise NotImplementedError(config.architectures)\\nNotImplementedError: [\\\'MultiHeadLlamaClassifier\\\']\\n\', \'reason\': \'ModelLoadFailed\'}, \'states\': {\'activeModelState\': \'\', \'targetModelState\': \'FailedToLoad\'}, \'transitionStatus\': \'BlockedByFailedLoad\'}, \'observedGeneration\': 1}}')
run pipeline stage %s
Running pipeline stage MKMLProfilerDeleter
Skipping teardown as no inference service was successfully deployed
Pipeline stage MKMLProfilerDeleter completed in 0.12s
Shutdown handler de-registered
Shutdown handler registered
run pipeline %s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeleter
Skipping teardown as no inference service was successfully deployed
Pipeline stage MKMLProfilerDeleter completed in 0.15s
run pipeline stage %s
Running pipeline stage MKMLProfilerTemplater
Pipeline stage MKMLProfilerTemplater completed in 0.11s
run pipeline stage %s
Running pipeline stage MKMLProfilerDeployer
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
%s, retrying in %s seconds...
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
%s, retrying in %s seconds...
Creating inference service chaiml-llama-8b-multih-78780-v32-profiler
Waiting for inference service chaiml-llama-8b-multih-78780-v32-profiler to be ready
Tearing down inference service chaiml-llama-8b-multih-78780-v32-profiler
clean up pipeline due to error=DeploymentError('Timeout to start the InferenceService chaiml-llama-8b-multih-78780-v32-profiler. The InferenceService is as following: {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'kind\': \'InferenceService\', \'metadata\': {\'annotations\': {\'autoscaling.knative.dev/class\': \'hpa.autoscaling.knative.dev\', \'autoscaling.knative.dev/container-concurrency-target-percentage\': \'70\', \'autoscaling.knative.dev/initial-scale\': \'1\', \'autoscaling.knative.dev/max-scale-down-rate\': \'1.1\', \'autoscaling.knative.dev/max-scale-up-rate\': \'2\', \'autoscaling.knative.dev/metric\': \'mean_pod_latency_ms_v2\', \'autoscaling.knative.dev/panic-threshold-percentage\': \'650\', \'autoscaling.knative.dev/panic-window-percentage\': \'35\', \'autoscaling.knative.dev/scale-down-delay\': \'30s\', \'autoscaling.knative.dev/scale-to-zero-grace-period\': \'10m\', \'autoscaling.knative.dev/stable-window\': \'180s\', \'autoscaling.knative.dev/target\': \'3700\', \'autoscaling.knative.dev/target-burst-capacity\': \'-1\', \'autoscaling.knative.dev/tick-interval\': \'15s\', \'features.knative.dev/http-full-duplex\': \'Enabled\', \'networking.knative.dev/ingress-class\': \'istio.ingress.networking.knative.dev\'}, \'creationTimestamp\': \'2025-02-14T21:29:57Z\', \'finalizers\': [\'inferenceservice.finalizers\'], \'generation\': 1, \'labels\': {\'knative.coreweave.cloud/ingress\': \'istio.ingress.networking.knative.dev\', \'prometheus.k.chaiverse.com\': \'true\', \'qos.coreweave.cloud/latency\': \'low\'}, \'managedFields\': [{\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:metadata\': {\'f:annotations\': {\'.\': {}, \'f:autoscaling.knative.dev/class\': {}, \'f:autoscaling.knative.dev/container-concurrency-target-percentage\': {}, \'f:autoscaling.knative.dev/initial-scale\': {}, \'f:autoscaling.knative.dev/max-scale-down-rate\': {}, \'f:autoscaling.knative.dev/max-scale-up-rate\': {}, \'f:autoscaling.knative.dev/metric\': {}, \'f:autoscaling.knative.dev/panic-threshold-percentage\': {}, \'f:autoscaling.knative.dev/panic-window-percentage\': {}, \'f:autoscaling.knative.dev/scale-down-delay\': {}, \'f:autoscaling.knative.dev/scale-to-zero-grace-period\': {}, \'f:autoscaling.knative.dev/stable-window\': {}, \'f:autoscaling.knative.dev/target\': {}, \'f:autoscaling.knative.dev/target-burst-capacity\': {}, \'f:autoscaling.knative.dev/tick-interval\': {}, \'f:features.knative.dev/http-full-duplex\': {}, \'f:networking.knative.dev/ingress-class\': {}}, \'f:labels\': {\'.\': {}, \'f:knative.coreweave.cloud/ingress\': {}, \'f:prometheus.k.chaiverse.com\': {}, \'f:qos.coreweave.cloud/latency\': {}}}, \'f:spec\': {\'.\': {}, \'f:predictor\': {\'.\': {}, \'f:affinity\': {\'.\': {}, \'f:nodeAffinity\': {\'.\': {}, \'f:tion\': {}, \'f:requiredDuringSchedulingIgnoredDuringExecution\': {}}}, \'f:containerConcurrency\': {}, \'f:containers\': {}, \'f:imagePullSecrets\': {}, \'f:maxReplicas\': {}, \'f:minReplicas\': {}, \'f:timeout\': {}, \'f:volumes\': {}}}}, \'manager\': \'OpenAPI-Generator\', \'operation\': \'Update\', \'time\': \'2025-02-14T21:29:57Z\'}, {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:metadata\': {\'f:finalizers\': {\'.\': {}, \'v:"inferenceservice.finalizers"\': {}}}}, \'manager\': \'manager\', \'operation\': \'Update\', \'time\': \'2025-02-14T21:29:57Z\'}, {\'apiVersion\': \'serving.kserve.io/v1beta1\', \'fieldsType\': \'FieldsV1\', \'fieldsV1\': {\'f:status\': {\'.\': {}, \'f:components\': {\'.\': {}, \'f:predictor\': {\'.\': {}, \'f:latestCreatedRevision\': {}}}, \'f:conditions\': {}, \'f:modelStatus\': {\'.\': {}, \'f:lastFailureInfo\': {\'.\': {}, \'f:exitCode\': {}, \'f:message\': {}, \'f:reason\': {}}, \'f:states\': {\'.\': {}, \'f:activeModelState\': {}, \'f:targetModelState\': {}}, \'f:transitionStatus\': {}}, \'f:observedGeneration\': {}}}, \'manager\': \'manager\', \'operation\': \'Update\', \'subresource\': \'status\', \'time\': \'2025-02-14T21:31:37Z\'}], \'name\': \'chaiml-llama-8b-multih-78780-v32-profiler\', \'namespace\': \'tenant-chaiml-guanaco\', \'resourceVersion\': \'276943751\', \'uid\': \'b1a2fa84-16aa-4e6a-be43-736b0e9bb227\'}, \'spec\': {\'predictor\': {\'affinity\': {\'nodeAffinity\': {\'tion\': [{\'preference\': {\'matchExpressions\': [{\'key\': \'topology.kubernetes.io/region\', \'operator\': \'In\', \'values\': [\'ORD1\']}]}, \'weight\': 5}], \'requiredDuringSchedulingIgnoredDuringExecution\': {\'nodeSelectorTerms\': [{\'matchExpressions\': [{\'key\': \'gpu.nvidia.com/class\', \'operator\': \'In\', \'values\': [\'RTX_A5000\']}]}]}}}, \'containerConcurrency\': 0, \'containers\': [{\'env\': [{\'name\': \'MAX_TOKEN_INPUT\', \'value\': \'1024\'}, {\'name\': \'BEST_OF\', \'value\': \'1\'}, {\'name\': \'TEMPERATURE\', \'value\': \'1.0\'}, {\'name\': \'PRESENCE_PENALTY\', \'value\': \'0.0\'}, {\'name\': \'FREQUENCY_PENALTY\', \'value\': \'0.0\'}, {\'name\': \'TOP_P\', \'value\': \'1.0\'}, {\'name\': \'MIN_P\', \'value\': \'0.0\'}, {\'name\': \'TOP_K\', \'value\': \'40\'}, {\'name\': \'STOPPING_WORDS\', \'value\': \'["\\\\\\\\n"]\'}, {\'name\': \'MAX_TOKENS\', \'value\': \'1\'}, {\'name\': \'MAX_BATCH_SIZE\', \'value\': \'128\'}, {\'name\': \'URL_ROUTE\', \'value\': \'GPT-J-6B-lit-v2\'}, {\'name\': \'OBJ_ACCESS_KEY_ID\', \'value\': \'LETMTTRMLFFAMTBK\'}, {\'name\': \'OBJ_SECRET_ACCESS_KEY\', \'value\': \'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\'}, {\'name\': \'OBJ_ENDPOINT\', \'value\': \'https://accel-object.ord1.coreweave.com\'}, {\'name\': \'TENSORIZER_URI\', \'value\': \'s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32\'}, {\'name\': \'RESERVE_MEMORY\', \'value\': \'2048\'}, {\'name\': \'DOWNLOAD_TO_LOCAL\', \'value\': \'/dev/shm/model_cache\'}, {\'name\': \'NUM_GPUS\', \'value\': \'1\'}, {\'name\': \'MK1_MKML_LICENSE_KEY\', \'valueFrom\': {\'secretKeyRef\': {\'key\': \'key\', \'name\': \'mkml-license-key\'}}}], \'image\': \'gcr.io/chai-959f8/chai-guanaco/mkml:mkml_v0.11.12_dg\', \'imagePullPolicy\': \'IfNotPresent\', \'name\': \'kserve-container\', \'readinessProbe\': {\'exec\': {\'command\': [\'cat\', \'/tmp/ready\']}, \'failureThreshold\': 1, \'initialDelaySeconds\': 10, \'periodSeconds\': 10, \'successThreshold\': 1, \'timeoutSeconds\': 5}, \'resources\': {\'limits\': {\'cpu\': \'2\', \'memory\': \'12Gi\', \'nvidia.com/gpu\': \'1\'}, \'requests\': {\'cpu\': \'2\', \'memory\': \'12Gi\', \'nvidia.com/gpu\': \'1\'}}, \'volumeMounts\': [{\'mountPath\': \'/dev/shm\', \'name\': \'shared-memory-cache\'}]}], \'imagePullSecrets\': [{\'name\': \'docker-creds\'}], \'maxReplicas\': 1, \'minReplicas\': 1, \'timeout\': 60, \'volumes\': [{\'emptyDir\': {\'medium\': \'Memory\'}, \'name\': \'shared-memory-cache\'}]}}, \'status\': {\'components\': {\'predictor\': {\'latestCreatedRevision\': \'chaiml-llama-8b-multih-78780-v32-profiler-predictor-00001\'}}, \'conditions\': [{\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'reason\': \'PredictorConfigurationReady not ready\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'LatestDeploymentReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'message\': \'Revision "chaiml-llama-8b-multih-78780-v32-profiler-predictor-00001" failed with message: Container failed with: uantization_profile=s0, all_reduce_profile=None, kv_cache_profile=None, calibration_samples=-1, sampling=SamplingParameters(temperature=1.0, top_p=1.0, min_p=0.0, top_k=40, max_input_tokens=1024, max_tokens=1, stop=[\\\'\\\\n\\\'], eos_token_ids=[], frequency_penalty=0.0, presence_penalty=0.0, reward_enabled=True, num_samples=1, reward_max_token_input=256, drop_incomplete_sentences=True, profile=False), url_route=GPT-J-6B-lit-v2, tensorizer_uri=s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32, s3_creds=S3Credentials(s3_access_key_id=\\\'LETMTTRMLFFAMTBK\\\', s3_secret_access_key=\\\'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\\\', s3_endpoint=\\\'https://accel-object.ord1.coreweave.com\\\', s3_uncached_endpoint=\\\'https://object.ord1.coreweave.com\\\'), local_folder=/dev/shm/model_cache)\\n[INFO] Initialized device rank 0\\nTraceback (most recent call last):\\n File "/code/mkml_inference_service/main.py", line 95, in <module>\\n model.load()\\n File "/code/mkml_inference_service/main.py", line 31, in load\\n self.engine = mkml_backend.AsyncInferenceService.from_folder(settings, settings.local_folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 49, in from_folder\\n return service._from_folder(settings, folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 71, in _from_folder\\n engine = mkml.ModelForInference.from_pretrained(\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/inference.py", line 66, in from_pretrained\\n manifold = TensorManifold(model_path, tensor_parallel_size, batching_config, profile, s3_config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 152, in __init__\\n self.model_actor.load(model_path, profile)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 63, in load\\n Factory = get_model_factory(self.config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/instrument.py", line 65, in get_model_factory\\n raise NotImplementedError(config.architectures)\\nNotImplementedError: [\\\'MultiHeadLlamaClassifier\\\']\\n.\', \'reason\': \'RevisionFailed\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'PredictorConfigurationReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'status\': \'False\', \'type\': \'PredictorReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'PredictorRouteReady\'}, {\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'message\': \'Configuration "chaiml-llama-8b-multih-78780-v32-profiler-predictor" does not have any ready Revision.\', \'reason\': \'RevisionMissing\', \'status\': \'False\', \'type\': \'Ready\'}, {\'lastTransitionTime\': \'2025-02-14T21:31:37Z\', \'reason\': \'PredictorRouteReady not ready\', \'severity\': \'Info\', \'status\': \'False\', \'type\': \'RoutesReady\'}], \'modelStatus\': {\'lastFailureInfo\': {\'exitCode\': 1, \'message\': \'uantization_profile=s0, all_reduce_profile=None, kv_cache_profile=None, calibration_samples=-1, sampling=SamplingParameters(temperature=1.0, top_p=1.0, min_p=0.0, top_k=40, max_input_tokens=1024, max_tokens=1, stop=[\\\'\\\\n\\\'], eos_token_ids=[], frequency_penalty=0.0, presence_penalty=0.0, reward_enabled=True, num_samples=1, reward_max_token_input=256, drop_incomplete_sentences=True, profile=False), url_route=GPT-J-6B-lit-v2, tensorizer_uri=s3://guanaco-mkml-models/chaiml-llama-8b-multih-78780-v32, s3_creds=S3Credentials(s3_access_key_id=\\\'LETMTTRMLFFAMTBK\\\', s3_secret_access_key=\\\'VwwZaqefOOoaouNxUk03oUmK9pVEfruJhjBHPGdgycK\\\', s3_endpoint=\\\'https://accel-object.ord1.coreweave.com\\\', s3_uncached_endpoint=\\\'https://object.ord1.coreweave.com\\\'), local_folder=/dev/shm/model_cache)\\n[INFO] Initialized device rank 0\\nTraceback (most recent call last):\\n File "/code/mkml_inference_service/main.py", line 95, in <module>\\n model.load()\\n File "/code/mkml_inference_service/main.py", line 31, in load\\n self.engine = mkml_backend.AsyncInferenceService.from_folder(settings, settings.local_folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 49, in from_folder\\n return service._from_folder(settings, folder)\\n File "/code/mkml_inference_service/mkml_backend.py", line 71, in _from_folder\\n engine = mkml.ModelForInference.from_pretrained(\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/inference.py", line 66, in from_pretrained\\n manifold = TensorManifold(model_path, tensor_parallel_size, batching_config, profile, s3_config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 152, in __init__\\n self.model_actor.load(model_path, profile)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/manifold.py", line 63, in load\\n Factory = get_model_factory(self.config)\\n File "/opt/conda/lib/python3.10/site-packages/mk1/flywheel/instrument.py", line 65, in get_model_factory\\n raise NotImplementedError(config.architectures)\\nNotImplementedError: [\\\'MultiHeadLlamaClassifier\\\']\\n\', \'reason\': \'ModelLoadFailed\'}, \'states\': {\'activeModelState\': \'\', \'targetModelState\': \'Pending\'}, \'transitionStatus\': \'InProgress\'}, \'observedGeneration\': 1}}')
run pipeline stage %s
Running pipeline stage MKMLProfilerDeleter
Skipping teardown as no inference service was successfully deployed
Pipeline stage MKMLProfilerDeleter completed in 0.13s
Shutdown handler de-registered
chaiml-llama-8b-multih_78780_v32 status is now inactive due to auto deactivation removed underperforming models
chaiml-llama-8b-multih_78780_v32 status is now torndown due to DeploymentManager action
chaiml-llama-8b-multih_78780_v32 status is now torndown due to DeploymentManager action
chaiml-llama-8b-multih_78780_v32 status is now torndown due to DeploymentManager action