AutoGPTQ

Author	SHA1	Message	Date
PanQiWei	e5f874e5af	add fused attention injection logic to llama	2023-08-07 13:45:37 +08:00
PanQiWei	07e06fa08c	make compatible with older transformers version	2023-05-15 13:26:18 +08:00
PanQiWei	2273f9ef39	refactor file structure for triton kernels	2023-05-14 11:49:10 +08:00
PanQiWei	fef1a4fe4b	make code clean and extendable	2023-05-12 20:11:55 +08:00
PanQiWei	f159aeabb6	refactor .from_quantized api and improve model loading strategy	2023-05-12 18:09:50 +08:00
TheBloke	1b3329b399	Fix 'groupsize' -> 'group_size' in all other .py files. I haven't touched any CUDA kernels in case there's any complexity there I don't understand	2023-05-05 14:44:16 +01:00
qwopqwop200	208d660920	fix bug	2023-05-04 10:04:00 +09:00
qwopqwop200	f51a92ed79	support faster and model load strict	2023-05-04 09:53:28 +09:00
qwopqwop200	d8707f92a9	support fused_attn	2023-05-02 21:54:15 +09:00
qwopqwop200	f47322f073	fix bug	2023-05-02 21:14:27 +09:00
qwopqwop200	a6d4f5c091	fix bug	2023-05-02 19:19:04 +09:00
qwopqwop200	1388acac94	fix bug	2023-05-02 19:13:13 +09:00
qwopqwop200	f51f763fde	fused attn ,fused mlp apply	2023-05-02 18:51:04 +09:00
PanQiWei	b490ab004e	remove override of _resize_attention_mask for llama and opt	2023-04-28 23:08:42 +08:00
PanQiWei	a2abff983e	support dispatch layers to different devices when loading pretrained model before quantization	2023-04-27 02:24:08 +08:00
PanQiWei	a830a62bc3	fix bugs for attention_mask and position_ids	2023-04-20 18:32:21 +08:00
PanQiWei	229b61e20e	first init	2023-04-14 01:09:40 +08:00