00:00:00 --- log: started forth/19.07.14 00:03:07 --- quit: rdrop-exit (Quit: Lost terminal) 00:04:50 the problem is more the generation of VM code 00:05:40 as on a small 32-bit system, the VM code consists 8 or 16-bit tokens interspersed with 8, 16, and 32-bit data fields 00:05:59 I would have to add extra NOP tokens to make sure that the data fields are aligned 00:06:09 that shouldn't be too hard to implement 00:06:36 the real problem is all the "userspace" code I have written which is not designed with alignment in mind 00:09:29 Couldn't unaligned accesses be split up into 2 aligned accesses plus some bit twiddling? Wouldn't be fast, but it'd work (unless you're doing concurrent stuff). 00:09:49 I'm already doing that for 16-bit tokens 00:10:03 because I don't know whether they're 16-bit until I read the first byte 00:10:24 only afterwards do I read the second byte 00:10:55 but for data field reading, I'd rather just use the nop padding approach 00:11:07 it would still have a performance hit 00:11:23 but only for fields that otherwise would be unaligned 00:12:01 actually 00:12:07 I'm wrong 00:12:11 it'd probably actually be faster 00:12:27 because for the nops it'd have to test each one for whether it was 8-bit or 16-bit 00:12:54 mind you for 16-bit fields there would be at most one nop 00:13:27 but it would be at least one conditional branch (for the test) and one function call (for calling the nop primitive function) 00:14:07 whereas data field access your way could be accessed without any branches at ll 00:14:10 *all 00:15:01 know what 00:15:07 screw the whole unaligned access thing 00:15:41 I'm just going to make all multibyte memory accesses be bytes bitshifted and ored together 00:15:50 it'll slow things down 00:16:01 but I won't have to worry about unaligned access 00:18:32 okay, I'm gonna hit the sack now 00:18:33 bbl 00:22:13 --- join: proteusguy (~proteusgu@cm-58-10-208-146.revip7.asianet.co.th) joined #forth 00:22:13 --- mode: ChanServ set +v proteusguy 01:22:09 tabemann: Hmm, RETRO forth uses the Nga VM, which requires 32 bit-aligned fields. It might work pretty well with M0. 01:23:01 --- join: dys (~dys@tmo-114-89.customers.d1-online.com) joined #forth 03:59:18 --- quit: jedb (Remote host closed the connection) 03:59:31 --- join: jedb (~jedb@103.254.153.113) joined #forth 05:02:15 --- join: dddddd (~dddddd@unaffiliated/dddddd) joined #forth 05:08:44 --- quit: jedb (Ping timeout: 246 seconds) 05:10:32 --- join: jedb (~jedb@103.254.153.113) joined #forth 10:16:44 --- quit: dys (Ping timeout: 246 seconds) 10:16:52 --- join: dys (~dys@tmo-080-166.customers.d1-online.com) joined #forth 12:00:25 --- quit: gravicappa (Ping timeout: 245 seconds) 13:38:20 --- quit: xek (Ping timeout: 245 seconds) 14:10:47 --- quit: tabemann (Ping timeout: 276 seconds) 14:25:35 --- join: tabemann (~tabemann@71-13-2-250.static.ftbg.wi.charter.com) joined #forth 15:05:00 --- quit: tabemann (Ping timeout: 245 seconds) 15:50:24 --- quit: jedb (Ping timeout: 245 seconds) 15:56:08 --- join: jedb (~jedb@103.254.153.113) joined #forth 16:07:00 --- quit: john_cephalopoda (Ping timeout: 252 seconds) 16:20:44 --- join: john_cephalopoda (~john@unaffiliated/john-cephalopoda/x-6407167) joined #forth 16:21:31 --- join: rdrop-exit (~markwilli@112.201.174.189) joined #forth 16:34:27 c[] Good morning Forthers 16:45:31 --- join: tabemann (~tabemann@rrcs-162-155-170-75.central.biz.rr.com) joined #forth 17:13:34 good evening rdrop-exit 17:13:49 Hi crc :) 17:18:27 hey guys 17:18:36 hi tabemann 17:18:37 I've thought of something 17:18:45 oh oh 17:18:49 ;-) 17:19:09 with the VM, I make it alignment-independent, with the only cost being that I load cells and shit one byte at a time on the M0 17:19:46 it'll make it slower on M0 without making other platforms have to change just to make M0 and like happy 17:22:49 the performance impact will be identical to the performance impact of running on a big-endian system 17:28:22 --- join: dave0 (~dave0@069.d.003.ncl.iprimus.net.au) joined #forth 17:32:29 What are the alignment requirements of the M0? 17:32:51 * alignment constraints 17:32:55 they're absolute - no unaligned access is allowed at ll 17:33:13 *all 17:34:04 Ok but a byte load on a byte address is an aligned access 17:34:13 yes 17:34:52 I just checked - M3 only allows unaligned access on certain instructions, such as LDR and STR, but that shouldn't be too hard to work around 17:36:01 it appears M4 is the same 17:36:27 with restriction that unaligned access across memory map boundaries is "unpredictable" 17:36:36 which i assume as applies to M3 17:39:55 Does it also have 16-bit loads and stores? 17:41:58 If it does that gives you 4 combinations for a alignment neutral 32 bit cell load and store 17:43:30 actually 3 combinations 17:45:28 back 17:45:32 it's weird 17:46:03 the M3 docs I see clearly indicate that unaligned 16-bit accesses are allowed, if the instructions they specify are halfword ones as they seem to be 17:46:27 but the M4 docs I see refer solely to LDR and STR, which are 32-bit accesses 17:46:58 but it makes no sense as to why 32-bit unaligned access would be allowed but 16-bit would not be 17:48:09 okay, it just omitted them 17:48:36 because now from looking at some different M4 docs I see it referring to "LDR, LDRH, STR, STRH" 17:50:57 Yes, 3 combinations 17:54:04 I figure that just breaking up memory accesses by default on systems like M0 makes the most sense because when I thought of it adding padding tokens would actually have much more of a performance hit 17:54:20 because for each padding token it needs to decide whether the token's 8-bit or 16-bit 17:54:31 and it needs to call a no-op function 17:54:45 and then it has to loop through the instrunctions 17:55:03 so you've got at least four extra branches per no-op 17:55:22 whereas simply breaking up memory accesses can be done with no extra branches at all 17:55:42 s/instrunctions/tokens 17:59:48 Is it cheaper to check the alignment and branch to one of the 3 combinations or to always assume unaligned? 18:00:09 checking the alignment means at least one branch 18:00:18 which means at least one chance of a branch prediction miss 18:00:53 whereas assuming unaligned means one can access memory with no branch prediction misses at all 18:01:31 If you check and branch it would be something like this (in C): 18:01:43 switch(a&3){ 18:01:43 case 0: return _ld32(a); 18:01:43 case 2: return _ld16(a)|_ld16(a+2)<<16; 18:01:43 default: /* 1,3 */ return ld8(a)|_ld16(a+1)<<8|ld8(a+3)<<24; 18:01:43 } 18:02:27 Sorry about the formatting 18:02:38 that kind of thing can probably be done most efficiently with a jump table 18:02:56 If you go branchless you'd only do the default 18:03:03 yes 18:03:31 I'd have to study what is faster before I commit to any design 18:03:59 a simple conditional branch approach would almost certainly be too slow 18:04:21 because then you'd have two conditional branches, so two opportunities for a prediction miss 18:05:03 Best to just test and see 18:05:04 I don't know what the performance impact of doing a jump table approach would be though 18:05:09 you could do byte access and combine them to larger types 18:05:24 dave0: that's what I was exactly thinking of originally 18:05:25 just always do #3 18:07:11 it sounds slower, but branch prediction misses are likely even slower 18:07:35 Actually the default case will only work for 1 or 3, you need byte access for all 4 bytes if you don't have the branch 18:07:51 yes 18:09:13 I need to get a Cortex M0 board 18:09:21 so I can test this in its natural habitat 18:09:35 an emulator obviously isn't gonna work 18:09:55 Since variables are cell aligned, the branching version might be worthwhile 18:11:46 You'll need to test 18:12:10 Case 0 should be by far the most common 18:12:43 I should ask Matthias what a good M0 board to try out is, and then load Mecrisp-Stellaris on it so I can do some profiling 18:14:04 You're misaligned cell loads and stores should be relatively rare 18:15:36 cell align your header and code addresses as well 18:15:45 actually, I just thought of something 18:16:00 oh oh 18:16:02 after tokens that take parameters, it automaticalloy should add padding 18:16:23 so it would have switch statement like you mention 18:16:38 but it would simply be ip += foo 18:16:54 to force data fields into alignment 18:17:13 there would be a branch, of course 18:17:18 actually 18:17:21 there might not be 18:17:50 I'm thinking I could push it into alignment without a branch 18:17:58 which would be excellent 18:18:04 Your tokens are 16 bit? 18:18:14 or 8 bit? 18:18:30 on small systems they're 8/16-bit - it loads the first byte, then decides whether to load the second byte 18:18:30 or bith? 18:18:48 so the tokens need not be aligned 18:19:07 it's just the data fields which have alignment issues 18:20:28 It's invitable 18:20:35 inevitable 18:22:05 but if I can figure out a way to calculate padding without branches I should be gold 18:23:40 You don't need branches to calculate alignment 18:23:58 I know, I'm thinking of using MUX 18:24:13 in my Forth test implementation 18:25:10 9 18:25:10 a aligned Round up to a multiple of the given power-of-2, i.e. 18:25:10 b align to the given power-of-2. 18:25:10 c 18:25:23 : aligned ( u pwr2 -- u' ) `1- mask or 1+ ;inline 18:25:52 mask is just an alias for 1- 18:26:30 what's `1- ? 18:26:31 `1- is ( x1 x2 -- x1-1 x2 ) 18:27:52 if your is a literal, you can fold the constant 18:28:15 thanks! 18:28:37 Sure 18:28:47 that is really helpful actually 18:29:23 so I can add automatic padding to data fields in the whole VM arch 18:30:16 I'll need to update the rest of my code that specifies data fields to also pad, but that shouldn't be too much trouble 18:31:39 bbiab 18:37:21 BTW, my "aligned" is unrelated to the ANS "aligned", you might want to rename to something else if you want to avoid confusion with the standard "aligned" 18:38:05 I doubt my alignment word matches any standard either. 18:38:07 It might. 18:38:41 Mine's just ALIGN, and it's ( old mask -- new). 18:38:52 back 18:39:17 yeahn, I'm calling it in my code ALIGNED-TO 18:39:21 You give it your value and the power of 2 you want it aligned to. 18:39:38 My align is different 18:39:57 0 shadow Host VM RAM - Dictionary - Alignment 18:39:57 1 18:39:57 2 realign Align the dictionary pointer up to a given power-of-2 18:39:57 3 boundary. 18:39:57 4 18:39:59 5 align Align the dictionary pointer up to a cell boundary. 18:40:02 6 18:40:07 Assembly primitive. Basically adds mask-1 and ands with ~(mask-1). 18:40:12 0 source Host VM RAM - Dictionary - Alignment 18:40:12 1 18:40:12 2 : realign ( pwr2 -- ) here swap aligned dp ! ; 18:40:12 3 18:40:12 4 : align () cell realign ; 18:40:14 5 18:40:19 5 cell align 18:40:36 Ah, I see. 18:41:02 Yours is probably better - I probably use CELL over 90% of the time. 18:41:12 So having the dedicated word would likely pay off. 18:43:05 --- quit: reepca (Read error: Connection reset by peer) 18:44:23 At least CELL is a constant and not a literall. 18:45:39 Since I use SRT/NCI there's no difference between a compiled constant and a compiled literal 18:46:05 They both end up being |lit|xx| 18:47:05 Although I have a few variants of the lit opcode for size optimization 18:47:26 (this is on the host side Forth) 18:48:03 bbiab 18:53:49 Yes, I do too. 18:53:58 I have 4-byte and 8 byte. 18:54:19 And one for 8-byte floats, since those go directly to the float stack instead of the normal stack. 18:54:32 I'm still wondering if I'm going to pay a price for that one at some p oint. 18:55:27 I do have standard constants too, though. 18:55:51 That just take one cell (32-bit instruction cell) like any other word. 18:59:00 back 18:59:02 On the host I have 8 literal opcodes, -1, (4,8,16,32,64) bit, and inverted (8,16,32) bit 18:59:32 Too many really, I'll remove some eventually 19:00:01 in hashforth I have 8-bit, 16-bit, 32-bit, and 64-bit literals - it by default uses the smallest possible 19:00:07 (oops that's 9 not 8 different ones) 19:00:13 (note that all of these have sign extension, so -1 is 8-bit) 19:01:09 -1 is frequent enough that it's just an opcode 19:01:46 maybe I should just make opcodes for 1, 0, and -1 19:02:10 I have opcodes for all nibble literals 0..f 19:03:14 So my 9 count is really 25 counting all the nibble literals individually 19:04:05 Since my host SRT/NCI runs on a bytecoded VM I have 256 opcodes to play with 19:04:34 okay, I'm gonna head out - coffee shop is closing 19:04:36 I'll bbiab 19:05:06 Eventually I'll eliminate some of the literal opcodes when I want to reuse their slots for something else 19:06:03 (correction 24 counting the nibble literals) 19:06:55 -1, 0..f, lit8, lit16, lit32, lit64, ~lit8, ~lit16, ~lit32 19:08:23 I don't really need that many since my host assumes 64-bit POSIX, so space isn't tight 19:08:44 --- quit: tabemann (Ping timeout: 246 seconds) 19:10:11 --- quit: dddddd (Remote host closed the connection) 19:33:22 --- join: tabemann (~tabemann@2600:1700:7990:24e0:3163:3257:92d5:a5e2) joined #forth 19:54:10 tabemann, want a std M0 board or a M0+ for your testing ? 19:54:16 g'day all 19:54:48 Hi tp 19:55:41 tabemann, a STM F0-discovery, or a STM F0-Nucleo board are both safe bets, the Discovery is older but all the pins are labelled unlike the Nucleo 19:55:48 g'day rdrop-exit 19:57:01 tabemann, the Nucleo is newer and has a 'usb bulk storage' drive that one is supposed to be able to 'drop' the binary into and it flashes the target M0, but it doesnt work with the Mecrisp-Stellaris binary 19:58:33 thanks 19:59:10 I've heard better things about Discovery boards than Nucleo boards in general, quality-wise 19:59:22 tabemann, yes, I agree totaly 19:59:31 like the Discovery board I have here, the only complaint I've heard about it is that the pins are kinda short 19:59:49 the nucleo schematic is the worst schematic I have ever seen in 40 years of being a technician 20:00:26 okay, I'll try out the STM F0-Discovery board then 20:00:55 tabemann, absolutely, I was about to say the same thing. The nucleo pins are only a TINY bit longer tho 20:01:54 taberman there are two types of M0, the M0 and the M0+, the latter having all the low power stuff, a slightly slower clock, but a more modern chip 20:03:04 tabemann, examples of the M0+ are stm32L072 20:03:30 tabemann, if you ever get the low power bug, a M0 will only frustrate you 20:03:44 you want a M0+ for low power 20:03:50 which is the F0-Discovery? 20:03:54 --- quit: dave0 (Quit: dave's not here) 20:04:06 it's a stm32F030 or STM32F051 20:04:19 a standard M0 20:05:07 if youre after speed and not low power a F model is 48Mhz and overclockable to 96Mhz 20:05:28 the M0+ is 32Mhz std, I havent tried to overclock ine yet 20:07:13 note: the M0 can still do quite low power, ie 0.8mA at 48Mhz in standby, no delays when coming out of standby 20:07:50 but the M0+ stuff can do 0.4uA in best power saving mode etc 20:08:35 I'll just stick with the F0-Discovery 20:08:39 the very, very low power modes are all complex, ttmrichter will testify to that 20:08:48 always a safe bet 20:09:20 the STM32F051 is more capable than the STM32F030 which I believe is in some F0-Discoveries 20:14:37 https://www.st.com/en/evaluation-tools/stm32f0discovery.html < this has the stm32f051 20:18:07 tabemann, thats it, I have 6 of those units 20:19:04 I've worn out the reset button on two of them as I hit those switches thousands of times during a project 20:19:41 easily replaced but as I have plenty of boards I just grab another one 20:21:48 tabemann: There's also an 32F030DISCOVERY board. 20:22:07 And yes, low-power modes are a serious pain in the ass. 20:22:21 A perfect example of a cross-cutting concern that will touch literally every other piece of code you use. 20:22:35 --- quit: proteusguy (Ping timeout: 258 seconds) 20:22:44 ttmrichter, ah yes the 32F030DISCOVERY board is the one with the 32F030! the f0 disco always has a f051 20:22:57 Yep. 20:23:12 They changed their naming conventions mid-stream to help end the confusion of which chip is in which. 20:23:29 aha, I was still confused 20:25:05 I used to think that ST gave their nucleo board schematics to Interns, but now I think they just invite homeless druggies in of the street for a meal if they will do the schematic 20:25:59 why? would interns result in too high of a quality? 20:26:35 my guess would be that they outsourced it to people in bangladesh 20:26:37 yeah, or maybe the interns all left and went to work for NXP ? 20:27:35 who ever it was, they should be horse whipped and banned from a schematic capture system for the rest of their lives 20:30:21 they should be condemned to a purgatory of writing code in brainfuck for the rest of their lives 20:31:03 and not the easy kind of brainfuck where each cell contains a full machine word 20:31:15 but rather the kind where each cell contains a single byte 20:31:24 hahah 20:31:59 oh, and they must be forbidden from writing compilers that use brainfuck as a compilation target 20:32:06 as that'd make it too easy 20:33:06 you're cruel tabemann, cruel but fair! 20:34:05 if I truly wanted to be cruel, I'd make them write code in unlambda 20:34:54 making someone write code that solely used the S, K, and I combinators would surely drive them insane 20:36:39 I'm a technician, Forth is enough challenge for me :) 20:41:05 * tabemann can't wrap his brain around unlambda, for one 21:30:32 --- join: gravicappa (~gravicapp@h109-187-44-163.dyn.bashtel.ru) joined #forth 21:40:30 tabemann: If I was really cruel, I'd make people write code in PHP. 21:41:54 Unlambda and brainfuck are obviously unusable. PHP gives you the delusion of being useful until you're in too deep and can't pull yourself out. 21:49:55 --- join: reepca (~user@208.89.170.37) joined #forth 22:11:49 --- quit: kori (Ping timeout: 264 seconds) 22:20:07 --- quit: dys (Ping timeout: 268 seconds) 22:21:50 --- join: dys (~dys@tmo-122-231.customers.d1-online.com) joined #forth 22:25:39 --- join: kori (~kori@arrowheads/kori) joined #forth 22:32:26 are there some experimental forths out there that use a tree instead of a stack? 22:41:04 --- join: dave0 (~dave0@069.d.003.ncl.iprimus.net.au) joined #forth 22:42:02 re 22:46:25 --- join: proteusguy (~proteusgu@2403:6200:89a6:8231:e1ac:8d85:348c:762f) joined #forth 22:46:25 --- mode: ChanServ set +v proteusguy 22:50:04 dys: How would you picture that working? 22:51:43 --- quit: kori (Ping timeout: 250 seconds) 23:08:13 that's what I'm curious about :-) 23:08:13 in firmforth, words are compiled by building an hybrid data-flow control-flow graph 23:09:12 i intend to expose more of constructing this "intermediate language" to forth, but it doesn't fit the stack world very well :-/ 23:11:15 just afraid of constructing trees explicitly why there is some elegant treeish forth already out there 23:11:44 s/why/while/ 23:12:12 --- join: kori (~kori@arrowheads/kori) joined #forth 23:17:29 --- quit: kori (Read error: Connection reset by peer) 23:34:00 --- join: ttmrichter_ (~ttmrichte@185.94.228.156) joined #forth 23:36:37 --- quit: ttmrichter_ (Client Quit) 23:36:52 --- join: ttmrichter_ (~ttmrichte@2a05:1500:501:1:1c00:34ff:fe00:10c) joined #forth 23:39:29 --- join: kori (~kori@arrowheads/kori) joined #forth 23:45:07 --- quit: kori (Read error: Connection reset by peer) 23:50:41 --- join: kori (~kori@arrowheads/kori) joined #forth 23:50:50 --- quit: ttmrichter_ (*.net *.split) 23:50:51 --- quit: sigjuice (*.net *.split) 23:50:51 --- quit: dbucklin (*.net *.split) 23:50:57 --- quit: proteusguy (*.net *.split) 23:50:57 --- quit: john_cephalopoda (*.net *.split) 23:50:58 --- quit: nonlinear (*.net *.split) 23:50:59 --- quit: remexre (*.net *.split) 23:56:38 --- quit: dys (Ping timeout: 245 seconds) 23:59:59 --- log: ended forth/19.07.14